[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-585908704 LGTM, There's one thing to be aware of, which is that `Path.get(conf)` maps to `FileSystem.get(path.getURI(), conf)`, which looks in a shared cache for existing instances (good) then instantiates one if there isn't one for that User + URI. And if that FS instance takes a while to start up (i.e. `FileSystem.initialize())` is slow, then multiple threads will all end up trying to create instances, then discard all but one afterwards. Hence: https://github.com/apache/hadoop/pull/1838 ; removing some network IO in `S3AFileSystem.initialize()` by giving you the option of not bothering to look to see if the bucket exists. Does that mean there's anything wrong with this PR? No, only that performance is best if the relevant FS instances have already been preloaded into the FS cache. And those people implementing filesystem connectors should do a better job at low-latency instantiation, even if it means async network startup threads and moving the blocking to the first FS API call instead. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-559112759 Quick follow up to the "how many connections" discussion. It turns out that a sufficiently large Number of S3 DNS lookups can trigger 503 throttling on DNS and there isn't anything in the AWS library to react to this. The bigger at the connection pool, the more connections you are likely to see on worker start-up, probably workers * S3A URLS * "fs.s3a.connection.maximum" . Don't go overboard. And if you do see the problem, file a HADOOP with stack traces, configs and anything else which could help implementing resilience here. *that doesn't make any difference to this PR, just something to know* This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-552456614 This code LGTM: skips needless probes on the globbed paths; parallel checks on the others. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-552454930 I'd say 40 sounds good; people can tune it This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-549465360 Nice experiment! I guess in-EC2, you're limited by the number of course but also latency is nice and low. Remotely, latency is worse so if there is anything we can do in parallel threads -there are some tangible benefits. in both local and remote S3 interaction rename() is faked with a COPY, which is 6-10MB/s; that can be done via the thread pool too if you can configure the AWS SDK to split up a large copy into parallel parts. That shares the same pools, so its useful to have some capacity there on any process renaming things. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-536993443 > Update: I tried increasing fs.s3a.connection.maximum and it did improve performance of the filesystem calls. > I still need to set up a benchmark that runs on EC2 instead of remote dev laptop, will update in a couple days. log the toString value of the FS instance at the end to see what the counters say This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-536223420 > fs.s3a.connection.maximum its 30 AFAIK. I should revisit that you know, there was never reason for it other than some uses can overload things (e.g hive doing many user instances_). * Impala runs with thousands; you need to bump up the thread pool too. * if you have spark workers and they all work with the same few buckets: go big * if you have spark workers working with different buckets, balance the capacity generally, for metadata ops (head, list) and for copy ops used in rename, those connections don't overload the client..they are waiting for things to happen. It's the GET and PUT which use up bandwidth. Why don't you submit a Hadoop PR which bumps the default value to some higher number which you can all agree on, and I'll review. We could certainly do some 64-100; above that gets harder to defend (I'm ignoring actual throttling of S3 REST calls; they span your entire cluster) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-535903963 > it seems like the sweet spot is somewhere between 20-30 threads (for my environment anyways, 2015 macbook pro, i7/w 8 cores). interesting. You may get different numbers running in EC2; it's always best to benchmark perf there. Remote dev amplifies some performance issues (cost of reopening an http connection, general latency) while hiding others (how easy it is for spark jobs to overload s3 shards and so get throttled, cause delays, trigger speculative task execution, more throttling, etc, etc) Try changing "fs.s3a.connection.maximum" from the default of 48 to something bigger. That's the limit on the http pool size. It's a small number to stop a single s3a instance from overloading the system, but you may want to consider. There's also "fs.s3a.max.total.tasks" which controls the thread pool size used for background writing of blocks of large files; in hadoop trunk parallel delete/rename operations, plus stuff in the AWS SDK itself. * "fs.s3a.connection.maximum" should be > than "fs.s3a.max.total.tasks" * "fs.s3a.threads.keepalivetime" from 60 to 300 to keep those connections around for longer (avoids that https overhead) Try with some bigger numbers and see if you get the same results. Your scanning threads may just be blocking on the http connection pool for bonus fun force random IO for ORC/parquet perf, but with remote reads, set the min block to read to be 256K or bigger ``` spark.hadoop.fs.s3a.readahead.range 256K spark.hadoop.fs.s3a.input.fadvise random ``` note: Java 8's SSL default encryption is underperformant. We've been doing work there but it's too early to think about backporting it. I'm planning to do a refresh of the s3a connector for hadoop 3.2.2 which should include it (https://github.com/apache/hadoop/pull/970) For now: look at [stack overflow](https://stackoverflow.com/questions/25992131/slow-aes-gcm-encryption-and-decryption-with-java-8u20) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org