[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2020-02-13 Thread GitBox
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize 
blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-585908704
 
 
   LGTM,
   
   
   There's one thing to be aware of, which is that `Path.get(conf)` maps to 
`FileSystem.get(path.getURI(), conf)`, which looks in a shared cache for 
existing instances (good) then instantiates one if there isn't one for that 
User + URI. And if that FS instance takes a while to start up (i.e. 
`FileSystem.initialize())` is slow, then multiple threads will all end up 
trying to create instances, then discard all but one afterwards. 
   
   Hence: https://github.com/apache/hadoop/pull/1838  ; removing some network 
IO in `S3AFileSystem.initialize()` by giving you the option of not bothering to 
look to see if the bucket exists.
   
   Does that mean there's anything wrong with this PR? No, only that 
performance is best if the relevant FS instances have already been preloaded 
into the FS cache. And those people implementing filesystem connectors should 
do a better job at low-latency instantiation, even if it means async network 
startup threads and moving the blocking to the first FS API call instead.
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-11-27 Thread GitBox
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize 
blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-559112759
 
 
   Quick follow up to the "how many connections" discussion. 
   
   It turns out that a sufficiently large Number of S3 DNS lookups can trigger 
503 throttling on DNS and there isn't anything in the AWS library to react to 
this. The bigger at the connection pool, the more connections you are likely to 
see on worker start-up, probably workers * S3A URLS * 
"fs.s3a.connection.maximum" . Don't go overboard. And if you do see the 
problem, file a HADOOP with stack traces, configs and anything else which could 
help implementing resilience here. 
   
   *that doesn't make any difference to this PR, just something to know*


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-11-11 Thread GitBox
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize 
blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-552456614
 
 
   This code LGTM: skips needless probes on the globbed paths; parallel checks 
on the others.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-11-11 Thread GitBox
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize 
blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-552454930
 
 
   I'd say 40 sounds good; people can tune it


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-11-04 Thread GitBox
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize 
blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-549465360
 
 
   Nice experiment!
   
   I guess in-EC2, you're limited by the number of course but also latency is 
nice and low. Remotely, latency is worse so if there is anything we can do in 
parallel threads -there are some tangible benefits.
   
   in both local and remote S3 interaction rename() is faked with a COPY, which 
is 6-10MB/s; that can be done via the thread pool too if you can configure the 
AWS SDK to split up a large copy into parallel parts. That shares the same 
pools, so its useful to have some capacity there on any process renaming things.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-10-01 Thread GitBox
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize 
blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-536993443
 
 
   > Update: I tried increasing fs.s3a.connection.maximum and it did improve 
performance of the filesystem calls.
   
    
   
   > I still need to set up a benchmark that runs on EC2 instead of remote dev 
laptop, will update in a couple days.
   
   log the toString value of the FS instance at the end to see what the 
counters say


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-09-28 Thread GitBox
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize 
blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-536223420
 
 
   > fs.s3a.connection.maximum 
   
   its 30 AFAIK. I should revisit that you know, there was never reason for it 
other than some uses can overload things (e.g hive doing many user instances_).
   
   * Impala runs with thousands; you need to bump up the thread pool too. 
   * if you have spark workers and they all work with the same few buckets: go 
big
   * if you have spark workers working with different buckets, balance the 
capacity
   
   generally, for metadata ops (head, list) and for copy ops used in rename, 
those connections don't overload the client..they are waiting for things to 
happen. It's the GET and PUT which use up bandwidth. 
   
   Why don't you submit a Hadoop PR which bumps the default value to some 
higher number which you can all agree on, and I'll review. We could certainly 
do some 64-100; above that gets harder to defend
   
   (I'm ignoring actual throttling of S3 REST calls; they span your entire 
cluster)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-09-27 Thread GitBox
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize 
blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-535903963
 
 
   >  it seems like the sweet spot is somewhere between 20-30 threads (for my 
environment anyways, 2015 macbook pro, i7/w 8 cores).
   
   interesting. You may get different numbers running in EC2; it's always best 
to benchmark perf there. Remote dev amplifies some performance issues (cost of 
reopening an http connection, general latency) while hiding others (how easy it 
is for spark jobs to overload s3 shards and so get throttled, cause delays, 
trigger speculative task execution, more throttling, etc, etc)
   
   Try changing "fs.s3a.connection.maximum" from the default of 48 to something 
bigger. That's the limit on the http pool size. It's a small number to stop a 
single s3a instance from overloading the system, but you may want to consider. 
There's also "fs.s3a.max.total.tasks" which controls the thread pool size used 
for background writing of blocks of large files; in hadoop trunk parallel 
delete/rename operations, plus stuff in the AWS SDK itself.
   
   * "fs.s3a.connection.maximum" should be > than "fs.s3a.max.total.tasks"
   * "fs.s3a.threads.keepalivetime" from 60 to 300 to keep those connections 
around for longer (avoids that https overhead)
   
   Try with some bigger numbers and see if you get the same results. Your 
scanning threads may just be blocking on the http connection pool
   
   for bonus fun force random IO for ORC/parquet perf, but with remote reads, 
set the min block to read to be 256K or bigger
   
   ```
   spark.hadoop.fs.s3a.readahead.range 256K
   spark.hadoop.fs.s3a.input.fadvise random
   ```
   
   note: Java 8's SSL default encryption is underperformant. We've been doing 
work there but it's too early to think about backporting it. I'm planning to do 
a refresh of the s3a connector for hadoop 3.2.2 which should include it 
(https://github.com/apache/hadoop/pull/970)
   For now: look at [stack 
overflow](https://stackoverflow.com/questions/25992131/slow-aes-gcm-encryption-and-decryption-with-java-8u20)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org