[GitHub] [spark] holdenk commented on pull request #29179: [WIP][SPARK-32381][CORE][SQL] Explore allowing parallel listing & non-location sensitive listing in core

GitBox Thu, 23 Jul 2020 15:07:17 -0700


holdenk commented on pull request #29179:
URL: https://github.com/apache/spark/pull/29179#issuecomment-663255596



   > There's potential here, I'm curious about the numbers
   > 
   > * try and do incremental result generation though remote iterators, yield 
etc. That way the ability to do async fetch of the next page of results from 
the store while the app is going through its first page 
https://issues.apache.org/jira/browse/HADOOP-17074 can deliver tangible 
benefits. You also avoid the out of memory problems related to directories with 
a few million files being represented as arrays with a few million FileStatus 
entries.
   > * use listLocatedStatus everywhere and I'll see about getting the azure 
developers to speed up the abfs implementation
   So I also want to support querying HDFS where we know it's disagreggated.
   > * in [apache/hadoop#2069](https://github.com/apache/hadoop/pull/2069) the 
S3a remoteiterators implement the new IOStatisticsAPI; LocatedFileStatusFetcher 
will collect and aggregate the results. If you use the iterator APIs it should 
be possible to do the same thing
   Interesting. Is this specific to the S3A impl or is there a higher base 
class? I want to make it work with multiple file formats if possible.
   > * even without that, if you can collect/report listing times that could be 
useful.
   So we'll sort of semi-implicitly have it from the job statistics, but not in 
a very easy access to form. I could use an accumulator to keep track of it to 
allow multi-worker fan out.
   > 
   > Finally -Is the idea here to actually push the scan out across the cluster 
or just to do it multithreaded in the spark driver process?
   
   The idea here is to push it out to the workers (in part per-host rate 
limiting) but also matching the code we have in the SQL side so we have less 
maintianence cost.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] holdenk commented on pull request #29179: [WIP][SPARK-32381][CORE][SQL] Explore allowing parallel listing & non-location sensitive listing in core

Reply via email to