[ https://issues.apache.org/jira/browse/SPARK-21056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950467#comment-17950467 ]
Ohad Raviv commented on SPARK-21056: ------------------------------------ Hi! I know it's been a while :D, but we've recently discovered significant resource waste related to this issue. The root cause appears to be the folder structure in S3 (`{{{}/year/month/day/hour`{}}}), which led the driver to perform repeated serial listings over an extended period. This resulted in prolonged job durations and inefficient use of resources. We found that adjusting the `{{{}threshold`{}}} parameter to offload some of the listing work to executors does help. However, there’s still considerable back-and-forth between the driver and executors, as the driver's serial loops continue to generate a large number of jobs. Any idea why this solution wasn't accepted? > InMemoryFileIndex.listLeafFiles should create at most one spark job when > listing files in parallel > -------------------------------------------------------------------------------------------------- > > Key: SPARK-21056 > URL: https://issues.apache.org/jira/browse/SPARK-21056 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.1.1, 2.2.0 > Reporter: Bertrand Bossy > Priority: Major > Labels: bulk-closed > > Given partitioned file relation (e.g. parquet): > {code} > root/a=../b=../c=.. > {code} > InMemoryFileIndex.listLeafFiles runs numberOfPartitions(a) times > numberOfPartitions(b) spark jobs sequentially to list leaf files, if both > numberOfPartitions(a) and numberOfPartitions(b) are below > {{spark.sql.sources.parallelPartitionDiscovery.threshold}} and > numberOfPartitions(c) is above > {{spark.sql.sources.parallelPartitionDiscovery.threshold}} > Since the jobs are run sequentially, the overhead of the jobs dominates and > the file listing operation can become significantly slower than listing the > files from the driver. > I propose that InMemoryFileIndex.listLeafFiles should launch at most one > spark job for listing leaf files. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org