[jira] [Commented] (SPARK-21056) InMemoryFileIndex.listLeafFiles should create at most one spark job when listing files in parallel

Ohad Raviv (Jira) Fri, 09 May 2025 00:54:25 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-21056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950467#comment-17950467
 ]


Ohad Raviv commented on SPARK-21056:
------------------------------------

Hi! I know it's been a while :D, but we've recently discovered significant 
resource waste related to this issue. The root cause appears to be the folder 
structure in S3 (`{{{}/year/month/day/hour`{}}}), which led the driver to 
perform repeated serial listings over an extended period. This resulted in 
prolonged job durations and inefficient use of resources.

We found that adjusting the `{{{}threshold`{}}} parameter to offload some of 
the listing work to executors does help. However, there’s still considerable 
back-and-forth between the driver and executors, as the driver's serial loops 
continue to generate a large number of jobs.
 
Any idea why this solution wasn't accepted?

> InMemoryFileIndex.listLeafFiles should create at most one spark job when 
> listing files in parallel
> --------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-21056
>                 URL: https://issues.apache.org/jira/browse/SPARK-21056
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.1.1, 2.2.0
>            Reporter: Bertrand Bossy
>            Priority: Major
>              Labels: bulk-closed
>
> Given partitioned file relation (e.g. parquet):
> {code}
> root/a=../b=../c=..
> {code}
> InMemoryFileIndex.listLeafFiles runs numberOfPartitions(a) times 
> numberOfPartitions(b) spark jobs sequentially to list leaf files, if both 
> numberOfPartitions(a) and numberOfPartitions(b) are below 
> {{spark.sql.sources.parallelPartitionDiscovery.threshold}} and 
> numberOfPartitions(c) is above 
> {{spark.sql.sources.parallelPartitionDiscovery.threshold}}
> Since the jobs are run sequentially, the overhead of the jobs dominates and 
> the file listing operation can become significantly slower than listing the 
> files from the driver.
> I propose that InMemoryFileIndex.listLeafFiles should launch at most one 
> spark job for listing leaf files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21056) InMemoryFileIndex.listLeafFiles should create at most one spark job when listing files in parallel

Reply via email to