Romain Manni-Bucau created SPARK-54923:
------------------------------------------

             Summary: 
org.apache.spark.util.HadoopFSUtils.parallelListLeafFilesInternal parallelism 
configuration not respected
                 Key: SPARK-54923
                 URL: https://issues.apache.org/jira/browse/SPARK-54923
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 4.0.1
            Reporter: Romain Manni-Bucau


Hi,

 

assuming there is no paralleism threshold (or nothing preventing parallelism at 
least)

and that a s3 bucket has 10k+ files

but with a hierarchy which looks like root/subpath/\{year}/...



then org.apache.spark.util.HadoopFSUtils.parallelListLeafFilesInternal will not 
parallelize because root has a single child (and once parallelize is called, 
workers don't get spark context and a threshold of int.max, two criterias 
preventing parallelism on workers).

While this seems intentional it is a big perf killer preventing to use SQL and 
forcing to superseed the listing manually in the driver and explicitly pass 
paths.

Any hope the design is rethought to add a config to force the driver to visit 
multiple level before calling parallelize(), in the previous case I could say 
visit 2 (3 actually since there is the month after) levels then delegate to the 
workers.

If needed I would work on a patch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to