Romain Manni-Bucau created SPARK-54923:
------------------------------------------
Summary:
org.apache.spark.util.HadoopFSUtils.parallelListLeafFilesInternal parallelism
configuration not respected
Key: SPARK-54923
URL: https://issues.apache.org/jira/browse/SPARK-54923
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 4.0.1
Reporter: Romain Manni-Bucau
Hi,
assuming there is no paralleism threshold (or nothing preventing parallelism at
least)
and that a s3 bucket has 10k+ files
but with a hierarchy which looks like root/subpath/\{year}/...
then org.apache.spark.util.HadoopFSUtils.parallelListLeafFilesInternal will not
parallelize because root has a single child (and once parallelize is called,
workers don't get spark context and a threshold of int.max, two criterias
preventing parallelism on workers).
While this seems intentional it is a big perf killer preventing to use SQL and
forcing to superseed the listing manually in the driver and explicitly pass
paths.
Any hope the design is rethought to add a config to force the driver to visit
multiple level before calling parallelize(), in the previous case I could say
visit 2 (3 actually since there is the month after) levels then delegate to the
workers.
If needed I would work on a patch.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]