Enrico Minack created SPARK-50417:
-------------------------------------

             Summary: Limit number of subdirectories that FallbackStorage 
creates per shuffle
                 Key: SPARK-50417
                 URL: https://issues.apache.org/jira/browse/SPARK-50417
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.0.0
            Reporter: Enrico Minack


The {{FallbackStorage}} copies shuffle data during executor decommissioning to 
a distributed or cloud storage like S3 or HDFS. In 
https://github.com/apache/spark/pull/34762, a hash has been added to the path 
of a file in order to reduce the number of files per directory (prefix in S3 
terms). This creates as many directories per shuffle as files are transferred, 
while each directory contains a single file.

While this might be useful for S3, it may pose challenges for other 
filesystems. A shuffle of 100,000 partitions creates 100,000 directories, each 
containing a single file. The number of directories should be configurable to 
be able to adjust this behavior for the specific filesystem used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to