[ https://issues.apache.org/jira/browse/SPARK-50417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SPARK-50417: ----------------------------------- Labels: pull-request-available (was: ) > Limit number of subdirectories that FallbackStorage creates per shuffle > ----------------------------------------------------------------------- > > Key: SPARK-50417 > URL: https://issues.apache.org/jira/browse/SPARK-50417 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 4.0.0 > Reporter: Enrico Minack > Priority: Major > Labels: pull-request-available > > The {{FallbackStorage}} copies shuffle data during executor decommissioning > to a distributed or cloud storage like S3 or HDFS. In > https://github.com/apache/spark/pull/34762, a hash has been added to the path > of a file in order to reduce the number of files per directory (prefix in S3 > terms). This creates as many directories per shuffle as files are > transferred, while each directory contains a single file. > While this might be useful for S3, it may pose challenges for other > filesystems. A shuffle of 100,000 partitions creates 100,000 directories, > each containing a single file. The number of directories should be > configurable to be able to adjust this behavior for the specific filesystem > used. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org