Enrico Minack created SPARK-50417: ------------------------------------- Summary: Limit number of subdirectories that FallbackStorage creates per shuffle Key: SPARK-50417 URL: https://issues.apache.org/jira/browse/SPARK-50417 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Enrico Minack
The {{FallbackStorage}} copies shuffle data during executor decommissioning to a distributed or cloud storage like S3 or HDFS. In https://github.com/apache/spark/pull/34762, a hash has been added to the path of a file in order to reduce the number of files per directory (prefix in S3 terms). This creates as many directories per shuffle as files are transferred, while each directory contains a single file. While this might be useful for S3, it may pose challenges for other filesystems. A shuffle of 100,000 partitions creates 100,000 directories, each containing a single file. The number of directories should be configurable to be able to adjust this behavior for the specific filesystem used. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org