[ 
https://issues.apache.org/jira/browse/SPARK-50417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-50417:
-----------------------------------
    Labels: pull-request-available  (was: )

> Limit number of subdirectories that FallbackStorage creates per shuffle
> -----------------------------------------------------------------------
>
>                 Key: SPARK-50417
>                 URL: https://issues.apache.org/jira/browse/SPARK-50417
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 4.0.0
>            Reporter: Enrico Minack
>            Priority: Major
>              Labels: pull-request-available
>
> The {{FallbackStorage}} copies shuffle data during executor decommissioning 
> to a distributed or cloud storage like S3 or HDFS. In 
> https://github.com/apache/spark/pull/34762, a hash has been added to the path 
> of a file in order to reduce the number of files per directory (prefix in S3 
> terms). This creates as many directories per shuffle as files are 
> transferred, while each directory contains a single file.
> While this might be useful for S3, it may pose challenges for other 
> filesystems. A shuffle of 100,000 partitions creates 100,000 directories, 
> each containing a single file. The number of directories should be 
> configurable to be able to adjust this behavior for the specific filesystem 
> used.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to