[ 
https://issues.apache.org/jira/browse/SPARK-38005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated SPARK-38005:
----------------------------------
    Description: 
Currently merged shuffle files and state is not cleaned up until an application 
ends. 

But shuffle files will still stick around until an application completes. 
Dynamic allocation is commonly used for long runnin

  was:
Currently shuffle data is not cleaned up when an external shuffle service is 
used and the associated executor has been deallocated before the shuffle is 
cleaned up. Shuffle data is only cleaned up once the application ends.

There have been various issues filed for this:

https://issues.apache.org/jira/browse/SPARK-26020

https://issues.apache.org/jira/browse/SPARK-17233

https://issues.apache.org/jira/browse/SPARK-4236

But shuffle files will still stick around until an application completes. 
Dynamic allocation is commonly used for long running jobs (such as structured 
streaming), so any long running jobs with a large shuffle involved will 
eventually fill up local disk space. The shuffle service already supports 
cleaning up shuffle service persisted RDDs, so it should be able to support 
cleaning up shuffle blocks as well once the shuffle is removed by the 
ContextCleaner. 

The current alternative is to use shuffle tracking instead of an external 
shuffle service, but this is less optimal from a resource perspective as all 
executors must be kept alive until the shuffle has been fully consumed and 
cleaned up (and with the default GC interval being 30 minutes this can waste a 
lot of time with executors held onto but not doing anything).


> Support cleaning up merged shuffle files and state from external shuffle 
> service
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-38005
>                 URL: https://issues.apache.org/jira/browse/SPARK-38005
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle, Spark Core
>    Affects Versions: 3.2.0
>            Reporter: Chandni Singh
>            Priority: Major
>
> Currently merged shuffle files and state is not cleaned up until an 
> application ends. 
> But shuffle files will still stick around until an application completes. 
> Dynamic allocation is commonly used for long runnin



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to