Bo Zhang created SPARK-47764:
--------------------------------

             Summary: Cleanup shuffle dependencies for Spark Connect SQL 
executions
                 Key: SPARK-47764
                 URL: https://issues.apache.org/jira/browse/SPARK-47764
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core, SQL
    Affects Versions: 4.0.0
            Reporter: Bo Zhang


Shuffle dependencies are created by shuffle map stages, which consists of files 
on disks and the corresponding references in Spark JVM heap memory. Currently 
Spark cleanup unused  shuffle dependencies through JVM GCs, and periodic GCs 
are triggered once every 30 minutes (see ContextCleaner). However, we still 
found cases in which the size of the shuffle data files are too large, which 
makes shuffle data migration slow.

 

We do have chances to cleanup shuffle dependencies, especially for SQL queries 
created by Spark Connect, since we do have better control of the DataFrame 
instances there. Even if DataFrame instances are reused in the client side, on 
the server side the instances are still recreated. 

 

We might also provide the option to 1. cleanup eagerly after each query 
executions, or 2. only mark the shuffle executions and do not migrate them at 
node decommissions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to