Nicholas Chammas created SPARK-33000:
----------------------------------------

             Summary: cleanCheckpoints config does not clean all checkpointed 
RDDs on shutdown
                 Key: SPARK-33000
                 URL: https://issues.apache.org/jira/browse/SPARK-33000
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.4.6
            Reporter: Nicholas Chammas


Maybe it's just that the documentation needs to be updated, but I found this 
surprising:
{code:java}
$ pyspark
...
>>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
>>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
>>> a = spark.range(10)
>>> a.checkpoint()
DataFrame[id: bigint]                                                           
>>> exit(){code}
The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
Spark to clean it up on shutdown.

The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} says:

> Controls whether to clean checkpoint files if the reference is out of scope.

When Spark shuts down, everything goes out of scope, so I'd expect all 
checkpointed RDDs to be cleaned up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to