[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215085#comment-17215085 ]
Nicholas Chammas commented on SPARK-33000: ------------------------------------------ Thanks for the explanation! I'm happy to leave this to you if you'd like to get back into open source work. If you're not sure you'll get to it in the next couple of weeks, let me know and I can take care of this since it's just a documentation update. > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > ------------------------------------------------------------------------ > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.4.6 > Reporter: Nicholas Chammas > Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org