[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215469#comment-17215469 ]
Nicholas Chammas commented on SPARK-33000: ------------------------------------------ I've tested this out a bit more, and I think the original issue I reported is valid. Either that, or I'm still missing something. I built Spark at the latest commit from {{master}}: {code:java} commit 3ae1520185e2d96d1bdbd08c989f0d48ad3ba578 (HEAD -> master, origin/master, origin/HEAD, apache/master) Author: ulysses <youxi...@weidian.com> Date: Fri Oct 16 11:26:27 2020 +0000 {code} One thing that has changed is that Spark now prevents you from setting {{cleanCheckpoints}} after startup: {code:java} >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/sql/conf.py", line 36, in set self._jconf.set(key, value) File ".../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__ File ".../spark/python/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.AnalysisException: Cannot modify the value of a Spark config: spark.cleaner.referenceTracking.cleanCheckpoints; {code} So that's good! This makes it clear to the user that this setting cannot be set at this time (though it could be made more helpful it explained why). However, if I try to set the config as part of invoking PySpark, I still don't see any checkpointed data get cleaned up on shutdown: {code:java} $ rm -rf /tmp/spark/checkpoint/ $ ./bin/pyspark --conf spark.cleaner.referenceTracking.cleanCheckpoints=true <snipped> >>> spark.conf.get('spark.cleaner.referenceTracking.cleanCheckpoints') 'true' >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') >>> a = spark.range(10) >>> a.checkpoint() DataFrame[id: bigint] >>> <Ctrl-D> $ du -sh /tmp/spark/checkpoint/* 32K /tmp/spark/checkpoint/57b0a413-9d47-4bcd-99ef-265e9f5c0f3b{code} > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > ------------------------------------------------------------------------ > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.4.6 > Reporter: Nicholas Chammas > Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org