[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322333#comment-17322333 ] Nicholas Chammas commented on SPARK-33000: -- Per the discussion [on the dev list|http://apache-spark-developers-list.1001551.n3.nabble.com/Shutdown-cleanup-of-disk-based-resources-that-Spark-creates-td30928.html] and [PR|https://github.com/apache/spark/pull/31742], it seems we just want to update the documentation to clarify that {{cleanCheckpoints}} does not impact shutdown behavior. i.e. Checkpoints are not meant to be cleaned up on shutdown (whether planned or unplanned), and the config is currently working as intended. > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Nicholas Chammas >Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. > Evidence the current behavior is confusing: > * [https://stackoverflow.com/q/52630858/877069] > * [https://stackoverflow.com/q/60009856/877069] > * [https://stackoverflow.com/q/61454740/877069] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295612#comment-17295612 ] Haoyuan Wang commented on SPARK-33000: -- [~nchammas] absolutely...I left a comment, I think you've found something interesting...this may not be a "minor" improvement if your finding is true... > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Nicholas Chammas >Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295531#comment-17295531 ] Nicholas Chammas commented on SPARK-33000: -- [~caowang888] - If you're still interested in this issue, take a look at my PR and let me know what you think. Hopefully, I've understood the issue correctly and proposed an appropriate fix. > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Nicholas Chammas >Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295529#comment-17295529 ] Apache Spark commented on SPARK-33000: -- User 'nchammas' has created a pull request for this issue: https://github.com/apache/spark/pull/31742 > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Nicholas Chammas >Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295528#comment-17295528 ] Apache Spark commented on SPARK-33000: -- User 'nchammas' has created a pull request for this issue: https://github.com/apache/spark/pull/31742 > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Nicholas Chammas >Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215577#comment-17215577 ] Nicholas Chammas commented on SPARK-33000: -- Ctrl-D gracefully shuts down the Python REPL, so that should trigger the appropriate cleanup. I repeated my test and did {{spark.stop()}} instead of Ctrl-D and waited 2 minutes. Same result. No cleanup. > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Nicholas Chammas >Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215515#comment-17215515 ] Haoyuan Wang commented on SPARK-33000: -- I'm not sure what ctrl-D does. It closed bash, but does it gracefully terminate SparkContext? Try calling spark.stop() and wait a minute or so. > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Nicholas Chammas >Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215469#comment-17215469 ] Nicholas Chammas commented on SPARK-33000: -- I've tested this out a bit more, and I think the original issue I reported is valid. Either that, or I'm still missing something. I built Spark at the latest commit from {{master}}: {code:java} commit 3ae1520185e2d96d1bdbd08c989f0d48ad3ba578 (HEAD -> master, origin/master, origin/HEAD, apache/master) Author: ulysses Date: Fri Oct 16 11:26:27 2020 + {code} One thing that has changed is that Spark now prevents you from setting {{cleanCheckpoints}} after startup: {code:java} >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') Traceback (most recent call last): File "", line 1, in File ".../spark/python/pyspark/sql/conf.py", line 36, in set self._jconf.set(key, value) File ".../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__ File ".../spark/python/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.AnalysisException: Cannot modify the value of a Spark config: spark.cleaner.referenceTracking.cleanCheckpoints; {code} So that's good! This makes it clear to the user that this setting cannot be set at this time (though it could be made more helpful it explained why). However, if I try to set the config as part of invoking PySpark, I still don't see any checkpointed data get cleaned up on shutdown: {code:java} $ rm -rf /tmp/spark/checkpoint/ $ ./bin/pyspark --conf spark.cleaner.referenceTracking.cleanCheckpoints=true >>> spark.conf.get('spark.cleaner.referenceTracking.cleanCheckpoints') 'true' >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') >>> a = spark.range(10) >>> a.checkpoint() DataFrame[id: bigint] >>> $ du -sh /tmp/spark/checkpoint/* 32K /tmp/spark/checkpoint/57b0a413-9d47-4bcd-99ef-265e9f5c0f3b{code} > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Nicholas Chammas >Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215086#comment-17215086 ] Haoyuan Wang commented on SPARK-33000: -- Please proceed with the change, don't wait for me :) > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Nicholas Chammas >Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215085#comment-17215085 ] Nicholas Chammas commented on SPARK-33000: -- Thanks for the explanation! I'm happy to leave this to you if you'd like to get back into open source work. If you're not sure you'll get to it in the next couple of weeks, let me know and I can take care of this since it's just a documentation update. > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Nicholas Chammas >Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215015#comment-17215015 ] Haoyuan Wang commented on SPARK-33000: -- [~nchammas] That'd be good. I was planning to take this opportunity to finally touch some open source work. But realistically, I don't know if I'll get time. Cleaning checkpointed file is done by [ContextCleaner|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala]. It's triggered by a JVM hook. In this case, a WeakReference hook. These hooks need to be registered before JVM is launched. As for checkpoint dir, it's just a property of SparkContext. So it can(have to) be set after object is initiated. There are a few more properties like this, that has to be set before SparkContext initiation, mostly due to utilization of JVM hooks. But these are not documented neither, however I don't have full list. But we are fixing one of them! > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Nicholas Chammas >Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214904#comment-17214904 ] Nicholas Chammas commented on SPARK-33000: -- Thanks for the pointer! No need for a new ticket. I can adjust the title of this ticket and open a PR for the doc update. Separate question: Do you know why there is such a restriction on when this configuration needs to be set? It's surprising since you can set the checkpoint directory after job submission. > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Nicholas Chammas >Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214840#comment-17214840 ] Haoyuan Wang commented on SPARK-33000: -- This config will not be picked up when set after SparkContext is initiated. It has to be set during submission, before SparkContext object is initiated. I'm able to repro this with Spark 2.3.0, making configuration during job submission resoled issue. What needs improved should be documentation, let me open a ticket for that and close this one. > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Nicholas Chammas >Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org