[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215469#comment-17215469
 ] 

Nicholas Chammas commented on SPARK-33000:
------------------------------------------

I've tested this out a bit more, and I think the original issue I reported is 
valid. Either that, or I'm still missing something.

I built Spark at the latest commit from {{master}}:
{code:java}
commit 3ae1520185e2d96d1bdbd08c989f0d48ad3ba578 (HEAD -> master, origin/master, 
origin/HEAD, apache/master)
Author: ulysses <youxi...@weidian.com>
Date:   Fri Oct 16 11:26:27 2020 +0000 {code}
One thing that has changed is that Spark now prevents you from setting 
{{cleanCheckpoints}} after startup:
{code:java}
>>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/conf.py", line 36, in set
    self._jconf.set(key, value)
  File ".../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
1304, in __call__
  File ".../spark/python/pyspark/sql/utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: Cannot modify the value of a Spark config: 
spark.cleaner.referenceTracking.cleanCheckpoints; {code}
So that's good! This makes it clear to the user that this setting cannot be set 
at this time (though it could be made more helpful it explained why).

However, if I try to set the config as part of invoking PySpark, I still don't 
see any checkpointed data get cleaned up on shutdown:
{code:java}
$ rm -rf /tmp/spark/checkpoint/
$ ./bin/pyspark --conf spark.cleaner.referenceTracking.cleanCheckpoints=true
<snipped>
>>> spark.conf.get('spark.cleaner.referenceTracking.cleanCheckpoints')
'true'
>>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
>>> a = spark.range(10)
>>> a.checkpoint()
DataFrame[id: bigint]                                                           
>>> <Ctrl-D>
$ du -sh /tmp/spark/checkpoint/*
32K     /tmp/spark/checkpoint/57b0a413-9d47-4bcd-99ef-265e9f5c0f3b{code}
 

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> ------------------------------------------------------------------------
>
>                 Key: SPARK-33000
>                 URL: https://issues.apache.org/jira/browse/SPARK-33000
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.6
>            Reporter: Nicholas Chammas
>            Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint]                                                         
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to