[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2021-04-15 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322333#comment-17322333
 ] 

Nicholas Chammas commented on SPARK-33000:
--

Per the discussion [on the dev 
list|http://apache-spark-developers-list.1001551.n3.nabble.com/Shutdown-cleanup-of-disk-based-resources-that-Spark-creates-td30928.html]
 and [PR|https://github.com/apache/spark/pull/31742], it seems we just want to 
update the documentation to clarify that {{cleanCheckpoints}} does not impact 
shutdown behavior. i.e. Checkpoints are not meant to be cleaned up on shutdown 
(whether planned or unplanned), and the config is currently working as intended.

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.
> Evidence the current behavior is confusing:
>  * [https://stackoverflow.com/q/52630858/877069]
>  * [https://stackoverflow.com/q/60009856/877069]
>  * [https://stackoverflow.com/q/61454740/877069]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2021-03-04 Thread Haoyuan Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295612#comment-17295612
 ] 

Haoyuan Wang commented on SPARK-33000:
--

[~nchammas] absolutely...I left a comment, I think you've found something 
interesting...this may not be a "minor" improvement if your finding is true...

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2021-03-04 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295531#comment-17295531
 ] 

Nicholas Chammas commented on SPARK-33000:
--

[~caowang888] - If you're still interested in this issue, take a look at my PR 
and let me know what you think. Hopefully, I've understood the issue correctly 
and proposed an appropriate fix.

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2021-03-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295529#comment-17295529
 ] 

Apache Spark commented on SPARK-33000:
--

User 'nchammas' has created a pull request for this issue:
https://github.com/apache/spark/pull/31742

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2021-03-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295528#comment-17295528
 ] 

Apache Spark commented on SPARK-33000:
--

User 'nchammas' has created a pull request for this issue:
https://github.com/apache/spark/pull/31742

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2020-10-16 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215577#comment-17215577
 ] 

Nicholas Chammas commented on SPARK-33000:
--

Ctrl-D gracefully shuts down the Python REPL, so that should trigger the 
appropriate cleanup.

I repeated my test and did {{spark.stop()}} instead of Ctrl-D and waited 2 
minutes. Same result. No cleanup.

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2020-10-16 Thread Haoyuan Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215515#comment-17215515
 ] 

Haoyuan Wang commented on SPARK-33000:
--

I'm not sure what ctrl-D does. It closed bash, but does it gracefully terminate 
SparkContext? Try calling spark.stop() and wait a minute or so.

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2020-10-16 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215469#comment-17215469
 ] 

Nicholas Chammas commented on SPARK-33000:
--

I've tested this out a bit more, and I think the original issue I reported is 
valid. Either that, or I'm still missing something.

I built Spark at the latest commit from {{master}}:
{code:java}
commit 3ae1520185e2d96d1bdbd08c989f0d48ad3ba578 (HEAD -> master, origin/master, 
origin/HEAD, apache/master)
Author: ulysses 
Date:   Fri Oct 16 11:26:27 2020 + {code}
One thing that has changed is that Spark now prevents you from setting 
{{cleanCheckpoints}} after startup:
{code:java}
>>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/sql/conf.py", line 36, in set
    self._jconf.set(key, value)
  File ".../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
1304, in __call__
  File ".../spark/python/pyspark/sql/utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: Cannot modify the value of a Spark config: 
spark.cleaner.referenceTracking.cleanCheckpoints; {code}
So that's good! This makes it clear to the user that this setting cannot be set 
at this time (though it could be made more helpful it explained why).

However, if I try to set the config as part of invoking PySpark, I still don't 
see any checkpointed data get cleaned up on shutdown:
{code:java}
$ rm -rf /tmp/spark/checkpoint/
$ ./bin/pyspark --conf spark.cleaner.referenceTracking.cleanCheckpoints=true

>>> spark.conf.get('spark.cleaner.referenceTracking.cleanCheckpoints')
'true'
>>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
>>> a = spark.range(10)
>>> a.checkpoint()
DataFrame[id: bigint]                                                           
>>> 
$ du -sh /tmp/spark/checkpoint/*
32K /tmp/spark/checkpoint/57b0a413-9d47-4bcd-99ef-265e9f5c0f3b{code}
 

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2020-10-15 Thread Haoyuan Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215086#comment-17215086
 ] 

Haoyuan Wang commented on SPARK-33000:
--

Please proceed with the change, don't wait for me :)

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2020-10-15 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215085#comment-17215085
 ] 

Nicholas Chammas commented on SPARK-33000:
--

Thanks for the explanation! I'm happy to leave this to you if you'd like to get 
back into open source work.

If you're not sure you'll get to it in the next couple of weeks, let me know 
and I can take care of this since it's just a documentation update.

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2020-10-15 Thread Haoyuan Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215015#comment-17215015
 ] 

Haoyuan Wang commented on SPARK-33000:
--

[~nchammas] That'd be good. I was planning to take this opportunity to finally 
touch some open source work. But realistically, I don't know if I'll get time.

Cleaning checkpointed file is done by 
[ContextCleaner|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala].
 It's triggered by a JVM hook. In this case, a WeakReference hook. These hooks 
need to be registered before JVM is launched.

As for checkpoint dir, it's just a property of SparkContext. So it can(have to) 
be set after object is initiated.

There are a few more properties like this, that has to be set before 
SparkContext initiation, mostly due to utilization of JVM hooks. But these are 
not documented neither, however I don't have full list. But we are fixing one 
of them!

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2020-10-15 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214904#comment-17214904
 ] 

Nicholas Chammas commented on SPARK-33000:
--

Thanks for the pointer!

No need for a new ticket. I can adjust the title of this ticket and open a PR 
for the doc update.

Separate question: Do you know why there is such a restriction on when this 
configuration needs to be set? It's surprising since you can set the checkpoint 
directory after job submission.

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2020-10-15 Thread Haoyuan Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214840#comment-17214840
 ] 

Haoyuan Wang commented on SPARK-33000:
--

This config will not be picked up when set after SparkContext is initiated. It 
has to be set during submission, before SparkContext object is initiated.

 

I'm able to repro this with Spark 2.3.0, making configuration during job 
submission resoled issue. What needs improved should be documentation, let me 
open a ticket for that and close this one.

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org