Re: Shutdown cleanup of disk-based resources that Spark creates

2021-04-06 Thread Steve Loughran
On Thu, 11 Mar 2021 at 19:58, Attila Zsolt Piros < piros.attila.zs...@gmail.com> wrote: > I agree with you to extend the documentation around this. Moreover I > support to have specific unit tests for this. > > > There is clearly some demand for Spark to automatically clean up > checkpoints on

Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-11 Thread Attila Zsolt Piros
I agree with you to extend the documentation around this. Moreover I support to have specific unit tests for this. > There is clearly some demand for Spark to automatically clean up checkpoints on shutdown What about I suggested on the PR? To clean up the checkpoint directory at shutdown one can

Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-11 Thread Nicholas Chammas
OK, perhaps the best course of action is to leave the current behavior as-is but clarify the documentation for `.checkpoint()` and/or `cleanCheckpoints`. I personally find it confusing that `cleanCheckpoints` doesn't address shutdown behavior, and the Stack Overflow links I shared

Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-10 Thread Attila Zsolt Piros
> Checkpoint data is left behind after a normal shutdown, not just an unexpected shutdown. The PR description includes a simple demonstration of this. I think I might overemphasized a bit the "unexpected" adjective to show you the value in the current behavior. The feature configured with

Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-10 Thread Nicholas Chammas
Checkpoint data is left behind after a normal shutdown, not just an unexpected shutdown. The PR description includes a simple demonstration of this. If the current behavior is truly intended -- which I find difficult to believe given how confusing it

Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-10 Thread Attila Zsolt Piros
Hi Nick! I am not sure you are fixing a problem here. I think what you see is as problem is actually an intended behaviour. Checkpoint data should outlive the unexpected shutdowns. So there is a very important difference between the reference goes out of scope during a normal execution (in this

Shutdown cleanup of disk-based resources that Spark creates

2021-03-10 Thread Nicholas Chammas
Hello people, I'm working on a fix for SPARK-33000 . Spark does not cleanup checkpointed RDDs/DataFrames on shutdown, even if the appropriate configs are set. In the course of developing a fix, another contributor pointed out