> Checkpoint data is left behind after a normal shutdown, not just an
unexpected shutdown. The PR description includes a simple demonstration of
this.

I think I might overemphasized a bit the "unexpected" adjective to show you
the value in the current behavior.

The feature configured with
"spark.cleaner.referenceTracking.cleanCheckpoints" is about out of scoped
references without ANY shutdown.

It would be hard to distinguish that level (ShutdownHookManager) the
unexpected from the intentional exits.
As the user code (run by driver) could contain a System.exit() which was
added by the developer for numerous reasons (this way distinguishing
unexpected and not unexpected is not really an option).
Even a third party library can contain s System.exit(). Would that be an
unexpected exit or intentional? You can see it is hard to tell.

To test the real feature
behind "spark.cleaner.referenceTracking.cleanCheckpoints" you can create a
reference within a scope which is closed. For example within the body of a
function (without return value) and store it only in a local
variable. After the scope is closed in case of our function when the caller
gets the control back you have chance to see the context cleaner working
(you might even need to trigger a GC too).

On Wed, Mar 10, 2021 at 10:09 PM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Checkpoint data is left behind after a normal shutdown, not just an
> unexpected shutdown. The PR description includes a simple demonstration of
> this.
>
> If the current behavior is truly intended -- which I find difficult to
> believe given how confusing <https://stackoverflow.com/q/52630858/877069>
> it <https://stackoverflow.com/q/60009856/877069> is
> <https://stackoverflow.com/q/61454740/877069> -- then at the very least
> we need to update the documentation for both `.checkpoint()` and
> `cleanCheckpoints` to make that clear.
>
> > This way even after an unexpected exit the next run of the same app
> should be able to pick up the checkpointed data.
>
> The use case you are describing potentially makes sense. But preserving
> checkpoint data after an unexpected shutdown -- even when
> `cleanCheckpoints` is set to true -- is a new guarantee that is not
> currently expressed in the API or documentation. At least as far as I can
> tell.
>
> On Wed, Mar 10, 2021 at 3:10 PM Attila Zsolt Piros <
> piros.attila.zs...@gmail.com> wrote:
>
>> Hi Nick!
>>
>> I am not sure you are fixing a problem here. I think what you see is as
>> problem is actually an intended behaviour.
>>
>> Checkpoint data should outlive the unexpected shutdowns. So there is a
>> very important difference between the reference goes out of scope during a
>> normal execution (in this case cleanup is expected depending on the config
>> you mentioned) and when a references goes out of scope because of an
>> unexpected error (in this case you should keep the checkpoint data).
>>
>> This way even after an unexpected exit the next run of the same app
>> should be able to pick up the checkpointed data.
>>
>> Best Regards,
>> Attila
>>
>>
>>
>>
>> On Wed, Mar 10, 2021 at 8:10 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Hello people,
>>>
>>> I'm working on a fix for SPARK-33000
>>> <https://issues.apache.org/jira/browse/SPARK-33000>. Spark does not
>>> cleanup checkpointed RDDs/DataFrames on shutdown, even if the appropriate
>>> configs are set.
>>>
>>> In the course of developing a fix, another contributor pointed out
>>> <https://github.com/apache/spark/pull/31742#issuecomment-790987483>
>>> that checkpointed data may not be the only type of resource that needs a
>>> fix for shutdown cleanup.
>>>
>>> I'm looking for a committer who might have an opinion on how Spark
>>> should clean up disk-based resources on shutdown. The last people who
>>> contributed significantly to the ContextCleaner, where this cleanup
>>> happens, were @witgo <https://github.com/witgo> and @andrewor14
>>> <https://github.com/andrewor14>. But that was ~6 years ago, and I don't
>>> think they are active on the project anymore.
>>>
>>> Any takers to take a look and give their thoughts? The PR is small
>>> <https://github.com/apache/spark/pull/31742>. +39 / -2.
>>>
>>> Nick
>>>
>>>

Reply via email to