While checkpointing RDDs as a part of an application that doesn't use
spark-streaming, I observed that the checkpointed files are not being
cleaned up even after the application completes successfully.

Is it because we assume that checkpointing would be primarily used for
spark-streaming applications which run in continuum?

Also the current mechanism supports recovery only in spark-streaming which
can survive driver crashes. There's no support to recover from previously
checkpointed RDDs in subsequent application attempts. It would be
consistent and nice to have the ability to recover across app attempts in
non streaming jobs.

Is there any specific reason for the current behavior of not cleaning the
files and lack of support across app attempts? If not I can raise a JIRA
for this.

Thanks,
Dhruve

Reply via email to