Temp checkpoint directory for EMR (S3 or HDFS)

Everett Anderson Fri, 26 May 2017 09:09:11 -0700

Hi,

I need to set a checkpoint directory as I'm starting to use GraphFrames.
(Also, occasionally my regular DataFrame lineages get too long so it'd be
nice to use checkpointing to squash the lineage.)


I don't actually need this checkpointed data to live beyond the life of the
job, however. I'm running jobs on AWS EMR (so on YARN + HDFS) and reading
and writing non-transient data to S3.

Two questions:

1. Is there a Spark --conf option to set the checkpoint directory? Somehow
I couldn't find it, but surely it exists.

2. What's a good checkpoint directory for this use case? I imagine it'd be
on HDFS and presumably in a YARN application-specific temporary path that
gets cleaned up afterwards. Does anyone have a recommendation?

Thanks!

- Everett

Temp checkpoint directory for EMR (S3 or HDFS)

Reply via email to