Hi, I need to set a checkpoint directory as I'm starting to use GraphFrames. (Also, occasionally my regular DataFrame lineages get too long so it'd be nice to use checkpointing to squash the lineage.)
I don't actually need this checkpointed data to live beyond the life of the job, however. I'm running jobs on AWS EMR (so on YARN + HDFS) and reading and writing non-transient data to S3. Two questions: 1. Is there a Spark --conf option to set the checkpoint directory? Somehow I couldn't find it, but surely it exists. 2. What's a good checkpoint directory for this use case? I imagine it'd be on HDFS and presumably in a YARN application-specific temporary path that gets cleaned up afterwards. Does anyone have a recommendation? Thanks! - Everett