Still haven't found a --conf option. Regarding a temporary HDFS checkpoint directory, it looks like when using --master yarn, spark-submit supplies a SPARK_YARN_STAGING_DIR environment variable. Thus, one could do the following when creating a SparkSession:
val checkpointPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"), "checkpoints").toString sparkSession.sparkContext.setCheckpointDir(checkpointPath) The staging directory is in an HDFS path like /user/[user]/.sparkStaging/[YARN application ID] and is deleted at the end of the application <https://github.com/apache/spark/blob/branch-2.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L184> . So this is one option, though certainly abusing the staging directory. A more general one might be to find where Dataset.persist(DISK_ONLY) writes. On Fri, May 26, 2017 at 9:08 AM, Everett Anderson <ever...@nuna.com> wrote: > Hi, > > I need to set a checkpoint directory as I'm starting to use GraphFrames. > (Also, occasionally my regular DataFrame lineages get too long so it'd be > nice to use checkpointing to squash the lineage.) > > I don't actually need this checkpointed data to live beyond the life of > the job, however. I'm running jobs on AWS EMR (so on YARN + HDFS) and > reading and writing non-transient data to S3. > > Two questions: > > 1. Is there a Spark --conf option to set the checkpoint directory? Somehow > I couldn't find it, but surely it exists. > > 2. What's a good checkpoint directory for this use case? I imagine it'd be > on HDFS and presumably in a YARN application-specific temporary path that > gets cleaned up afterwards. Does anyone have a recommendation? > > Thanks! > > - Everett > >