You should actually be able to get to the underlying filesystem from your SparkContext:
String originalFs = sparkContext.hadoopConfiguration().get("fs.defaultFS"); and then you could just use that: String checkpointPath = String.format("%s/%s/", originalFs, checkpointDirectory); sparkContext.setCheckpointDir(checkpointPath); Asher Krim Senior Software Engineer On Tue, May 30, 2017 at 12:37 PM, Everett Anderson <ever...@nuna.com.invalid > wrote: > Still haven't found a --conf option. > > Regarding a temporary HDFS checkpoint directory, it looks like when using > --master yarn, spark-submit supplies a SPARK_YARN_STAGING_DIR environment > variable. Thus, one could do the following when creating a SparkSession: > > val checkpointPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"), > "checkpoints").toString > sparkSession.sparkContext.setCheckpointDir(checkpointPath) > > The staging directory is in an HDFS path like > > /user/[user]/.sparkStaging/[YARN application ID] > > and is deleted at the end of the application > <https://github.com/apache/spark/blob/branch-2.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L184> > . > > So this is one option, though certainly abusing the staging directory. > > A more general one might be to find where Dataset.persist(DISK_ONLY) > writes. > > > On Fri, May 26, 2017 at 9:08 AM, Everett Anderson <ever...@nuna.com> > wrote: > >> Hi, >> >> I need to set a checkpoint directory as I'm starting to use GraphFrames. >> (Also, occasionally my regular DataFrame lineages get too long so it'd be >> nice to use checkpointing to squash the lineage.) >> >> I don't actually need this checkpointed data to live beyond the life of >> the job, however. I'm running jobs on AWS EMR (so on YARN + HDFS) and >> reading and writing non-transient data to S3. >> >> Two questions: >> >> 1. Is there a Spark --conf option to set the checkpoint directory? >> Somehow I couldn't find it, but surely it exists. >> >> 2. What's a good checkpoint directory for this use case? I imagine it'd >> be on HDFS and presumably in a YARN application-specific temporary path >> that gets cleaned up afterwards. Does anyone have a recommendation? >> >> Thanks! >> >> - Everett >> >> >