Still haven't found a --conf option.

Regarding a temporary HDFS checkpoint directory, it looks like when using
--master yarn, spark-submit supplies a SPARK_YARN_STAGING_DIR environment
variable. Thus, one could do the following when creating a SparkSession:

val checkpointPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"),
"checkpoints").toString
sparkSession.sparkContext.setCheckpointDir(checkpointPath)

The staging directory is in an HDFS path like

/user/[user]/.sparkStaging/[YARN application ID]

and is deleted at the end of the application
<https://github.com/apache/spark/blob/branch-2.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L184>
.

So this is one option, though certainly abusing the staging directory.

A more general one might be to find where Dataset.persist(DISK_ONLY) writes.


On Fri, May 26, 2017 at 9:08 AM, Everett Anderson <ever...@nuna.com> wrote:

> Hi,
>
> I need to set a checkpoint directory as I'm starting to use GraphFrames.
> (Also, occasionally my regular DataFrame lineages get too long so it'd be
> nice to use checkpointing to squash the lineage.)
>
> I don't actually need this checkpointed data to live beyond the life of
> the job, however. I'm running jobs on AWS EMR (so on YARN + HDFS) and
> reading and writing non-transient data to S3.
>
> Two questions:
>
> 1. Is there a Spark --conf option to set the checkpoint directory? Somehow
> I couldn't find it, but surely it exists.
>
> 2. What's a good checkpoint directory for this use case? I imagine it'd be
> on HDFS and presumably in a YARN application-specific temporary path that
> gets cleaned up afterwards. Does anyone have a recommendation?
>
> Thanks!
>
> - Everett
>
>

Reply via email to