Re: Temp checkpoint directory for EMR (S3 or HDFS)

2017-05-30 Thread Asher Krim
You should actually be able to get to the underlying filesystem from your

String originalFs = sparkContext.hadoopConfiguration().get("fs.defaultFS");

and then you could just use that:

String checkpointPath = String.format("%s/%s/", originalFs,

Asher Krim
Senior Software Engineer

On Tue, May 30, 2017 at 12:37 PM, Everett Anderson  wrote:

> Still haven't found a --conf option.
> Regarding a temporary HDFS checkpoint directory, it looks like when using
> --master yarn, spark-submit supplies a SPARK_YARN_STAGING_DIR environment
> variable. Thus, one could do the following when creating a SparkSession:
> val checkpointPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"),
> "checkpoints").toString
> sparkSession.sparkContext.setCheckpointDir(checkpointPath)
> The staging directory is in an HDFS path like
> /user/[user]/.sparkStaging/[YARN application ID]
> and is deleted at the end of the application
> .
> So this is one option, though certainly abusing the staging directory.
> A more general one might be to find where Dataset.persist(DISK_ONLY)
> writes.
> On Fri, May 26, 2017 at 9:08 AM, Everett Anderson 
> wrote:
>> Hi,
>> I need to set a checkpoint directory as I'm starting to use GraphFrames.
>> (Also, occasionally my regular DataFrame lineages get too long so it'd be
>> nice to use checkpointing to squash the lineage.)
>> I don't actually need this checkpointed data to live beyond the life of
>> the job, however. I'm running jobs on AWS EMR (so on YARN + HDFS) and
>> reading and writing non-transient data to S3.
>> Two questions:
>> 1. Is there a Spark --conf option to set the checkpoint directory?
>> Somehow I couldn't find it, but surely it exists.
>> 2. What's a good checkpoint directory for this use case? I imagine it'd
>> be on HDFS and presumably in a YARN application-specific temporary path
>> that gets cleaned up afterwards. Does anyone have a recommendation?
>> Thanks!
>> - Everett

Re: Temp checkpoint directory for EMR (S3 or HDFS)

2017-05-30 Thread Everett Anderson
Still haven't found a --conf option.

Regarding a temporary HDFS checkpoint directory, it looks like when using
--master yarn, spark-submit supplies a SPARK_YARN_STAGING_DIR environment
variable. Thus, one could do the following when creating a SparkSession:

val checkpointPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"),

The staging directory is in an HDFS path like

/user/[user]/.sparkStaging/[YARN application ID]

and is deleted at the end of the application


So this is one option, though certainly abusing the staging directory.

A more general one might be to find where Dataset.persist(DISK_ONLY) writes.

On Fri, May 26, 2017 at 9:08 AM, Everett Anderson  wrote:

> Hi,
> I need to set a checkpoint directory as I'm starting to use GraphFrames.
> (Also, occasionally my regular DataFrame lineages get too long so it'd be
> nice to use checkpointing to squash the lineage.)
> I don't actually need this checkpointed data to live beyond the life of
> the job, however. I'm running jobs on AWS EMR (so on YARN + HDFS) and
> reading and writing non-transient data to S3.
> Two questions:
> 1. Is there a Spark --conf option to set the checkpoint directory? Somehow
> I couldn't find it, but surely it exists.
> 2. What's a good checkpoint directory for this use case? I imagine it'd be
> on HDFS and presumably in a YARN application-specific temporary path that
> gets cleaned up afterwards. Does anyone have a recommendation?
> Thanks!
> - Everett