Re: Temp checkpoint directory for EMR (S3 or HDFS)

2017-05-30 Thread Asher Krim
You should actually be able to get to the underlying filesystem from your
SparkContext:

String originalFs = sparkContext.hadoopConfiguration().get("fs.defaultFS");


and then you could just use that:

String checkpointPath = String.format("%s/%s/", originalFs,
checkpointDirectory);
sparkContext.setCheckpointDir(checkpointPath);


Asher Krim
Senior Software Engineer

On Tue, May 30, 2017 at 12:37 PM, Everett Anderson  wrote:

> Still haven't found a --conf option.
>
> Regarding a temporary HDFS checkpoint directory, it looks like when using
> --master yarn, spark-submit supplies a SPARK_YARN_STAGING_DIR environment
> variable. Thus, one could do the following when creating a SparkSession:
>
> val checkpointPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"),
> "checkpoints").toString
> sparkSession.sparkContext.setCheckpointDir(checkpointPath)
>
> The staging directory is in an HDFS path like
>
> /user/[user]/.sparkStaging/[YARN application ID]
>
> and is deleted at the end of the application
> 
> .
>
> So this is one option, though certainly abusing the staging directory.
>
> A more general one might be to find where Dataset.persist(DISK_ONLY)
> writes.
>
>
> On Fri, May 26, 2017 at 9:08 AM, Everett Anderson 
> wrote:
>
>> Hi,
>>
>> I need to set a checkpoint directory as I'm starting to use GraphFrames.
>> (Also, occasionally my regular DataFrame lineages get too long so it'd be
>> nice to use checkpointing to squash the lineage.)
>>
>> I don't actually need this checkpointed data to live beyond the life of
>> the job, however. I'm running jobs on AWS EMR (so on YARN + HDFS) and
>> reading and writing non-transient data to S3.
>>
>> Two questions:
>>
>> 1. Is there a Spark --conf option to set the checkpoint directory?
>> Somehow I couldn't find it, but surely it exists.
>>
>> 2. What's a good checkpoint directory for this use case? I imagine it'd
>> be on HDFS and presumably in a YARN application-specific temporary path
>> that gets cleaned up afterwards. Does anyone have a recommendation?
>>
>> Thanks!
>>
>> - Everett
>>
>>
>


Re: Temp checkpoint directory for EMR (S3 or HDFS)

2017-05-30 Thread Everett Anderson
Still haven't found a --conf option.

Regarding a temporary HDFS checkpoint directory, it looks like when using
--master yarn, spark-submit supplies a SPARK_YARN_STAGING_DIR environment
variable. Thus, one could do the following when creating a SparkSession:

val checkpointPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"),
"checkpoints").toString
sparkSession.sparkContext.setCheckpointDir(checkpointPath)

The staging directory is in an HDFS path like

/user/[user]/.sparkStaging/[YARN application ID]

and is deleted at the end of the application

.

So this is one option, though certainly abusing the staging directory.

A more general one might be to find where Dataset.persist(DISK_ONLY) writes.


On Fri, May 26, 2017 at 9:08 AM, Everett Anderson  wrote:

> Hi,
>
> I need to set a checkpoint directory as I'm starting to use GraphFrames.
> (Also, occasionally my regular DataFrame lineages get too long so it'd be
> nice to use checkpointing to squash the lineage.)
>
> I don't actually need this checkpointed data to live beyond the life of
> the job, however. I'm running jobs on AWS EMR (so on YARN + HDFS) and
> reading and writing non-transient data to S3.
>
> Two questions:
>
> 1. Is there a Spark --conf option to set the checkpoint directory? Somehow
> I couldn't find it, but surely it exists.
>
> 2. What's a good checkpoint directory for this use case? I imagine it'd be
> on HDFS and presumably in a YARN application-specific temporary path that
> gets cleaned up afterwards. Does anyone have a recommendation?
>
> Thanks!
>
> - Everett
>
>


Temp checkpoint directory for EMR (S3 or HDFS)

2017-05-26 Thread Everett Anderson
Hi,

I need to set a checkpoint directory as I'm starting to use GraphFrames.
(Also, occasionally my regular DataFrame lineages get too long so it'd be
nice to use checkpointing to squash the lineage.)

I don't actually need this checkpointed data to live beyond the life of the
job, however. I'm running jobs on AWS EMR (so on YARN + HDFS) and reading
and writing non-transient data to S3.

Two questions:

1. Is there a Spark --conf option to set the checkpoint directory? Somehow
I couldn't find it, but surely it exists.

2. What's a good checkpoint directory for this use case? I imagine it'd be
on HDFS and presumably in a YARN application-specific temporary path that
gets cleaned up afterwards. Does anyone have a recommendation?

Thanks!

- Everett