Re: eager? in dataframe's checkpoint

2017-02-02 Thread Jean Georges Perrin
i wrote this piece based on all that, hopefully it will help:
http://jgp.net/2017/02/02/what-are-spark-checkpoints-on-dataframes/ 


> On Jan 31, 2017, at 4:18 PM, Burak Yavuz  wrote:
> 
> Hi Koert,
> 
> When eager is true, we return you a new DataFrame that depends on the files 
> written out to the checkpoint directory.
> All previous operations on the checkpointed DataFrame are gone forever. You 
> basically start fresh. AFAIK, when eager is true, the method will not return 
> until the DataFrame is completely checkpointed. If you look at the 
> RDD.checkpoint implementation, the checkpoint location is updated 
> synchronously therefore during the count, `isCheckpointed` will be true.
> 
> Best,
> Burak
> 
> On Tue, Jan 31, 2017 at 12:52 PM, Koert Kuipers  > wrote:
> i understand that checkpoint cuts the lineage, but i am not fully sure i 
> understand the role of eager. 
> 
> eager simply seems to materialize the rdd early with a count, right after the 
> rdd has been checkpointed. but why is that useful? rdd.checkpoint is 
> asynchronous, so when the rdd.count happens most likely rdd.isCheckpointed 
> will be false, and the count will be on the rdd before it was checkpointed. 
> what is the benefit of that?
> 
> 
> On Thu, Jan 26, 2017 at 11:19 PM, Burak Yavuz  > wrote:
> Hi,
> 
> One of the goals of checkpointing is to cut the RDD lineage. Otherwise you 
> run into StackOverflowExceptions. If you eagerly checkpoint, you basically 
> cut the lineage there, and the next operations all depend on the checkpointed 
> DataFrame. If you don't checkpoint, you continue to build the lineage, 
> therefore while that lineage is being resolved, you may hit the 
> StackOverflowException.
> 
> HTH,
> Burak
> 
> On Thu, Jan 26, 2017 at 10:36 AM, Jean Georges Perrin  > wrote:
> Hey Sparkers,
> 
> Trying to understand the Dataframe's checkpoint (not in the context of 
> streaming) 
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#checkpoint(boolean)
>  
> 
> 
> What is the goal of the eager flag?
> 
> Thanks!
> 
> jg
> 
> 
> 



Re: eager? in dataframe's checkpoint

2017-01-31 Thread Koert Kuipers
i thought RDD.checkpoint is async? checkpointData is indeed updated
synchronously, but checkpointData.isCheckpointed is false until the actual
checkpoint operation has completed. and until the actual checkpoint
operation is done any operation will be on the original rdd.

for example notice how below it prints "not yet materialized" 6 times,
instead of just 3 times if the count had operated on the checkpoint data.

scala> val x = sc.parallelize(1 to 3).map{ (i) => println("not yet
materialized"); i }
x: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[4] at map at
:24

scala> x.checkpoint(); println("is checkpointed? " + x.isCheckpointed);
println("count " + x.count)
is checkpointed? false
not yet materialized
not yet materialized
not yet materialized
not yet materialized
not yet materialized
not yet materialized
count 3






On Tue, Jan 31, 2017 at 4:18 PM, Burak Yavuz  wrote:

> Hi Koert,
>
> When eager is true, we return you a new DataFrame that depends on the
> files written out to the checkpoint directory.
> All previous operations on the checkpointed DataFrame are gone forever.
> You basically start fresh. AFAIK, when eager is true, the method will not
> return until the DataFrame is completely checkpointed. If you look at the
> RDD.checkpoint implementation, the checkpoint location is updated
> synchronously therefore during the count, `isCheckpointed` will be true.
>
> Best,
> Burak
>
> On Tue, Jan 31, 2017 at 12:52 PM, Koert Kuipers  wrote:
>
>> i understand that checkpoint cuts the lineage, but i am not fully sure i
>> understand the role of eager.
>>
>> eager simply seems to materialize the rdd early with a count, right after
>> the rdd has been checkpointed. but why is that useful? rdd.checkpoint is
>> asynchronous, so when the rdd.count happens most likely rdd.isCheckpointed
>> will be false, and the count will be on the rdd before it was checkpointed.
>> what is the benefit of that?
>>
>>
>> On Thu, Jan 26, 2017 at 11:19 PM, Burak Yavuz  wrote:
>>
>>> Hi,
>>>
>>> One of the goals of checkpointing is to cut the RDD lineage. Otherwise
>>> you run into StackOverflowExceptions. If you eagerly checkpoint, you
>>> basically cut the lineage there, and the next operations all depend on the
>>> checkpointed DataFrame. If you don't checkpoint, you continue to build the
>>> lineage, therefore while that lineage is being resolved, you may hit the
>>> StackOverflowException.
>>>
>>> HTH,
>>> Burak
>>>
>>> On Thu, Jan 26, 2017 at 10:36 AM, Jean Georges Perrin 
>>> wrote:
>>>
 Hey Sparkers,

 Trying to understand the Dataframe's checkpoint (*not* in the context
 of streaming) https://spark.apache.org/docs/latest/api/java/org
 /apache/spark/sql/Dataset.html#checkpoint(boolean)

 What is the goal of the *eager* flag?

 Thanks!

 jg

>>>
>>>
>>
>


Re: eager? in dataframe's checkpoint

2017-01-31 Thread Burak Yavuz
Hi Koert,

When eager is true, we return you a new DataFrame that depends on the files
written out to the checkpoint directory.
All previous operations on the checkpointed DataFrame are gone forever. You
basically start fresh. AFAIK, when eager is true, the method will not
return until the DataFrame is completely checkpointed. If you look at the
RDD.checkpoint implementation, the checkpoint location is updated
synchronously therefore during the count, `isCheckpointed` will be true.

Best,
Burak

On Tue, Jan 31, 2017 at 12:52 PM, Koert Kuipers  wrote:

> i understand that checkpoint cuts the lineage, but i am not fully sure i
> understand the role of eager.
>
> eager simply seems to materialize the rdd early with a count, right after
> the rdd has been checkpointed. but why is that useful? rdd.checkpoint is
> asynchronous, so when the rdd.count happens most likely rdd.isCheckpointed
> will be false, and the count will be on the rdd before it was checkpointed.
> what is the benefit of that?
>
>
> On Thu, Jan 26, 2017 at 11:19 PM, Burak Yavuz  wrote:
>
>> Hi,
>>
>> One of the goals of checkpointing is to cut the RDD lineage. Otherwise
>> you run into StackOverflowExceptions. If you eagerly checkpoint, you
>> basically cut the lineage there, and the next operations all depend on the
>> checkpointed DataFrame. If you don't checkpoint, you continue to build the
>> lineage, therefore while that lineage is being resolved, you may hit the
>> StackOverflowException.
>>
>> HTH,
>> Burak
>>
>> On Thu, Jan 26, 2017 at 10:36 AM, Jean Georges Perrin 
>> wrote:
>>
>>> Hey Sparkers,
>>>
>>> Trying to understand the Dataframe's checkpoint (*not* in the context
>>> of streaming) https://spark.apache.org/docs/latest/api/java/org
>>> /apache/spark/sql/Dataset.html#checkpoint(boolean)
>>>
>>> What is the goal of the *eager* flag?
>>>
>>> Thanks!
>>>
>>> jg
>>>
>>
>>
>


Re: eager? in dataframe's checkpoint

2017-01-31 Thread Koert Kuipers
i understand that checkpoint cuts the lineage, but i am not fully sure i
understand the role of eager.

eager simply seems to materialize the rdd early with a count, right after
the rdd has been checkpointed. but why is that useful? rdd.checkpoint is
asynchronous, so when the rdd.count happens most likely rdd.isCheckpointed
will be false, and the count will be on the rdd before it was checkpointed.
what is the benefit of that?


On Thu, Jan 26, 2017 at 11:19 PM, Burak Yavuz  wrote:

> Hi,
>
> One of the goals of checkpointing is to cut the RDD lineage. Otherwise you
> run into StackOverflowExceptions. If you eagerly checkpoint, you basically
> cut the lineage there, and the next operations all depend on the
> checkpointed DataFrame. If you don't checkpoint, you continue to build the
> lineage, therefore while that lineage is being resolved, you may hit the
> StackOverflowException.
>
> HTH,
> Burak
>
> On Thu, Jan 26, 2017 at 10:36 AM, Jean Georges Perrin  wrote:
>
>> Hey Sparkers,
>>
>> Trying to understand the Dataframe's checkpoint (*not* in the context of
>> streaming) https://spark.apache.org/docs/latest/api/java/
>> org/apache/spark/sql/Dataset.html#checkpoint(boolean)
>>
>> What is the goal of the *eager* flag?
>>
>> Thanks!
>>
>> jg
>>
>
>


Re: eager? in dataframe's checkpoint

2017-01-26 Thread Burak Yavuz
Hi,

One of the goals of checkpointing is to cut the RDD lineage. Otherwise you
run into StackOverflowExceptions. If you eagerly checkpoint, you basically
cut the lineage there, and the next operations all depend on the
checkpointed DataFrame. If you don't checkpoint, you continue to build the
lineage, therefore while that lineage is being resolved, you may hit the
StackOverflowException.

HTH,
Burak

On Thu, Jan 26, 2017 at 10:36 AM, Jean Georges Perrin  wrote:

> Hey Sparkers,
>
> Trying to understand the Dataframe's checkpoint (*not* in the context of
> streaming) https://spark.apache.org/docs/latest/api/
> java/org/apache/spark/sql/Dataset.html#checkpoint(boolean)
>
> What is the goal of the *eager* flag?
>
> Thanks!
>
> jg
>


eager? in dataframe's checkpoint

2017-01-26 Thread Jean Georges Perrin
Hey Sparkers,

Trying to understand the Dataframe's checkpoint (not in the context of 
streaming) 
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#checkpoint(boolean)
 


What is the goal of the eager flag?

Thanks!

jg