Re: Is there a difference between df.cache() vs df.rdd.cache()

2017-10-14 Thread Supun Nakandala
Hi Weichen,

Thank you very much for the explanation.

On Fri, Oct 13, 2017 at 6:56 PM, Weichen Xu 
wrote:

> Hi Supun,
>
> Dataframe API is NOT using the old RDD implementation under the covers,
> dataframe has its own implementation. (Dataframe use binary row format and
> columnar storage when cached). So dataframe has no relationship with the
> `RDD[Row]` you want get.
>
> When calling `df.rdd`, and then cache, it need to turn this dataframe into
> rdd, it will extract each row from dataframe, unserialize them, and compose
> the new RDD.
>
> Thanks!
>
> On Sat, Oct 14, 2017 at 6:17 AM, Stephen Boesch  wrote:
>
>> @Vadim   Would it be true to say the `.rdd` *may* be creating a new job -
>> depending on whether the DataFrame/DataSet had already been materialized
>> via an action or checkpoint?   If the only prior operations on the
>> DataFrame had been transformations then the dataframe would still not have
>> been calculated.  In that case would it also be true that a subsequent
>> action/checkpoint on the DataFrame (not the rdd) would then generate a
>> separate job?
>>
>> 2017-10-13 14:50 GMT-07:00 Vadim Semenov :
>>
>>> When you do `Dataset.rdd` you actually create a new job
>>>
>>> here you can see what it does internally:
>>> https://github.com/apache/spark/blob/master/sql/core/src/mai
>>> n/scala/org/apache/spark/sql/Dataset.scala#L2816-L2828
>>>
>>>
>>>
>>> On Fri, Oct 13, 2017 at 5:24 PM, Supun Nakandala <
>>> supun.nakand...@gmail.com> wrote:
>>>
 Hi Weichen,

 Thank you for the reply.

 My understanding was Dataframe API is using the old RDD implementation
 under the covers though it presents a different API. And calling
 df.rdd will simply give access to the underlying RDD. Is this assumption
 wrong? I would appreciate if you can shed more insights on this issue or
 point me to documentation where I can learn them.

 Thank you in advance.

 On Fri, Oct 13, 2017 at 3:19 AM, Weichen Xu 
 wrote:

> You should use `df.cache()`
> `df.rdd.cache()` won't work, because `df.rdd` generate a new RDD from
> the original `df`. and then cache the new RDD.
>
> On Fri, Oct 13, 2017 at 3:35 PM, Supun Nakandala <
> supun.nakand...@gmail.com> wrote:
>
>> Hi all,
>>
>> I have been experimenting with cache/persist/unpersist methods with
>> respect to both Dataframes and RDD APIs. However, I am experiencing
>> different behaviors Ddataframe API compared RDD API such Dataframes are 
>> not
>> getting cached when count() is called.
>>
>> Is there a difference between how these operations act wrt to
>> Dataframe and RDD APIs?
>>
>> Thank You.
>> -Supun
>>
>
>

>>>
>>
>


Re: Is there a difference between df.cache() vs df.rdd.cache()

2017-10-13 Thread Weichen Xu
Hi Supun,

Dataframe API is NOT using the old RDD implementation under the covers,
dataframe has its own implementation. (Dataframe use binary row format and
columnar storage when cached). So dataframe has no relationship with the
`RDD[Row]` you want get.

When calling `df.rdd`, and then cache, it need to turn this dataframe into
rdd, it will extract each row from dataframe, unserialize them, and compose
the new RDD.

Thanks!

On Sat, Oct 14, 2017 at 6:17 AM, Stephen Boesch  wrote:

> @Vadim   Would it be true to say the `.rdd` *may* be creating a new job -
> depending on whether the DataFrame/DataSet had already been materialized
> via an action or checkpoint?   If the only prior operations on the
> DataFrame had been transformations then the dataframe would still not have
> been calculated.  In that case would it also be true that a subsequent
> action/checkpoint on the DataFrame (not the rdd) would then generate a
> separate job?
>
> 2017-10-13 14:50 GMT-07:00 Vadim Semenov :
>
>> When you do `Dataset.rdd` you actually create a new job
>>
>> here you can see what it does internally:
>> https://github.com/apache/spark/blob/master/sql/core/src/mai
>> n/scala/org/apache/spark/sql/Dataset.scala#L2816-L2828
>>
>>
>>
>> On Fri, Oct 13, 2017 at 5:24 PM, Supun Nakandala <
>> supun.nakand...@gmail.com> wrote:
>>
>>> Hi Weichen,
>>>
>>> Thank you for the reply.
>>>
>>> My understanding was Dataframe API is using the old RDD implementation
>>> under the covers though it presents a different API. And calling
>>> df.rdd will simply give access to the underlying RDD. Is this assumption
>>> wrong? I would appreciate if you can shed more insights on this issue or
>>> point me to documentation where I can learn them.
>>>
>>> Thank you in advance.
>>>
>>> On Fri, Oct 13, 2017 at 3:19 AM, Weichen Xu 
>>> wrote:
>>>
 You should use `df.cache()`
 `df.rdd.cache()` won't work, because `df.rdd` generate a new RDD from
 the original `df`. and then cache the new RDD.

 On Fri, Oct 13, 2017 at 3:35 PM, Supun Nakandala <
 supun.nakand...@gmail.com> wrote:

> Hi all,
>
> I have been experimenting with cache/persist/unpersist methods with
> respect to both Dataframes and RDD APIs. However, I am experiencing
> different behaviors Ddataframe API compared RDD API such Dataframes are 
> not
> getting cached when count() is called.
>
> Is there a difference between how these operations act wrt to
> Dataframe and RDD APIs?
>
> Thank You.
> -Supun
>


>>>
>>
>


Re: Is there a difference between df.cache() vs df.rdd.cache()

2017-10-13 Thread Stephen Boesch
@Vadim   Would it be true to say the `.rdd` *may* be creating a new job -
depending on whether the DataFrame/DataSet had already been materialized
via an action or checkpoint?   If the only prior operations on the
DataFrame had been transformations then the dataframe would still not have
been calculated.  In that case would it also be true that a subsequent
action/checkpoint on the DataFrame (not the rdd) would then generate a
separate job?

2017-10-13 14:50 GMT-07:00 Vadim Semenov :

> When you do `Dataset.rdd` you actually create a new job
>
> here you can see what it does internally:
> https://github.com/apache/spark/blob/master/sql/core/src/
> main/scala/org/apache/spark/sql/Dataset.scala#L2816-L2828
>
>
>
> On Fri, Oct 13, 2017 at 5:24 PM, Supun Nakandala <
> supun.nakand...@gmail.com> wrote:
>
>> Hi Weichen,
>>
>> Thank you for the reply.
>>
>> My understanding was Dataframe API is using the old RDD implementation
>> under the covers though it presents a different API. And calling
>> df.rdd will simply give access to the underlying RDD. Is this assumption
>> wrong? I would appreciate if you can shed more insights on this issue or
>> point me to documentation where I can learn them.
>>
>> Thank you in advance.
>>
>> On Fri, Oct 13, 2017 at 3:19 AM, Weichen Xu 
>> wrote:
>>
>>> You should use `df.cache()`
>>> `df.rdd.cache()` won't work, because `df.rdd` generate a new RDD from
>>> the original `df`. and then cache the new RDD.
>>>
>>> On Fri, Oct 13, 2017 at 3:35 PM, Supun Nakandala <
>>> supun.nakand...@gmail.com> wrote:
>>>
 Hi all,

 I have been experimenting with cache/persist/unpersist methods with
 respect to both Dataframes and RDD APIs. However, I am experiencing
 different behaviors Ddataframe API compared RDD API such Dataframes are not
 getting cached when count() is called.

 Is there a difference between how these operations act wrt to Dataframe
 and RDD APIs?

 Thank You.
 -Supun

>>>
>>>
>>
>


Re: Is there a difference between df.cache() vs df.rdd.cache()

2017-10-13 Thread Vadim Semenov
When you do `Dataset.rdd` you actually create a new job

here you can see what it does internally:
https://github.com/apache/spark/blob/master/sql/core/
src/main/scala/org/apache/spark/sql/Dataset.scala#L2816-L2828



On Fri, Oct 13, 2017 at 5:24 PM, Supun Nakandala 
wrote:

> Hi Weichen,
>
> Thank you for the reply.
>
> My understanding was Dataframe API is using the old RDD implementation
> under the covers though it presents a different API. And calling
> df.rdd will simply give access to the underlying RDD. Is this assumption
> wrong? I would appreciate if you can shed more insights on this issue or
> point me to documentation where I can learn them.
>
> Thank you in advance.
>
> On Fri, Oct 13, 2017 at 3:19 AM, Weichen Xu 
> wrote:
>
>> You should use `df.cache()`
>> `df.rdd.cache()` won't work, because `df.rdd` generate a new RDD from the
>> original `df`. and then cache the new RDD.
>>
>> On Fri, Oct 13, 2017 at 3:35 PM, Supun Nakandala <
>> supun.nakand...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I have been experimenting with cache/persist/unpersist methods with
>>> respect to both Dataframes and RDD APIs. However, I am experiencing
>>> different behaviors Ddataframe API compared RDD API such Dataframes are not
>>> getting cached when count() is called.
>>>
>>> Is there a difference between how these operations act wrt to Dataframe
>>> and RDD APIs?
>>>
>>> Thank You.
>>> -Supun
>>>
>>
>>
>


Re: Is there a difference between df.cache() vs df.rdd.cache()

2017-10-13 Thread Supun Nakandala
Hi Weichen,

Thank you for the reply.

My understanding was Dataframe API is using the old RDD implementation
under the covers though it presents a different API. And calling
df.rdd will simply give access to the underlying RDD. Is this assumption
wrong? I would appreciate if you can shed more insights on this issue or
point me to documentation where I can learn them.

Thank you in advance.

On Fri, Oct 13, 2017 at 3:19 AM, Weichen Xu 
wrote:

> You should use `df.cache()`
> `df.rdd.cache()` won't work, because `df.rdd` generate a new RDD from the
> original `df`. and then cache the new RDD.
>
> On Fri, Oct 13, 2017 at 3:35 PM, Supun Nakandala <
> supun.nakand...@gmail.com> wrote:
>
>> Hi all,
>>
>> I have been experimenting with cache/persist/unpersist methods with
>> respect to both Dataframes and RDD APIs. However, I am experiencing
>> different behaviors Ddataframe API compared RDD API such Dataframes are not
>> getting cached when count() is called.
>>
>> Is there a difference between how these operations act wrt to Dataframe
>> and RDD APIs?
>>
>> Thank You.
>> -Supun
>>
>
>


Re: Is there a difference between df.cache() vs df.rdd.cache()

2017-10-13 Thread Weichen Xu
You should use `df.cache()`
`df.rdd.cache()` won't work, because `df.rdd` generate a new RDD from the
original `df`. and then cache the new RDD.

On Fri, Oct 13, 2017 at 3:35 PM, Supun Nakandala 
wrote:

> Hi all,
>
> I have been experimenting with cache/persist/unpersist methods with
> respect to both Dataframes and RDD APIs. However, I am experiencing
> different behaviors Ddataframe API compared RDD API such Dataframes are not
> getting cached when count() is called.
>
> Is there a difference between how these operations act wrt to Dataframe
> and RDD APIs?
>
> Thank You.
> -Supun
>