@Vadim   Would it be true to say the `.rdd` *may* be creating a new job -
depending on whether the DataFrame/DataSet had already been materialized
via an action or checkpoint?   If the only prior operations on the
DataFrame had been transformations then the dataframe would still not have
been calculated.  In that case would it also be true that a subsequent
action/checkpoint on the DataFrame (not the rdd) would then generate a
separate job?

2017-10-13 14:50 GMT-07:00 Vadim Semenov <vadim.seme...@datadoghq.com>:

> When you do `Dataset.rdd` you actually create a new job
>
> here you can see what it does internally:
> https://github.com/apache/spark/blob/master/sql/core/src/
> main/scala/org/apache/spark/sql/Dataset.scala#L2816-L2828
>
>
>
> On Fri, Oct 13, 2017 at 5:24 PM, Supun Nakandala <
> supun.nakand...@gmail.com> wrote:
>
>> Hi Weichen,
>>
>> Thank you for the reply.
>>
>> My understanding was Dataframe API is using the old RDD implementation
>> under the covers though it presents a different API. And calling
>> df.rdd will simply give access to the underlying RDD. Is this assumption
>> wrong? I would appreciate if you can shed more insights on this issue or
>> point me to documentation where I can learn them.
>>
>> Thank you in advance.
>>
>> On Fri, Oct 13, 2017 at 3:19 AM, Weichen Xu <weichen...@databricks.com>
>> wrote:
>>
>>> You should use `df.cache()`
>>> `df.rdd.cache()` won't work, because `df.rdd` generate a new RDD from
>>> the original `df`. and then cache the new RDD.
>>>
>>> On Fri, Oct 13, 2017 at 3:35 PM, Supun Nakandala <
>>> supun.nakand...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have been experimenting with cache/persist/unpersist methods with
>>>> respect to both Dataframes and RDD APIs. However, I am experiencing
>>>> different behaviors Ddataframe API compared RDD API such Dataframes are not
>>>> getting cached when count() is called.
>>>>
>>>> Is there a difference between how these operations act wrt to Dataframe
>>>> and RDD APIs?
>>>>
>>>> Thank You.
>>>> -Supun
>>>>
>>>
>>>
>>
>

Reply via email to