When you do `Dataset.rdd` you actually create a new job

here you can see what it does internally:
https://github.com/apache/spark/blob/master/sql/core/
src/main/scala/org/apache/spark/sql/Dataset.scala#L2816-L2828



On Fri, Oct 13, 2017 at 5:24 PM, Supun Nakandala <supun.nakand...@gmail.com>
wrote:

> Hi Weichen,
>
> Thank you for the reply.
>
> My understanding was Dataframe API is using the old RDD implementation
> under the covers though it presents a different API. And calling
> df.rdd will simply give access to the underlying RDD. Is this assumption
> wrong? I would appreciate if you can shed more insights on this issue or
> point me to documentation where I can learn them.
>
> Thank you in advance.
>
> On Fri, Oct 13, 2017 at 3:19 AM, Weichen Xu <weichen...@databricks.com>
> wrote:
>
>> You should use `df.cache()`
>> `df.rdd.cache()` won't work, because `df.rdd` generate a new RDD from the
>> original `df`. and then cache the new RDD.
>>
>> On Fri, Oct 13, 2017 at 3:35 PM, Supun Nakandala <
>> supun.nakand...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I have been experimenting with cache/persist/unpersist methods with
>>> respect to both Dataframes and RDD APIs. However, I am experiencing
>>> different behaviors Ddataframe API compared RDD API such Dataframes are not
>>> getting cached when count() is called.
>>>
>>> Is there a difference between how these operations act wrt to Dataframe
>>> and RDD APIs?
>>>
>>> Thank You.
>>> -Supun
>>>
>>
>>
>

Reply via email to