When you do `Dataset.rdd` you actually create a new job here you can see what it does internally: https://github.com/apache/spark/blob/master/sql/core/ src/main/scala/org/apache/spark/sql/Dataset.scala#L2816-L2828
On Fri, Oct 13, 2017 at 5:24 PM, Supun Nakandala <supun.nakand...@gmail.com> wrote: > Hi Weichen, > > Thank you for the reply. > > My understanding was Dataframe API is using the old RDD implementation > under the covers though it presents a different API. And calling > df.rdd will simply give access to the underlying RDD. Is this assumption > wrong? I would appreciate if you can shed more insights on this issue or > point me to documentation where I can learn them. > > Thank you in advance. > > On Fri, Oct 13, 2017 at 3:19 AM, Weichen Xu <weichen...@databricks.com> > wrote: > >> You should use `df.cache()` >> `df.rdd.cache()` won't work, because `df.rdd` generate a new RDD from the >> original `df`. and then cache the new RDD. >> >> On Fri, Oct 13, 2017 at 3:35 PM, Supun Nakandala < >> supun.nakand...@gmail.com> wrote: >> >>> Hi all, >>> >>> I have been experimenting with cache/persist/unpersist methods with >>> respect to both Dataframes and RDD APIs. However, I am experiencing >>> different behaviors Ddataframe API compared RDD API such Dataframes are not >>> getting cached when count() is called. >>> >>> Is there a difference between how these operations act wrt to Dataframe >>> and RDD APIs? >>> >>> Thank You. >>> -Supun >>> >> >> >