Sean, you mean if df  is used more than once in transformation then use
cache. But be frankly that is also not true because at many places even if
df is used once with caching and without cache also it gives same result.
How to decide should we use cache or not


Thanks
Amit

On Mon, Dec 7, 2020 at 1:01 PM Sean Owen <sro...@gmail.com> wrote:

> No, it's not true that one action means every DF is evaluated once. This
> is a good counterexample.
>
> On Mon, Dec 7, 2020 at 11:47 AM Amit Sharma <resolve...@gmail.com> wrote:
>
>> Thanks for the information. I am using  spark 2.3.3 There are few more
>> questions
>>
>> 1. Yes I am using DF1 two times but at the end action is one on DF3. In
>> that case action of DF1 should be just 1 or it depends how many times this
>> dataframe is used in transformation.
>>
>> I believe even if we use a dataframe multiple times for transformation ,
>> use caching should be based on actions. In my case action is one save call
>> on DF3. Please correct me if i am wrong.
>>
>> Thanks
>> Amit
>>
>> On Mon, Dec 7, 2020 at 11:54 AM Theodoros Gkountouvas <
>> theo.gkountou...@futurewei.com> wrote:
>>
>>> Hi Amit,
>>>
>>>
>>>
>>> One action might use the same DataFrame more than once. You can look at
>>> your LogicalPlan by executing DF3.explain (arguments different depending
>>> the version of Spark you are using) and see how many times you need to
>>> compute DF2 or DF1. Given the information you have provided I suspect that
>>> DF1 is used more than once (one time at  DF2 and another one at DF3). So,
>>> Spark is going to cache it the first time and it will load it from cache
>>> instead of running it again the second time.
>>>
>>>
>>>
>>> I hope this helped,
>>>
>>> Theo.
>>>
>>>
>>>
>>> *From:* Amit Sharma <resolve...@gmail.com>
>>> *Sent:* Monday, December 7, 2020 11:32 AM
>>> *To:* user@spark.apache.org
>>> *Subject:* Caching
>>>
>>>
>>>
>>> Hi All, I am using caching in my code. I have a DF like
>>>
>>> val  DF1 = read csv.
>>>
>>> val DF2 = DF1.groupBy().agg().select(.....)
>>>
>>>
>>>
>>> Val DF3 =  read csv .join(DF1).join(DF2)
>>>
>>>   DF3 .save.
>>>
>>>
>>>
>>> If I do not cache DF2 or Df1 it is taking longer time  . But i am doing
>>> 1 action only why do I need to cache.
>>>
>>>
>>>
>>> Thanks
>>>
>>> Amit
>>>
>>>
>>>
>>>
>>>
>>

Reply via email to