Re: When should we cache / persist ? After or Before Actions?

Koert Kuipers Wed, 27 Apr 2022 09:16:45 -0700

we have quite a few persists statements in our codebase whenever we are
reusing a dataframe.
we noticed that it slows things down quite a bit (sometimes doubles the
runtime), while providing little benefits, since spark already re-uses the
shuffle files underlying the dataframe efficiently even if you don't do the
persist.
so at this point i am considering removing those persist statements...
not sure what other peoples experiences are on this


‪On Thu, Apr 21, 2022 at 9:41 AM ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ <
yur...@gmail.com> wrote:‬

> Hi Sean
>
> Persisting/caching is useful when you’re going to reuse dataframe. So in
> your case no persisting/caching is required. This is regarding to “when”.
>
> The “where” usually belongs to the closest point of reusing
> calculations/transformations
>
> Btw, I’m not sure if caching is useful when you have a HUGE dataframe.
> Maybe persisting will be more useful
>
> Best regards
>
> On 21 Apr 2022, at 16:24, Sean Owen <sro...@gmail.com> wrote:
>
> 
> You persist before actions, not after, if you want the action's outputs to
> be persistent.
> If anything swap line 2 and 3. However, there's no point in the count()
> here, and because there is already only one action following to write, no
> caching is useful in that example.
>
> On Thu, Apr 21, 2022 at 2:26 AM Sid <flinkbyhe...@gmail.com> wrote:
>
>> Hi Folks,
>>
>> I am working on Spark Dataframe API where I am doing following thing:
>>
>> 1) df = spark.sql("some sql on huge dataset").persist()
>> 2) df1 = df.count()
>> 3) df.repartition().write.mode().parquet("")
>>
>>
>> AFAIK, persist should be used after count statement if at all it is
>> needed to be used since spark is lazily evaluated and if I call any action
>> it will recompute the above code and hence no use of persisting it before
>> action.
>>
>> Therefore, it should be something like the below that should give better
>> performance.
>> 1) df= spark.sql("some sql on huge dataset")
>> 2) df1 = df.count()
>> 3) df.persist()
>> 4) df.repartition().write.mode().parquet("")
>>
>> So please help me to understand how it should be exactly and why? If I am
>> not correct
>>
>> Thanks,
>> Sid
>>
>>

-- 
CONFIDENTIALITY NOTICE: This electronic communication and any files 
transmitted with it are confidential, privileged and intended solely for 
the use of the individual or entity to whom they are addressed. If you are 
not the intended recipient, you are hereby notified that any disclosure, 
copying, distribution (electronic or otherwise) or forwarding of, or the 
taking of any action in reliance on the contents of this transmission is 
strictly prohibited. Please notify the sender immediately by e-mail if you 
have received this email by mistake and delete this email from your system.


Is it necessary to print this email? If you care about the environment 
like we do, please refrain from printing emails. It helps to keep the 
environment forested and litter-free.

Re: When should we cache / persist ? After or Before Actions?

Reply via email to