Re: When should we cache / persist ? After or Before Actions?

Sean Owen Wed, 27 Apr 2022 09:20:23 -0700

You certainly shouldn't just sprinkle them in, no, that has never been the
idea here. It can help in some cases, but is just overhead in others.
Be thoughtful about why you are adding these statements.


On Wed, Apr 27, 2022 at 11:16 AM Koert Kuipers <ko...@tresata.com> wrote:

> we have quite a few persists statements in our codebase whenever we are
> reusing a dataframe.
> we noticed that it slows things down quite a bit (sometimes doubles the
> runtime), while providing little benefits, since spark already re-uses the
> shuffle files underlying the dataframe efficiently even if you don't do the
> persist.
> so at this point i am considering removing those persist statements...
> not sure what other peoples experiences are on this
>
> ‪On Thu, Apr 21, 2022 at 9:41 AM ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ <
> yur...@gmail.com> wrote:‬
>
>> Hi Sean
>>
>> Persisting/caching is useful when you’re going to reuse dataframe. So in
>> your case no persisting/caching is required. This is regarding to “when”.
>>
>> The “where” usually belongs to the closest point of reusing
>> calculations/transformations
>>
>> Btw, I’m not sure if caching is useful when you have a HUGE dataframe.
>> Maybe persisting will be more useful
>>
>> Best regards
>>
>> On 21 Apr 2022, at 16:24, Sean Owen <sro...@gmail.com> wrote:
>>
>> 
>> You persist before actions, not after, if you want the action's outputs
>> to be persistent.
>> If anything swap line 2 and 3. However, there's no point in the count()
>> here, and because there is already only one action following to write, no
>> caching is useful in that example.
>>
>> On Thu, Apr 21, 2022 at 2:26 AM Sid <flinkbyhe...@gmail.com> wrote:
>>
>>> Hi Folks,
>>>
>>> I am working on Spark Dataframe API where I am doing following thing:
>>>
>>> 1) df = spark.sql("some sql on huge dataset").persist()
>>> 2) df1 = df.count()
>>> 3) df.repartition().write.mode().parquet("")
>>>
>>>
>>> AFAIK, persist should be used after count statement if at all it is
>>> needed to be used since spark is lazily evaluated and if I call any action
>>> it will recompute the above code and hence no use of persisting it before
>>> action.
>>>
>>> Therefore, it should be something like the below that should give better
>>> performance.
>>> 1) df= spark.sql("some sql on huge dataset")
>>> 2) df1 = df.count()
>>> 3) df.persist()
>>> 4) df.repartition().write.mode().parquet("")
>>>
>>> So please help me to understand how it should be exactly and why? If I
>>> am not correct
>>>
>>> Thanks,
>>> Sid
>>>
>>>
> CONFIDENTIALITY NOTICE: This electronic communication and any files
> transmitted with it are confidential, privileged and intended solely for
> the use of the individual or entity to whom they are addressed. If you are
> not the intended recipient, you are hereby notified that any disclosure,
> copying, distribution (electronic or otherwise) or forwarding of, or the
> taking of any action in reliance on the contents of this transmission is
> strictly prohibited. Please notify the sender immediately by e-mail if you
> have received this email by mistake and delete this email from your system.
>
> Is it necessary to print this email? If you care about the environment
> like we do, please refrain from printing emails. It helps to keep the
> environment forested and litter-free.

Re: When should we cache / persist ? After or Before Actions?

Reply via email to