Hi Sean 

Persisting/caching is useful when you’re going to reuse dataframe. So in your 
case no persisting/caching is required. This is regarding to “when”.

The “where” usually belongs to the closest point of reusing 
calculations/transformations

Btw, I’m not sure if caching is useful when you have a HUGE dataframe. Maybe 
persisting will be more useful

Best regards

> On 21 Apr 2022, at 16:24, Sean Owen <sro...@gmail.com> wrote:
> 
> 
> You persist before actions, not after, if you want the action's outputs to be 
> persistent.
> If anything swap line 2 and 3. However, there's no point in the count() here, 
> and because there is already only one action following to write, no caching 
> is useful in that example.
> 
>> On Thu, Apr 21, 2022 at 2:26 AM Sid <flinkbyhe...@gmail.com> wrote:
>> Hi Folks,
>> 
>> I am working on Spark Dataframe API where I am doing following thing:
>> 
>> 1) df = spark.sql("some sql on huge dataset").persist()
>> 2) df1 = df.count()
>> 3) df.repartition().write.mode().parquet("")
>> 
>> 
>> AFAIK, persist should be used after count statement if at all it is needed 
>> to be used since spark is lazily evaluated and if I call any action it will 
>> recompute the above code and hence no use of persisting it before action. 
>> 
>> Therefore, it should be something like the below that should give better 
>> performance.
>> 1) df= spark.sql("some sql on huge dataset")
>> 2) df1 = df.count()
>> 3) df.persist()
>> 4) df.repartition().write.mode().parquet("")
>> 
>> So please help me to understand how it should be exactly and why? If I am 
>> not correct
>> 
>> Thanks,
>> Sid
>> 

Reply via email to