Hi Sean Persisting/caching is useful when you’re going to reuse dataframe. So in your case no persisting/caching is required. This is regarding to “when”.
The “where” usually belongs to the closest point of reusing calculations/transformations Btw, I’m not sure if caching is useful when you have a HUGE dataframe. Maybe persisting will be more useful Best regards > On 21 Apr 2022, at 16:24, Sean Owen <sro...@gmail.com> wrote: > > > You persist before actions, not after, if you want the action's outputs to be > persistent. > If anything swap line 2 and 3. However, there's no point in the count() here, > and because there is already only one action following to write, no caching > is useful in that example. > >> On Thu, Apr 21, 2022 at 2:26 AM Sid <flinkbyhe...@gmail.com> wrote: >> Hi Folks, >> >> I am working on Spark Dataframe API where I am doing following thing: >> >> 1) df = spark.sql("some sql on huge dataset").persist() >> 2) df1 = df.count() >> 3) df.repartition().write.mode().parquet("") >> >> >> AFAIK, persist should be used after count statement if at all it is needed >> to be used since spark is lazily evaluated and if I call any action it will >> recompute the above code and hence no use of persisting it before action. >> >> Therefore, it should be something like the below that should give better >> performance. >> 1) df= spark.sql("some sql on huge dataset") >> 2) df1 = df.count() >> 3) df.persist() >> 4) df.repartition().write.mode().parquet("") >> >> So please help me to understand how it should be exactly and why? If I am >> not correct >> >> Thanks, >> Sid >>