Re: When should we cache / persist ? After or Before Actions?

2022-04-27 Thread Sean Owen
You certainly shouldn't just sprinkle them in, no, that has never been the idea here. It can help in some cases, but is just overhead in others. Be thoughtful about why you are adding these statements. On Wed, Apr 27, 2022 at 11:16 AM Koert Kuipers wrote: > we have quite a few persists

Re: When should we cache / persist ? After or Before Actions?

2022-04-27 Thread Koert Kuipers
we have quite a few persists statements in our codebase whenever we are reusing a dataframe. we noticed that it slows things down quite a bit (sometimes doubles the runtime), while providing little benefits, since spark already re-uses the shuffle files underlying the dataframe efficiently even if

Re: When should we cache / persist ? After or Before Actions?

2022-04-21 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Hi Sean Persisting/caching is useful when you’re going to reuse dataframe. So in your case no persisting/caching is required. This is regarding to “when”. The “where” usually belongs to the closest point of reusing calculations/transformations Btw, I’m not sure if caching is useful when you

Re: When should we cache / persist ? After or Before Actions?

2022-04-21 Thread Sean Owen
You persist before actions, not after, if you want the action's outputs to be persistent. If anything swap line 2 and 3. However, there's no point in the count() here, and because there is already only one action following to write, no caching is useful in that example. On Thu, Apr 21, 2022 at

When should we cache / persist ? After or Before Actions?

2022-04-21 Thread Sid
Hi Folks, I am working on Spark Dataframe API where I am doing following thing: 1) df = spark.sql("some sql on huge dataset").persist() 2) df1 = df.count() 3) df.repartition().write.mode().parquet("") AFAIK, persist should be used after count statement if at all it is needed to be used since