You certainly shouldn't just sprinkle them in, no, that has never been the
idea here. It can help in some cases, but is just overhead in others.
Be thoughtful about why you are adding these statements.
On Wed, Apr 27, 2022 at 11:16 AM Koert Kuipers wrote:
> we have quite a few persists
we have quite a few persists statements in our codebase whenever we are
reusing a dataframe.
we noticed that it slows things down quite a bit (sometimes doubles the
runtime), while providing little benefits, since spark already re-uses the
shuffle files underlying the dataframe efficiently even if
Hi Sean
Persisting/caching is useful when you’re going to reuse dataframe. So in your
case no persisting/caching is required. This is regarding to “when”.
The “where” usually belongs to the closest point of reusing
calculations/transformations
Btw, I’m not sure if caching is useful when you
You persist before actions, not after, if you want the action's outputs to
be persistent.
If anything swap line 2 and 3. However, there's no point in the count()
here, and because there is already only one action following to write, no
caching is useful in that example.
On Thu, Apr 21, 2022 at
Hi Folks,
I am working on Spark Dataframe API where I am doing following thing:
1) df = spark.sql("some sql on huge dataset").persist()
2) df1 = df.count()
3) df.repartition().write.mode().parquet("")
AFAIK, persist should be used after count statement if at all it is needed
to be used since