Re: Spark query performance of cached data affected by RDD lineage

fwy Mon, 24 May 2021 15:20:59 -0700

Thanks to all for the quick replies, they helped a lot.  To answer a few of
the follow-up questions ...


> 1. How did you fix this performance which I gather programmatically

The main problem in my original code was that the <first time through the
data frame> logic was not being executed when it should have been.  This
'if' clause has the effect of creating a new data frame directly from the
contents of the Cassandra data source, throwing away the old RDD lineage of
the existing data frame.  By incorrectly executing the 'else' clause
instead, the new data frame was being created from the existing data frame
again and again, extending the entire lineage with each iteration and
degrading query times.  Fixing the code stopped the lineage from constantly
growing with each refresh.

> 2. In your code have you set spark.conf.set("spark.sql.adaptive.enabled",
> "true")

Yes, this is already set.

> 3. Depending on the size of your data both source and new data, do you
> have any indication that your data in global temporary view is totally
> cached. This should show in the storage tab in UI. If you have data on the
> disk for then this will affect the performance 

Yes, the cached temporary view is in the storage tab with the expected
amount of memory usage, although it is difficult to check the specific
contents of this memory cache.


> You could try checkpointing occasionally and see if that helps

This was very helpful, as I was not familiar with the Spark checkpointing
feature.  In our application, we don't need fault-tolerant dataframes/RDDs,
so I inserted localCheckpoint() calls in the code prior to the uncaching and
re-caching steps.  This seems to have made all queries run faster, with some
improved immediately by a factor of 3!


Based on these tests, it is clear that RDD lineage can have a major effect
on the performance of Spark applications, whether or not the data has been
cached in memory.

Thanks again for the good advice!





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark query performance of cached data affected by RDD lineage

Reply via email to