There is a trade off involved here. If you have a Spark application with a
complicated logical graph, you can either cache data at certain points in the
DAG, or you don’t cache data. The side effect of caching data is higher memory
usage. The side effect of not caching data is higher CPU usage and perhaps, IO.
Ultimately, you can increase both memory and CPU by adding more workers to your
cluster, and adding workers costs money. So, your caching choices are reflected
in the overall cost of running your application. You need to do some analysis
to determine the caching configuration the will result in lowest cost. Usually,
being selective about which dataframes to cache results in a good balance
between memory usage and CPU usage
I will not write data back to S3 and read it back in as a practice.
Essentially, you are using S3 as a “cache”. However, reading and writing from
S3 is not a scalable solution because it results in higher IO and IO doesn’t
scale up as easily as CPU and Memory. The only time I would use S3 as a cache
will be when by cached data is in terabyte+ range. If you are caching gigabytes
of data, then you are better off caching in memory. This is 2018. Memory is
cheap but limited.
From: Valery Khamenya
Date: Tuesday, May 1, 2018 at 9:17 AM
To: "user@spark.apache.org"
Subject: smarter way to "forget" DataFrame definition and stick to its values
hi all
a short example before the long story:
var accumulatedDataFrame = ... // initialize
for (i <- 1 to 100) {
val myTinyNewData = ... // my slowly calculated new data portion in tiny
amounts
accumulatedDataFrame = accumulatedDataFrame.union(myTinyNewData)
// how to stick here to the values of accumulatedDataFrame only and forget
definitions?!
}
this kind of stuff is likely to get slower and slower on each iteration even if
myTinyNewData is quite compact. Usually I write accumulatedDataFrame to S3 and
then re-load it back to clear the definition history. It makes code ugly
though. Are there any smarter way?
It happens very often that a DataFrame is created via complex definitions. The
DataFrame is then re-used in several places and sometimes it gets recalculated
triggering a heavy cascade of operations.
Of course one could use .persist or .cache modifiers, but the result is
unfortunately not transparent and instead of speeding up things it results in
slow-down or even lost jobs if storage resources are not enough.
Any advice?
best regards
--
Valery
The information contained in this e-mail is confidential and/or proprietary to
Capital One and/or its affiliates and may only be used solely in performance of
work or services for Capital One. The information transmitted herewith is
intended only for use by the individual or entity to which it is addressed. If
the reader of this message is not the intended recipient, you are hereby
notified that any review, retransmission, dissemination, distribution, copying
or other use of, or taking of any action in reliance upon this information is
strictly prohibited. If you have received this communication in error, please
contact the sender and delete the material from your computer.