There is a trade off involved here. If you have a Spark application with a
complicated logical graph, you can either cache data at certain points in the
DAG, or you don’t cache data. The side effect of caching data is higher memory
usage. The side effect of not caching data is higher CPU usage
hi all
a short example before the long story:
var accumulatedDataFrame = ... // initialize
for (i <- 1 to 100) {
val myTinyNewData = ... // my slowly calculated new data portion in
tiny amounts
accumulatedDataFrame = accumulatedDataFrame.union(myTinyNewData)
// how to stick