You cannot assume that caching would always reduce the execution time,
especially if the data-set is large. It appears that if too much memory is
used for caching, then less memory is left for the actual computation
itself. There has to be a balance between the two.
Page 33 of this thesis from
...@spark.incubator.apache.org,
Date: 06/18/2014 06:30 AM
Subject:Re: rdd.cache() is not faster?
You cannot assume that caching would always reduce the execution time,
especially if the data-set is large. It appears that if too much memory is
used for caching, then less memory is left for the actual
if I do have big data (40GB, cached size is 60GB) and even big memory (192
GB), I cannot benefit from RDD cache, and should persist on disk and
leverage filesystem cache?
The answer to the question of whether to persist (spill-over) data on disk
is not always immediately clear, because generally
Hi, I have a 40G file which is a concatenation of multiple documents, I
want to extract two features (title and tables) from each doc, so the
program is like this:
-
val file = sc.textFile(/path/to/40G/file)
//file.cache() //to