Re: rdd.cache() is not faster?

2014-06-18 Thread Gaurav Jain
You cannot assume that caching would always reduce the execution time, especially if the data-set is large. It appears that if too much memory is used for caching, then less memory is left for the actual computation itself. There has to be a balance between the two. Page 33 of this thesis from

Re: rdd.cache() is not faster?

2014-06-18 Thread Wei Tan
...@spark.incubator.apache.org, Date: 06/18/2014 06:30 AM Subject:Re: rdd.cache() is not faster? You cannot assume that caching would always reduce the execution time, especially if the data-set is large. It appears that if too much memory is used for caching, then less memory is left for the actual

Re: rdd.cache() is not faster?

2014-06-18 Thread Gaurav Jain
if I do have big data (40GB, cached size is 60GB) and even big memory (192 GB), I cannot benefit from RDD cache, and should persist on disk and leverage filesystem cache? The answer to the question of whether to persist (spill-over) data on disk is not always immediately clear, because generally

rdd.cache() is not faster?

2014-06-17 Thread Wei Tan
Hi, I have a 40G file which is a concatenation of multiple documents, I want to extract two features (title and tables) from each doc, so the program is like this: - val file = sc.textFile(/path/to/40G/file) //file.cache() //to