Sean Yes I know that I can use persist() to persist to disk, but it is still a big extra cost of persist a huge RDD to disk. I hope that I can do one pass to get the count as well as rdd.saveAsObjectFile(file2), but I don’t know how.
May be use accumulator to count the total ? Ningjun From: Mark Hamstra [mailto:[email protected]] Sent: Thursday, March 26, 2015 12:37 PM To: Sean Owen Cc: Wang, Ningjun (LNG-NPV); [email protected] Subject: Re: How to get rdd count() without double evaluation of the RDD? You can also always take the more extreme approach of using SparkContext#runJob (or submitJob) to write a custom Action that does what you want in one pass. Usually that's not worth the extra effort. On Thu, Mar 26, 2015 at 9:27 AM, Sean Owen <[email protected]<mailto:[email protected]>> wrote: To avoid computing twice you need to persist the RDD but that need not be in memory. You can persist to disk with persist(). On Mar 26, 2015 4:11 PM, "Wang, Ningjun (LNG-NPV)" <[email protected]<mailto:[email protected]>> wrote: I have a rdd that is expensive to compute. I want to save it as object file and also print the count. How can I avoid double computation of the RDD? val rdd = sc.textFile(someFile).map(line => expensiveCalculation(line)) val count = rdd.count() // this force computation of the rdd println(count) rdd.saveAsObjectFile(file2) // this compute the RDD again I can avoid double computation by using cache val rdd = sc.textFile(someFile).map(line => expensiveCalculation(line)) rdd.cache() val count = rdd.count() println(count) rdd.saveAsObjectFile(file2) // this compute the RDD again This only compute rdd once. However the rdd has millions of items and will cause out of memory. Question: how can I avoid double computation without using cache? Ningjun
