I have a rdd that is expensive to compute. I want to save it as object file and also print the count. How can I avoid double computation of the RDD?
val rdd = sc.textFile(someFile).map(line => expensiveCalculation(line)) val count = rdd.count() // this force computation of the rdd println(count) rdd.saveAsObjectFile(file2) // this compute the RDD again I can avoid double computation by using cache val rdd = sc.textFile(someFile).map(line => expensiveCalculation(line)) rdd.cache() val count = rdd.count() println(count) rdd.saveAsObjectFile(file2) // this compute the RDD again This only compute rdd once. However the rdd has millions of items and will cause out of memory. Question: how can I avoid double computation without using cache? Ningjun