...@clearstorydata.com]
*Sent:* Thursday, March 26, 2015 12:37 PM
*To:* Sean Owen
*Cc:* Wang, Ningjun (LNG-NPV); user@spark.apache.org
*Subject:* Re: How to get rdd count() without double evaluation of the
RDD?
You can also always take the more extreme approach of using
SparkContext#runJob (or submitJob
: Mark Hamstra [mailto:m...@clearstorydata.com]
Sent: Thursday, March 26, 2015 12:37 PM
To: Sean Owen
Cc: Wang, Ningjun (LNG-NPV); user@spark.apache.org
Subject: Re: How to get rdd count() without double evaluation of the RDD?
You can also always take the more extreme approach of using SparkContext
You can also always take the more extreme approach of using
SparkContext#runJob (or submitJob) to write a custom Action that does what
you want in one pass. Usually that's not worth the extra effort.
On Thu, Mar 26, 2015 at 9:27 AM, Sean Owen so...@cloudera.com wrote:
To avoid computing twice
I have a rdd that is expensive to compute. I want to save it as object file and
also print the count. How can I avoid double computation of the RDD?
val rdd = sc.textFile(someFile).map(line = expensiveCalculation(line))
val count = rdd.count() // this force computation of the rdd
To avoid computing twice you need to persist the RDD but that need not be
in memory. You can persist to disk with persist().
On Mar 26, 2015 4:11 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
I have a rdd that is expensive to compute. I want to save it as object
file and also