Re: How to get rdd count() without double evaluation of the RDD?

2015-04-13 Thread Imran Rashid
...@clearstorydata.com] *Sent:* Thursday, March 26, 2015 12:37 PM *To:* Sean Owen *Cc:* Wang, Ningjun (LNG-NPV); user@spark.apache.org *Subject:* Re: How to get rdd count() without double evaluation of the RDD? You can also always take the more extreme approach of using SparkContext#runJob (or submitJob

RE: How to get rdd count() without double evaluation of the RDD?

2015-03-30 Thread Wang, Ningjun (LNG-NPV)
: Mark Hamstra [mailto:m...@clearstorydata.com] Sent: Thursday, March 26, 2015 12:37 PM To: Sean Owen Cc: Wang, Ningjun (LNG-NPV); user@spark.apache.org Subject: Re: How to get rdd count() without double evaluation of the RDD? You can also always take the more extreme approach of using SparkContext

Re: How to get rdd count() without double evaluation of the RDD?

2015-03-26 Thread Mark Hamstra
You can also always take the more extreme approach of using SparkContext#runJob (or submitJob) to write a custom Action that does what you want in one pass. Usually that's not worth the extra effort. On Thu, Mar 26, 2015 at 9:27 AM, Sean Owen so...@cloudera.com wrote: To avoid computing twice

How to get rdd count() without double evaluation of the RDD?

2015-03-26 Thread Wang, Ningjun (LNG-NPV)
I have a rdd that is expensive to compute. I want to save it as object file and also print the count. How can I avoid double computation of the RDD? val rdd = sc.textFile(someFile).map(line = expensiveCalculation(line)) val count = rdd.count() // this force computation of the rdd

Re: How to get rdd count() without double evaluation of the RDD?

2015-03-26 Thread Sean Owen
To avoid computing twice you need to persist the RDD but that need not be in memory. You can persist to disk with persist(). On Mar 26, 2015 4:11 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: I have a rdd that is expensive to compute. I want to save it as object file and also