Re: How to get rdd count() without double evaluation of the RDD?

2015-04-13 Thread Imran Rashid
yes, it sounds like a good use of an accumulator to me

val counts = sc.accumulator(0L)
rdd.map{x =
  counts += 1
  x
}.saveAsObjectFile(file2)


On Mon, Mar 30, 2015 at 12:08 PM, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.com wrote:

  Sean



 Yes I know that I can use persist() to persist to disk, but it is still a
 big extra cost of persist a huge RDD to disk. I hope that I can do one pass
 to get the count as well as rdd.saveAsObjectFile(file2), but I don’t know
 how.



 May be use accumulator to count the total ?



 Ningjun



 *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
 *Sent:* Thursday, March 26, 2015 12:37 PM
 *To:* Sean Owen
 *Cc:* Wang, Ningjun (LNG-NPV); user@spark.apache.org
 *Subject:* Re: How to get rdd count() without double evaluation of the
 RDD?



 You can also always take the more extreme approach of using
 SparkContext#runJob (or submitJob) to write a custom Action that does what
 you want in one pass.  Usually that's not worth the extra effort.



 On Thu, Mar 26, 2015 at 9:27 AM, Sean Owen so...@cloudera.com wrote:

 To avoid computing twice you need to persist the RDD but that need not be
 in memory. You can persist to disk with persist().

 On Mar 26, 2015 4:11 PM, Wang, Ningjun (LNG-NPV) 
 ningjun.w...@lexisnexis.com wrote:

 I have a rdd that is expensive to compute. I want to save it as object
 file and also print the count. How can I avoid double computation of the
 RDD?



 val rdd = sc.textFile(someFile).map(line = expensiveCalculation(line))



 val count = rdd.count()  // this force computation of the rdd

 println(count)

 rdd.saveAsObjectFile(file2) // this compute the RDD again



 I can avoid double computation by using cache



 val rdd = sc.textFile(someFile).map(line = expensiveCalculation(line))

 rdd.cache()

 val count = rdd.count()

 println(count)

 rdd.saveAsObjectFile(file2) // this compute the RDD again



 This only compute rdd once. However the rdd has millions of items and will
 cause out of memory.



 Question: how can I avoid double computation without using cache?





 Ningjun





RE: How to get rdd count() without double evaluation of the RDD?

2015-03-30 Thread Wang, Ningjun (LNG-NPV)
Sean

Yes I know that I can use persist() to persist to disk, but it is still a big 
extra cost of persist a huge RDD to disk. I hope that I can do one pass to get 
the count as well as rdd.saveAsObjectFile(file2), but I don’t know how.

May be use accumulator to count the total ?

Ningjun

From: Mark Hamstra [mailto:m...@clearstorydata.com]
Sent: Thursday, March 26, 2015 12:37 PM
To: Sean Owen
Cc: Wang, Ningjun (LNG-NPV); user@spark.apache.org
Subject: Re: How to get rdd count() without double evaluation of the RDD?

You can also always take the more extreme approach of using SparkContext#runJob 
(or submitJob) to write a custom Action that does what you want in one pass.  
Usually that's not worth the extra effort.

On Thu, Mar 26, 2015 at 9:27 AM, Sean Owen 
so...@cloudera.commailto:so...@cloudera.com wrote:

To avoid computing twice you need to persist the RDD but that need not be in 
memory. You can persist to disk with persist().
On Mar 26, 2015 4:11 PM, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.commailto:ningjun.w...@lexisnexis.com wrote:
I have a rdd that is expensive to compute. I want to save it as object file and 
also print the count. How can I avoid double computation of the RDD?

val rdd = sc.textFile(someFile).map(line = expensiveCalculation(line))

val count = rdd.count()  // this force computation of the rdd
println(count)
rdd.saveAsObjectFile(file2) // this compute the RDD again

I can avoid double computation by using cache

val rdd = sc.textFile(someFile).map(line = expensiveCalculation(line))
rdd.cache()
val count = rdd.count()
println(count)
rdd.saveAsObjectFile(file2) // this compute the RDD again

This only compute rdd once. However the rdd has millions of items and will 
cause out of memory.

Question: how can I avoid double computation without using cache?


Ningjun



Re: How to get rdd count() without double evaluation of the RDD?

2015-03-26 Thread Mark Hamstra
You can also always take the more extreme approach of using
SparkContext#runJob (or submitJob) to write a custom Action that does what
you want in one pass.  Usually that's not worth the extra effort.

On Thu, Mar 26, 2015 at 9:27 AM, Sean Owen so...@cloudera.com wrote:

 To avoid computing twice you need to persist the RDD but that need not be
 in memory. You can persist to disk with persist().
 On Mar 26, 2015 4:11 PM, Wang, Ningjun (LNG-NPV) 
 ningjun.w...@lexisnexis.com wrote:

  I have a rdd that is expensive to compute. I want to save it as object
 file and also print the count. How can I avoid double computation of the
 RDD?



 val rdd = sc.textFile(someFile).map(line = expensiveCalculation(line))



 val count = rdd.count()  // this force computation of the rdd

 println(count)

 rdd.saveAsObjectFile(file2) // this compute the RDD again



 I can avoid double computation by using cache



 val rdd = sc.textFile(someFile).map(line = expensiveCalculation(line))

 rdd.cache()

 val count = rdd.count()

 println(count)

 rdd.saveAsObjectFile(file2) // this compute the RDD again



 This only compute rdd once. However the rdd has millions of items and
 will cause out of memory.



 Question: how can I avoid double computation without using cache?





 Ningjun




How to get rdd count() without double evaluation of the RDD?

2015-03-26 Thread Wang, Ningjun (LNG-NPV)
I have a rdd that is expensive to compute. I want to save it as object file and 
also print the count. How can I avoid double computation of the RDD?

val rdd = sc.textFile(someFile).map(line = expensiveCalculation(line))

val count = rdd.count()  // this force computation of the rdd
println(count)
rdd.saveAsObjectFile(file2) // this compute the RDD again

I can avoid double computation by using cache

val rdd = sc.textFile(someFile).map(line = expensiveCalculation(line))
rdd.cache()
val count = rdd.count()
println(count)
rdd.saveAsObjectFile(file2) // this compute the RDD again

This only compute rdd once. However the rdd has millions of items and will 
cause out of memory.

Question: how can I avoid double computation without using cache?


Ningjun


Re: How to get rdd count() without double evaluation of the RDD?

2015-03-26 Thread Sean Owen
To avoid computing twice you need to persist the RDD but that need not be
in memory. You can persist to disk with persist().
On Mar 26, 2015 4:11 PM, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.com wrote:

  I have a rdd that is expensive to compute. I want to save it as object
 file and also print the count. How can I avoid double computation of the
 RDD?



 val rdd = sc.textFile(someFile).map(line = expensiveCalculation(line))



 val count = rdd.count()  // this force computation of the rdd

 println(count)

 rdd.saveAsObjectFile(file2) // this compute the RDD again



 I can avoid double computation by using cache



 val rdd = sc.textFile(someFile).map(line = expensiveCalculation(line))

 rdd.cache()

 val count = rdd.count()

 println(count)

 rdd.saveAsObjectFile(file2) // this compute the RDD again



 This only compute rdd once. However the rdd has millions of items and will
 cause out of memory.



 Question: how can I avoid double computation without using cache?





 Ningjun