count vs countByValue in for/yield

2014-07-15 Thread Ognen Duzlevski

Hello,

I am curious about something:

val result = for {
  (dt,evrdd) <- evrdds
  val ct = evrdd.count
} yield (dt->ct)

works.

val result = for {
  (dt,evrdd) <- evrdds
  val ct = evrdd.countByValue
} yield (dt->ct)

does not work. I get:
14/07/15 16:46:33 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
14/07/15 16:46:33 WARN TaskSetManager: Loss was due to 
java.lang.NullPointerException

java.lang.NullPointerException
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

What is the difference? Is it in the fact that countByValue passes back 
a Map and count passes back a Long?


Thanks!
Ognen


Re: count vs countByValue in for/yield

2014-07-16 Thread Ognen Duzlevski

Hello all,

Can anyone offer any insight on the below?

Both are "legal" Spark but the first one works, the latter one does not. 
They both work on a local machine but in a standalone cluster the one 
with countByValue fails.


Thanks!
Ognen

On 7/15/14, 2:23 PM, Ognen Duzlevski wrote:

Hello,

I am curious about something:

val result = for {
  (dt,evrdd) <- evrdds
  val ct = evrdd.count
} yield (dt->ct)

works.

val result = for {
  (dt,evrdd) <- evrdds
  val ct = evrdd.countByValue
} yield (dt->ct)

does not work. I get:
14/07/15 16:46:33 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
14/07/15 16:46:33 WARN TaskSetManager: Loss was due to 
java.lang.NullPointerException

java.lang.NullPointerException
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)

at org.apache.spark.scheduler.Task.run(Task.scala:51)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

What is the difference? Is it in the fact that countByValue passes 
back a Map and count passes back a Long?


Thanks!
Ognen