Hello all,

Can anyone offer any insight on the below?

Both are "legal" Spark but the first one works, the latter one does not. They both work on a local machine but in a standalone cluster the one with countByValue fails.

Thanks!
Ognen

On 7/15/14, 2:23 PM, Ognen Duzlevski wrote:
Hello,

I am curious about something:

val result = for {
      (dt,evrdd) <- evrdds
      val ct = evrdd.count
    } yield (dt->ct)

works.

val result = for {
      (dt,evrdd) <- evrdds
      val ct = evrdd.countByValue
    } yield (dt->ct)

does not work. I get:
14/07/15 16:46:33 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
14/07/15 16:46:33 WARN TaskSetManager: Loss was due to java.lang.NullPointerException
java.lang.NullPointerException
    at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
    at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
    at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

What is the difference? Is it in the fact that countByValue passes back a Map and count passes back a Long?

Thanks!
Ognen

Reply via email to