Re: count vs countByValue in for/yield

Ognen Duzlevski Wed, 16 Jul 2014 06:04:37 -0700

Hello all,

Can anyone offer any insight on the below?

Both are "legal" Spark but the first one works, the latter one does not.They both work on a local machine but in a standalone cluster the onewith countByValue fails.


Thanks!
Ognen

On 7/15/14, 2:23 PM, Ognen Duzlevski wrote:

Hello,

I am curious about something:

val result = for {
      (dt,evrdd) <- evrdds
      val ct = evrdd.count
    } yield (dt->ct)

works.

val result = for {
      (dt,evrdd) <- evrdds
      val ct = evrdd.countByValue
    } yield (dt->ct)

does not work. I get:
14/07/15 16:46:33 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
14/07/15 16:46:33 WARN TaskSetManager: Loss was due tojava.lang.NullPointerException
java.lang.NullPointerException
    at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
    at org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
atorg.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
atorg.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
    at org.apache.spark.scheduler.Task.run(Task.scala:51)
atorg.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
What is the difference? Is it in the fact that countByValue passesback a Map and count passes back a Long?
Thanks!
Ognen

Re: count vs countByValue in for/yield

Reply via email to