I am getting a strange null pointer exception when trying to list the first entry of a JavaPairRDD after calling groupByKey on it. Following is my code:
JavaPairRDD<Tuple3<String, String, String>, List<String>> KeyToAppList = KeyToApp.distinct().groupByKey(); // System.out.println("First member of the key-val list: " + KeyToAppList.first()); // Above call to .first causes a null pointer exception JavaRDD<Integer> KeyToAppCount = KeyToAppList.map( new Function<Tuple2<Tuple3<String, String, String>, List<String>>, Integer>() { @Override public Integer call(Tuple2<Tuple3<String, String, String>, List<String>> tupleOfTupAndList) throws Exception { List<String> apps = tupleOfTupAndList._2; Set<String> uniqueApps = new HashSet<String>(apps); return uniqueApps.size(); } }); System.out.println("First member of the key-val list: " + KeyToAppCount.first()); // Above call to .first prints the first element all right. The first call to JavaPairRDD results in a null pointer exception. However, if I comment out the call to JavaPairRDD.first(), and instead proceed onto applying the map function, the call to JavaPairRDD.first() doesn't raise any exception. Why the null pointer exception immediately after applying groupByKey? The null pointer exception looks like follows: Exception in thread "main" org.apache.spark.SparkException: Job aborted: Exception while deserializing and fetching task: java.lang.NullPointerException at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Calling-JavaPairRDD-first-after-calling-JavaPairRDD-groupByKey-results-in-NullPointerException-tp7318.html Sent from the Apache Spark User List mailing list archive at Nabble.com.