Re: Spark streaming 1.6.0-RC4 NullPointerException using mapWithState

2015-12-31 Thread Jan Uyttenhove
Thanks for your answers, and I'm sorry for the sight delay.

I was trying to narrow it down first since I noticed very unpredictable
behaviour in reproducing it. Finally the unpredictability seemed related to
the message format of the messages on Kafka, so I also came to suspect it
had something to do with serialization.

I was using the KryoSerializer, and I can confirm that it is in fact
related to it (no exception when using the JavaSerializer). And it's
unrelated to Kafka.

Apparently the exception occurs when restoring state from previous
checkpoints and using KryoSerialization. When the job has checkpoints from
a previous run (containing state) and is started with new messages already
available on e.g. a Kafka topic (or via nc), the NPE occurs. And this is -
of course - a typical use case of using Kafka with Spark streaming and
checkpointing.

As requested I created issue SPARK-12591 (
https://issues.apache.org/jira/browse/SPARK-12591). It contains the
procedure to reproduce the error with the testcase which is again a
modified version of the StatefulNetworkWordCount Spark streaming example (
https://gist.github.com/juyttenh/9b4a4103699a7d5f698f).

Best regards,
Jan

On Thu, Dec 31, 2015 at 2:32 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> I went through StateMap.scala a few times but didn't find any logic error
> yet.
>
> According to the call stack, the following was executed in get(key):
>
> } else {
>   parentStateMap.get(key)
> }
> This implies that parentStateMap was null.
> But it seems parentStateMap is properly assigned in readObject().
>
> Jan:
> Which serializer did you use ?
>
> Thanks
>
> On Tue, Dec 29, 2015 at 3:42 AM, Jan Uyttenhove <j...@insidin.com> wrote:
>
>> Hi guys,
>>
>> I upgraded to the RC4 of Spark (streaming) 1.6.0 to (re)test the new
>> mapWithState API, after previously reporting issue SPARK-11932 (
>> https://issues.apache.org/jira/browse/SPARK-11932).
>>
>> My Spark streaming job involves reading data from a Kafka topic (using
>> KafkaUtils.createDirectStream), stateful processing (using checkpointing
>> & mapWithState) & publishing the results back to Kafka.
>>
>> I'm now facing the NullPointerException below when restoring from a
>> checkpoint in the following scenario:
>> 1/ run job (with local[2]), process data from Kafka while creating &
>> keeping state
>> 2/ stop the job
>> 3/ generate some extra message on the input Kafka topic
>> 4/ start the job again (and restore offsets & state from the checkpoints)
>>
>> The problem is caused by (or at least related to) step 3, i.e. publishing
>> data to the input topic while the job is stopped.
>> The above scenario has been tested successfully when:
>> - step 3 is excluded, so restoring state from a checkpoint is successful
>> when no messages are added when the job is stopped
>> - after step 2, the checkpoints are deleted
>>
>> Any clues? Am I doing something wrong here, or is there still a problem
>> with the mapWithState impl?
>>
>> Thanx,
>>
>> Jan
>>
>>
>>
>> 15/12/29 11:56:12 ERROR executor.Executor: Exception in task 0.0 in stage
>> 3.0 (TID 24)
>> java.lang.NullPointerException
>> at
>> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103)
>> at
>> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> at
>> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
>> at org.apache.spark.r

Spark streaming 1.6.0-RC4 NullPointerException using mapWithState

2015-12-29 Thread Jan Uyttenhove
pute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at
org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
... 3 more
15/12/29 11:56:12 INFO streaming.DStreamGraph: Updating checkpoint data for
time 145138655 ms
15/12/29 11:56:12 INFO executor.Executor: Executor killed task 1.0 in stage
3.0 (TID 25)
15/12/29 11:56:12 INFO spark.MapOutputTrackerMaster: Size of output
statuses for shuffle 2 is 153 bytes
15/12/29 11:56:12 INFO spark.MapOutputTrackerMaster: Size of output
statuses for shuffle 1 is 153 bytes



-- 
Jan Uyttenhove
Streaming data & digital solutions architect @ Insidin bvba

j...@insidin.com

https://twitter.com/xorto
https://www.linkedin.com/in/januyttenhove

This e-mail and any files transmitted with it are intended solely for the
use of the individual or entity to whom they are addressed. It may contain
privileged and confidential information. If you are not the intended
recipient please notify the sender immediately and destroy this e-mail. Any
form of reproduction, dissemination, copying, disclosure, modification,
distribution and/or publication of this e-mail message is strictly
prohibited. Whilst all efforts are made to safeguard e-mails, the sender
cannot guarantee that attachments are virus free or compatible with your
systems and does not accept liability in respect of viruses or computer
problems experienced.