We are using Kinesis with Spark Streaming 1.5 on a YARN cluster.  When we
enable checkpointing in Spark, where in the Kinesis stream should a
restarted driver continue? I run a simple experiment as follows:

1. In the first driver run, Spark driver processes 1 million records
starting from InitialPositionInStream.TRIM_HORIZON  in 5 second batch
intervals with 10 seconds set as the Kinesis receiver checkpoint interval.
(This interval has been purposely set low to see the impact of where a
restarted driver would pick up. )

2. We stop pushing events to Kinesis stream until the driver keeps pulling
zero events for a few minutes. Then first driver killed manually through
"yarn application --kill".

3. The driver is relaunched a second time and the logs show it successfully
restored from the DFS checkpoint directory. Because the first driver had
completely processed all the entries in the stream, I would expect the
second driver to pick up at the end of the stream or at minimum the last 10
second interval window. However the second driver launch (and subsequent
driver launches)  re-processes about 30 seconds worth of (100,000) events
and appears not to be related to the Kinesis checkpoint interval.

Also with a Kinesis driver, does it make sense you would use Write Ahead
Logs and incur the cost of writing to DFS when you could remember the
previous to last checkpoint and just reprocess/refetch directly from the
stream?

Any input is highly appreciated.

Thanks,
Heji

Reply via email to