Are you actually losing messages then?

On Tue, Aug 25, 2015 at 1:15 PM, Susan Zhang <suchenz...@gmail.com> wrote:

> No; first batch only contains messages received after the second job
> starts (messages come in at a steady rate of about 400/second).
>
> On Tue, Aug 25, 2015 at 11:07 AM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> Does the first batch after restart contain all the messages received
>> while the job was down?
>>
>> On Tue, Aug 25, 2015 at 12:53 PM, suchenzang <suchenz...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> I'm using direct spark streaming (from kafka) with checkpointing, and
>>> everything works well until a restart. When I shut down (^C) the first
>>> streaming job, wait 1 minute, then re-submit, there is somehow a series
>>> of 0
>>> event batches that get queued (corresponding to the 1 minute when the job
>>> was down). Eventually, the batches would resume processing, and I would
>>> see
>>> that each batch has roughly 2000 events.
>>>
>>> I see that at the beginning of the second launch, the checkpoint dirs are
>>> found and "loaded", according to console output.
>>>
>>> Is this expected behavior? It seems like I might've configured something
>>> incorrectly, since I would expect with checkpointing that the streaming
>>> job
>>> would resume from checkpoint and continue processing from there (without
>>> seeing 0 event batches corresponding to when the job was down).
>>>
>>> Also, if I were to wait > 10 minutes or so before re-launching, there
>>> would
>>> be so many 0 event batches that the job would hang. Is this merely
>>> something
>>> to be "waited out", or should I set up some restart behavior/make a
>>> config
>>> change to discard checkpointing if the elapsed time has been too long?
>>>
>>> Thanks!
>>>
>>> <
>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n24450/Screen_Shot_2015-08-25_at_10.png
>>> >
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Checkpointing-Restarts-with-0-Event-Batches-tp24450.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Reply via email to