If you checkpoint, the job will start from the successfully consumed
offsets.  If you don't checkpoint, by default it will start from the
highest available offset, and you will potentially lose data.

Is the link I posted, or for that matter the scaladoc, really not clear on
that point?

The scaladoc says:

 To recover from driver failures, you have to enable checkpointing in the
StreamingContext
<http://spark.apache.org/docs/latest/api/scala/org/apache/spark/streaming/StreamingContext.html>.
The information on consumed offset can be recovered from the checkpoint.

On Tue, May 19, 2015 at 2:38 PM, Bill Jay <bill.jaypeter...@gmail.com>
wrote:

> If a Spark streaming job stops at 12:01 and I resume the job at 12:02.
> Will it still start to consume the data that were produced to Kafka at
> 12:01? Or it will just start consuming from the current time?
>
>
> On Tue, May 19, 2015 at 10:58 AM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> Have you read
>> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md ?
>>
>> 1.  There's nothing preventing that.
>>
>> 2. Checkpointing will give you at-least-once semantics, provided you have
>> sufficient kafka retention.  Be aware that checkpoints aren't recoverable
>> if you upgrade code.
>>
>> On Tue, May 19, 2015 at 12:42 PM, Bill Jay <bill.jaypeter...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I am currently using Spark streaming to consume and save logs every hour
>>> in our production pipeline. The current setting is to run a crontab job to
>>> check every minute whether the job is still there and if not resubmit a
>>> Spark streaming job. I am currently using the direct approach for Kafka
>>> consumer. I have two questions:
>>>
>>> 1. In the direct approach, no offset is stored in zookeeper and no group
>>> id is specified. Can two consumers (one is Spark streaming and the other is
>>> a Kafak console consumer in Kafka package) read from the same topic from
>>> the brokers together (I would like both of them to get all messages, i.e.
>>> publish-subscribe mode)? What about two Spark streaming jobs read from the
>>> same topic?
>>>
>>> 2. How to avoid data loss if a Spark job is killed? Does checkpointing
>>> serve this purpose? The default behavior of Spark streaming is to read the
>>> latest logs. However, if a job is killed, can the new job resume from what
>>> was left to avoid loosing logs?
>>>
>>> Thanks!
>>>
>>> Bill
>>>
>>
>>
>

Reply via email to