Hi Cody,

That is clear. Thanks!

Bill

On Tue, May 19, 2015 at 1:27 PM, Cody Koeninger <c...@koeninger.org> wrote:

> If you checkpoint, the job will start from the successfully consumed
> offsets.  If you don't checkpoint, by default it will start from the
> highest available offset, and you will potentially lose data.
>
> Is the link I posted, or for that matter the scaladoc, really not clear on
> that point?
>
> The scaladoc says:
>
>  To recover from driver failures, you have to enable checkpointing in the
> StreamingContext
> <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/streaming/StreamingContext.html>.
> The information on consumed offset can be recovered from the checkpoint.
>
> On Tue, May 19, 2015 at 2:38 PM, Bill Jay <bill.jaypeter...@gmail.com>
> wrote:
>
>> If a Spark streaming job stops at 12:01 and I resume the job at 12:02.
>> Will it still start to consume the data that were produced to Kafka at
>> 12:01? Or it will just start consuming from the current time?
>>
>>
>> On Tue, May 19, 2015 at 10:58 AM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> Have you read
>>> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
>>> ?
>>>
>>> 1.  There's nothing preventing that.
>>>
>>> 2. Checkpointing will give you at-least-once semantics, provided you
>>> have sufficient kafka retention.  Be aware that checkpoints aren't
>>> recoverable if you upgrade code.
>>>
>>> On Tue, May 19, 2015 at 12:42 PM, Bill Jay <bill.jaypeter...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am currently using Spark streaming to consume and save logs every
>>>> hour in our production pipeline. The current setting is to run a crontab
>>>> job to check every minute whether the job is still there and if not
>>>> resubmit a Spark streaming job. I am currently using the direct approach
>>>> for Kafka consumer. I have two questions:
>>>>
>>>> 1. In the direct approach, no offset is stored in zookeeper and no
>>>> group id is specified. Can two consumers (one is Spark streaming and the
>>>> other is a Kafak console consumer in Kafka package) read from the same
>>>> topic from the brokers together (I would like both of them to get all
>>>> messages, i.e. publish-subscribe mode)? What about two Spark streaming jobs
>>>> read from the same topic?
>>>>
>>>> 2. How to avoid data loss if a Spark job is killed? Does checkpointing
>>>> serve this purpose? The default behavior of Spark streaming is to read the
>>>> latest logs. However, if a job is killed, can the new job resume from what
>>>> was left to avoid loosing logs?
>>>>
>>>> Thanks!
>>>>
>>>> Bill
>>>>
>>>
>>>
>>
>

Reply via email to