Thanks Cody.

Here are the events:

- Spark app connects to Kafka first time and starts consuming
- Messages 1 - 10 arrive at Kafka then Spark app gets them
- Now driver dies
- Messages 11 - 15 arrive at Kafka
- Spark driver program reconnects
- Then Messages 16 - 20 arrive Kafka

What I want is that Spark ignores 11 - 15
but should process 16 - 20 since they arrived after the driver reconnected
to Kafka

Is this what happens by default in your suggestion?





On Tue, May 12, 2015 at 3:52 PM, Cody Koeninger <c...@koeninger.org> wrote:

> I don't think it's accurate for Akhil to claim that the linked library is
> "much more flexible/reliable" than what's available in Spark at this point.
>
> James, what you're describing is the default behavior for the
> createDirectStream api available as part of spark since 1.3.  The kafka
> parameter auto.offset.reset defaults to largest, ie start at the most
> recent available message.
>
> This is described at
> http://spark.apache.org/docs/latest/streaming-kafka-integration.html  The
> createDirectStream api implementation is described in detail at
> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
>
> If for some reason you're stuck using an earlier version of spark, you can
> accomplish what you want simply by starting the job using a new consumer
> group (there will be no prior state in zookeeper, so it will start
> consuming according to auto.offset.reset)
>
> On Tue, May 12, 2015 at 7:26 AM, James King <jakwebin...@gmail.com> wrote:
>
>> Very nice! will try and let you know, thanks.
>>
>> On Tue, May 12, 2015 at 2:25 PM, Akhil Das <ak...@sigmoidanalytics.com>
>> wrote:
>>
>>> Yep, you can try this lowlevel Kafka receiver
>>> https://github.com/dibbhatt/kafka-spark-consumer. Its much more
>>> flexible/reliable than the one comes with Spark.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Tue, May 12, 2015 at 5:15 PM, James King <jakwebin...@gmail.com>
>>> wrote:
>>>
>>>> What I want is if the driver dies for some reason and it is restarted I
>>>> want to read only messages that arrived into Kafka following the restart of
>>>> the driver program and re-connection to Kafka.
>>>>
>>>> Has anyone done this? any links or resources that can help explain this?
>>>>
>>>> Regards
>>>> jk
>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to