Re: Reading Real Time Data only from Kafka

Cody Koeninger Tue, 12 May 2015 07:19:37 -0700

Yes, that's what happens by default.

If you want to be super accurate about it, you can also specify the exact
starting offsets for every topic/partition.


On Tue, May 12, 2015 at 9:01 AM, James King <jakwebin...@gmail.com> wrote:

> Thanks Cody.
>
> Here are the events:
>
> - Spark app connects to Kafka first time and starts consuming
> - Messages 1 - 10 arrive at Kafka then Spark app gets them
> - Now driver dies
> - Messages 11 - 15 arrive at Kafka
> - Spark driver program reconnects
> - Then Messages 16 - 20 arrive Kafka
>
> What I want is that Spark ignores 11 - 15
> but should process 16 - 20 since they arrived after the driver reconnected
> to Kafka
>
> Is this what happens by default in your suggestion?
>
>
>
>
>
> On Tue, May 12, 2015 at 3:52 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> I don't think it's accurate for Akhil to claim that the linked library is
>> "much more flexible/reliable" than what's available in Spark at this point.
>>
>> James, what you're describing is the default behavior for the
>> createDirectStream api available as part of spark since 1.3.  The kafka
>> parameter auto.offset.reset defaults to largest, ie start at the most
>> recent available message.
>>
>> This is described at
>> http://spark.apache.org/docs/latest/streaming-kafka-integration.html
>>  The createDirectStream api implementation is described in detail at
>> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
>>
>> If for some reason you're stuck using an earlier version of spark, you
>> can accomplish what you want simply by starting the job using a new
>> consumer group (there will be no prior state in zookeeper, so it will start
>> consuming according to auto.offset.reset)
>>
>> On Tue, May 12, 2015 at 7:26 AM, James King <jakwebin...@gmail.com>
>> wrote:
>>
>>> Very nice! will try and let you know, thanks.
>>>
>>> On Tue, May 12, 2015 at 2:25 PM, Akhil Das <ak...@sigmoidanalytics.com>
>>> wrote:
>>>
>>>> Yep, you can try this lowlevel Kafka receiver
>>>> https://github.com/dibbhatt/kafka-spark-consumer. Its much more
>>>> flexible/reliable than the one comes with Spark.
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>> On Tue, May 12, 2015 at 5:15 PM, James King <jakwebin...@gmail.com>
>>>> wrote:
>>>>
>>>>> What I want is if the driver dies for some reason and it is restarted
>>>>> I want to read only messages that arrived into Kafka following the restart
>>>>> of the driver program and re-connection to Kafka.
>>>>>
>>>>> Has anyone done this? any links or resources that can help explain
>>>>> this?
>>>>>
>>>>> Regards
>>>>> jk
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reading Real Time Data only from Kafka

Reply via email to