Re: Reading Real Time Data only from Kafka

Cody Koeninger Tue, 12 May 2015 07:18:12 -0700

Akhil, I hope I'm misreading the tone of this. If you have personal issues
at stake, please take them up outside of the public list.  If you have
actual factual concerns about the kafka integration, please share them in a
jira.


Regarding reliability, here's a screenshot of a current production job with
a 3 week uptime  Was a month before that, only took it down to change code.

http://tinypic.com/r/2e4vkht/8

Regarding flexibility, both of the apis available in spark will do what
James needs, as I described.



On Tue, May 12, 2015 at 8:55 AM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> Hi Cody,
>
> If you are so sure, can you share a bench-marking (which you ran for days
> maybe?) that you have done with Kafka APIs provided by Spark?
>
> Thanks
> Best Regards
>
> On Tue, May 12, 2015 at 7:22 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> I don't think it's accurate for Akhil to claim that the linked library is
>> "much more flexible/reliable" than what's available in Spark at this point.
>>
>> James, what you're describing is the default behavior for the
>> createDirectStream api available as part of spark since 1.3.  The kafka
>> parameter auto.offset.reset defaults to largest, ie start at the most
>> recent available message.
>>
>> This is described at
>> http://spark.apache.org/docs/latest/streaming-kafka-integration.html
>>  The createDirectStream api implementation is described in detail at
>> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
>>
>> If for some reason you're stuck using an earlier version of spark, you
>> can accomplish what you want simply by starting the job using a new
>> consumer group (there will be no prior state in zookeeper, so it will start
>> consuming according to auto.offset.reset)
>>
>> On Tue, May 12, 2015 at 7:26 AM, James King <jakwebin...@gmail.com>
>> wrote:
>>
>>> Very nice! will try and let you know, thanks.
>>>
>>> On Tue, May 12, 2015 at 2:25 PM, Akhil Das <ak...@sigmoidanalytics.com>
>>> wrote:
>>>
>>>> Yep, you can try this lowlevel Kafka receiver
>>>> https://github.com/dibbhatt/kafka-spark-consumer. Its much more
>>>> flexible/reliable than the one comes with Spark.
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>> On Tue, May 12, 2015 at 5:15 PM, James King <jakwebin...@gmail.com>
>>>> wrote:
>>>>
>>>>> What I want is if the driver dies for some reason and it is restarted
>>>>> I want to read only messages that arrived into Kafka following the restart
>>>>> of the driver program and re-connection to Kafka.
>>>>>
>>>>> Has anyone done this? any links or resources that can help explain
>>>>> this?
>>>>>
>>>>> Regards
>>>>> jk
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reading Real Time Data only from Kafka

Reply via email to