Re: Reading Real Time Data only from Kafka

Akhil Das Tue, 12 May 2015 07:42:25 -0700

Hi Cody,
I was just saying that i found more success and high throughput with the
low level kafka api prior to KafkfaRDDs which is the future it seems. My
apologies if you felt it that way. :)
On 12 May 2015 19:47, "Cody Koeninger" <c...@koeninger.org> wrote:


> Akhil, I hope I'm misreading the tone of this. If you have personal issues
> at stake, please take them up outside of the public list.  If you have
> actual factual concerns about the kafka integration, please share them in a
> jira.
>
> Regarding reliability, here's a screenshot of a current production job
> with a 3 week uptime  Was a month before that, only took it down to change
> code.
>
> http://tinypic.com/r/2e4vkht/8
>
> Regarding flexibility, both of the apis available in spark will do what
> James needs, as I described.
>
>
>
> On Tue, May 12, 2015 at 8:55 AM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Hi Cody,
>>
>> If you are so sure, can you share a bench-marking (which you ran for days
>> maybe?) that you have done with Kafka APIs provided by Spark?
>>
>> Thanks
>> Best Regards
>>
>> On Tue, May 12, 2015 at 7:22 PM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> I don't think it's accurate for Akhil to claim that the linked library
>>> is "much more flexible/reliable" than what's available in Spark at this
>>> point.
>>>
>>> James, what you're describing is the default behavior for the
>>> createDirectStream api available as part of spark since 1.3.  The kafka
>>> parameter auto.offset.reset defaults to largest, ie start at the most
>>> recent available message.
>>>
>>> This is described at
>>> http://spark.apache.org/docs/latest/streaming-kafka-integration.html
>>>  The createDirectStream api implementation is described in detail at
>>> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
>>>
>>> If for some reason you're stuck using an earlier version of spark, you
>>> can accomplish what you want simply by starting the job using a new
>>> consumer group (there will be no prior state in zookeeper, so it will start
>>> consuming according to auto.offset.reset)
>>>
>>> On Tue, May 12, 2015 at 7:26 AM, James King <jakwebin...@gmail.com>
>>> wrote:
>>>
>>>> Very nice! will try and let you know, thanks.
>>>>
>>>> On Tue, May 12, 2015 at 2:25 PM, Akhil Das <ak...@sigmoidanalytics.com>
>>>> wrote:
>>>>
>>>>> Yep, you can try this lowlevel Kafka receiver
>>>>> https://github.com/dibbhatt/kafka-spark-consumer. Its much more
>>>>> flexible/reliable than the one comes with Spark.
>>>>>
>>>>> Thanks
>>>>> Best Regards
>>>>>
>>>>> On Tue, May 12, 2015 at 5:15 PM, James King <jakwebin...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> What I want is if the driver dies for some reason and it is restarted
>>>>>> I want to read only messages that arrived into Kafka following the 
>>>>>> restart
>>>>>> of the driver program and re-connection to Kafka.
>>>>>>
>>>>>> Has anyone done this? any links or resources that can help explain
>>>>>> this?
>>>>>>
>>>>>> Regards
>>>>>> jk
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reading Real Time Data only from Kafka

Reply via email to