Re: spark with kafka

Koert Kuipers Sat, 18 Apr 2015 12:19:03 -0700

I mean to say it is simpler in case of failures, restarts, upgrades, etc.
Not just failures.


But they did do a lot of work on streaming from kafka in spark 1.3.x to
make it simpler (streaming simple calls KafkaRDD for every batch if you use
KafkaUtils.createDirectStream), so maybe i am wrong and streaming is just
as good an approach. Not sure...

On Sat, Apr 18, 2015 at 3:13 PM, Koert Kuipers <[email protected]> wrote:

> Yeah I think would pick the second approach because it is simpler
> operationally in case of any failures. But of course the smaller the window
> gets the more attractive the streaming solution gets.
>
> We do daily extracts, not every 2 hours.
>
> On Sat, Apr 18, 2015 at 2:57 PM, Shushant Arora <[email protected]
> > wrote:
>
>> Thanks Koert.
>>
>> So in short for Highlevel api I ll have to go with spark streaming only
>> and there the issue is of handling cluster restart , thats why you opted
>> for second approach of batch job or due to batch interval (2 hours is large
>> for stream job) or some other reason?
>>
>>
>>
>> On Sun, Apr 19, 2015 at 12:20 AM, Koert Kuipers <[email protected]>
>> wrote:
>>
>>> KafkaRDD uses the simple consumer api. and i think you need to handle
>>> offsets yourself, unless things changed since i last looked.
>>>
>>> I would do second approach.
>>>
>>> On Sat, Apr 18, 2015 at 2:42 PM, Shushant Arora <
>>> [email protected]> wrote:
>>>
>>>> Thanks !!
>>>> I have few more doubts :
>>>>
>>>> Does kafka RDD uses simpleAPI for kafka consumer or highlevel API, I
>>>> mean do I need to handle offset of partitions myself or it will be taken
>>>> care by KafkaRDD, Plus which one is better for batch programming. I have a
>>>> requirement to read kafka messages by a spark job at  every 2 hours
>>>> interval.
>>>>
>>>> 1.One approach is to use spark stream(with stream duration as 2 hours)
>>>> + kafka - My doubt is -Is spark stream stable enough to handle cluster
>>>> outage, If spark cluster gets restart , will the stream application be able
>>>> to handle it or I need to restart stream application and pass last offsets
>>>> or how is it gonna work ?Plus will the executor nodes be different in each
>>>> run of stream interval or once decided the same nodes will be used
>>>> throughout the application life ? Spark stream use high level Api for kafka
>>>> integration ?
>>>>
>>>> 2.Second Approach  is to Use spark batch job and fire a new  job at
>>>> every 2 hour interval- use kafka RDD to read from kafka, Now doubt is who
>>>> will maintain the offset of last read messages- my application need to
>>>> maintain it or I can use high level API here somehow?
>>>>
>>>> Thanks
>>>> Shushant
>>>>
>>>>
>>>>
>>>> On Sat, Apr 18, 2015 at 9:09 PM, Ilya Ganelin <[email protected]>
>>>> wrote:
>>>>
>>>>> That's a much better idea :)
>>>>>
>>>>> On Sat, Apr 18, 2015 at 11:22 AM Koert Kuipers <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Use KafkaRDD directly. It is in spark-streaming-kafka package
>>>>>>
>>>>>> On Sat, Apr 18, 2015 at 6:43 AM, Shushant Arora <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> I want to consume messages from kafka queue using spark batch
>>>>>>> program not spark streaming, Is there any way to achieve this, other 
>>>>>>> than
>>>>>>> using low level(simple api) of kafka consumer.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: spark with kafka

Reply via email to