I mean to say it is simpler in case of failures, restarts, upgrades, etc. Not just failures.
But they did do a lot of work on streaming from kafka in spark 1.3.x to make it simpler (streaming simple calls KafkaRDD for every batch if you use KafkaUtils.createDirectStream), so maybe i am wrong and streaming is just as good an approach. Not sure... On Sat, Apr 18, 2015 at 3:13 PM, Koert Kuipers <[email protected]> wrote: > Yeah I think would pick the second approach because it is simpler > operationally in case of any failures. But of course the smaller the window > gets the more attractive the streaming solution gets. > > We do daily extracts, not every 2 hours. > > On Sat, Apr 18, 2015 at 2:57 PM, Shushant Arora <[email protected] > > wrote: > >> Thanks Koert. >> >> So in short for Highlevel api I ll have to go with spark streaming only >> and there the issue is of handling cluster restart , thats why you opted >> for second approach of batch job or due to batch interval (2 hours is large >> for stream job) or some other reason? >> >> >> >> On Sun, Apr 19, 2015 at 12:20 AM, Koert Kuipers <[email protected]> >> wrote: >> >>> KafkaRDD uses the simple consumer api. and i think you need to handle >>> offsets yourself, unless things changed since i last looked. >>> >>> I would do second approach. >>> >>> On Sat, Apr 18, 2015 at 2:42 PM, Shushant Arora < >>> [email protected]> wrote: >>> >>>> Thanks !! >>>> I have few more doubts : >>>> >>>> Does kafka RDD uses simpleAPI for kafka consumer or highlevel API, I >>>> mean do I need to handle offset of partitions myself or it will be taken >>>> care by KafkaRDD, Plus which one is better for batch programming. I have a >>>> requirement to read kafka messages by a spark job at every 2 hours >>>> interval. >>>> >>>> 1.One approach is to use spark stream(with stream duration as 2 hours) >>>> + kafka - My doubt is -Is spark stream stable enough to handle cluster >>>> outage, If spark cluster gets restart , will the stream application be able >>>> to handle it or I need to restart stream application and pass last offsets >>>> or how is it gonna work ?Plus will the executor nodes be different in each >>>> run of stream interval or once decided the same nodes will be used >>>> throughout the application life ? Spark stream use high level Api for kafka >>>> integration ? >>>> >>>> 2.Second Approach is to Use spark batch job and fire a new job at >>>> every 2 hour interval- use kafka RDD to read from kafka, Now doubt is who >>>> will maintain the offset of last read messages- my application need to >>>> maintain it or I can use high level API here somehow? >>>> >>>> Thanks >>>> Shushant >>>> >>>> >>>> >>>> On Sat, Apr 18, 2015 at 9:09 PM, Ilya Ganelin <[email protected]> >>>> wrote: >>>> >>>>> That's a much better idea :) >>>>> >>>>> On Sat, Apr 18, 2015 at 11:22 AM Koert Kuipers <[email protected]> >>>>> wrote: >>>>> >>>>>> Use KafkaRDD directly. It is in spark-streaming-kafka package >>>>>> >>>>>> On Sat, Apr 18, 2015 at 6:43 AM, Shushant Arora < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi >>>>>>> >>>>>>> I want to consume messages from kafka queue using spark batch >>>>>>> program not spark streaming, Is there any way to achieve this, other >>>>>>> than >>>>>>> using low level(simple api) of kafka consumer. >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>> >>>>>> >>>> >>> >> >
