Many thanks both, appreciate the help. On Tue, May 12, 2015 at 4:18 PM, Cody Koeninger <c...@koeninger.org> wrote:
> Yes, that's what happens by default. > > If you want to be super accurate about it, you can also specify the exact > starting offsets for every topic/partition. > > On Tue, May 12, 2015 at 9:01 AM, James King <jakwebin...@gmail.com> wrote: > >> Thanks Cody. >> >> Here are the events: >> >> - Spark app connects to Kafka first time and starts consuming >> - Messages 1 - 10 arrive at Kafka then Spark app gets them >> - Now driver dies >> - Messages 11 - 15 arrive at Kafka >> - Spark driver program reconnects >> - Then Messages 16 - 20 arrive Kafka >> >> What I want is that Spark ignores 11 - 15 >> but should process 16 - 20 since they arrived after the driver >> reconnected to Kafka >> >> Is this what happens by default in your suggestion? >> >> >> >> >> >> On Tue, May 12, 2015 at 3:52 PM, Cody Koeninger <c...@koeninger.org> >> wrote: >> >>> I don't think it's accurate for Akhil to claim that the linked library >>> is "much more flexible/reliable" than what's available in Spark at this >>> point. >>> >>> James, what you're describing is the default behavior for the >>> createDirectStream api available as part of spark since 1.3. The kafka >>> parameter auto.offset.reset defaults to largest, ie start at the most >>> recent available message. >>> >>> This is described at >>> http://spark.apache.org/docs/latest/streaming-kafka-integration.html >>> The createDirectStream api implementation is described in detail at >>> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md >>> >>> If for some reason you're stuck using an earlier version of spark, you >>> can accomplish what you want simply by starting the job using a new >>> consumer group (there will be no prior state in zookeeper, so it will start >>> consuming according to auto.offset.reset) >>> >>> On Tue, May 12, 2015 at 7:26 AM, James King <jakwebin...@gmail.com> >>> wrote: >>> >>>> Very nice! will try and let you know, thanks. >>>> >>>> On Tue, May 12, 2015 at 2:25 PM, Akhil Das <ak...@sigmoidanalytics.com> >>>> wrote: >>>> >>>>> Yep, you can try this lowlevel Kafka receiver >>>>> https://github.com/dibbhatt/kafka-spark-consumer. Its much more >>>>> flexible/reliable than the one comes with Spark. >>>>> >>>>> Thanks >>>>> Best Regards >>>>> >>>>> On Tue, May 12, 2015 at 5:15 PM, James King <jakwebin...@gmail.com> >>>>> wrote: >>>>> >>>>>> What I want is if the driver dies for some reason and it is restarted >>>>>> I want to read only messages that arrived into Kafka following the >>>>>> restart >>>>>> of the driver program and re-connection to Kafka. >>>>>> >>>>>> Has anyone done this? any links or resources that can help explain >>>>>> this? >>>>>> >>>>>> Regards >>>>>> jk >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >