Yes, that's what happens by default. If you want to be super accurate about it, you can also specify the exact starting offsets for every topic/partition.
On Tue, May 12, 2015 at 9:01 AM, James King <jakwebin...@gmail.com> wrote: > Thanks Cody. > > Here are the events: > > - Spark app connects to Kafka first time and starts consuming > - Messages 1 - 10 arrive at Kafka then Spark app gets them > - Now driver dies > - Messages 11 - 15 arrive at Kafka > - Spark driver program reconnects > - Then Messages 16 - 20 arrive Kafka > > What I want is that Spark ignores 11 - 15 > but should process 16 - 20 since they arrived after the driver reconnected > to Kafka > > Is this what happens by default in your suggestion? > > > > > > On Tue, May 12, 2015 at 3:52 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >> I don't think it's accurate for Akhil to claim that the linked library is >> "much more flexible/reliable" than what's available in Spark at this point. >> >> James, what you're describing is the default behavior for the >> createDirectStream api available as part of spark since 1.3. The kafka >> parameter auto.offset.reset defaults to largest, ie start at the most >> recent available message. >> >> This is described at >> http://spark.apache.org/docs/latest/streaming-kafka-integration.html >> The createDirectStream api implementation is described in detail at >> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md >> >> If for some reason you're stuck using an earlier version of spark, you >> can accomplish what you want simply by starting the job using a new >> consumer group (there will be no prior state in zookeeper, so it will start >> consuming according to auto.offset.reset) >> >> On Tue, May 12, 2015 at 7:26 AM, James King <jakwebin...@gmail.com> >> wrote: >> >>> Very nice! will try and let you know, thanks. >>> >>> On Tue, May 12, 2015 at 2:25 PM, Akhil Das <ak...@sigmoidanalytics.com> >>> wrote: >>> >>>> Yep, you can try this lowlevel Kafka receiver >>>> https://github.com/dibbhatt/kafka-spark-consumer. Its much more >>>> flexible/reliable than the one comes with Spark. >>>> >>>> Thanks >>>> Best Regards >>>> >>>> On Tue, May 12, 2015 at 5:15 PM, James King <jakwebin...@gmail.com> >>>> wrote: >>>> >>>>> What I want is if the driver dies for some reason and it is restarted >>>>> I want to read only messages that arrived into Kafka following the restart >>>>> of the driver program and re-connection to Kafka. >>>>> >>>>> Has anyone done this? any links or resources that can help explain >>>>> this? >>>>> >>>>> Regards >>>>> jk >>>>> >>>>> >>>>> >>>> >>> >> >