Thanks Cody. Here are the events:
- Spark app connects to Kafka first time and starts consuming - Messages 1 - 10 arrive at Kafka then Spark app gets them - Now driver dies - Messages 11 - 15 arrive at Kafka - Spark driver program reconnects - Then Messages 16 - 20 arrive Kafka What I want is that Spark ignores 11 - 15 but should process 16 - 20 since they arrived after the driver reconnected to Kafka Is this what happens by default in your suggestion? On Tue, May 12, 2015 at 3:52 PM, Cody Koeninger <c...@koeninger.org> wrote: > I don't think it's accurate for Akhil to claim that the linked library is > "much more flexible/reliable" than what's available in Spark at this point. > > James, what you're describing is the default behavior for the > createDirectStream api available as part of spark since 1.3. The kafka > parameter auto.offset.reset defaults to largest, ie start at the most > recent available message. > > This is described at > http://spark.apache.org/docs/latest/streaming-kafka-integration.html The > createDirectStream api implementation is described in detail at > https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md > > If for some reason you're stuck using an earlier version of spark, you can > accomplish what you want simply by starting the job using a new consumer > group (there will be no prior state in zookeeper, so it will start > consuming according to auto.offset.reset) > > On Tue, May 12, 2015 at 7:26 AM, James King <jakwebin...@gmail.com> wrote: > >> Very nice! will try and let you know, thanks. >> >> On Tue, May 12, 2015 at 2:25 PM, Akhil Das <ak...@sigmoidanalytics.com> >> wrote: >> >>> Yep, you can try this lowlevel Kafka receiver >>> https://github.com/dibbhatt/kafka-spark-consumer. Its much more >>> flexible/reliable than the one comes with Spark. >>> >>> Thanks >>> Best Regards >>> >>> On Tue, May 12, 2015 at 5:15 PM, James King <jakwebin...@gmail.com> >>> wrote: >>> >>>> What I want is if the driver dies for some reason and it is restarted I >>>> want to read only messages that arrived into Kafka following the restart of >>>> the driver program and re-connection to Kafka. >>>> >>>> Has anyone done this? any links or resources that can help explain this? >>>> >>>> Regards >>>> jk >>>> >>>> >>>> >>> >> >