We have not tried direct approach . We are using receiver based approach ( we use zookeepers to connect from spark)
We have around 20+ Kafka and some times we replace the kafka brokers ( they go down ). So each time I need to change list at spark application and I need to restart the streaming app. Thanks Sandesh On Wed, Jun 22, 2016 at 4:25 PM, Denys Cherepanin <denusk...@gmail.com> wrote: > Hi Sandesh, > > As I understand you are using "receiver based" approach to integrate kafka > with spark streaming. > > Did you tried "direct" approach > <http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers> > ? > In this case offsets will be tracked by streaming app via check-pointing > and you should achieve exactly-once semantics > > On Wed, Jun 22, 2016 at 5:58 AM, Jörn Franke <jornfra...@gmail.com> wrote: > >> >> Spark Streamig does not guarantee exactly once for output action. It >> means that one item is only processed in an RDD. >> You can achieve at most once or at least once. >> You could however do at least once (via checkpoing) and record which >> messages have been proceed (some identifier available?) and do not re >> process them .... You could also store (safely) what range has been already >> processed etc >> >> Think about the business case if exactly once is needed or if it can be >> replaced by one of the others. >> Exactly once, it needed requires in any system including spark more >> effort and usually the throughput is lower. A risk evaluation from a >> business point of view has to be done anyway... >> >> > On 22 Jun 2016, at 09:09, sandesh deshmane <sandesh.v...@gmail.com> >> wrote: >> > >> > Hi, >> > >> > I am writing spark streaming application which reads messages from >> Kafka. >> > >> > I am using checkpointing and write ahead logs ( WAL) to achieve fault >> tolerance . >> > >> > I have created batch size of 10 sec for reading messages from kafka. >> > >> > I read messages for kakfa and generate the count of messages as per >> values received from Kafka message. >> > >> > In case there is failure and my spark streaming application is >> restarted I see duplicate messages processed ( which is close to 2 batches) >> > >> > The problem that I have is per sec I get around 300k messages and In >> case application is restarted I see around 3-5 million duplicate counts. >> > >> > How to avoid such duplicates? >> > >> > what is best to way to recover from such failures ? >> > >> > Thanks >> > Sandesh >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > > > -- > Yours faithfully, Denys Cherepanin >