Hi, I am writing spark streaming application which reads messages from Kafka.
I am using checkpointing and write ahead logs ( WAL) to achieve fault tolerance . I have created batch size of 10 sec for reading messages from kafka. I read messages for kakfa and generate the count of messages as per values received from Kafka message. In case there is failure and my spark streaming application is restarted I see duplicate messages processed ( which is close to 2 batches) The problem that I have is per sec I get around 300k messages and In case application is restarted I see around 3-5 million duplicate counts. How to avoid such duplicates? what is best to way to recover from such failures ? Thanks Sandesh