Hi Experts, I am using spark structured streaming to read message from Kafka, with a producer that works with at-least once guarantee. This streaming job is running on Yarn cluster with hadoop 2.7 and spark 2.3
What is the most reliable strategy for avoiding duplicate data within stream in the scenarios of fail-over or job restarts/re-submits, and guarantee exactly once non-duplicate stream? 1. One of the strategies I have read other people using is to maintain an external KV store for unique-key/checksum of the incoming message, and write to a 2nd kafka topic only if the checksum is not present in KV store. - My doubts with this approach is how to ensure safe write to both the 2nd topic and to KV store for storing checksum, in the case of unwanted failures. How does that guarantee exactly-once with restarts? Any suggestions are highly appreciated. Akshay Bhardwaj +91-97111-33849