Spark Structured Streaming | Highly reliable de-duplication strategy

Akshay Bhardwaj Tue, 30 Apr 2019 07:01:14 -0700

Hi Experts,

I am using spark structured streaming to read message from Kafka, with a
producer that works with at-least once guarantee. This streaming job is
running on Yarn cluster with hadoop 2.7 and spark 2.3


What is the most reliable strategy for avoiding duplicate data within
stream in the scenarios of fail-over or job restarts/re-submits, and
guarantee exactly once non-duplicate stream?


   1. One of the strategies I have read other people using is to maintain
   an external KV store for unique-key/checksum of the incoming message, and
   write to a 2nd kafka topic only if the checksum is not present in KV store.
   - My doubts with this approach is how to ensure safe write to both the
      2nd topic and to KV store for storing checksum, in the case of unwanted
      failures. How does that guarantee exactly-once with restarts?

Any suggestions are highly appreciated.


Akshay Bhardwaj
+91-97111-33849

Spark Structured Streaming | Highly reliable de-duplication strategy

Reply via email to