Hi, Have you checked the docs, i.e., https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#streaming-deduplication
You can generate a uuid column in your streaming DataFrame and drop duplicate messages with a single line of code. Best, Anastasios On Wed, May 1, 2019 at 11:15 AM Akshay Bhardwaj < akshay.bhardwaj1...@gmail.com> wrote: > Hi All, > > Floating this again. Any suggestions? > > > Akshay Bhardwaj > +91-97111-33849 > > > On Tue, Apr 30, 2019 at 7:30 PM Akshay Bhardwaj < > akshay.bhardwaj1...@gmail.com> wrote: > >> Hi Experts, >> >> I am using spark structured streaming to read message from Kafka, with a >> producer that works with at-least once guarantee. This streaming job is >> running on Yarn cluster with hadoop 2.7 and spark 2.3 >> >> What is the most reliable strategy for avoiding duplicate data within >> stream in the scenarios of fail-over or job restarts/re-submits, and >> guarantee exactly once non-duplicate stream? >> >> >> 1. One of the strategies I have read other people using is to >> maintain an external KV store for unique-key/checksum of the incoming >> message, and write to a 2nd kafka topic only if the checksum is not >> present >> in KV store. >> - My doubts with this approach is how to ensure safe write to both >> the 2nd topic and to KV store for storing checksum, in the case of >> unwanted >> failures. How does that guarantee exactly-once with restarts? >> >> Any suggestions are highly appreciated. >> >> >> Akshay Bhardwaj >> +91-97111-33849 >> > -- -- Anastasios Zouzias <a...@zurich.ibm.com>