Hi Anastasios, Thanks for this. I have a few doubts with this approach. The dropDuplicate operation will keep all the data across triggers.
1. Where is this data stored? - IN_MEMORY state means the data is not persisted during job resubmit. - Persistence in disk like HDFS has proved to be unreliable, as I have encountered corrupted files which causes errors on job restarts. Akshay Bhardwaj +91-97111-33849 On Wed, May 1, 2019 at 3:20 PM Anastasios Zouzias <zouz...@gmail.com> wrote: > Hi, > > Have you checked the docs, i.e., > https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#streaming-deduplication > > You can generate a uuid column in your streaming DataFrame and drop > duplicate messages with a single line of code. > > Best, > Anastasios > > On Wed, May 1, 2019 at 11:15 AM Akshay Bhardwaj < > akshay.bhardwaj1...@gmail.com> wrote: > >> Hi All, >> >> Floating this again. Any suggestions? >> >> >> Akshay Bhardwaj >> +91-97111-33849 >> >> >> On Tue, Apr 30, 2019 at 7:30 PM Akshay Bhardwaj < >> akshay.bhardwaj1...@gmail.com> wrote: >> >>> Hi Experts, >>> >>> I am using spark structured streaming to read message from Kafka, with a >>> producer that works with at-least once guarantee. This streaming job is >>> running on Yarn cluster with hadoop 2.7 and spark 2.3 >>> >>> What is the most reliable strategy for avoiding duplicate data within >>> stream in the scenarios of fail-over or job restarts/re-submits, and >>> guarantee exactly once non-duplicate stream? >>> >>> >>> 1. One of the strategies I have read other people using is to >>> maintain an external KV store for unique-key/checksum of the incoming >>> message, and write to a 2nd kafka topic only if the checksum is not >>> present >>> in KV store. >>> - My doubts with this approach is how to ensure safe write to both >>> the 2nd topic and to KV store for storing checksum, in the case of >>> unwanted >>> failures. How does that guarantee exactly-once with restarts? >>> >>> Any suggestions are highly appreciated. >>> >>> >>> Akshay Bhardwaj >>> +91-97111-33849 >>> >> > > -- > -- Anastasios Zouzias > <a...@zurich.ibm.com> >