Re: Spark Structured Streaming | Highly reliable de-duplication strategy

2019-05-01 Thread Akshay Bhardwaj
Hi Anastasios, Thanks for this. I have a few doubts with this approach. The dropDuplicate operation will keep all the data across triggers. 1. Where is this data stored? - IN_MEMORY state means the data is not persisted during job resubmit. - Persistence in disk like HDFS has

Re: Spark Structured Streaming | Highly reliable de-duplication strategy

2019-05-01 Thread Anastasios Zouzias
Hi, Have you checked the docs, i.e., https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#streaming-deduplication You can generate a uuid column in your streaming DataFrame and drop duplicate messages with a single line of code. Best, Anastasios On Wed, May 1, 2019

Re: Spark Structured Streaming | Highly reliable de-duplication strategy

2019-05-01 Thread Akshay Bhardwaj
Hi All, Floating this again. Any suggestions? Akshay Bhardwaj +91-97111-33849 On Tue, Apr 30, 2019 at 7:30 PM Akshay Bhardwaj < akshay.bhardwaj1...@gmail.com> wrote: > Hi Experts, > > I am using spark structured streaming to read message from Kafka, with a > producer that works with at-least

Spark Structured Streaming | Highly reliable de-duplication strategy

2019-04-30 Thread Akshay Bhardwaj
Hi Experts, I am using spark structured streaming to read message from Kafka, with a producer that works with at-least once guarantee. This streaming job is running on Yarn cluster with hadoop 2.7 and spark 2.3 What is the most reliable strategy for avoiding duplicate data within stream in the