subject:"Spark Structured Streaming \| Highly reliable de\-duplication strategy"

Re: Spark Structured Streaming | Highly reliable de-duplication strategy

2019-05-01 Thread Akshay Bhardwaj

Hi Anastasios, Thanks for this. I have a few doubts with this approach. The dropDuplicate operation will keep all the data across triggers. 1. Where is this data stored? - IN_MEMORY state means the data is not persisted during job resubmit. - Persistence in disk like HDFS has

Re: Spark Structured Streaming | Highly reliable de-duplication strategy

2019-05-01 Thread Anastasios Zouzias

Hi, Have you checked the docs, i.e., https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#streaming-deduplication You can generate a uuid column in your streaming DataFrame and drop duplicate messages with a single line of code. Best, Anastasios On Wed, May 1, 2019

Re: Spark Structured Streaming | Highly reliable de-duplication strategy

2019-05-01 Thread Akshay Bhardwaj

Hi All, Floating this again. Any suggestions? Akshay Bhardwaj +91-97111-33849 On Tue, Apr 30, 2019 at 7:30 PM Akshay Bhardwaj < akshay.bhardwaj1...@gmail.com> wrote: > Hi Experts, > > I am using spark structured streaming to read message from Kafka, with a > producer that works with at-least

Spark Structured Streaming | Highly reliable de-duplication strategy

2019-04-30 Thread Akshay Bhardwaj

Hi Experts, I am using spark structured streaming to read message from Kafka, with a producer that works with at-least once guarantee. This streaming job is running on Yarn cluster with hadoop 2.7 and spark 2.3 What is the most reliable strategy for avoiding duplicate data within stream in the

Re: Spark Structured Streaming | Highly reliable de-duplication strategy

Re: Spark Structured Streaming | Highly reliable de-duplication strategy

Re: Spark Structured Streaming | Highly reliable de-duplication strategy

Spark Structured Streaming | Highly reliable de-duplication strategy

4 matches

Site Navigation

Mail list logo

Footer information