Hi Anastasios,
Thanks for this.
I have a few doubts with this approach. The dropDuplicate operation will
keep all the data across triggers.
1. Where is this data stored?
- IN_MEMORY state means the data is not persisted during job resubmit.
- Persistence in disk like HDFS has
Hi,
Have you checked the docs, i.e.,
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#streaming-deduplication
You can generate a uuid column in your streaming DataFrame and drop
duplicate messages with a single line of code.
Best,
Anastasios
On Wed, May 1, 2019
Hi All,
Floating this again. Any suggestions?
Akshay Bhardwaj
+91-97111-33849
On Tue, Apr 30, 2019 at 7:30 PM Akshay Bhardwaj <
akshay.bhardwaj1...@gmail.com> wrote:
> Hi Experts,
>
> I am using spark structured streaming to read message from Kafka, with a
> producer that works with at-least
Hi Experts,
I am using spark structured streaming to read message from Kafka, with a
producer that works with at-least once guarantee. This streaming job is
running on Yarn cluster with hadoop 2.7 and spark 2.3
What is the most reliable strategy for avoiding duplicate data within
stream in the