Re: Spark Structured Streaming | Highly reliable de-duplication strategy

Anastasios Zouzias Wed, 01 May 2019 02:51:26 -0700

Hi,

Have you checked the docs, i.e.,
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#streaming-deduplication


You can generate a uuid column in your streaming DataFrame and drop
duplicate messages with a single line of code.

Best,
Anastasios

On Wed, May 1, 2019 at 11:15 AM Akshay Bhardwaj <
akshay.bhardwaj1...@gmail.com> wrote:

> Hi All,
>
> Floating this again. Any suggestions?
>
>
> Akshay Bhardwaj
> +91-97111-33849
>
>
> On Tue, Apr 30, 2019 at 7:30 PM Akshay Bhardwaj <
> akshay.bhardwaj1...@gmail.com> wrote:
>
>> Hi Experts,
>>
>> I am using spark structured streaming to read message from Kafka, with a
>> producer that works with at-least once guarantee. This streaming job is
>> running on Yarn cluster with hadoop 2.7 and spark 2.3
>>
>> What is the most reliable strategy for avoiding duplicate data within
>> stream in the scenarios of fail-over or job restarts/re-submits, and
>> guarantee exactly once non-duplicate stream?
>>
>>
>>    1. One of the strategies I have read other people using is to
>>    maintain an external KV store for unique-key/checksum of the incoming
>>    message, and write to a 2nd kafka topic only if the checksum is not 
>> present
>>    in KV store.
>>    - My doubts with this approach is how to ensure safe write to both
>>       the 2nd topic and to KV store for storing checksum, in the case of 
>> unwanted
>>       failures. How does that guarantee exactly-once with restarts?
>>
>> Any suggestions are highly appreciated.
>>
>>
>> Akshay Bhardwaj
>> +91-97111-33849
>>
>

-- 
-- Anastasios Zouzias
<a...@zurich.ibm.com>

Re: Spark Structured Streaming | Highly reliable de-duplication strategy

Reply via email to