Hi Anastasios,

Thanks for this.
I have a few doubts with this approach. The dropDuplicate operation will
keep all the data across triggers.

   1. Where is this data stored?
      - IN_MEMORY state means the data is not persisted during job resubmit.
      - Persistence in disk like HDFS has proved to be unreliable, as I
      have encountered corrupted files which causes errors on job restarts.



Akshay Bhardwaj
+91-97111-33849


On Wed, May 1, 2019 at 3:20 PM Anastasios Zouzias <zouz...@gmail.com> wrote:

> Hi,
>
> Have you checked the docs, i.e.,
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#streaming-deduplication
>
> You can generate a uuid column in your streaming DataFrame and drop
> duplicate messages with a single line of code.
>
> Best,
> Anastasios
>
> On Wed, May 1, 2019 at 11:15 AM Akshay Bhardwaj <
> akshay.bhardwaj1...@gmail.com> wrote:
>
>> Hi All,
>>
>> Floating this again. Any suggestions?
>>
>>
>> Akshay Bhardwaj
>> +91-97111-33849
>>
>>
>> On Tue, Apr 30, 2019 at 7:30 PM Akshay Bhardwaj <
>> akshay.bhardwaj1...@gmail.com> wrote:
>>
>>> Hi Experts,
>>>
>>> I am using spark structured streaming to read message from Kafka, with a
>>> producer that works with at-least once guarantee. This streaming job is
>>> running on Yarn cluster with hadoop 2.7 and spark 2.3
>>>
>>> What is the most reliable strategy for avoiding duplicate data within
>>> stream in the scenarios of fail-over or job restarts/re-submits, and
>>> guarantee exactly once non-duplicate stream?
>>>
>>>
>>>    1. One of the strategies I have read other people using is to
>>>    maintain an external KV store for unique-key/checksum of the incoming
>>>    message, and write to a 2nd kafka topic only if the checksum is not 
>>> present
>>>    in KV store.
>>>    - My doubts with this approach is how to ensure safe write to both
>>>       the 2nd topic and to KV store for storing checksum, in the case of 
>>> unwanted
>>>       failures. How does that guarantee exactly-once with restarts?
>>>
>>> Any suggestions are highly appreciated.
>>>
>>>
>>> Akshay Bhardwaj
>>> +91-97111-33849
>>>
>>
>
> --
> -- Anastasios Zouzias
> <a...@zurich.ibm.com>
>

Reply via email to