Re: Deal with duplicates in Flume with a crash.

Joey Echeverria Wed, 03 Dec 2014 15:06:45 -0800

What happens if the write to HDFS succeeds before the HBase put?

-Joey


On Wed, Dec 3, 2014 at 2:35 PM, Mike Keane <[email protected]> wrote:
> We effectively mitigated this problem by using the UUID interceptor and 
> customizing the HDFS Sink to do a check and put of the UUID to HBase.  In the 
> customized sink we check HBase to see if we have seen the UUID before, if we 
> have it is a duplicate and we log a new duplicate metric with the existing 
> sink metrics and throw the event away.  If we have not seen the UUID before 
> we write the Event to HDFS and do a put of the UUID to hbase.
>
> Because of our volume to minimize the number of check/puts to HBase we put 
> multiple logs in a single FlumeEvent.
>
>
> -Mike
>
> ________________________________________
> From: Guillermo Ortiz [[email protected]]
> Sent: Wednesday, December 03, 2014 4:15 PM
> To: [email protected]
> Subject: Re: Deal with duplicates in Flume with a crash.
>
> I didn't know anything about a Hive Sink, I'll check the JIRA about it, 
> thanks.
> The pipeline is Flume-Kafka-SparkStreaming-XXX
>
> So I guess I should deal in SparkStreaming with it, right? I guess
> that it would be easy to do it with an UUID interceptor or is there
> another way easier?
>
> 2014-12-03 22:56 GMT+01:00 Roshan Naik <[email protected]>:
>> Using the UUID interceptor at the source closest to data origination.. it
>> will help identify duplicate events after they are delivered.
>>
>> If it satisfies your use case, the upcoming Hive Sink will mitigate the
>> problem a little bit (since it uses transactions to write to destination).
>>
>> -roshan
>>
>>
>> On Wed, Dec 3, 2014 at 8:44 AM, Joey Echeverria <[email protected]> wrote:
>>>
>>> There's nothing built into Flume to deal with duplicates, it only
>>> provides at-least-once delivery semantics.
>>>
>>> You'll have to handle it in your data processing applications or add
>>> an ETL step to deal with duplicates before making data available for
>>> other queries.
>>>
>>> -Joey
>>>
>>> On Wed, Dec 3, 2014 at 5:46 AM, Guillermo Ortiz <[email protected]>
>>> wrote:
>>> > Hi,
>>> >
>>> > I would like to know if there's a easy way to deal with data
>>> > duplication when an agent crashs and it resends same data again.
>>> >
>>> > Is there any mechanism to deal with it in Flume,
>>>
>>>
>>>
>>> --
>>> Joey Echeverria
>>
>>
>>
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader of
>> this message is not the intended recipient, you are hereby notified that any
>> printing, copying, dissemination, distribution, disclosure or forwarding of
>> this communication is strictly prohibited. If you have received this
>> communication in error, please contact the sender immediately and delete it
>> from your system. Thank You.
>
>
>
>
> This email and any files included with it may contain privileged,
> proprietary and/or confidential information that is for the sole use
> of the intended recipient(s).  Any disclosure, copying, distribution,
> posting, or use of the information contained in or attached to this
> email is prohibited unless permitted by the sender.  If you have
> received this email in error, please immediately notify the sender
> via return email, telephone, or fax and destroy this original transmission
> and its included files without reading or saving it in any manner.
> Thank you.
>



-- 
Joey Echeverria

Re: Deal with duplicates in Flume with a crash.

Reply via email to