What happens if the write to HDFS succeeds before the HBase put? -Joey
On Wed, Dec 3, 2014 at 2:35 PM, Mike Keane <[email protected]> wrote: > We effectively mitigated this problem by using the UUID interceptor and > customizing the HDFS Sink to do a check and put of the UUID to HBase. In the > customized sink we check HBase to see if we have seen the UUID before, if we > have it is a duplicate and we log a new duplicate metric with the existing > sink metrics and throw the event away. If we have not seen the UUID before > we write the Event to HDFS and do a put of the UUID to hbase. > > Because of our volume to minimize the number of check/puts to HBase we put > multiple logs in a single FlumeEvent. > > > -Mike > > ________________________________________ > From: Guillermo Ortiz [[email protected]] > Sent: Wednesday, December 03, 2014 4:15 PM > To: [email protected] > Subject: Re: Deal with duplicates in Flume with a crash. > > I didn't know anything about a Hive Sink, I'll check the JIRA about it, > thanks. > The pipeline is Flume-Kafka-SparkStreaming-XXX > > So I guess I should deal in SparkStreaming with it, right? I guess > that it would be easy to do it with an UUID interceptor or is there > another way easier? > > 2014-12-03 22:56 GMT+01:00 Roshan Naik <[email protected]>: >> Using the UUID interceptor at the source closest to data origination.. it >> will help identify duplicate events after they are delivered. >> >> If it satisfies your use case, the upcoming Hive Sink will mitigate the >> problem a little bit (since it uses transactions to write to destination). >> >> -roshan >> >> >> On Wed, Dec 3, 2014 at 8:44 AM, Joey Echeverria <[email protected]> wrote: >>> >>> There's nothing built into Flume to deal with duplicates, it only >>> provides at-least-once delivery semantics. >>> >>> You'll have to handle it in your data processing applications or add >>> an ETL step to deal with duplicates before making data available for >>> other queries. >>> >>> -Joey >>> >>> On Wed, Dec 3, 2014 at 5:46 AM, Guillermo Ortiz <[email protected]> >>> wrote: >>> > Hi, >>> > >>> > I would like to know if there's a easy way to deal with data >>> > duplication when an agent crashs and it resends same data again. >>> > >>> > Is there any mechanism to deal with it in Flume, >>> >>> >>> >>> -- >>> Joey Echeverria >> >> >> >> CONFIDENTIALITY NOTICE >> NOTICE: This message is intended for the use of the individual or entity to >> which it is addressed and may contain information that is confidential, >> privileged and exempt from disclosure under applicable law. If the reader of >> this message is not the intended recipient, you are hereby notified that any >> printing, copying, dissemination, distribution, disclosure or forwarding of >> this communication is strictly prohibited. If you have received this >> communication in error, please contact the sender immediately and delete it >> from your system. Thank You. > > > > > This email and any files included with it may contain privileged, > proprietary and/or confidential information that is for the sole use > of the intended recipient(s). Any disclosure, copying, distribution, > posting, or use of the information contained in or attached to this > email is prohibited unless permitted by the sender. If you have > received this email in error, please immediately notify the sender > via return email, telephone, or fax and destroy this original transmission > and its included files without reading or saving it in any manner. > Thank you. > -- Joey Echeverria
