I didn't know anything about a Hive Sink, I'll check the JIRA about it, thanks. The pipeline is Flume-Kafka-SparkStreaming-XXX
So I guess I should deal in SparkStreaming with it, right? I guess that it would be easy to do it with an UUID interceptor or is there another way easier? 2014-12-03 22:56 GMT+01:00 Roshan Naik <[email protected]>: > Using the UUID interceptor at the source closest to data origination.. it > will help identify duplicate events after they are delivered. > > If it satisfies your use case, the upcoming Hive Sink will mitigate the > problem a little bit (since it uses transactions to write to destination). > > -roshan > > > On Wed, Dec 3, 2014 at 8:44 AM, Joey Echeverria <[email protected]> wrote: >> >> There's nothing built into Flume to deal with duplicates, it only >> provides at-least-once delivery semantics. >> >> You'll have to handle it in your data processing applications or add >> an ETL step to deal with duplicates before making data available for >> other queries. >> >> -Joey >> >> On Wed, Dec 3, 2014 at 5:46 AM, Guillermo Ortiz <[email protected]> >> wrote: >> > Hi, >> > >> > I would like to know if there's a easy way to deal with data >> > duplication when an agent crashs and it resends same data again. >> > >> > Is there any mechanism to deal with it in Flume, >> >> >> >> -- >> Joey Echeverria > > > > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader of > this message is not the intended recipient, you are hereby notified that any > printing, copying, dissemination, distribution, disclosure or forwarding of > this communication is strictly prohibited. If you have received this > communication in error, please contact the sender immediately and delete it > from your system. Thank You.
