Re: About duplicate events and how to deal with them in Flume with interceptors.

Guillermo Ortiz Fri, 07 Aug 2015 04:34:20 -0700

Thanks for the answer. I was talking more about possible failures of an
Flume Agent. There's a tiny possiblity to get duplicates not because the
source is producing duplicates. It's true that they should be a really
small percentage of the data size but if the agent crashs you could get
duplicates when you starts the agent again.


I guess that you need a third player if you want to manage this case of
duplicates and it's not possible to use a CircularFifoQueue in the same JVM
than Flume that's why I thought about Redis or something similar. Ideally,
that system should be independent of Flume and have HA.



2015-08-07 13:20 GMT+02:00 Majid Alfifi <[email protected]>:

> It's not clear if you are referring to duplicates that result from the
> source or duplicates that result from Flume itself trying to maintain the
> at-least-once delivery of events.
>
> I had a case were the source was producing  duplicates but the network
> bandwidth was almost fully utilized by the regular de-duplicated stream so
> we couldn't afford to have duplicates travel all the way to the final
> destination (HDFS in our case). We ultimately just used a CircularFifoQueue
> in a flume interceptor. It was a good fit because for our case all
> duplicates will come in about 30-seconds window. We were receiving about
> 600 event per second so a CircularFifoQueue of size 18,000 for example was
> an easy solution to remove duplicates but at the expense of having a single
> flume agent to remove duplicates (SPOF).
>
> However, we still see duplicates at the final destination that are a
> result of Flume architecture or from occasional duplicates that come more
> than 30 seconds apart from the source but they were a very small percentage
> of the data size. We had a MapReduce job that removed those remaining
> duplicates in HDFS.
>
> -Majid
>
> > On Aug 7, 2015, at 1:23 PM, Guillermo Ortiz <[email protected]>
> wrote:
> >
> > Hi,
> >
> > I would like to delete duplicates in Flume with Interceptors.
> > The idea is to calculate an MD5 or similar for the event and store in
> Redis or another database. I want just to check the lost of performance and
> which it's the best solution for dealing with it.
> >
> > As I understand the max number of events what they could be duplicates
> depend of the batchSize. So, you only need to store that number of keys in
> your database. I don't know if Redis has that feature as capped collection
> in Mongo.
> >
> > Has someone done something similar and knows the lost of performance?
> Which could it be the best place where to store the keys for really fast
> access?? Mongo, Redis,...? I think that HBase or Cassandra could be worse
> since with Redis or similar could be in the same host than Flume and you
> don't lose time because the network.
> > Any other solution to deal with duplicates in realtime?
> >
> >
>

Re: About duplicate events and how to deal with them in Flume with interceptors.

Reply via email to