It's not clear if you are referring to duplicates that result from the source 
or duplicates that result from Flume itself trying to maintain the 
at-least-once delivery of events.

I had a case were the source was producing  duplicates but the network 
bandwidth was almost fully utilized by the regular de-duplicated stream so we 
couldn't afford to have duplicates travel all the way to the final destination 
(HDFS in our case). We ultimately just used a CircularFifoQueue in a flume 
interceptor. It was a good fit because for our case all duplicates will come in 
about 30-seconds window. We were receiving about 600 event per second so a 
CircularFifoQueue of size 18,000 for example was an easy solution to remove 
duplicates but at the expense of having a single flume agent to remove 
duplicates (SPOF). 

However, we still see duplicates at the final destination that are a result of 
Flume architecture or from occasional duplicates that come more than 30 seconds 
apart from the source but they were a very small percentage of the data size. 
We had a MapReduce job that removed those remaining duplicates in HDFS.

-Majid

> On Aug 7, 2015, at 1:23 PM, Guillermo Ortiz <[email protected]> wrote:
> 
> Hi, 
> 
> I would like to delete duplicates in Flume with Interceptors. 
> The idea is to calculate an MD5 or similar for the event and store in Redis 
> or another database. I want just to check the lost of performance and which 
> it's the best solution for dealing with it. 
> 
> As I understand the max number of events what they could be duplicates depend 
> of the batchSize. So, you only need to store that number of keys in your 
> database. I don't know if Redis has that feature as capped collection in 
> Mongo.
> 
> Has someone done something similar and knows the lost of performance? Which 
> could it be the best place where to store the keys for really fast access?? 
> Mongo, Redis,...? I think that HBase or Cassandra could be worse since with 
> Redis or similar could be in the same host than Flume and you don't lose time 
> because the network.
> Any other solution to deal with duplicates in realtime?
> 
> 

Reply via email to