It's not clear if you are referring to duplicates that result from the source or duplicates that result from Flume itself trying to maintain the at-least-once delivery of events.
I had a case were the source was producing duplicates but the network bandwidth was almost fully utilized by the regular de-duplicated stream so we couldn't afford to have duplicates travel all the way to the final destination (HDFS in our case). We ultimately just used a CircularFifoQueue in a flume interceptor. It was a good fit because for our case all duplicates will come in about 30-seconds window. We were receiving about 600 event per second so a CircularFifoQueue of size 18,000 for example was an easy solution to remove duplicates but at the expense of having a single flume agent to remove duplicates (SPOF). However, we still see duplicates at the final destination that are a result of Flume architecture or from occasional duplicates that come more than 30 seconds apart from the source but they were a very small percentage of the data size. We had a MapReduce job that removed those remaining duplicates in HDFS. -Majid > On Aug 7, 2015, at 1:23 PM, Guillermo Ortiz <[email protected]> wrote: > > Hi, > > I would like to delete duplicates in Flume with Interceptors. > The idea is to calculate an MD5 or similar for the event and store in Redis > or another database. I want just to check the lost of performance and which > it's the best solution for dealing with it. > > As I understand the max number of events what they could be duplicates depend > of the batchSize. So, you only need to store that number of keys in your > database. I don't know if Redis has that feature as capped collection in > Mongo. > > Has someone done something similar and knows the lost of performance? Which > could it be the best place where to store the keys for really fast access?? > Mongo, Redis,...? I think that HBase or Cassandra could be worse since with > Redis or similar could be in the same host than Flume and you don't lose time > because the network. > Any other solution to deal with duplicates in realtime? > >
