Thanks for the response. I’ve been sitting on this for a few months, as fixing the typo resolved the issue. Still, I must have read your response like 40 times because I just wasn’t getting the logic behind the duplicates. Well, I think I’ve almost got it, but I’m still unclear on something.
- Let’s say we’re using a Memory Channel and a Rolling File Sink - Let’s say Batch Size for the Source and Sink equals 50 - Let’s say Transaction Capacity for the Channel = 25 So, as Data A1 through A20 come in from the source, as opposed to waiting for the Sink Batch Size of 50 to be filled, they immediately start writing to the File Sink. However, A1..A20 are still added to the Take List, since the Sink’s Batch of 50 hasn’t been filled and the Transaction is not yet marked as committed. Now, let’s say there is a Traffic Surge, and Data A21..A70 are sent to the source. Data A21..A25 are accepted in the Channel, added to the Take List, and written to the File Sink. The Sink Batch Size of 50 still isn’t reached, and a Transaction still isn’t submitted. Now, A26..A70 will not make it into the Channel because it’s currently filled by A1..A25, and nothing else will ever get into the Channel because The Batch Size of 50 will never be reached, and that Transaction will never be marked as complete nor will the Channel be emptied. Where is the part where the Transaction of writing to the Sink fails? Is there some Timeout that causes a retry on the Transaction on the Sink side? Will transaction A1..A25 be written over and over to the Sink? Thanks again for all the help! Cesar M. Quintana From: Johny Rufus [mailto:[email protected]] Sent: Wednesday, June 17, 2015 6:52 PM To: [email protected] Subject: Re: Flume duplicating a set of events many (hundreds of) times! A transaction in Flume consists of 1 or more batches. So the minimum requirement is your channel's transaction capacity >= batchSize of the src/sink. Since Flume supports "at least once" transaction semantics, all events part of the current transaction are stored internally as part of a Take List that Flume maintains, so that in case of transaction failure, the events can be put back into the channel. Typically when batchSize > transactionCapacity, the transaction will never succeed and will keep on retrying. Since not a single batch went through, there should be no duplicates. But RollingFileSink, writes every event taken from the channel immediately, hence every time Flume retries the transaction, a partial set of events that are part of the current transaction/batch would still make it to the destination file. and will be duplicated when the transaction fails and is rolledback and retried. Thanks, Rufus On Wed, Jun 17, 2015 at 4:48 PM, Quintana, Cesar (C) <[email protected]<mailto:[email protected]>> wrote: Oh man! Thanks for spotting that. Whoever modified this config must have copied and pasted because EVERY Memory Channel has the same typo. I’ve corrected it. Now, I’m still not understanding how having the TransactionCapacity = 100 and the BatchSize = 1000 would cause duplicates. Can someone walk me through that logic? Thanks for all the help so far. And, FYI, I am RTFMing it, as well. Cesar M. Quintana From: Hari Shreedharan [mailto:[email protected]<mailto:[email protected]>] Sent: Wednesday, June 17, 2015 4:15 PM To: [email protected]<mailto:[email protected]> Subject: Re: Flume duplicating a set of events many (hundreds of) times! On Wed, Jun 17, 2015 at 3:54 PM, Quintana, Cesar (C) <[email protected]<mailto:[email protected]>> wrote: agent1.channels.PASPAChannel.transactionCapactiy = 1000 This line has a typo - so the channel is starting up at default capacity. Change this to: agent1.channels.PASPAChannel.transactionCapacity = 1000 Thanks, Hari
