[ 
https://issues.apache.org/jira/browse/FLUME-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749570#comment-13749570
 ] 

Gabriel Commeau edited comment on FLUME-2173 at 8/25/13 8:22 AM:
-----------------------------------------------------------------

I would approach the problem from a different angle. The way I see it, there 
are two main places where duplicates can occur: when using multiple channels 
for one source (using a replication channel selector), and when the "output" of 
a sink cannot guaranty whether the event has truly been committed or not (as 
you pointed out for example, HDFS writing the event but throwing an exception). 
Actually, I don’t think there is a general solution to the problem of output 
systems (e.g. HDFS) that do not guaranty whether the event is truly committed 
or not, because we’d need to enforce this requirement on 3rd party systems 
(relative to Flume). I see it as a problem to be solved on a case-by-case basis 
for each sink.

However, I would like to suggest a solution to the first problem. Here is an 
example to illustrate it: Pretend an agent has a source that writes to two 
(required) channels. As part of a transaction, the channel processor will 
commit to the first channel, which succeeds, and then to the second channel, 
which fails. The whole transaction will fail, but the event has already been 
committed once to the first channel. When the transaction is retried, the event 
will be duplicated.
The solution I discussed a few months back with Mike P. was to use a two-phase 
commit when writing to channels. This insures that the events are not actually 
committed to a channel if the following ones fail. This however will require an 
API change on the Channel interface. I would suggest adding a preparePut method 
returning a boolean, which would be the “voting” phase. The put method becomes 
the commit phase. To make it backward compatible, we'd implement preparePut to 
always return true in the AbstractChannel.

I hope this helps.

                
      was (Author: gcommeau):
    I would approach the problem from a different angle. The way I see it, 
there are two main places where duplicates can occur: when using multiple 
channels for one source (using a replication channel selector), and when the 
"output" of a sink cannot guaranty whether the event has truly been committed 
or not (as you pointed out for example, HDFS writing the event but throwing an 
exception). 
Actually, I don’t think there is a general solution to the problem of output 
systems (e.g. HDFS) that do not guaranty whether the event is truly committed 
or not, because we’d need to enforce this requirement on 3rd party systems 
(relative to Flume). I see it as a problem to be solved for each sink class.

However, I would like to suggest a solution to the first problem. Here is an 
example to illustrate it: Pretend an agent has a source that writes to two 
(required) channels. As part of a transaction, the channel processor will 
commit to the first channel, which succeeds, and then to the second channel, 
which fails. The whole transaction will fail, but the event has already been 
committed once to the first channel. When the transaction is retried, the event 
will be duplicated.
The solution I discussed a few months back with Mike P. was to use a two-phase 
commit when writing to channels. This insures that the events are not actually 
committed to a channel if the following ones fail. This however will require an 
API change on the Channel interface. I would suggest adding a preparePut method 
returning a boolean, which would be the “voting” phase. The put method becomes 
the commit phase. To make it backward compatible, we'd implement preparePut to 
always return true in the AbstractChannel.

I hope this helps.

                  
> Exactly once semantics for Flume
> --------------------------------
>
>                 Key: FLUME-2173
>                 URL: https://issues.apache.org/jira/browse/FLUME-2173
>             Project: Flume
>          Issue Type: Bug
>            Reporter: Hari Shreedharan
>            Assignee: Hari Shreedharan
>
> Currently Flume guarantees only at least once semantics. This jira is meant 
> to track exactly once semantics for Flume. My initial idea is to include uuid 
> event ids on events at the original source (use a config to mark a source an 
> original source) and identify destination sinks. At the destination sinks, 
> use a unique ZK Znode to track the events. If once seen (and configured), 
> pull the duplicate out.
> This might need some refactoring, but my belief is we can do this in a 
> backward compatible way.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to