[
https://issues.apache.org/jira/browse/FLUME-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gabriel Commeau updated FLUME-2173:
-----------------------------------
Comment: was deleted
(was: The way I see it, there are two main places where duplicates can occur:
when using multiple channels for one source, and when the "output" of a sink
cannot guaranty whether the event has truly been committed or not (as you
pointed out for example, HDFS writing the event indeed but throwing an
exception).
I find 3 drawbacks to the proposed approach:
1) Creating and removing a node in ZK for every event is going to hurt
performances
2) My understanding is that this approach will insure that an event that
"enters" Flume is outputted only once. I understand "exactly once semantic" as
an effort to remove duplicates that occur un-intentionally, not the ones that
the user configures to happen (using a replication channel selector).
3) I fail to see how using ZK to insure that only one agent deals with an event
fixes the issue of the output system accepting the event and failing to report
so properly.
Actually, I don’t think there is a solution to the problem of output systems
(e.g. HDFS) that do not guaranty whether the event is truly committed or not,
because we’d need to enforce this requirement on 3rd party systems (relative to
Flume).
However, I would like to suggest a solution to the first problem. Here is a
simple example: Pretend an agent has a source that writes to two (required)
channels. As part of a transaction, the channel processor will commit to the
first channel, which succeeds, and then to the second channel, which fails. The
whole transaction will fail, but the event has already been committed once to
the first channel. When the transaction is retried, the event will be
duplicated.
The solution I discussed a few months back with Mike P. was to use a 2-phase
commit when writing to channels. This insures that the events are not actually
committed to a channel if the following ones fail. This however will require an
API change on the Channel interface. I would suggest to add a preparePut method
(returning a boolean), which would be the “voting” phase, and the put method
becomes the commit phase. To make it backward compatible, we'd implement
preparePut to always return true in the AbstractChannel.
I hope this helps.
)
> Exactly once semantics for Flume
> --------------------------------
>
> Key: FLUME-2173
> URL: https://issues.apache.org/jira/browse/FLUME-2173
> Project: Flume
> Issue Type: Bug
> Reporter: Hari Shreedharan
> Assignee: Hari Shreedharan
>
> Currently Flume guarantees only at least once semantics. This jira is meant
> to track exactly once semantics for Flume. My initial idea is to include uuid
> event ids on events at the original source (use a config to mark a source an
> original source) and identify destination sinks. At the destination sinks,
> use a unique ZK Znode to track the events. If once seen (and configured),
> pull the duplicate out.
> This might need some refactoring, but my belief is we can do this in a
> backward compatible way.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira