[
https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Juhani Connolly updated FLUME-1030:
-----------------------------------
Attachment: FLUME-1030.4.patch
> Distinguish between temporary and longterm failure to avoid repeated beating
> on dead components
> -----------------------------------------------------------------------------------------------
>
> Key: FLUME-1030
> URL: https://issues.apache.org/jira/browse/FLUME-1030
> Project: Flume
> Issue Type: Improvement
> Components: Sinks+Sources
> Reporter: Juhani Connolly
> Assignee: Juhani Connolly
> Fix For: v1.2.0
>
> Attachments: FLUME-1030.2.patch, FLUME-1030.3.patch,
> FLUME-1030.4.patch
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover,
> then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages
> throwing Sink specific exceptions. Further, there is no distinction between
> temporary disconnectivity(e.g. HBase timed out because of a compaction or
> something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery
> mechanisms can throw, ConnectivityException/FatalException or something
> similar. For the purposes of any failover/load balancing mechanism this would
> signal that a component is out of order for a more significant amount of time
> and thus constant polling should be stopped(perhaps retry it every 5 minutes
> instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the
> possibility of expecting SinkProcessors to figure out if a sink is dead...
> E.g. counting sequential failures, though I do not think this is ideal. I
> would prefer to see a clear contract defined by SinkRunner that well behaved
> sinks could adhere to and get the benefits of graceful temporary/longterm
> failure from.
> If someone has other suggestions for distinguishing between temporary and
> longer term failure please let me know. As it stands, components that are
> unresponsive can and do get called constantly, and some components trigger
> retries and can actually block a SinkRunner thread for a fair while.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira