[jira] [Commented] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Juhani Connolly (Commented) (JIRA) Thu, 15 Mar 2012 01:18:04 -0700

    [ 
https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229974#comment-13229974
 ]


Juhani Connolly commented on FLUME-1030:
----------------------------------------

The method you describe is fine for a processor dealing with a single sink but 
seems a bit vague for multiple sinks that are being balanced or being used for 
failover.

One way of dealing with this with multiple sinks  is to just put sinks that had 
exceptions on a priority list with the time to reactivate them, passing events 
to other sinks until "recovery". Since balancing/failover processors have other 
alternatives, they can just get another sink to deal with it, using longer 
timeouts than would be applied by backoff. Would this be a better way to deal 
with balancing/failover?

This has made me curious of exactly what the intended use of 
EventDeliveryException is now. The distinction between it and other Exceptions 
is pretty blurred now that we just elect to log everything
                
> Distinguish between temporary and longterm failure to avoid repeated beating 
> on dead components
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.1.0
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, 
> then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages 
> throwing Sink specific exceptions. Further, there is no distinction between 
> temporary disconnectivity(e.g. HBase timed out because of a compaction or 
> something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery 
> mechanisms can throw, ConnectivityException/FatalException or something 
> similar. For the purposes of any failover/load balancing mechanism this would 
> signal that a component is out of order for a more significant amount of time 
> and thus constant polling should be stopped(perhaps retry it every 5 minutes 
> instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the 
> possibility of expecting SinkProcessors to figure out if a sink is dead... 
> E.g. counting sequential failures, though I do not think this is ideal. I 
> would prefer to see a clear contract defined by SinkRunner that well behaved 
> sinks could adhere to and get the benefits of graceful temporary/longterm 
> failure from.
> If someone has other suggestions for distinguishing between temporary and 
> longer term failure please let me know. As it stands, components that are 
> unresponsive can and do get called constantly, and some components trigger 
> retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Reply via email to