[jira] [Commented] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Arvind Prabhakar (Commented) (JIRA) Mon, 19 Mar 2012 11:00:03 -0700

    [ 
https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13232773#comment-13232773
 ]


Arvind Prabhakar commented on FLUME-1030:
-----------------------------------------

bq. One way of dealing with this with multiple sinks is to just put sinks that 
had exceptions on a priority list with the time to reactivate them, passing 
events to other sinks until "recovery". Since balancing/failover processors 
have other alternatives, they can just get another sink to deal with it, using 
longer timeouts than would be applied by backoff. Would this be a better way to 
deal with balancing/failover?

Yes - that makes sense to me and will be a deterministic solution regardless of 
the underlying problem.

bq. This has made me curious of exactly what the intended use of 
EventDeliveryException is now. The distinction between it and other Exceptions 
is pretty blurred now that we just elect to log everything

As the name suggests - it indicates a failure to relay the event to it's next 
hop destination. Ideally it should have the causal exception buried within 
itself that gives more of a clue as to what may have gone wrong. Eventually, if 
we see a pattern of failures emerging out of this, we should modify the 
component that is responsible to deal with it rather than adding that logic to 
the exception handling code.
                
> Distinguish between temporary and longterm failure to avoid repeated beating 
> on dead components
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.2.0
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, 
> then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages 
> throwing Sink specific exceptions. Further, there is no distinction between 
> temporary disconnectivity(e.g. HBase timed out because of a compaction or 
> something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery 
> mechanisms can throw, ConnectivityException/FatalException or something 
> similar. For the purposes of any failover/load balancing mechanism this would 
> signal that a component is out of order for a more significant amount of time 
> and thus constant polling should be stopped(perhaps retry it every 5 minutes 
> instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the 
> possibility of expecting SinkProcessors to figure out if a sink is dead... 
> E.g. counting sequential failures, though I do not think this is ideal. I 
> would prefer to see a clear contract defined by SinkRunner that well behaved 
> sinks could adhere to and get the benefits of graceful temporary/longterm 
> failure from.
> If someone has other suggestions for distinguishing between temporary and 
> longer term failure please let me know. As it stands, components that are 
> unresponsive can and do get called constantly, and some components trigger 
> retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Reply via email to