Distinguish between temporary and longterm failure to avoid repeated beating on 
dead components
-----------------------------------------------------------------------------------------------

                 Key: FLUME-1030
                 URL: https://issues.apache.org/jira/browse/FLUME-1030
             Project: Flume
          Issue Type: Improvement
          Components: Sinks+Sources
            Reporter: Juhani Connolly
            Assignee: Juhani Connolly
             Fix For: v1.1.0


One may want to refer to FLUME-984 for some history of this.

As it stands, a sink can have several outcomes:
- OK - succesfully transferred some data
- TRY_LATER - no data to transfer
- throw EventDeliveryException - Give the sink a short breather to recover, 
then try again
- throw anything else - get logged and more or less ignored

I don't think the last choice in particular is a good idea as it encourages 
throwing Sink specific exceptions. Further, there is no distinction between 
temporary disconnectivity(e.g. HBase timed out because of a compaction or 
something), and more permanent problems(e.g. cannot write to a file).

One solution to this is to add a second type of exception that delivery 
mechanisms can throw, ConnectivityException/FatalException or something 
similar. For the purposes of any failover/load balancing mechanism this would 
signal that a component is out of order for a more significant amount of time 
and thus constant polling should be stopped(perhaps retry it every 5 minutes 
instead, or have an exponentially increasing retry time).

If adding another exception is not deemed acceptable, there is always the 
possibility of expecting SinkProcessors to figure out if a sink is dead... E.g. 
counting sequential failures, though I do not think this is ideal. I would 
prefer to see a clear contract defined by SinkRunner that well behaved sinks 
could adhere to and get the benefits of graceful temporary/longterm failure 
from.

If someone has other suggestions for distinguishing between temporary and 
longer term failure please let me know. As it stands, components that are 
unresponsive can and do get called constantly, and some components trigger 
retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to