[jira] [Commented] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

[email protected] (Commented) (JIRA) Thu, 22 Mar 2012 22:22:52 -0700

    [ 
https://issues.apache.org/jira/browse/FLUME-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13236342#comment-13236342
 ]

[email protected] commented on FLUME-1030:
------------------------------------------------------

bq.  On 2012-03-23 03:48:14, Arvind Prabhakar wrote:
bq.  > 
flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java, 
line 193
bq.  > <https://reviews.apache.org/r/4445/diff/2/?file=94789#file94789line193>
bq.  >
bq.  >     Log the exception.

Did this, changed EventDeliveryException->Exception

bq.  On 2012-03-23 03:48:14, Arvind Prabhakar wrote:
bq.  > 
flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java, 
line 182
bq.  > <https://reviews.apache.org/r/4445/diff/2/?file=94789#file94789line182>
bq.  >
bq.  >     It will be good to log this exception so we have a trace of the 
failures that are happening.

Was just using it to guard against null and number exceptions at the same time. 
Separated it out and checked for null. Logging the Number exception because 
it's probably a typo in config

bq.  On 2012-03-23 03:48:14, Arvind Prabhakar wrote:
bq.  > 
flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java, 
line 138
bq.  > <https://reviews.apache.org/r/4445/diff/2/?file=94789#file94789line138>
bq.  >
bq.  >     This seems like a typo. Perhaps you want to do something like
bq.  >     
bq.  >     maxPenalty = context.getInteger(CONF_KEY_MAX_PENALTY, 
DEFAULT_MAX_PENALTY);
bq.  >     
bq.  >

Yes. Probably not enough sleep. I checked in the debugger that things were 
getting correctly set/read now

bq.  On 2012-03-23 03:48:14, Arvind Prabhakar wrote:
bq.  > 
flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java, 
lines 94-95
bq.  > <https://reviews.apache.org/r/4445/diff/2/?file=94789#file94789line94>
bq.  >
bq.  >     The max penalty calculated during configure of the sink processor 
should be applied here to enforce the ceiling.

Done

On 2012-03-23 03:48:14, Juhani Connolly wrote:
bq.  > Rest of the changes look good to me.

Wow, what a mess. Must've been tired or something.

The name of the max limit also had a period on its end that I removed.

- Juhani

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4445/#review6265
-----------------------------------------------------------

On 2012-03-23 02:41:03, Juhani Connolly wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4445/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-03-23 02:41:03)
bq.  
bq.  
bq.  Review request for Flume.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  As discussed in the JIRA item, I modified FailoverSink to deal with all 
exceptions.
bq.  Now a sink that fails will be put onto a failed links queue, from which a 
recovery will be attempted after a timeout. Each sequential failure the timeout 
will increase. I am open to other methods of increasing the timeout(maybe add 
on a ceiling?)
bq.  
bq.  
bq.  This addresses bug FLUME-1030.
bq.      https://issues.apache.org/jira/browse/FLUME-1030
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    
flume-ng-core/src/test/java/org/apache/flume/sink/TestFailoverSinkProcessor.java
 195c121 
bq.    
flume-ng-core/src/main/java/org/apache/flume/sink/FailoverSinkProcessor.java 
7eada57 
bq.  
bq.  Diff: https://reviews.apache.org/r/4445/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Modified the test for the new functionality, new test passes
bq.  
bq.  No other tests should be affected, but my environment was having some 
weird problems. I'll look into them tomorrow, just leaving this up so people 
can have a browse and will confirm tests passing tomorrow
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Juhani
bq.  
bq.

> Distinguish between temporary and longterm failure to avoid repeated beating 
> on dead components
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLUME-1030
>                 URL: https://issues.apache.org/jira/browse/FLUME-1030
>             Project: Flume
>          Issue Type: Improvement
>          Components: Sinks+Sources
>            Reporter: Juhani Connolly
>            Assignee: Juhani Connolly
>             Fix For: v1.2.0
>
>         Attachments: FLUME-1030.2.patch, FLUME-1030.3.patch
>
>
> One may want to refer to FLUME-984 for some history of this.
> As it stands, a sink can have several outcomes:
> - OK - succesfully transferred some data
> - TRY_LATER - no data to transfer
> - throw EventDeliveryException - Give the sink a short breather to recover, 
> then try again
> - throw anything else - get logged and more or less ignored
> I don't think the last choice in particular is a good idea as it encourages 
> throwing Sink specific exceptions. Further, there is no distinction between 
> temporary disconnectivity(e.g. HBase timed out because of a compaction or 
> something), and more permanent problems(e.g. cannot write to a file).
> One solution to this is to add a second type of exception that delivery 
> mechanisms can throw, ConnectivityException/FatalException or something 
> similar. For the purposes of any failover/load balancing mechanism this would 
> signal that a component is out of order for a more significant amount of time 
> and thus constant polling should be stopped(perhaps retry it every 5 minutes 
> instead, or have an exponentially increasing retry time).
> If adding another exception is not deemed acceptable, there is always the 
> possibility of expecting SinkProcessors to figure out if a sink is dead... 
> E.g. counting sequential failures, though I do not think this is ideal. I 
> would prefer to see a clear contract defined by SinkRunner that well behaved 
> sinks could adhere to and get the benefits of graceful temporary/longterm 
> failure from.
> If someone has other suggestions for distinguishing between temporary and 
> longer term failure please let me know. As it stands, components that are 
> unresponsive can and do get called constantly, and some components trigger 
> retries and can actually block a SinkRunner thread for a fair while.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1030) Distinguish between temporary and longterm failure to avoid repeated beating on dead components

Reply via email to