[ 
https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354068#comment-14354068
 ] 

Jun Rao commented on KAFKA-1461:
--------------------------------

[~sriharsha], thanks for the patch. Managing the backoff per partition is a bit 
more complicated than I was expecting. The most common case that we want to 
handle here is that the fetcher is trying to fetch from a broker that's already 
down. In this case, the simplest approach is to just back off the fetcher (for 
all partitions) a bit. 

Another common case is that we are doing a controlled shutdown by moving the 
leaders off a broker one at the time. The fetcher may get a NotLeader error 
code for some partitions. In this case, it's less critical to remove those 
partitions from the fetcher since those partitions will be removed from the 
fetcher quickly by the leaderAndIsrRequests from the controller.

My concern with managing the backoff at the partition level is that if the 
backoff is out of sync among the partitions, it may happen that different 
partitions become active at slightly different times and the fetcher doesn't 
actually back off. Also, the code becomes more complicated.

So, my recommendation is the following.
(1) Add the backoff config for the replica fetcher.
(2) In AbstractFetcherThread, simply backoff based on the configured time, if 
it hits an exception when doing a fetch.
(3) In order to shut down AbstractFetcherThread quickly, the backoff can be 
implemented on waiting on a new condition. We will signal that new condition 
during the shutdown.

> Replica fetcher thread does not implement any back-off behavior
> ---------------------------------------------------------------
>
>                 Key: KAFKA-1461
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1461
>             Project: Kafka
>          Issue Type: Improvement
>          Components: replication
>    Affects Versions: 0.8.1.1
>            Reporter: Sam Meder
>            Assignee: Sriharsha Chintalapani
>              Labels: newbie++
>             Fix For: 0.8.3
>
>         Attachments: KAFKA-1461.patch
>
>
> The current replica fetcher thread will retry in a tight loop if any error 
> occurs during the fetch call. For example, we've seen cases where the fetch 
> continuously throws a connection refused exception leading to several replica 
> fetcher threads that spin in a pretty tight loop.
> To a much lesser degree this is also an issue in the consumer fetcher thread, 
> although the fact that erroring partitions are removed so a leader can be 
> re-discovered helps some.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to