[ https://issues.apache.org/jira/browse/KAFKA-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354068#comment-14354068 ]
Jun Rao commented on KAFKA-1461: -------------------------------- [~sriharsha], thanks for the patch. Managing the backoff per partition is a bit more complicated than I was expecting. The most common case that we want to handle here is that the fetcher is trying to fetch from a broker that's already down. In this case, the simplest approach is to just back off the fetcher (for all partitions) a bit. Another common case is that we are doing a controlled shutdown by moving the leaders off a broker one at the time. The fetcher may get a NotLeader error code for some partitions. In this case, it's less critical to remove those partitions from the fetcher since those partitions will be removed from the fetcher quickly by the leaderAndIsrRequests from the controller. My concern with managing the backoff at the partition level is that if the backoff is out of sync among the partitions, it may happen that different partitions become active at slightly different times and the fetcher doesn't actually back off. Also, the code becomes more complicated. So, my recommendation is the following. (1) Add the backoff config for the replica fetcher. (2) In AbstractFetcherThread, simply backoff based on the configured time, if it hits an exception when doing a fetch. (3) In order to shut down AbstractFetcherThread quickly, the backoff can be implemented on waiting on a new condition. We will signal that new condition during the shutdown. > Replica fetcher thread does not implement any back-off behavior > --------------------------------------------------------------- > > Key: KAFKA-1461 > URL: https://issues.apache.org/jira/browse/KAFKA-1461 > Project: Kafka > Issue Type: Improvement > Components: replication > Affects Versions: 0.8.1.1 > Reporter: Sam Meder > Assignee: Sriharsha Chintalapani > Labels: newbie++ > Fix For: 0.8.3 > > Attachments: KAFKA-1461.patch > > > The current replica fetcher thread will retry in a tight loop if any error > occurs during the fetch call. For example, we've seen cases where the fetch > continuously throws a connection refused exception leading to several replica > fetcher threads that spin in a pretty tight loop. > To a much lesser degree this is also an issue in the consumer fetcher thread, > although the fact that erroring partitions are removed so a leader can be > re-discovered helps some. -- This message was sent by Atlassian JIRA (v6.3.4#6332)