Justin Downing created KAFKA-5116:
-------------------------------------

             Summary: Controller updates to ISR holds the controller lock for a 
very long time
                 Key: KAFKA-5116
                 URL: https://issues.apache.org/jira/browse/KAFKA-5116
             Project: Kafka
          Issue Type: Bug
          Components: controller
    Affects Versions: 0.10.2.0, 0.10.1.0
            Reporter: Justin Downing
             Fix For: 0.11.0.0


Hello!

Lately, we have noticed slow (or no) results when monitoring the broker's ISR 
using JMX. Many of these requests appear to be 'hung' for a very long time (eg: 
>2m). We've dug a bunch, and found that in our case, sometimes the 
controllerLock can be held for multiple minutes in the IsrChangeNotifier 
callback.

Inside the lock, we are reading from Zookeeper for *each* partition in the 
changeset. With a large changeset (eg: >500 partitions), [this 
operation|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/controller/KafkaController.scala#L1347]
 can take a long time to complete. 

In KAFKA-2406, throttling was introduced to prevent overwhelming the controller 
with many changesets at once. However, this does not take into consideration 
_large_ changesets.

We have identified two potential remediations we'd like to discuss further:

* Move the Zookeeper request outside of the lock. This would then only lock for 
the controller update and processing of the changeset.

* Send limited changesets to Zookeeper when calling the 
maybePropagateIsrChanges. When dealing with lots of partitions (eg: >1000) it 
may be useful to batch the changesets in groups of 100 rather the send the 
[entire 
list|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/ReplicaManager.scala#L204]
 to Zookeeper at once.

We're happy working on patches for either or both of these, but we are unsure 
of the safety around these two proposals. Specifically, moving the Zookeeper 
request out of the lock may be unsafe.

Holding these locks for long periods of time seems problematic - it means that 
broker failure won't be detected and acted upon quickly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to