[jira] [Commented] (KAFKA-5116) Controller updates to ISR holds the controller lock for a very long time

2019-02-17 Thread Matthias J. Sax (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-5116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16770566#comment-16770566
 ] 

Matthias J. Sax commented on KAFKA-5116:


Moving all major/minor/trivial tickets that are not merged yet out of 2.2 
release.

> Controller updates to ISR holds the controller lock for a very long time
> 
>
> Key: KAFKA-5116
> URL: https://issues.apache.org/jira/browse/KAFKA-5116
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.10.1.0, 0.10.2.0
>Reporter: Justin Downing
>Priority: Major
> Fix For: 2.2.0
>
>
> Hello!
> Lately, we have noticed slow (or no) results when monitoring the broker's ISR 
> using JMX. Many of these requests appear to be 'hung' for a very long time 
> (eg: >2m). We've dug a bunch, and found that in our case, sometimes the 
> controllerLock can be held for multiple minutes in the IsrChangeNotifier 
> callback.
> Inside the lock, we are reading from Zookeeper for *each* partition in the 
> changeset. With a large changeset (eg: >500 partitions), [this 
> operation|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/controller/KafkaController.scala#L1347]
>  can take a long time to complete. 
> In KAFKA-2406, throttling was introduced to prevent overwhelming the 
> controller with many changesets at once. However, this does not take into 
> consideration _large_ changesets.
> We have identified two potential remediations we'd like to discuss further:
> * Move the Zookeeper request outside of the lock. This would then only lock 
> for the controller update and processing of the changeset.
> * Send limited changesets to Zookeeper when calling the 
> maybePropagateIsrChanges. When dealing with lots of partitions (eg: >1000) it 
> may be useful to batch the changesets in groups of 100 rather the send the 
> [entire 
> list|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/ReplicaManager.scala#L204]
>  to Zookeeper at once.
> We're happy working on patches for either or both of these, but we are unsure 
> of the safety around these two proposals. Specifically, moving the Zookeeper 
> request out of the lock may be unsafe.
> Holding these locks for long periods of time seems problematic - it means 
> that broker failure won't be detected and acted upon quickly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-5116) Controller updates to ISR holds the controller lock for a very long time

2018-10-02 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-5116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16635780#comment-16635780
 ] 

Dong Lin commented on KAFKA-5116:
-

Moving this to 2.2.0 since PR is not ready yet.

> Controller updates to ISR holds the controller lock for a very long time
> 
>
> Key: KAFKA-5116
> URL: https://issues.apache.org/jira/browse/KAFKA-5116
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.10.1.0, 0.10.2.0
>Reporter: Justin Downing
>Priority: Major
> Fix For: 2.2.0
>
>
> Hello!
> Lately, we have noticed slow (or no) results when monitoring the broker's ISR 
> using JMX. Many of these requests appear to be 'hung' for a very long time 
> (eg: >2m). We've dug a bunch, and found that in our case, sometimes the 
> controllerLock can be held for multiple minutes in the IsrChangeNotifier 
> callback.
> Inside the lock, we are reading from Zookeeper for *each* partition in the 
> changeset. With a large changeset (eg: >500 partitions), [this 
> operation|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/controller/KafkaController.scala#L1347]
>  can take a long time to complete. 
> In KAFKA-2406, throttling was introduced to prevent overwhelming the 
> controller with many changesets at once. However, this does not take into 
> consideration _large_ changesets.
> We have identified two potential remediations we'd like to discuss further:
> * Move the Zookeeper request outside of the lock. This would then only lock 
> for the controller update and processing of the changeset.
> * Send limited changesets to Zookeeper when calling the 
> maybePropagateIsrChanges. When dealing with lots of partitions (eg: >1000) it 
> may be useful to batch the changesets in groups of 100 rather the send the 
> [entire 
> list|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/ReplicaManager.scala#L204]
>  to Zookeeper at once.
> We're happy working on patches for either or both of these, but we are unsure 
> of the safety around these two proposals. Specifically, moving the Zookeeper 
> request out of the lock may be unsafe.
> Holding these locks for long periods of time seems problematic - it means 
> that broker failure won't be detected and acted upon quickly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)