[
https://issues.apache.org/jira/browse/KAFKA-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336114#comment-16336114
]
Jeff Widman commented on KAFKA-6469:
------------------------------------
We hit a similar issue when doing partition re-assignments across the cluster
and the total payload was greater than 1MB... We solved it by raising the
jute.maxbuffer.size limit to several MB.
> ISR change notification queue can prevent controller from making progress
> -------------------------------------------------------------------------
>
> Key: KAFKA-6469
> URL: https://issues.apache.org/jira/browse/KAFKA-6469
> Project: Kafka
> Issue Type: Bug
> Reporter: Kyle Ambroff-Kao
> Assignee: Kyle Ambroff-Kao
> Priority: Major
>
> When the writes /isr_change_notification in ZooKeeper (which is effectively a
> queue of ISR change events for the controller) happen at a rate high enough
> that the node with a watch can't dequeue them, the trouble starts.
> The watcher kafka.controller.IsrChangeNotificationListener is fired in the
> controller when a new entry is written to /isr_change_notification, and the
> zkclient library sends a GetChildrenRequest to zookeeper to fetch all child
> znodes.
> We've failures in one of our test clusters as the partition count started to
> climb north of 60k per broker. We had brokers writing child nodes under
> /isr_change_notification that were larger than the jute.maxbuffer size in
> ZooKeeper (1MB), causing the ZooKeeper server to drop the controller's
> session, effectively bricking the cluster.
> This can be partially mitigated by chunking ISR notifications to increase the
> maximum number of partitions a broker can host.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)