[ https://issues.apache.org/jira/browse/KAFKA-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Haruki Okada updated KAFKA-17061: --------------------------------- Attachment: flame.html > KafkaController takes long time to connect to newly added broker after > registration on large cluster > ---------------------------------------------------------------------------------------------------- > > Key: KAFKA-17061 > URL: https://issues.apache.org/jira/browse/KAFKA-17061 > Project: Kafka > Issue Type: Improvement > Reporter: Haruki Okada > Assignee: Haruki Okada > Priority: Major > Attachments: flame-patched.html, flame.html, > image-2024-07-02-17-22-06-100.png, image-2024-07-02-17-24-11-861.png > > > h2. Environment > * Kafka version: 3.3.2 > * Cluster: 200~ brokers > * Total num partitions: 40k > * ZK-based cluster > h2. Phenomenon > When a broker left the cluster once due to the long STW and came back after a > while, the controller took 6 seconds until connecting to the broker after > znode registration, it caused significant message delivery delay. > {code:java} > [2024-06-22 23:59:38,202] INFO [Controller id=1] Newly added brokers: 2, > deleted brokers: , bounced brokers: , all live brokers: 1,... > (kafka.controller.KafkaController) > [2024-06-22 23:59:38,203] DEBUG [Channel manager on controller 1]: Controller > 1 trying to connect to broker 2 (kafka.controller.ControllerChannelManager) > [2024-06-22 23:59:38,205] INFO [RequestSendThread controllerId=1] Starting > (kafka.controller.RequestSendThread) > [2024-06-22 23:59:38,205] INFO [Controller id=1] New broker startup callback > for 2 (kafka.controller.KafkaController) > [2024-06-22 23:59:44,524] INFO [RequestSendThread controllerId=1] Controller > 1 connected to broker-2:9092 (id: 2 rack: rack-2) for sending state change > requests (kafka.controller.RequestSendThread) > {code} > h2. Analysis > From the flamegraph at that time, we can see that > [liveBrokerIds|https://github.com/apache/kafka/blob/3.3.2/core/src/main/scala/kafka/controller/ControllerContext.scala#L217] > calculation takes significant time in `addUpdateMetadataRequestForBrokers` > invocation on broker startup. > !image-2024-07-02-17-24-11-861.png|width=541,height=303! -- This message was sent by Atlassian Jira (v8.20.10#820010)