[ 
https://issues.apache.org/jira/browse/KAFKA-17061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haruki Okada updated KAFKA-17061:
---------------------------------
    Description: 
h2. Environment
 * Kafka version: 3.3.2
 * Cluster: 200~ brokers
 * Total num partitions: 40k
 * ZK-based cluster

h2. Phenomenon

When a broker left the cluster once due to the long STW and came back after a 
while, the controller took 6 seconds until connecting to the broker after znode 
registration, it caused significant message delivery delay.
{code:java}
[2024-06-22 23:59:38,202] INFO [Controller id=1] Newly added brokers: 2, 
deleted brokers: , bounced brokers: , all live brokers: 1,... 
(kafka.controller.KafkaController)
[2024-06-22 23:59:38,203] DEBUG [Channel manager on controller 1]: Controller 1 
trying to connect to broker 2 (kafka.controller.ControllerChannelManager)
[2024-06-22 23:59:38,205] INFO [RequestSendThread controllerId=1] Starting 
(kafka.controller.RequestSendThread)
[2024-06-22 23:59:38,205] INFO [Controller id=1] New broker startup callback 
for 2 (kafka.controller.KafkaController)
[2024-06-22 23:59:44,524] INFO [RequestSendThread controllerId=1] Controller 1 
connected to broker-2:9092 (id: 2 rack: rack-2) for sending state change 
requests (kafka.controller.RequestSendThread)
{code}
h2. Analysis

>From the flamegraph at that time, we can see that 
>[liveBrokerIds|https://github.com/apache/kafka/blob/3.3.2/core/src/main/scala/kafka/controller/ControllerContext.scala#L217]
> calculation takes significant time.

!image-2024-07-02-17-24-11-861.png|width=541,height=303!

  was:
h2. Environment
 * Kafka version: 3.3.2
 * Cluster: 200~ brokers
 * Total num partitions: 40k
 * ZK-based cluster

h2. Phenomenon

When a broker left the cluster once due to the long STW and came back after a 
while, the controller took 6 seconds until connecting to the broker after znode 
registration, it caused significant message delivery delay.
{code:java}
[2024-06-22 23:59:38,202] INFO [Controller id=1] Newly added brokers: 2, 
deleted brokers: , bounced brokers: , all live brokers: 1,... 
(kafka.controller.KafkaController)
[2024-06-22 23:59:38,203] DEBUG [Channel manager on controller 1]: Controller 1 
trying to connect to broker 2 (kafka.controller.ControllerChannelManager)
[2024-06-22 23:59:38,205] INFO [RequestSendThread controllerId=1] Starting 
(kafka.controller.RequestSendThread)
[2024-06-22 23:59:38,205] INFO [Controller id=1] New broker startup callback 
for 2 (kafka.controller.KafkaController)
[2024-06-22 23:59:44,524] INFO [RequestSendThread controllerId=1] Controller 1 
connected to broker-2:9092 (id: 2 rack: rack-2) for sending state change 
requests (kafka.controller.RequestSendThread)
{code}
h2. Analysis

>From the flamegraph at that time, we can see that 
>[liveBrokerIds|https://github.com/apache/kafka/blob/3.3.2/core/src/main/scala/kafka/controller/ControllerContext.scala#L217]
> calculation takes significant time.

!image-2024-07-02-17-24-11-861.png|width=541,height=303!

Since no concurrent modification against liveBrokerEpochs is expected, we can 
just cache the result to improve the performance.


> KafkaController takes long time to connect to newly added broker after 
> registration on large cluster
> ----------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-17061
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17061
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Haruki Okada
>            Assignee: Haruki Okada
>            Priority: Major
>         Attachments: image-2024-07-02-17-22-06-100.png, 
> image-2024-07-02-17-24-11-861.png
>
>
> h2. Environment
>  * Kafka version: 3.3.2
>  * Cluster: 200~ brokers
>  * Total num partitions: 40k
>  * ZK-based cluster
> h2. Phenomenon
> When a broker left the cluster once due to the long STW and came back after a 
> while, the controller took 6 seconds until connecting to the broker after 
> znode registration, it caused significant message delivery delay.
> {code:java}
> [2024-06-22 23:59:38,202] INFO [Controller id=1] Newly added brokers: 2, 
> deleted brokers: , bounced brokers: , all live brokers: 1,... 
> (kafka.controller.KafkaController)
> [2024-06-22 23:59:38,203] DEBUG [Channel manager on controller 1]: Controller 
> 1 trying to connect to broker 2 (kafka.controller.ControllerChannelManager)
> [2024-06-22 23:59:38,205] INFO [RequestSendThread controllerId=1] Starting 
> (kafka.controller.RequestSendThread)
> [2024-06-22 23:59:38,205] INFO [Controller id=1] New broker startup callback 
> for 2 (kafka.controller.KafkaController)
> [2024-06-22 23:59:44,524] INFO [RequestSendThread controllerId=1] Controller 
> 1 connected to broker-2:9092 (id: 2 rack: rack-2) for sending state change 
> requests (kafka.controller.RequestSendThread)
> {code}
> h2. Analysis
> From the flamegraph at that time, we can see that 
> [liveBrokerIds|https://github.com/apache/kafka/blob/3.3.2/core/src/main/scala/kafka/controller/ControllerContext.scala#L217]
>  calculation takes significant time.
> !image-2024-07-02-17-24-11-861.png|width=541,height=303!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to