[ 
https://issues.apache.org/jira/browse/KAFKA-13392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18055936#comment-18055936
 ] 

hongwei.xiang commented on KAFKA-13392:
---------------------------------------

In our Kafka 3.8 setup, I noticed we can hit a TimeoutException during 
partition reassignment when one broker is down.

*Root cause*

When we run a reassignment using a plan file (e.g. xxx.json), the plan may 
still include replicas on the down broker.

During the execution, we try to apply throttling by calling:

 
{code:java}
adminClient.incrementalAlterConfigs(configs){code}
 

The issue is: this API needs to connect to the target broker to set the 
broker-level throttle configs.
If the broker is down, it’s obviously unreachable, so the client keeps retrying 
and eventually times out → TimeoutException.

 

*My proposed solution*

Add a new parameter:

 
{code:java}
--broker-list-without-throttle{code}
 

Value: a list of broker IDs, comma-separated
Example: 1001 or 1001,1002

Purpose: skip throttle config updates for known down brokers during 
reassignment execution.

So we still throttle normal brokers, but we don’t waste time trying (and 
failing) to throttle the down broker.

Example
{code:java}
/opt/kafka/bin/kafka-reassign-partitions.sh \
  --bootstrap-server xxx.xxx.xxx.xxx:9092 \
  --reassignment-json-file /tmp/reassignment-20211021130718.json \
  --throttle 100000000 \
  --execute \
  --broker-list-without-throttle 1001{code}
 


If broker 1001 is known to be down, and the reassignment plan includes it, then 
we exclude 1001 from throttle config changes.

 

*Why this is needed*

If we don’t use '--throttle' at all, then Kafka won’t set throttle on any 
broker (including the down one). But that’s risky, migrations can easily blow 
up network bandwidth or disk IO.

If we only skip throttling for the known down broker, it doesn’t change the 
reassignment logic itself, and it avoids the timeout.
Meanwhile, healthy brokers still get throttled properly.

 

Please assign this ticket to me, I’m happy to implement and share the full 
solution. Thanks!

> Timeout Exception triggering reassign partitions with --bootstrap-server 
> option
> -------------------------------------------------------------------------------
>
>                 Key: KAFKA-13392
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13392
>             Project: Kafka
>          Issue Type: Bug
>          Components: admin
>    Affects Versions: 2.8.0
>            Reporter: Yevgeniy Korin
>            Priority: Minor
>
> *Scenario when we faced with this issue:*
>  One of three brokers is down. Add another (fourth) broker and try to 
> reassign partitions using '--bootstrap-server'
>  option.
> *What's failed:*
> {code:java}
> /opt/kafka/bin/kafka-reassign-partitions.sh --bootstrap-server 
> xxx.xxx.xxx.xxx:9092 --reassignment-json-file 
> /tmp/reassignment-20211021130718.json --throttle 100000000 --execute{code}
> failed with
> {code:java}
> Error: org.apache.kafka.common.errors.TimeoutException: 
> Call(callName=incrementalAlterConfigs, deadlineMs=1634811369255, tries=1, 
> nextAllowedTryMs=1634811369356) timed out at 1634811369256 after 1 attempt(s)
>  java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.TimeoutException: 
> Call(callName=incrementalAlterConfigs, deadlineMs=1634811369255, tries=1, 
> nextAllowedTryMs=1634811369356) timed out at 1634811369256 after 1 attempt(s)
>  at 
> org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
>  at 
> org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
>  at 
> org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)
>  at 
> org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)
>  at 
> kafka.admin.ReassignPartitionsCommand$.modifyInterBrokerThrottle(ReassignPartitionsCommand.scala:1435)
>  at 
> kafka.admin.ReassignPartitionsCommand$.modifyReassignmentThrottle(ReassignPartitionsCommand.scala:1412)
>  at 
> kafka.admin.ReassignPartitionsCommand$.executeAssignment(ReassignPartitionsCommand.scala:974)
>  at 
> kafka.admin.ReassignPartitionsCommand$.handleAction(ReassignPartitionsCommand.scala:255)
>  at 
> kafka.admin.ReassignPartitionsCommand$.main(ReassignPartitionsCommand.scala:216)
>  at 
> kafka.admin.ReassignPartitionsCommand.main(ReassignPartitionsCommand.scala)
>  Caused by: org.apache.kafka.common.errors.TimeoutException: 
> Call(callName=incrementalAlterConfigs, deadlineMs=1634811369255, tries=1, 
> nextAllowedTryMs=1634811369356) timed out at 1634811369256 after 1 attempt(s)
>  Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out 
> waiting for a node assignment. Call: incrementalAlterConfigs{code}
>  *Expected behavio**:*
>  partition reassignment process started.
> *Workaround:*
>  Trigger partition reassignment process using '--zookeeper' option:
> {code:java}
> /opt/kafka/bin/kafka-reassign-partitions.sh --zookeeper 
> zookeeper.my.company:2181/kafka-cluster --reassignment-json-file 
> /tmp/reassignment-20211021130718.json --throttle 100000000 --execute{code}
>  *Additional info:*
>  We are able to trigger partition reassignment using '--bootstrap-server' 
> option with no exceptions when all four brokers are alive.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to