[ https://issues.apache.org/jira/browse/KAFKA-7241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16574331#comment-16574331 ]
huxihx commented on KAFKA-7241: ------------------------------- Did you try to run with `--verify` before doing reassigning. This should be failed to pass the verification. > Reassignment partitiona to non existent broker > ---------------------------------------------- > > Key: KAFKA-7241 > URL: https://issues.apache.org/jira/browse/KAFKA-7241 > Project: Kafka > Issue Type: Task > Components: admin > Affects Versions: 1.1.1 > Reporter: Igor Martemyanov > Priority: Major > Attachments: kafka-logs.zip, reassignment_json.txt, > reassignment_partitions_path_after_finish_work.txt > > > There is the trouble with starting reassignment partitions process to non > existent broker. > Kafka cluster has 3 brokers with ids - 1, 2, 3. > We try to reassign some partitions to another broker(e.g. with id=4) and at > last we cath up situation when reassign task does not stop at all. We can't > start others reassign tasks before finish that task. > Details: > > We have broker list before reassignment partitions task is started. > {code} > [zk: grid1219:3185(CONNECTED) 0] ls /brokers/ids > [1, 2, 3] > {code} > Admin path is: > {code} > [zk: grid1219:3185(CONNECTED) 1] ls /admin > [delete_topics] > [zk: grid1219:3185(CONNECTED) 2] get /admin/delete_topics > null > cZxid = 0xe > ctime = Fri Aug 03 08:04:25 MSK 2018 > mZxid = 0xe > mtime = Fri Aug 03 08:04:25 MSK 2018 > pZxid = 0xe > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0x0 > dataLength = 0 > numChildren = 0 > {code} > There is one topic created with 20 partitions and 3 replication-factor. > Reassign partitions json is here [^reassignment_json.txt] . We are write > json to path /admin/reassign_partitions and after that partition reassignment > is started. > We can see the result of reassignment process in kafka controller logs(full > log is here - [^kafka-logs.zip] ): > {code} > [2018-08-03 08:52:21,329] INFO [Controller id=1] Handling reassignment of > partition test-15 to new replicas 4,1,2 (kafka.controller.KafkaController) > [2018-08-03 08:52:21,329] INFO [Controller id=1] New replicas 4,1,2 for > partition test-15 being reassigned not yet caught up with the leader > (kafka.controller.KafkaController) > [2018-08-03 08:52:21,330] INFO [Controller id=1] Updated assigned replicas > for partition test-15 being reassigned to 4,1,2,3 > (kafka.controller.KafkaController) > [2018-08-03 08:52:21,330] DEBUG [Controller id=1] Updating leader epoch for > partition test-15 (kafka.controller.KafkaController) > [2018-08-03 08:52:21,331] INFO [Controller id=1] Updated leader epoch for > partition test-15 to 1 (kafka.controller.KafkaController) > [2018-08-03 08:52:21,331] WARN [Channel manager on controller 1]: Not sending > request (type=LeaderAndIsRequest, controllerId=1, controllerEpoch=1, > partitionStates={test-15=PartitionState(controllerEpoch=1, leader=3, > leaderEpoch=1, isr=3,1[2018-08-03 08:52:21,331] WARN [Channel manager on > controller 1]: Not sending request (type=LeaderAndIsRequest, controllerId=1, > controllerEpoch=1, partitionStates={test-15=PartitionState(controllerEpoch=1, > leader=3, leaderEpoch=0, isr=3,1[2018-08-03 08:52:21,331] INFO [Controller > id=1] Waiting for new replicas 4,1,2 for partition test-15 being reassigned > to catch up with the leader (kafka.controller.KafkaController) > [2018-08-03 08:52:21,332] INFO [Controller id=1] Handling reassignment of > partition test-2 to new replicas 2,1,4 (kafka.controller.KafkaController) > [2018-08-03 08:52:21,332] INFO [Controller id=1] New replicas 2,1,4 for > partition test-2 being reassigned not yet caught up with the leader > (kafka.controller.KafkaController) > [2018-08-03 08:52:21,333] INFO [Controller id=1] Updated assigned replicas > for partition test-2 being reassigned to 2,1,4,3 > (kafka.controller.KafkaController) > [2018-08-03 08:52:21,333] DEBUG [Controller id=1] Updating leader epoch for > partition test-2 (kafka.controller.KafkaController) > [2018-08-03 08:52:21,333] INFO [Controller id=1] Updated leader epoch for > partition test-2 to 1 (kafka.controller.KafkaController) > [2018-08-03 08:52:21,334] WARN [Channel manager on controller 1]: Not sending > request (type=LeaderAndIsRequest, controllerId=1, controllerEpoch=1, > partitionStates={test-2=PartitionState(controllerEpoch=1, leader=2, > leaderEpoch=1, isr=2,1,[2018-08-03 08:52:21,334] WARN [Channel manager on > controller 1]: Not sending request (type=LeaderAndIsRequest, controllerId=1, > controllerEpoch=1, partitionStates={test-2=PartitionState(controllerEpoch=1, > leader=2, leaderEpoch=0, isr=2,1,[2018-08-03 08:52:21,334] INFO [Controller > id=1] Waiting for new replicas 2,1,4 for partition test-2 being reassigned to > catch up with the leader (kafka.controller.KafkaController) > [2018-08-03 08:52:21,337] INFO [Controller id=1] 2/3 replicas have caught up > with the leader for partition test-14 being reassigned. Replica(s) 4 still > need to catch up (kafka.controller.KafkaController) > [2018-08-03 08:52:21,337] INFO [Controller id=1] 2/3 replicas have caught up > with the leader for partition test-6 being reassigned. Replica(s) 4 still > need to catch up (kafka.controller.KafkaController) > [2018-08-03 08:52:21,338] INFO [Controller id=1] 2/3 replicas have caught up > with the leader for partition test-17 being reassigned. Replica(s) 4 still > need to catch up (kafka.controller.KafkaController) > [2018-08-03 08:52:21,338] INFO [Controller id=1] 2/3 replicas have caught up > with the leader for partition test-11 being reassigned. Replica(s) 4 still > need to catch up (kafka.controller.KafkaController) > [2018-08-03 08:52:21,339] INFO [Controller id=1] 2/3 replicas have caught up > with the leader for partition test-10 being reassigned. Replica(s) 4 still > need to catch up (kafka.controller.KafkaController) > [2018-08-03 08:52:21,339] INFO [Controller id=1] 2/3 replicas have caught up > with the leader for partition test-19 being reassigned. Replica(s) 4 still > need to catch up (kafka.controller.KafkaController) > [2018-08-03 08:52:21,340] INFO [Controller id=1] 2/3 replicas have caught up > with the leader for partition test-0 being reassigned. Replica(s) 4 still > need to catch up (kafka.controller.KafkaController) > [2018-08-03 08:52:21,340] INFO [Controller id=1] 2/3 replicas have caught up > with the leader for partition test-7 being reassigned. Replica(s) 4 still > need to catch up (kafka.controller.KafkaController) > [2018-08-03 08:52:21,341] INFO [Controller id=1] 2/3 replicas have caught up > with the leader for partition test-18 being reassigned. Replica(s) 4 still > need to catch up (kafka.controller.KafkaController) > [2018-08-03 08:52:21,341] INFO [Controller id=1] 2/3 replicas have caught up > with the leader for partition test-5 being reassigned. Replica(s) 4 still > need to catch up (kafka.controller.KafkaController) > [2018-08-03 08:52:21,342] INFO [Controller id=1] 2/3 replicas have caught up > with the leader for partition test-8 being reassigned. Replica(s) 4 still > need to catch up (kafka.controller.KafkaController) > [2018-08-03 08:52:21,342] INFO [Controller id=1] 2/3 replicas have caught up > with the leader for partition test-1 being reassigned. Replica(s) 4 still > need to catch up (kafka.controller.KafkaController) > [2018-08-03 08:52:21,343] INFO [Controller id=1] 2/3 replicas have caught up > with the leader for partition test-13 being reassigned. Replica(s) 4 still > need to catch up (kafka.controller.KafkaController) > [2018-08-03 08:52:21,343] INFO [Controller id=1] 2/3 replicas have caught up > with the leader for partition test-4 being reassigned. Replica(s) 4 still > need to catch up (kafka.controller.KafkaController) > [2018-08-03 08:52:21,344] INFO [Controller id=1] 2/3 replicas have caught up > with the leader for partition test-16 being reassigned. Replica(s) 4 still > need to catch up (kafka.controller.KafkaController) > [2018-08-03 08:52:21,344] INFO [Controller id=1] 2/3 replicas have caught up > with the leader for partition test-9 being reassigned. Replica(s) 4 still > need to catch up (kafka.controller.KafkaController) > [2018-08-03 08:52:21,345] INFO [Controller id=1] 2/3 replicas have caught up > with the leader for partition test-3 being reassigned. Replica(s) 4 still > need to catch up (kafka.controller.KafkaController) > [2018-08-03 08:52:21,345] INFO [Controller id=1] 2/3 replicas have caught up > with the leader for partition test-12 being reassigned. Replica(s) 4 still > need to catch up (kafka.controller.KafkaController) > [2018-08-03 08:52:21,346] INFO [Controller id=1] 2/3 replicas have caught up > with the leader for partition test-15 being reassigned. Replica(s) 4 still > need to catch up (kafka.controller.KafkaController) > [2018-08-03 08:52:21,346] INFO [Controller id=1] 2/3 replicas have caught up > with the leader for partition test-2 being reassigned. Replica(s) 4 still > need to catch up (kafka.controller.KafkaController) > {code} > After reassignment process is finished znode /admin/reassign_partitions > contains value from file > [^reassignment_partitions_path_after_finish_work.txt] . > We have some reassign task in znode /admin/reassign_partitions. And we can't > start other reassign tasks before previous task is not finished. > We need some legal mechanism to catch up such situations and stop such > reassign tasks. > For example we need timeout parameter to clean reassign tasks after which > reassign task of one partition is dropped from znode with warning in logs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)