[ 
https://issues.apache.org/jira/browse/KAFKA-7241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16574331#comment-16574331
 ] 

huxihx commented on KAFKA-7241:
-------------------------------

Did you try to run with `--verify` before doing reassigning. This should be 
failed to pass the verification.

> Reassignment partitiona to non existent broker
> ----------------------------------------------
>
>                 Key: KAFKA-7241
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7241
>             Project: Kafka
>          Issue Type: Task
>          Components: admin
>    Affects Versions: 1.1.1
>            Reporter: Igor Martemyanov
>            Priority: Major
>         Attachments: kafka-logs.zip, reassignment_json.txt, 
> reassignment_partitions_path_after_finish_work.txt
>
>
> There is the trouble with starting reassignment partitions process to non 
> existent broker.
> Kafka cluster has 3 brokers with ids - 1, 2, 3.
> We try to reassign some partitions to another broker(e.g. with id=4) and at 
> last we cath up situation when reassign task does not stop at all. We can't 
> start others reassign tasks before finish that task.
> Details:
>  
> We have broker list before reassignment partitions task is started.
> {code}
> [zk: grid1219:3185(CONNECTED) 0] ls /brokers/ids
> [1, 2, 3]
> {code}
> Admin path is:
> {code}
> [zk: grid1219:3185(CONNECTED) 1] ls /admin
> [delete_topics]
> [zk: grid1219:3185(CONNECTED) 2] get /admin/delete_topics
> null
> cZxid = 0xe
> ctime = Fri Aug 03 08:04:25 MSK 2018
> mZxid = 0xe
> mtime = Fri Aug 03 08:04:25 MSK 2018
> pZxid = 0xe
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x0
> dataLength = 0
> numChildren = 0
> {code}
> There is one topic created with 20 partitions and 3 replication-factor. 
> Reassign partitions json is here  [^reassignment_json.txt] . We are write 
> json to path /admin/reassign_partitions and after that partition reassignment 
> is started.
> We can see the result of reassignment process in kafka controller logs(full 
> log is here -  [^kafka-logs.zip] ):
> {code}
> [2018-08-03 08:52:21,329] INFO [Controller id=1] Handling reassignment of 
> partition test-15 to new replicas 4,1,2 (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,329] INFO [Controller id=1] New replicas 4,1,2 for 
> partition test-15 being reassigned not yet caught up with the leader 
> (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,330] INFO [Controller id=1] Updated assigned replicas 
> for partition test-15 being reassigned to 4,1,2,3 
> (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,330] DEBUG [Controller id=1] Updating leader epoch for 
> partition test-15 (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,331] INFO [Controller id=1] Updated leader epoch for 
> partition test-15 to 1 (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,331] WARN [Channel manager on controller 1]: Not sending 
> request (type=LeaderAndIsRequest, controllerId=1, controllerEpoch=1, 
> partitionStates={test-15=PartitionState(controllerEpoch=1, leader=3, 
> leaderEpoch=1, isr=3,1[2018-08-03 08:52:21,331] WARN [Channel manager on 
> controller 1]: Not sending request (type=LeaderAndIsRequest, controllerId=1, 
> controllerEpoch=1, partitionStates={test-15=PartitionState(controllerEpoch=1, 
> leader=3, leaderEpoch=0, isr=3,1[2018-08-03 08:52:21,331] INFO [Controller 
> id=1] Waiting for new replicas 4,1,2 for partition test-15 being reassigned 
> to catch up with the leader (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,332] INFO [Controller id=1] Handling reassignment of 
> partition test-2 to new replicas 2,1,4 (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,332] INFO [Controller id=1] New replicas 2,1,4 for 
> partition test-2 being reassigned not yet caught up with the leader 
> (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,333] INFO [Controller id=1] Updated assigned replicas 
> for partition test-2 being reassigned to 2,1,4,3 
> (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,333] DEBUG [Controller id=1] Updating leader epoch for 
> partition test-2 (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,333] INFO [Controller id=1] Updated leader epoch for 
> partition test-2 to 1 (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,334] WARN [Channel manager on controller 1]: Not sending 
> request (type=LeaderAndIsRequest, controllerId=1, controllerEpoch=1, 
> partitionStates={test-2=PartitionState(controllerEpoch=1, leader=2, 
> leaderEpoch=1, isr=2,1,[2018-08-03 08:52:21,334] WARN [Channel manager on 
> controller 1]: Not sending request (type=LeaderAndIsRequest, controllerId=1, 
> controllerEpoch=1, partitionStates={test-2=PartitionState(controllerEpoch=1, 
> leader=2, leaderEpoch=0, isr=2,1,[2018-08-03 08:52:21,334] INFO [Controller 
> id=1] Waiting for new replicas 2,1,4 for partition test-2 being reassigned to 
> catch up with the leader (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,337] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-14 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,337] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-6 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,338] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-17 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,338] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-11 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,339] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-10 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,339] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-19 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,340] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-0 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,340] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-7 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,341] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-18 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,341] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-5 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,342] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-8 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,342] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-1 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,343] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-13 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,343] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-4 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,344] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-16 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,344] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-9 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,345] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-3 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,345] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-12 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,346] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-15 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,346] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-2 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> {code}
> After reassignment process is finished znode /admin/reassign_partitions 
> contains value from file  
> [^reassignment_partitions_path_after_finish_work.txt] .
> We have some reassign task in znode  /admin/reassign_partitions. And we can't 
> start other reassign tasks before previous task is not finished.
> We need some legal mechanism to catch up such situations and stop such 
> reassign tasks.
> For example we need timeout parameter to clean reassign tasks after which 
> reassign task of one partition is dropped from znode with warning in logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to