[ 
https://issues.apache.org/jira/browse/KAFKA-7241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Martemyanov updated KAFKA-7241:
------------------------------------
    Description: 
There is the trouble with starting reassignment partitions process to non 
existent broker.

Kafka cluster has 3 brokers with ids - 1, 2, 3.

We try to reassign some partitions to another broker(e.g. with id=4) and at 
last we cath up situation when reassign task does not stop at all. We can't 
start others reassign tasks before finish that task.

Details:

 

We have broker list before reassignment partitions task is started.
{code}
[zk: grid1219:3185(CONNECTED) 0] ls /brokers/ids
[1, 2, 3]
{code}
Admin path is:
{code}
[zk: grid1219:3185(CONNECTED) 1] ls /admin
[delete_topics]
[zk: grid1219:3185(CONNECTED) 2] get /admin/delete_topics
null
cZxid = 0xe
ctime = Fri Aug 03 08:04:25 MSK 2018
mZxid = 0xe
mtime = Fri Aug 03 08:04:25 MSK 2018
pZxid = 0xe
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 0
numChildren = 0
{code}

There is one topic created with 20 partitions and 3 replication-factor. 
Reassign partitions json is here  [^reassignment_json.txt] . We are write json 
to path /admin/reassign_partitions and after that partition reassignment is 
started.
We can see the result of reassignment process in kafka controller logs(full log 
is here -  [^kafka-logs.zip] ):
{code}
[2018-08-03 08:52:21,329] INFO [Controller id=1] Handling reassignment of 
partition test-15 to new replicas 4,1,2 (kafka.controller.KafkaController)
[2018-08-03 08:52:21,329] INFO [Controller id=1] New replicas 4,1,2 for 
partition test-15 being reassigned not yet caught up with the leader 
(kafka.controller.KafkaController)
[2018-08-03 08:52:21,330] INFO [Controller id=1] Updated assigned replicas for 
partition test-15 being reassigned to 4,1,2,3 (kafka.controller.KafkaController)
[2018-08-03 08:52:21,330] DEBUG [Controller id=1] Updating leader epoch for 
partition test-15 (kafka.controller.KafkaController)
[2018-08-03 08:52:21,331] INFO [Controller id=1] Updated leader epoch for 
partition test-15 to 1 (kafka.controller.KafkaController)
[2018-08-03 08:52:21,331] WARN [Channel manager on controller 1]: Not sending 
request (type=LeaderAndIsRequest, controllerId=1, controllerEpoch=1, 
partitionStates={test-15=PartitionState(controllerEpoch=1, leader=3, 
leaderEpoch=1, isr=3,1[2018-08-03 08:52:21,331] WARN [Channel manager on 
controller 1]: Not sending request (type=LeaderAndIsRequest, controllerId=1, 
controllerEpoch=1, partitionStates={test-15=PartitionState(controllerEpoch=1, 
leader=3, leaderEpoch=0, isr=3,1[2018-08-03 08:52:21,331] INFO [Controller 
id=1] Waiting for new replicas 4,1,2 for partition test-15 being reassigned to 
catch up with the leader (kafka.controller.KafkaController)
[2018-08-03 08:52:21,332] INFO [Controller id=1] Handling reassignment of 
partition test-2 to new replicas 2,1,4 (kafka.controller.KafkaController)
[2018-08-03 08:52:21,332] INFO [Controller id=1] New replicas 2,1,4 for 
partition test-2 being reassigned not yet caught up with the leader 
(kafka.controller.KafkaController)
[2018-08-03 08:52:21,333] INFO [Controller id=1] Updated assigned replicas for 
partition test-2 being reassigned to 2,1,4,3 (kafka.controller.KafkaController)
[2018-08-03 08:52:21,333] DEBUG [Controller id=1] Updating leader epoch for 
partition test-2 (kafka.controller.KafkaController)
[2018-08-03 08:52:21,333] INFO [Controller id=1] Updated leader epoch for 
partition test-2 to 1 (kafka.controller.KafkaController)
[2018-08-03 08:52:21,334] WARN [Channel manager on controller 1]: Not sending 
request (type=LeaderAndIsRequest, controllerId=1, controllerEpoch=1, 
partitionStates={test-2=PartitionState(controllerEpoch=1, leader=2, 
leaderEpoch=1, isr=2,1,[2018-08-03 08:52:21,334] WARN [Channel manager on 
controller 1]: Not sending request (type=LeaderAndIsRequest, controllerId=1, 
controllerEpoch=1, partitionStates={test-2=PartitionState(controllerEpoch=1, 
leader=2, leaderEpoch=0, isr=2,1,[2018-08-03 08:52:21,334] INFO [Controller 
id=1] Waiting for new replicas 2,1,4 for partition test-2 being reassigned to 
catch up with the leader (kafka.controller.KafkaController)
[2018-08-03 08:52:21,337] INFO [Controller id=1] 2/3 replicas have caught up 
with the leader for partition test-14 being reassigned. Replica(s) 4 still need 
to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,337] INFO [Controller id=1] 2/3 replicas have caught up 
with the leader for partition test-6 being reassigned. Replica(s) 4 still need 
to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,338] INFO [Controller id=1] 2/3 replicas have caught up 
with the leader for partition test-17 being reassigned. Replica(s) 4 still need 
to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,338] INFO [Controller id=1] 2/3 replicas have caught up 
with the leader for partition test-11 being reassigned. Replica(s) 4 still need 
to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,339] INFO [Controller id=1] 2/3 replicas have caught up 
with the leader for partition test-10 being reassigned. Replica(s) 4 still need 
to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,339] INFO [Controller id=1] 2/3 replicas have caught up 
with the leader for partition test-19 being reassigned. Replica(s) 4 still need 
to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,340] INFO [Controller id=1] 2/3 replicas have caught up 
with the leader for partition test-0 being reassigned. Replica(s) 4 still need 
to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,340] INFO [Controller id=1] 2/3 replicas have caught up 
with the leader for partition test-7 being reassigned. Replica(s) 4 still need 
to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,341] INFO [Controller id=1] 2/3 replicas have caught up 
with the leader for partition test-18 being reassigned. Replica(s) 4 still need 
to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,341] INFO [Controller id=1] 2/3 replicas have caught up 
with the leader for partition test-5 being reassigned. Replica(s) 4 still need 
to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,342] INFO [Controller id=1] 2/3 replicas have caught up 
with the leader for partition test-8 being reassigned. Replica(s) 4 still need 
to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,342] INFO [Controller id=1] 2/3 replicas have caught up 
with the leader for partition test-1 being reassigned. Replica(s) 4 still need 
to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,343] INFO [Controller id=1] 2/3 replicas have caught up 
with the leader for partition test-13 being reassigned. Replica(s) 4 still need 
to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,343] INFO [Controller id=1] 2/3 replicas have caught up 
with the leader for partition test-4 being reassigned. Replica(s) 4 still need 
to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,344] INFO [Controller id=1] 2/3 replicas have caught up 
with the leader for partition test-16 being reassigned. Replica(s) 4 still need 
to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,344] INFO [Controller id=1] 2/3 replicas have caught up 
with the leader for partition test-9 being reassigned. Replica(s) 4 still need 
to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,345] INFO [Controller id=1] 2/3 replicas have caught up 
with the leader for partition test-3 being reassigned. Replica(s) 4 still need 
to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,345] INFO [Controller id=1] 2/3 replicas have caught up 
with the leader for partition test-12 being reassigned. Replica(s) 4 still need 
to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,346] INFO [Controller id=1] 2/3 replicas have caught up 
with the leader for partition test-15 being reassigned. Replica(s) 4 still need 
to catch up (kafka.controller.KafkaController)
[2018-08-03 08:52:21,346] INFO [Controller id=1] 2/3 replicas have caught up 
with the leader for partition test-2 being reassigned. Replica(s) 4 still need 
to catch up (kafka.controller.KafkaController)
{code}

After reassignment process is finished znode /admin/reassign_partitions 
contains value from file  [^reassignment_partitions_path_after_finish_work.txt] 
.

We have some reassign task in znode  /admin/reassign_partitions. And we can't 
start other reassign tasks before previous task is not finished.

We need some legal mechanism to catch up such situations and stop such reassign 
tasks.
For example we need timeout parameter to clean reassign tasks after which 
reassign task of one partition is dropped from znode with warning in logs.

  was:
There is the trouble with starting reassignment partitions process to non 
existent broker.

 Kafka cluster has 3 brokers with ids - 1, 2, 3.

We try to reassign some partitions to another broker(e.g. with id=4) and at 
last we cath up situation when reassign task does not stop at all. We can't 
start others reassign tasks before finish that task.

Details:

 

We have broker list before reassignment partitions task is started.
{code:java}
[zk: grid1219:3185(CONNECTED) 0] ls /brokers/ids
[1, 2, 3]
{code}


> Reassignment partitiona to non existent broker
> ----------------------------------------------
>
>                 Key: KAFKA-7241
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7241
>             Project: Kafka
>          Issue Type: Task
>          Components: admin
>    Affects Versions: 1.1.1
>            Reporter: Igor Martemyanov
>            Priority: Major
>         Attachments: kafka-logs.zip, reassignment_json.txt, 
> reassignment_partitions_path_after_finish_work.txt
>
>
> There is the trouble with starting reassignment partitions process to non 
> existent broker.
> Kafka cluster has 3 brokers with ids - 1, 2, 3.
> We try to reassign some partitions to another broker(e.g. with id=4) and at 
> last we cath up situation when reassign task does not stop at all. We can't 
> start others reassign tasks before finish that task.
> Details:
>  
> We have broker list before reassignment partitions task is started.
> {code}
> [zk: grid1219:3185(CONNECTED) 0] ls /brokers/ids
> [1, 2, 3]
> {code}
> Admin path is:
> {code}
> [zk: grid1219:3185(CONNECTED) 1] ls /admin
> [delete_topics]
> [zk: grid1219:3185(CONNECTED) 2] get /admin/delete_topics
> null
> cZxid = 0xe
> ctime = Fri Aug 03 08:04:25 MSK 2018
> mZxid = 0xe
> mtime = Fri Aug 03 08:04:25 MSK 2018
> pZxid = 0xe
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x0
> dataLength = 0
> numChildren = 0
> {code}
> There is one topic created with 20 partitions and 3 replication-factor. 
> Reassign partitions json is here  [^reassignment_json.txt] . We are write 
> json to path /admin/reassign_partitions and after that partition reassignment 
> is started.
> We can see the result of reassignment process in kafka controller logs(full 
> log is here -  [^kafka-logs.zip] ):
> {code}
> [2018-08-03 08:52:21,329] INFO [Controller id=1] Handling reassignment of 
> partition test-15 to new replicas 4,1,2 (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,329] INFO [Controller id=1] New replicas 4,1,2 for 
> partition test-15 being reassigned not yet caught up with the leader 
> (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,330] INFO [Controller id=1] Updated assigned replicas 
> for partition test-15 being reassigned to 4,1,2,3 
> (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,330] DEBUG [Controller id=1] Updating leader epoch for 
> partition test-15 (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,331] INFO [Controller id=1] Updated leader epoch for 
> partition test-15 to 1 (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,331] WARN [Channel manager on controller 1]: Not sending 
> request (type=LeaderAndIsRequest, controllerId=1, controllerEpoch=1, 
> partitionStates={test-15=PartitionState(controllerEpoch=1, leader=3, 
> leaderEpoch=1, isr=3,1[2018-08-03 08:52:21,331] WARN [Channel manager on 
> controller 1]: Not sending request (type=LeaderAndIsRequest, controllerId=1, 
> controllerEpoch=1, partitionStates={test-15=PartitionState(controllerEpoch=1, 
> leader=3, leaderEpoch=0, isr=3,1[2018-08-03 08:52:21,331] INFO [Controller 
> id=1] Waiting for new replicas 4,1,2 for partition test-15 being reassigned 
> to catch up with the leader (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,332] INFO [Controller id=1] Handling reassignment of 
> partition test-2 to new replicas 2,1,4 (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,332] INFO [Controller id=1] New replicas 2,1,4 for 
> partition test-2 being reassigned not yet caught up with the leader 
> (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,333] INFO [Controller id=1] Updated assigned replicas 
> for partition test-2 being reassigned to 2,1,4,3 
> (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,333] DEBUG [Controller id=1] Updating leader epoch for 
> partition test-2 (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,333] INFO [Controller id=1] Updated leader epoch for 
> partition test-2 to 1 (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,334] WARN [Channel manager on controller 1]: Not sending 
> request (type=LeaderAndIsRequest, controllerId=1, controllerEpoch=1, 
> partitionStates={test-2=PartitionState(controllerEpoch=1, leader=2, 
> leaderEpoch=1, isr=2,1,[2018-08-03 08:52:21,334] WARN [Channel manager on 
> controller 1]: Not sending request (type=LeaderAndIsRequest, controllerId=1, 
> controllerEpoch=1, partitionStates={test-2=PartitionState(controllerEpoch=1, 
> leader=2, leaderEpoch=0, isr=2,1,[2018-08-03 08:52:21,334] INFO [Controller 
> id=1] Waiting for new replicas 2,1,4 for partition test-2 being reassigned to 
> catch up with the leader (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,337] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-14 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,337] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-6 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,338] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-17 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,338] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-11 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,339] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-10 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,339] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-19 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,340] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-0 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,340] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-7 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,341] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-18 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,341] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-5 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,342] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-8 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,342] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-1 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,343] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-13 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,343] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-4 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,344] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-16 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,344] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-9 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,345] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-3 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,345] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-12 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,346] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-15 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> [2018-08-03 08:52:21,346] INFO [Controller id=1] 2/3 replicas have caught up 
> with the leader for partition test-2 being reassigned. Replica(s) 4 still 
> need to catch up (kafka.controller.KafkaController)
> {code}
> After reassignment process is finished znode /admin/reassign_partitions 
> contains value from file  
> [^reassignment_partitions_path_after_finish_work.txt] .
> We have some reassign task in znode  /admin/reassign_partitions. And we can't 
> start other reassign tasks before previous task is not finished.
> We need some legal mechanism to catch up such situations and stop such 
> reassign tasks.
> For example we need timeout parameter to clean reassign tasks after which 
> reassign task of one partition is dropped from znode with warning in logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to