GitHub user apurvam opened a pull request: https://github.com/apache/kafka/pull/1910
KAFKA-4214:kafka-reassign-partitions fails all the time when brokers are bounced during reassignment There is a corner case bug, where during partition reassignment, if the controller and a broker receiving a new replica are bounced at the same time, the partition reassignment is failed. The cause of this bug is a block of code in the KafkaController which fails the reassignment if the aliveNewReplicas != newReplicas, ie. if some of the new replicas are offline at the time a controller fails over. The fix is to have the controller listen for ISR change events even for new replicas which are not alive when the controller boots up. Once the said replicas come online, they will be in the ISR set, and the new controller will detect this, and then mark the reassignment as successful. Interestingly, the block of code in question was introduced in KAFKA-990, where a concern about this exact scenario was raised :) This bug was revealed in the system tests in https://github.com/apache/kafka/pull/1904. The relevant tests will be enabled in either this or a followup PR when PR-1904 is merged. You can merge this pull request into a Git repository by running: $ git pull https://github.com/apurvam/kafka KAFKA-4214 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/1910.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1910 ---- commit 7f9814466592f549e8fcde914815c35dcb5a046d Author: Apurva Mehta <apurva.1...@gmail.com> Date: 2016-09-26T21:11:44Z There is a corner case bug, where during partition reassignment, if the controller and a broker receiving a new replica are bounced at the same time, the partition reassignment is failed. The cause of this bug is a block of code in the KafkaController which fails the reassignment if the aliveNewReplicas != newReplicas, ie. if some of the new replicas are offline at the time a controller fails over. The fix is to have the controller listen for ISR change events even for new replicas which are not alive when the controller boots up. Once the said replicas come online, they will be in the ISR set, and the new controller will detect this, and then mark the reassignment as successful. Interestingly, the block of code in question was introduced in KAFKA-990, where a concern about this exact scenario was raised :) ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---