[
https://issues.apache.org/jira/browse/KAFKA-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Neha Narkhede updated KAFKA-42:
-------------------------------
Attachment: kafka-42-v1.patch
This is a pretty tricky feature. Since it involves multiple state changes
before reassignment can be marked complete, there are many failure conditions
to think about and handle recovery correctly
1. Admin tool changes
1.1 Added a new admin command reassign-partition. Right now, it handles one
partition, since I thought the failure/exit conditions and error messages are
simpler to handle. But if people think we should add multiple partitions
support in the same command invocation, that is fine too.
1.2 Added a new reassignPartition(topic, partition, RAR) API that registers a
data change listener on /admin/reassign_partitions path and then creates the
/admin/reassign_partitions={} path in zookeeper. It waits until that path is
deleted from zookeeper. Once it is deleted, it checks if AR == RAR. If yes, it
reports success otherwise failure.
1.3 Added a shutdown hook to handle command cancellation by the admin. In this
case, it checks if reassignment was completed or not and logs the output
accordingly.
2. Controller changes
Reassigning replicas for a partition goes through a few stages -
RAR = Reassigned replicas
AR = Original list of replicas for partition
1. Register listener for ISR changes to detect when the RAR is a subset of the
ISR
2. Start new replicas RAR - AR.
3. Wait until new replicas are in sync with the leader
4. If the leader is not in RAR, elect a new leader from RAR
5. Stop old replicas AR - RAR
6. Write new AR
7. Remove partition from the /admin/reassign_partitions path
The above state changes steps are inside the onPartitionReassignment() callback
in KafkaController.scala
3. Partition Reassignment failure cases
Broadly there are 2 types of failures we need to worry about -
1. Controller failover
2. Runtime error at the broker hosting the replica
Let's go through the failure cases and recovery -
1. If the controller fails over between steps 1 and 2, the new controller on
startup will read the non-empty admin path and just restart the partition
reassignment process from scratch
2a. If the controller fails over between steps 2 and 3 above, the new
controller will check if the new replicas are in sync with the leader or not.
In either case, it will resume partition reassignment for the partitions listed
in the admin path
2b. If, for some reason, the broker is not able to start the replicas, the isr
listener for reassigned partitions will not trigger. So, the controller will
not resume partition reassignment process for that partition. After some time,
the admin command can be killed and it will report failure and delete the admin
path so it can be retried.
3. If the controller fails over between steps 4 and 5, the new controller will
realize that the new replicas are already in sync. If the new leader is part of
the new replicas and is alive, it will not trigger leader re-election. Else it
will re-elect the leader from amongst the live reassigned replicas.
4a. If the controller fails over between steps 5 and 6, the new controller
resumes partition reassignment and repeats steps 4 onwards
4b. If, for some reason, the broker does not complete the leader state change,
the partition after reassignment will be offline. This is a problem we have
today even for leader election of newly created partitions. The controller
doesn't wait for an acknowledgement from the broker for the make-leader state
change. Nevertheless, the broker can fail even after sending a successful ack,
so there isn't much value in waiting for an ack. However, I think the leader
broker should expose an mbean to signify the availability of a partition. If
people think this is a good idea, I can file a bug to fix this.
5. If the controller fails over between steps 6 and 7, it deletes the partition
from the admin path marking the completion of this partition's reassignment.
The partition reassignment zookeeper listener should record partition to be
reassigned only if RAR not equal AR.
4. PartitionReassignedListener
Starts the partition reassignment process unless -
1. Partition previously existed
2. New replicas are the same as existing replicas
3. Any replica in the new set of replicas are dead
If any of the above conditions are satisfies, it logs an error and removes the
partition from list of reassigned partitions notifying the admin command about
the failure/completion.
5. PartitionLeaderSelector
Added a self transition on the OnlinePartition state change. This is because,
with cluster expansion and preferred replica leader election features, we need
to move the leader for online partitions as well.
Added partition leader selector module since we have 3 different ways of
selecting the leader for a partition -
1. Offline leader selector - Pick an alive in sync replica as the leader.
Otherwise, pick an alive assigned replica
2. Reassigned partition leader selector - Pick one of the alive in-sync
reassigned replicas as the new leader
3. Preferred replica leader selector - Pick the preferred replica as the new
leader
4. Testing
6. Replica state machine changes
Added 2 new states to the replica state machine -
1. NewReplica : The controller can create new replicas during partition
reassignment. In this state, a
replica can only get become follower state change
request. Valid previous
state is NonExistentReplica
2. OnlineReplica : Once a replica is started and part of the assigned
replicas for its partition, it is in this
state. In this state, it can get either become leader or
become follower state change requests.
Valid previous state are NewReplica, OnlineReplica or
OfflineReplica
3. OfflineReplica : If a replica dies, it moves to this state. This happens
when the broker hosting the replica
is down. Valid previous state are NewReplica,
OnlineReplica
4. NonExistentReplica: If a replica is deleted, it is moved to this state.
Valid previous state is OfflineReplica
7. Added 6 unit test cases to test -
1. Partition reassignment with leader of the partition in the new list of
replicas
2. Partition reassignment with leader of the partition NOT in the new list of
replicas
3. Partition reassignment with existing assigned replicas NOT overlapping with
new list of replicas
4. Partition reassignment for a non existing partition. This is a negative test
case
5. Partition reassignment for a partition that was completed upto step 6 by
previous controller. This tests if after controller failover, it handles
marking that partition's reassignment as completed.
6. Partition reassignment for a partition that was completed upto step 3 by
previous controller. This tests if after controller failover, it handles leader
re-election correctly and completes rest of the partition reassignment process.
> Support rebalancing the partitions with replication
> ---------------------------------------------------
>
> Key: KAFKA-42
> URL: https://issues.apache.org/jira/browse/KAFKA-42
> Project: Kafka
> Issue Type: Bug
> Components: core
> Reporter: Jun Rao
> Assignee: Neha Narkhede
> Priority: Blocker
> Labels: features
> Fix For: 0.8
>
> Attachments: kafka-42-v1.patch
>
> Original Estimate: 240h
> Remaining Estimate: 240h
>
> As new brokers are added, we need to support moving partition replicas from
> one set of brokers to another, online.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira