[ 
https://issues.apache.org/jira/browse/KAFKA-3083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102589#comment-15102589
 ] 

Jun Rao commented on KAFKA-3083:
--------------------------------

[~fpj], thanks for the clarification. If broker A shrinks the partition ISR in 
ZK before step 2, then broker A's ZK session expires, then, broker sends the 
shrunk ISR to broker C, two things can happen. (1) C has already received 
requests from the new controller B. In this case, A's request will be rejected. 
However, since the new controller B is re-elected after broker A shrinks the 
ISR in ZK and the new controller read the latest ISR from ZK on initialization, 
B will send the latest ISR to broker C. (2) C hasn't received any request from 
the new controller B. In this case, A's request will be accepted. The new 
controller B will later send the same ISR to broker B, but that's fine. So, in 
either case, we are covered.

The problem in the description is really caused by broker A changing ZK after 
its session expires. So, it seems the fix would be the following. If the 
controller (say A) hits a ZK ConnectionLoss event while reading/writing to ZK, 
it will pause the operation. Two possibilities can follow. In the case when the 
controller A's ZK session expires, it will just ignore all the outstanding ZK 
events. This guarantees that controller A can't touch ZK any more after a new 
controller is elected (which has to happen after controller A's 
SessionExpiration event). So, the new controller is guaranteed to read the 
latest ZK data, act on this, and send the latest info to the broker. This would 
avoid the issue in the description. 

In the second case, controller A will get a SyncConnected event. In this case, 
does controller A just resume from where it's left off? Or does it ignore all 
outstanding events and re-read all subscribed ZK paths (since there could be 
missing events between the connection loss event and the SyncConnected event)?

Finally, ZkClient actually hides the ZK ConnectionLoss event and only informs 
the application when the ZK session expires. To pursue this, we will have to 
access ZK directly.

> a soft failure in controller may leader a topic partition in an inconsistent 
> state
> ----------------------------------------------------------------------------------
>
>                 Key: KAFKA-3083
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3083
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.9.0.0
>            Reporter: Jun Rao
>            Assignee: Mayuresh Gharat
>
> The following sequence can happen.
> 1. Broker A is the controller and is in the middle of processing a broker 
> change event. As part of this process, let's say it's about to shrink the isr 
> of a partition.
> 2. Then broker A's session expires and broker B takes over as the new 
> controller. Broker B sends the initial leaderAndIsr request to all brokers.
> 3. Broker A continues by shrinking the isr of the partition in ZK and sends 
> the new leaderAndIsr request to the broker (say C) that leads the partition. 
> Broker C will reject this leaderAndIsr since the request comes from a 
> controller with an older epoch. Now we could be in a situation that Broker C 
> thinks the isr has all replicas, but the isr stored in ZK is different.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to