[ 
https://issues.apache.org/jira/browse/KAFKA-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neha Narkhede updated KAFKA-532:
--------------------------------

    Attachment: kafka-532-v5.patch

ew more changes in this patch -

1. Changed leader and isr request to send the controller epoch that made the 
last change for leader/isr per partition. This is used by the broker to update 
the leader and isr path with the correct controller epoch. Each Partition 
object on a Kafka server will maintain the epoch of the controller that made 
the last leader/isr decision. If/when the broker changes the isr, it uses the 
correct value for the controller epoch, instead of using the currently active 
controller's epoch. Functionally, nothing bad will happen even if it uses the 
currently active controller's epoch (that is sent on every state change 
request), but semantically it will not quite be right to do so. This can happen 
when a previous controller has made the leader/isr decisions for partitions, 
while the newer controllers have merely re-published those decisions upon 
controller failover.
2. Changed the become controller procedure to resign as the current controller 
if it runs into any unexpected error/exception while making the state change to 
become controller. This is to ensure that the currently elected controller is 
actually serving as the controller.
3. LogRecoveryTest and LogText occasionally fail, but I believe they fail on 
our nightly build as well. Didn't attempt to fix those tests in this patch. 

Regarding Jun's review -

>> 40. PartitionStateInfo: It seems that we need to send the controllerEpoc 
>> associated with this partition. Note that this epoc is different from the 
>> controllerEpoc in LeaderAndIsrRequest. The former is the epoc of the 
>> controller that last changed the leader or isr and will be used when broker 
>> updates the isr. The latter is the epoc of the controller that sends the 
>> request and will be used in ReplicaManager to decide which controller's 
>> decision to follow. We will need to change the controllerEpoc passed to 
>> makeLeader and makeFollower in ReplicaManager accordingly.

You raise a good point here. What I missed is initializing the controller epoch 
for each partition. There are 2 ways to initialize it 1. zookeeper read on 
startup 2. Active controller sending the controller epoch of the controller 
that last made a leader/isr decision for that partition. I'm guessing 2. might 
be better from a performance perspective.

>> 41. ReplicaManager: In stopReplicas() and becomeLeaderOrFollower(), it would 
>> be better to only update controllerEpoch when it's truly necessary, i.e., 
>> the new controllerEpoch is larger than the cached one (not equal). This is 
>> because updating a volatile variable is a bit expensive than updating a 
>> local variable since the update has to be exposed to other threads.

Not sure if this is a performance win. Volatile variables are never cached in 
memory registers. So a read needs to reload data from memory and write needs to 
write data back to memory. The if statement would need to access the same 
volatile variable, requiring it to go to memory anyways. :)

>> 2.1. incrementControllerEpoch(): If the controllerEpoc path doesn't exist, 
>> we create the path using the initial epoc version without using conditional 
>> update. It is possible for 2 controllers to execute this logic 
>> simultaneously and both get the initial epoc version. One solution is to 
>> make sure the controller epoc path exists during context initialization. 
>> Then we can always use conditional update here.

It is not possible for 2 clients to create the same zookeeper path. This is the 
simplest guarantee zookeeper provides. One of the writes will fail and that 
controller will abort its controller startup procedure. The larger problem here 
is not so much that one of the writes should fail, but we need to ensure that 
if the failed zk operation happens to be for the latest active controller, then 
it will abort its controller startup procedure and the old one will lose its 
zookeeper session anyways

>> 42.2. ControllerContext: We need to initialize controllerEpoc by reading 
>> from ZK. We also need to make sure that we subscribe to the controllerEpoc 
>> path first and then read its value from ZK for initialization.

The controller constructor is modified to initialize the controller epoch and 
zk version by reading from zk and then it subscribes to controller epoch's zk 
path.

>> 42.3. ControllerEpochListener: It's safer to set both the epoc and the ZK 
>> version using the value from ZkUtils.readData.

You're right and there is no perfect solution to this. Ideally, the zkclient 
API should change to expose the version since the underlying zookeeper API 
exposes it. The problem is that there will always be a window after the 
listener has fired and before the read returns when the controller's epoch 
could change. There will be another listener fired, though during each listener 
invocation, this problem would exist. The right way is to rely on the data the 
listener returns to the controller. But, with this change, at least the epoch 
and its version will correspond to the same epoch change, so its still better.

                
> Multiple controllers can co-exist during soft failures
> ------------------------------------------------------
>
>                 Key: KAFKA-532
>                 URL: https://issues.apache.org/jira/browse/KAFKA-532
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Neha Narkhede
>            Assignee: Neha Narkhede
>            Priority: Blocker
>              Labels: bugs
>         Attachments: kafka-532-v1.patch, kafka-532-v2.patch, 
> kafka-532-v3.patch, kafka-532-v4.patch, kafka-532-v5.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> If the current controller experiences an intermittent soft failure (GC pause) 
> in the middle of leader election or partition reassignment, a new controller 
> might get elected and start communicating new state change decisions to the 
> brokers. After recovering from the soft failure, the old controller might 
> continue sending some stale state change decisions to the brokers, resulting 
> in unexpected failures. We need to introduce a controller generation id that 
> increments with controller election. The brokers should reject any state 
> change requests by a controller with an older generation id.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to