[ 
https://issues.apache.org/jira/browse/KAFKA-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-4442:
----------------------------
    Description: 
Currently controller will register broker change listener before sending send 
LeaderAndIsrRequests to live replicas. The call path looks like this:

- onControllerFailover()
  - partitionStateMachine.startup()
    - triggerOnlinePartitionStateChange()
      - handleStateChange(partition, OnlinePartition)
        - electLeaderForPartition(partition)
          - determines live replicas for this partition (step a)
          - add partition to controllerContext.partitionLeadershipInfo. (step b)
          - send LeaderAndIsrRequest to those live replics for this partition

However, if a broker registers itself in zookeeper in between step (a) and step 
(b), the onBrokerStartup() will not send LeaderAndIsrRequest to this broker for 
this partition because the partition is not found in 
controllerContext.partitionLeadershipInfo. Yet onControllerFailover() will not 
send LeaderAndIsrRequest to this broker for this partition either because the 
broker is not considered live in step (a).

The root cause is that onBrokerStartup() should only be executed after 
controller has finished onControllerFailover() and initialized its state. 
Therefore controller should grab the lock controllerContext.controllerLock 
during onControllerFailover().




  was:
Currently controller will register broker change listener before sending send 
LeaderAndIsrRequests to live replicas. The call path looks like this:

- onControllerFailover()
  - partitionStateMachine.startup()
    - triggerOnlinePartitionStateChange()
      - handleStateChange(partition, OnlinePartition)
        - electLeaderForPartition(partition)
          - determines live replicas for this partition (step a)
          - add partition to controllerContext.partitionLeadershipInfo. (step b)
          - send LeaderAndIsrRequest to those live replics for this partition

However, if a broker registers itself in zookeeper in between step (a) and step 
(b), the onBrokerStartup() will not send LeaderAndIsrRequest to this broker for 
this partition because the partition is not found in 
controllerContext.partitionLeadershipInfo. Yet onControllerFailover() will not 
send LeaderAndIsrRequest to this broker for this partition either before the 
broker is not considered live in step (a).

The root cause is that onBrokerStartup() should only be executed after 
controller has finished onControllerFailover() and initialized its state. 
Therefore controller should grab the lock controllerContext.controllerLock 
during onControllerFailover().





> Controller should grab lock when it is being initialized to avoid race 
> condition
> --------------------------------------------------------------------------------
>
>                 Key: KAFKA-4442
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4442
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Dong Lin
>            Assignee: Dong Lin
>
> Currently controller will register broker change listener before sending send 
> LeaderAndIsrRequests to live replicas. The call path looks like this:
> - onControllerFailover()
>   - partitionStateMachine.startup()
>     - triggerOnlinePartitionStateChange()
>       - handleStateChange(partition, OnlinePartition)
>         - electLeaderForPartition(partition)
>           - determines live replicas for this partition (step a)
>           - add partition to controllerContext.partitionLeadershipInfo. (step 
> b)
>           - send LeaderAndIsrRequest to those live replics for this partition
> However, if a broker registers itself in zookeeper in between step (a) and 
> step (b), the onBrokerStartup() will not send LeaderAndIsrRequest to this 
> broker for this partition because the partition is not found in 
> controllerContext.partitionLeadershipInfo. Yet onControllerFailover() will 
> not send LeaderAndIsrRequest to this broker for this partition either because 
> the broker is not considered live in step (a).
> The root cause is that onBrokerStartup() should only be executed after 
> controller has finished onControllerFailover() and initialized its state. 
> Therefore controller should grab the lock controllerContext.controllerLock 
> during onControllerFailover().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to