[
https://issues.apache.org/jira/browse/KAFKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13840449#comment-13840449
]
Guozhang Wang commented on KAFKA-1134:
--------------------------------------
After checking the stack trace again, now I think the problem is that
1) In KafkaController.handleNewSession
controllerContext.controllerLock synchronized {
Utils.unregisterMBean(KafkaController.MBeanName)
partitionStateMachine.shutdown()
replicaStateMachine.shutdown()
if(controllerContext.controllerChannelManager != null) {
controllerContext.controllerChannelManager.shutdown()
controllerContext.controllerChannelManager = null
}
controllerElector.elect
}
elect function is called directly after controllerChannelManager.shutdown and
is lock covered by controllerContext.controllerLock, however from the logs.
elect is not immediately called since addpartition listener gets triggered due
to ZK expiration (known issue similar as KAFKA-1143) and which are covered by
the same lock:
2013/11/14 00:00:24.596 [RequestSendThread]
[Controller-583-to-broker-587-send-thread], Stopped
2013/11/14 00:00:24.596 [RequestSendThread]
[Controller-583-to-broker-587-send-thread], Shutdown completed
2013/11/14 00:00:24.596 [RequestSendThread]
[Controller-583-to-broker-579-send-thread], Shutting down
2013/11/14 00:00:24.596 [RequestSendThread]
[Controller-583-to-broker-579-send-thread], Stopped
2013/11/14 00:00:24.596 [RequestSendThread]
[Controller-583-to-broker-579-send-thread], Shutdown completed
2013/11/14 00:00:24.603 [ReplicaStateMachine$BrokerChangeListener]
[BrokerChangeListener on Controller 583]: Broker change listener fired for path
/brokers/ids with children 583,575,585,587,579,589
2013/11/14 00:00:24.605 [ReplicaStateMachine$BrokerChangeListener]
[BrokerChangeListener on Controller 583]: Broker change listener fired for path
/brokers/ids with children 583,575,585,587,579,589
2013/11/14 00:00:24.614 [PartitionStateMachine$AddPartitionsListener]
[AddPartitionsListener on 583]: Add Partition triggered { "partitions":{ "0":[
577, 589 ], "1":[ 579, 575 ], "2":[ 581, 577 ], "3":[ 583, 579 ] }, "version":1
} for path /brokers/topics/databus2-relay-log_event
2013/11/14 00:00:24.616 [PartitionStateMachine$AddPartitionsListener]
[AddPartitionsListener on 583]: New partitions to be added [Map()]
2013/11/14 00:00:24.616 [KafkaController] [Controller 583]: New partition
creation callback for
2013/11/14 00:00:24.618 [PartitionStateMachine$AddPartitionsListener]
[AddPartitionsListener on 583]: Add Partition triggered { "partitions":{ "0":[
577, 589 ], "1":[ 579, 575 ], "2":[ 581, 577 ], "3":[ 583, 579 ] }, "version":1
} for path /brokers/topics/databus2-relay-log_event
----------------
Without other logging info I cannot deduce any further, so I propose in this
jira we just improve the logging info for better debugging if this issue comes
up in the future.
> onControllerFailover function should be synchronized with other functions
> -------------------------------------------------------------------------
>
> Key: KAFKA-1134
> URL: https://issues.apache.org/jira/browse/KAFKA-1134
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 0.8.0, 0.8.1
> Reporter: Guozhang Wang
> Attachments: KAFKA-1134.patch, KAFKA-1134_2013-12-05_11:13:33.patch
>
>
> Otherwise race conditions could happen. For example, handleNewSession will
> close all sockets with brokers while the handleStateChange in
> onControllerFailover tries to send requests to them.
--
This message was sent by Atlassian JIRA
(v6.1#6144)