[
https://issues.apache.org/jira/browse/KAFKA-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272070#comment-15272070
]
Flavio Junqueira commented on KAFKA-3173:
-----------------------------------------
[~junrao] True, {{onControllerFailover}} has the lock when it runs, and it is
the only place where we call {{PartitionStateMachine.startup()}}. The confusing
part is that the lock is acquired a few hops upwards in the call path, but it
does look like the additional lock isn't necessary. Also, I'm wondering if we
even need that controller lock. All the zk events are processed using the
ZkClient event thread, and there is just one. The runs I was trying to put
together had concurrent zk events being triggered, which was causing the
potential problems I raised above. If there is any chance of internal threads
racing excluding the ZkClient event thread, then the lock is needed, otherwise
it isn't.
I don't think we need the change I proposed, so I'll go ahead and close the PR,
but we can't resolve this issue until we determine the cases in which we can
get a dirty batch, preventing the controller from sending further requests. We
need more info on this. One of the possibilities given what I've seen in other
logs is simply that there is a transient error while sending a message to a
broker in {{ControllerBrokerRequestBatch.sendRequestsToBrokers}}, but we are
currently not logging the exception. I was hoping that the originator of the
call would log it, but it isn't happen. Perhaps one thing we can do for the
upcoming release is to log the exception in the case we observe the problem
again.
> Error while moving some partitions to OnlinePartition state
> ------------------------------------------------------------
>
> Key: KAFKA-3173
> URL: https://issues.apache.org/jira/browse/KAFKA-3173
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 0.9.0.0
> Reporter: Flavio Junqueira
> Assignee: Flavio Junqueira
> Priority: Critical
> Fix For: 0.10.0.0
>
> Attachments: KAFKA-3173-race-repro.patch
>
>
> We observed another instance of the problem reported in KAFKA-2300, but this
> time the error appeared in the partition state machine. In KAFKA-2300, we
> haven't cleaned up the state in {{PartitionStateMachine}} and
> {{ReplicaStateMachine}} as we do in {{KafkaController}}.
> Here is the stack trace:
> {noformat}
> 2016-01-29 15:26:51,393] ERROR [Partition state machine on Controller 0]:
> Error while moving some partitions to OnlinePartition state
> (kafka.controller.PartitionStateMachine)java.lang.IllegalStateException:
> Controller to broker state change requests batch is not empty while creating
> a new one.
> Some LeaderAndIsr state changes Map(0 -> Map(foo-0 -> (LeaderAndIsrInfo:
> (Leader:0,ISR:0,LeaderEpoch:0,ControllerEpoch:1),ReplicationFactor:1),AllReplicas:0)))
> might be lost at
> kafka.controller.ControllerBrokerRequestBatch.newBatch(ControllerChannelManager.scala:254)
> at
> kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:144)
> at
> kafka.controller.KafkaController.onNewPartitionCreation(KafkaController.scala:517)
> at
> kafka.controller.KafkaController.onNewTopicCreation(KafkaController.scala:504)
> at
> kafka.controller.PartitionStateMachine$TopicChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(PartitionStateMachine.scala:437)
> at
> kafka.controller.PartitionStateMachine$TopicChangeListener$$anonfun$handleChildChange$1.apply(PartitionStateMachine.scala:419)
> at
> kafka.controller.PartitionStateMachine$TopicChangeListener$$anonfun$handleChildChange$1.apply(PartitionStateMachine.scala:419)
> at
> kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262) at
> kafka.controller.PartitionStateMachine$TopicChangeListener.handleChildChange(PartitionStateMachine.scala:418)
> at
> org.I0Itec.zkclient.ZkClient$10.run(ZkClient.java:842) at
> org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)