[jira] [Commented] (KAFKA-1310) Zookeeper timeout causes deadlock in Controller
[ https://issues.apache.org/jira/browse/KAFKA-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15407483#comment-15407483 ] .D. commented on KAFKA-1310: Excuse me, appear this error is how to solve? What principle, each too see? > Zookeeper timeout causes deadlock in Controller > --- > > Key: KAFKA-1310 > URL: https://issues.apache.org/jira/browse/KAFKA-1310 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.1 >Reporter: Fedor Korotkiy >Assignee: Neha Narkhede >Priority: Blocker > Fix For: 0.8.1.1 > > > Steps to reproduce: > 1. Checkout and build 0.8.1 branch from github: > git clone g...@github.com:apache/kafka.git && cd kafka && git checkout > origin/0.8.1 && ./gradlew jar > 2. Start zookeeper server: > ./bin/zookeeper-server-start.sh config/zookeeper.properties > 3. Start kafka server: > ./bin/kafka-server-start.sh config/server.properties > 4. Suspend zookeeper process for 10 seconds (ctrl-Z, then %1). > 5. And kafka hasn't been re-registered in zookeeper. > ./bin/zookeeper-shell.sh > ls /brokers/ids > >> [] > Root cause of the problem seems to be the deadlock between DeleteTopicsThread > and SessionExpirationListener in KafkaController. > 1. DeleteTopicsThread acquires controllerLock and await()-s on > deleteTopicsCond in awaitTopicDeletionNotification() > 2. SessionExpirationListener fires. It acquires controllerLock and tries to > shutdown deleteTopicManager(in onControllerResignation()). This interrupts > DeleteTopicsThread. > 3. DeleteTopicsThread can't return from deleteTopicsCond.await() because > controllerLock is taken. We got a deadlock. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1310) Zookeeper timeout causes deadlock in Controller
[ https://issues.apache.org/jira/browse/KAFKA-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13964521#comment-13964521 ] Joel Koshy commented on KAFKA-1310: --- Fixed by KAFKA-1317 Zookeeper timeout causes deadlock in Controller --- Key: KAFKA-1310 URL: https://issues.apache.org/jira/browse/KAFKA-1310 Project: Kafka Issue Type: Bug Affects Versions: 0.8.1 Reporter: Fedor Korotkiy Assignee: Neha Narkhede Priority: Blocker Fix For: 0.8.1.1 Steps to reproduce: 1. Checkout and build 0.8.1 branch from github: git clone g...@github.com:apache/kafka.git cd kafka git checkout origin/0.8.1 ./gradlew jar 2. Start zookeeper server: ./bin/zookeeper-server-start.sh config/zookeeper.properties 3. Start kafka server: ./bin/kafka-server-start.sh config/server.properties 4. Suspend zookeeper process for 10 seconds (ctrl-Z, then %1). 5. And kafka hasn't been re-registered in zookeeper. ./bin/zookeeper-shell.sh ls /brokers/ids [] Root cause of the problem seems to be the deadlock between DeleteTopicsThread and SessionExpirationListener in KafkaController. 1. DeleteTopicsThread acquires controllerLock and await()-s on deleteTopicsCond in awaitTopicDeletionNotification() 2. SessionExpirationListener fires. It acquires controllerLock and tries to shutdown deleteTopicManager(in onControllerResignation()). This interrupts DeleteTopicsThread. 3. DeleteTopicsThread can't return from deleteTopicsCond.await() because controllerLock is taken. We got a deadlock. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1310) Zookeeper timeout causes deadlock in Controller
[ https://issues.apache.org/jira/browse/KAFKA-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951474#comment-13951474 ] Timothy Chen commented on KAFKA-1310: - I tried out the repro scenario described in the latest 0.8.1 branch, and with latest commit 39a5607 I see that after pausing zookeeper for 10 seconds the broker successfully registers itself afterwards. Zookeeper timeout causes deadlock in Controller --- Key: KAFKA-1310 URL: https://issues.apache.org/jira/browse/KAFKA-1310 Project: Kafka Issue Type: Bug Affects Versions: 0.8.1 Reporter: Fedor Korotkiy Assignee: Neha Narkhede Priority: Blocker Fix For: 0.8.1.1 Steps to reproduce: 1. Checkout and build 0.8.1 branch from github: git clone g...@github.com:apache/kafka.git cd kafka git checkout origin/0.8.1 ./gradlew jar 2. Start zookeeper server: ./bin/zookeeper-server-start.sh config/zookeeper.properties 3. Start kafka server: ./bin/kafka-server-start.sh config/server.properties 4. Suspend zookeeper process for 10 seconds (ctrl-Z, then %1). 5. And kafka hasn't been re-registered in zookeeper. ./bin/zookeeper-shell.sh ls /brokers/ids [] Root cause of the problem seems to be the deadlock between DeleteTopicsThread and SessionExpirationListener in KafkaController. 1. DeleteTopicsThread acquires controllerLock and await()-s on deleteTopicsCond in awaitTopicDeletionNotification() 2. SessionExpirationListener fires. It acquires controllerLock and tries to shutdown deleteTopicManager(in onControllerResignation()). This interrupts DeleteTopicsThread. 3. DeleteTopicsThread can't return from deleteTopicsCond.await() because controllerLock is taken. We got a deadlock. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1310) Zookeeper timeout causes deadlock in Controller
[ https://issues.apache.org/jira/browse/KAFKA-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951678#comment-13951678 ] Neha Narkhede commented on KAFKA-1310: -- Very cool. Thanks for verifying that [~tnachen]! Zookeeper timeout causes deadlock in Controller --- Key: KAFKA-1310 URL: https://issues.apache.org/jira/browse/KAFKA-1310 Project: Kafka Issue Type: Bug Affects Versions: 0.8.1 Reporter: Fedor Korotkiy Assignee: Neha Narkhede Priority: Blocker Fix For: 0.8.1.1 Steps to reproduce: 1. Checkout and build 0.8.1 branch from github: git clone g...@github.com:apache/kafka.git cd kafka git checkout origin/0.8.1 ./gradlew jar 2. Start zookeeper server: ./bin/zookeeper-server-start.sh config/zookeeper.properties 3. Start kafka server: ./bin/kafka-server-start.sh config/server.properties 4. Suspend zookeeper process for 10 seconds (ctrl-Z, then %1). 5. And kafka hasn't been re-registered in zookeeper. ./bin/zookeeper-shell.sh ls /brokers/ids [] Root cause of the problem seems to be the deadlock between DeleteTopicsThread and SessionExpirationListener in KafkaController. 1. DeleteTopicsThread acquires controllerLock and await()-s on deleteTopicsCond in awaitTopicDeletionNotification() 2. SessionExpirationListener fires. It acquires controllerLock and tries to shutdown deleteTopicManager(in onControllerResignation()). This interrupts DeleteTopicsThread. 3. DeleteTopicsThread can't return from deleteTopicsCond.await() because controllerLock is taken. We got a deadlock. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1310) Zookeeper timeout causes deadlock in Controller
[ https://issues.apache.org/jira/browse/KAFKA-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942939#comment-13942939 ] Michael Noll commented on KAFKA-1310: - I can confirm this issue, using Kafka 0.8.1. Here are the error messages when trying to create a topic: {code} $ bin/kafka-topics.sh --create --zookeeper zookeeper1:2181 --topic testing --partitions 1 --replication-factor 1 Error while executing topic command org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/ids org.I0Itec.zkclient.exception.ZkNoNodeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/ids at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:685) at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:413) at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:409) at kafka.utils.ZkUtils$.getChildren(ZkUtils.scala:480) at kafka.utils.ZkUtils$.getSortedBrokerList(ZkUtils.scala:81) at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:154) at kafka.admin.TopicCommand$.createTopic(TopicCommand.scala:88) at kafka.admin.TopicCommand$.main(TopicCommand.scala:50) at kafka.admin.TopicCommand.main(TopicCommand.scala) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/ids at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1249) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1277) at org.I0Itec.zkclient.ZkConnection.getChildren(ZkConnection.java:99) at org.I0Itec.zkclient.ZkClient$2.call(ZkClient.java:416) at org.I0Itec.zkclient.ZkClient$2.call(ZkClient.java:413) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675) ... 8 more {code} If you use the ZK CLI you will sometimes see a znode under {{/brokers/ids}}, sometimes not. In my limited testing I could, for instance, create a topic (partitions=1, replicas=1) and then list/describe it. But at least when I reached the point to try sending messages to it, it would fail. See next example. When trying to use the console producer to sent a test message foo (String) to the topic/broker: {code} $ bin/kafka-console-producer.sh --topic testing --broker-list localhost:9092 fooThis is the test message, manually entered in the console/terminal [2014-03-20 09:45:32,223] WARN Error while fetching metadata [{TopicMetadata for topic testing - No partition metadata for topic testing due to kafka.common.LeaderNotAvailableException}] for topic [testing]: class kafka.common.LeaderNotAvailableException (kafka.producer.BrokerPartitionInfo) [2014-03-20 09:45:32,233] WARN Error while fetching metadata [{TopicMetadata for topic testing - No partition metadata for topic testing due to kafka.common.LeaderNotAvailableException}] for topic [testing]: class kafka.common.LeaderNotAvailableException (kafka.producer.BrokerPartitionInfo) [2014-03-20 09:45:32,234] ERROR Failed to collate messages by topic, partition due to: Failed to fetch topic metadata for topic: testing (kafka.producer.async.DefaultEventHandler) {code} *How to reproduce* Using Wirbelsturm you can reproduce this error as follow. This assumes you have Vagrant 1.4.x and VirtualBox already installed on your host machine. {code} $ git clone https://github.com/miguno/wirbelsturm.git $ cd wirbelsturm $ ./bootstrap # May take a while depending on how fast your Internet connection is. # Then uncomment the `kafka_broker` section in `wirbelsturm.yaml`. # Only remove the leading `#` character in each line -- the remaining leading whitespace is significant. $ vagrant up zookeeper1 kafka1 # May take a while (boots VMs, downloads RPMs from the Internet to provision the VMs, etc.) {code} Now you can ssh into the VM {{kafka1}} via {{vagrant ssh kafka1}} and run the commands above from within the {{/opt/kafka}} directory. Zookeeper timeout causes deadlock in Controller --- Key: KAFKA-1310 URL: https://issues.apache.org/jira/browse/KAFKA-1310 Project: Kafka Issue Type: Bug Affects Versions: 0.8.1 Reporter: Fedor Korotkiy Assignee: Neha Narkhede Priority: Blocker Steps to reproduce: 1. Checkout and build 0.8.1 branch from github: git clone g...@github.com:apache/kafka.git cd kafka git checkout origin/0.8.1 ./gradlew jar 2. Start zookeeper server: ./bin/zookeeper-server-start.sh config/zookeeper.properties 3. Start
[jira] [Commented] (KAFKA-1310) Zookeeper timeout causes deadlock in Controller
[ https://issues.apache.org/jira/browse/KAFKA-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13942947#comment-13942947 ] Michael Noll commented on KAFKA-1310: - Also, I can confirm the errors above do not occur with Kafka 0.8.0, using the following test commands: {code} $ bin/kafka-create-topic.sh --topic testing --zookeeper zookeeper1:2181 --partition 1 --replica 1 creation succeeded! $ bin/kafka-list-topic.sh --zookeeper zookeeper1:2181 topic: testing partition: 0leader: 0 replicas: 0 isr: 0 # Trying to produce dat works! $ bin/kafka-console-producer.sh --topic testing --broker-list localhost:9092 foo ^C $ bin/kafka-console-consumer.sh --topic testing --zookeeper zookeeper1:2181 --from-beginning foo {code} Zookeeper timeout causes deadlock in Controller --- Key: KAFKA-1310 URL: https://issues.apache.org/jira/browse/KAFKA-1310 Project: Kafka Issue Type: Bug Affects Versions: 0.8.1 Reporter: Fedor Korotkiy Assignee: Neha Narkhede Priority: Blocker Steps to reproduce: 1. Checkout and build 0.8.1 branch from github: git clone g...@github.com:apache/kafka.git cd kafka git checkout origin/0.8.1 ./gradlew jar 2. Start zookeeper server: ./bin/zookeeper-server-start.sh config/zookeeper.properties 3. Start kafka server: ./bin/kafka-server-start.sh config/server.properties 4. Suspend zookeeper process for 10 seconds (ctrl-Z, then %1). 5. And kafka hasn't been re-registered in zookeeper. ./bin/zookeeper-shell.sh ls /brokers/ids [] Root cause of the problem seems to be the deadlock between DeleteTopicsThread and SessionExpirationListener in KafkaController. 1. DeleteTopicsThread acquires controllerLock and await()-s on deleteTopicsCond in awaitTopicDeletionNotification() 2. SessionExpirationListener fires. It acquires controllerLock and tries to shutdown deleteTopicManager(in onControllerResignation()). This interrupts DeleteTopicsThread. 3. DeleteTopicsThread can't return from deleteTopicsCond.await() because controllerLock is taken. We got a deadlock. -- This message was sent by Atlassian JIRA (v6.2#6252)