Neha, Ewen (and others), my initial attempt to solve this is uploaded here https://reviews.apache.org/r/30477/. It solves the shutdown problem and now the server shuts down even when Zookeeper has gone down before the Kafka server.

I went with the approach of introducing a custom (enhanced) ZkClient which for now allows time outs to be optionally specified for certain operations. I intentionally haven't forced the use of this new KafkaZkClient all over the code and instead for now have just used it in the KafkaServer.

Does this patch look like something worth using?

-Jaikiran

On Thursday 29 January 2015 10:41 PM, Neha Narkhede wrote:
Ewen is right. ZkClient APIs are blocking and the right fix for this seems
to be patching ZkClient. At some point, if we find ourselves fiddling too
much with ZkClient, it wouldn't hurt to write our own little zookeeper
client wrapper.

On Thu, Jan 29, 2015 at 12:57 AM, Ewen Cheslack-Postava <e...@confluent.io>
wrote:

Looks like a bug to me -- the underlying ZK library wraps a lot of blocking
method implementations with waitUntilConnected() calls without any
timeouts. Ideally we could just add a version of ZkUtils.getController()
with a timeout, but I don't see an easy way to accomplish that with
ZkClient.

There's at least one other call to ZkUtils besides the one in the
stacktrace you gave that would cause the same issue, possibly more that
aren't directly called in that method. One ugly solution would be to use an
extra thread during shutdown to trigger timeouts, but I'd imagine we
probably have other threads that could end up blocking in similar ways.

I filed https://issues.apache.org/jira/browse/KAFKA-1907 to track the
issue.


On Mon, Jan 26, 2015 at 6:35 AM, Jaikiran Pai <jai.forums2...@gmail.com>
wrote:

The main culprit is this thread which goes into "forever retry connection
to a closed zookeeper" when I shutdown Kafka (via a Ctrl + C) after
zookeeper has already been shutdown. I have attached the complete thread
dump, but I don't know if it will be delivered to the mailing list.

"Thread-2" prio=10 tid=0xb3305000 nid=0x4758 waiting on condition
[0x6ad69000]
    java.lang.Thread.State: TIMED_WAITING (parking)
     at sun.misc.Unsafe.park(Native Method)
     - parking to wait for  <0x70a93368> (a java.util.concurrent.locks.
AbstractQueuedSynchronizer$ConditionObject)
     at java.util.concurrent.locks.LockSupport.parkUntil(
LockSupport.java:267)
     at java.util.concurrent.locks.AbstractQueuedSynchronizer$
ConditionObject.awaitUntil(AbstractQueuedSynchronizer.java:2130)
     at org.I0Itec.zkclient.ZkClient.waitForKeeperState(ZkClient.java:636)
     at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:619)
     at org.I0Itec.zkclient.ZkClient.waitUntilConnected(ZkClient.java:615)
     at
org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:679)
     at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:766)
     at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:761)
     at kafka.utils.ZkUtils$.readDataMaybeNull(ZkUtils.scala:456)
     at kafka.utils.ZkUtils$.getController(ZkUtils.scala:65)
     at kafka.server.KafkaServer.kafka$server$KafkaServer$$
controlledShutdown(KafkaServer.scala:194)
     at kafka.server.KafkaServer$$anonfun$shutdown$1.apply$mcV$
sp(KafkaServer.scala:269)
     at kafka.utils.Utils$.swallow(Utils.scala:172)
     at kafka.utils.Logging$class.swallowWarn(Logging.scala:92)
     at kafka.utils.Utils$.swallowWarn(Utils.scala:45)
     at kafka.utils.Logging$class.swallow(Logging.scala:94)
     at kafka.utils.Utils$.swallow(Utils.scala:45)
     at kafka.server.KafkaServer.shutdown(KafkaServer.scala:269)
     at kafka.server.KafkaServerStartable.shutdown(
KafkaServerStartable.scala:42)
     at kafka.Kafka$$anon$1.run(Kafka.scala:42)

-Jaikiran


On Monday 26 January 2015 05:46 AM, Neha Narkhede wrote:

For a clean shutdown, the broker tries to talk to the controller and
also
issues reads to zookeeper. Possibly that is where it tries to reconnect
to
zk. It will help to look at the thread dump.

Thanks
Neha

On Fri, Jan 23, 2015 at 8:53 PM, Jaikiran Pai <jai.forums2...@gmail.com
wrote:

  I was just playing around with the RC2 of 0.8.2 and noticed that if I
shutdown zookeeper first I can't shutdown Kafka server at all since it
goes
into a never ending attempt to reconnect with zookeeper. I had to kill
the
Kafka process to stop it. I tried it against trunk too and there too I
see
the same issue. Should I file a JIRA for this and see if I can come up
with
a patch?

FWIW, here's the unending (and IMO too frequent) attempts at trying to
reconnect. I've a thread dump too which shows that the other thread
which
is trying to complete a controlled shutdown of Kafka is blocked forever
for
the zookeeper to be up. I can attach it to the JIRA.

2015-01-24 10:15:46,278] WARN Session 0x14b1a4136800000 for server
null,
unexpected error, closing socket connection and attempting reconnect
(org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
      at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
      at sun.nio.ch.SocketChannelImpl.finishConnect(
SocketChannelImpl.java:739)
      at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(
ClientCnxnSocketNIO.java:361)
      at org.apache.zookeeper.ClientCnxn$SendThread.run(
ClientCnxn.java:1081)
[2015-01-24 10:15:47,437] INFO Opening socket connection to server
localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL
(unknown error) (org.apache.zookeeper.ClientCnxn)
[2015-01-24 10:15:47,438] WARN Session 0x14b1a4136800000 for server
null,
unexpected error, closing socket connection and attempting reconnect
(org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
      at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
      at sun.nio.ch.SocketChannelImpl.finishConnect(
SocketChannelImpl.java:739)
      at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(
ClientCnxnSocketNIO.java:361)
      at org.apache.zookeeper.ClientCnxn$SendThread.run(
ClientCnxn.java:1081)
[2015-01-24 10:15:49,056] INFO Opening socket connection to server
localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL
(unknown error) (org.apache.zookeeper.ClientCnxn)
[2015-01-24 10:15:49,057] WARN Session 0x14b1a4136800000 for server
null,
unexpected error, closing socket connection and attempting reconnect
(org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
      at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
      at sun.nio.ch.SocketChannelImpl.finishConnect(
SocketChannelImpl.java:739)
      at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(
ClientCnxnSocketNIO.java:361)
      at org.apache.zookeeper.ClientCnxn$SendThread.run(
ClientCnxn.java:1081)
[2015-01-24 10:15:50,801] INFO Opening socket connection to server
localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL
(unknown error) (org.apache.zookeeper.ClientCnxn)
[2015-01-24 10:15:50,802] WARN Session 0x14b1a4136800000 for server
null,
unexpected error, closing socket connection and attempting reconnect
(org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
      at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
      at sun.nio.ch.SocketChannelImpl.finishConnect(
SocketChannelImpl.java:739)
      at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(
ClientCnxnSocketNIO.java:361)
      at org.apache.zookeeper.ClientCnxn$SendThread.run(
ClientCnxn.java:1081)




-Jaikiran




--
Thanks,
Ewen




Reply via email to