Re: zkclient dies after UnknownHostException in zk reconnect

Anatoly Fayngelerin Tue, 24 Sep 2013 14:04:25 -0700

Submitted a pull request: https://github.com/sgroschupf/zkclient/pull/24.



On Tue, Sep 24, 2013 at 1:46 PM, Neha Narkhede <neha.narkh...@gmail.com>wrote:

> Ya, it is not very active, but you can submit patches to master on
> https://github.com/sgroschupf/zkclient
>
> Thanks,
> Neha
>
>
> On Tue, Sep 24, 2013 at 9:58 AM, Anatoly Fayngelerin <fanat...@gmail.com
> >wrote:
>
> > That does sound like a saner solution. Which github repo do you submit
> > patches to? It looks like the repo I posted on originally(
> > https://github.com/sgroschupf/zkclient/issues/23) might be a little
> stale.
> >
> >
> > On Tue, Sep 24, 2013 at 11:34 AM, Neha Narkhede <neha.narkh...@gmail.com
> > >wrote:
> >
> > > Thanks for explaining the bug. This is a serious issue that we should
> fix
> > > at the zkclient level. We have submitted patches to them before and
> they
> > > were pretty helpful in releasing a new version with the patch. I think
> > that
> > > will lead to a cleaner solution than trying to get around it in Kafka
> > code
> > > since zkclient usage is pretty wide spread across the server and
> consumer
> > > code today.
> > >
> > > Thanks,
> > > Neha
> > >
> > >
> > > On Tue, Sep 24, 2013 at 8:28 AM, Anatoly Fayngelerin <
> fanat...@gmail.com
> > > >wrote:
> > >
> > > > Joel - that is exactly right. ZkClient has no way to notify consumers
> > of
> > > > this situation. The session end event gets fired, however, the
> session
> > > > begin event never occurs.
> > > >
> > > > Neha - The issue manifested itself when producers were attempting to
> > > > discover topics/brokers. The kafka brokers had lost their ZK sessions
> > > > during a network outage. The outage was long enough for ZooKeeper to
> > > expire
> > > > the sessions corresponding to the ephemeral nodes in /broker/. The
> > > zkclient
> > > > bug prevented the broker from ever re-establishing the ZK session.
> > > > Subsequently, no zookeeper based producer was able to discover
> > > > topic->broker mappings. The resulting exceptions looked like:
> > > >
> > > > Caused by: kafka.common.NoBrokersForPartitionException: Partition =
> > null
> > > > at
> > > >
> > > >
> > >
> >
> kafka.producer.Producer.kafka$producer$Producer$getPartitionListForTopic(Producer.scala:167)
> > > > at kafka.producer.Producer$anonfun$3.apply(Producer.scala:116)
> > > > at kafka.producer.Producer$anonfun$3.apply(Producer.scala:105)
> > > > at
> > > >
> > > >
> > >
> >
> scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:233)
> > > > at
> > > >
> > > >
> > >
> >
> scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:233)
> > > > at
> > > >
> > > >
> > >
> >
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:34)
> > > > at
> scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:33)
> > > > at
> > scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
> > > > at scala.collection.mutable.WrappedArray.map(WrappedArray.scala:33)
> > > > at kafka.producer.Producer.zkSend(Producer.scala:105)
> > > > at kafka.producer.Producer.send(Producer.scala:99)
> > > > at
> > > >
> > > >
> > >
> >
> com.yieldmo.common.protobuf.ProtoKafkaWriter$class.write(ProtoKafka.scala:20)
> > > > at com.yieldmo.common.protobuf.ProtoWriter.write(ProtoKafka.scala:40)
> > > > at
> > > >
> > > >
> > >
> >
> com.yieldmo.storm.bolt.KafkaProtoWriterBolt.execute(KafkaProtoWriterBolt.scala:48)
> > > >
> > > > As far as I can see, the only way to deal with this without patching
> > > > zkclient is to periodically check the status of the zk connection and
> > try
> > > > to detect this kind of situation. I would love to hear better ideas
> for
> > > how
> > > > to handle this.
> > > >
> > > >
> > > > On Tue, Sep 24, 2013 at 3:31 AM, Joel Koshy <jjkosh...@gmail.com>
> > wrote:
> > > >
> > > > > > node loss. Did the Kafka consumer not respond to rebalance events
> > or
> > > > did
> > > > > > the server not respond to state change events ? Also, ephemeral
> > nodes
> > > > are
> > > > > > lost only when sessions are expired on the zookeeper server or if
> > > > clients
> > > > > > close the session actively, how does losing connection lead to
> > > > ephemeral
> > > > > > node loss?
> > > > >
> > > > > My understanding of Anatoly's observation is that on session
> > > > > expiration, zkclient will reconnect
> > > > > (
> > > > >
> > > >
> > >
> >
> https://github.com/sgroschupf/zkclient/blob/master/src/main/java/org/I0Itec/zkclient/ZkClient.java#L458
> > > > > )
> > > > > but if the connect causes an IOException, that would effectively
> mean
> > > > > that the session will not get re-established. Anatoly, can you
> > > > > confirm?
> > > > >
> > > > > > On Mon, Sep 23, 2013 at 7:02 AM, Anatoly Fayngelerin <
> > > > fanat...@gmail.com
> > > > > >wrote:
> > > > > >
> > > > > >> Hi Everyone,
> > > > > >>
> > > > > >> I've run into the following issue with the Kafka server. The
> > > zkclient
> > > > > lib
> > > > > >> seems to die silently if there is an UnknownHostException(or any
> > > > > >> IOException) while reconnecting the ZK session. I've filed a bug
> > > about
> > > > > this
> > > > > >> with the zkclient lib(
> > > > https://github.com/sgroschupf/zkclient/issues/23
> > > > > ).
> > > > > >> The
> > > > > >> ramifications for Kafka were the silent loss of all ephemeral
> > nodes
> > > > > >> associated with the affected process.
> > > > > >>
> > > > > >> Has anyone faced this issue? If so, what is the recommended way
> of
> > > > > dealing
> > > > > >> with this?
> > > > > >>
> > > > > >> If there is no good solution available, would the community be
> > open
> > > > to a
> > > > > >> patch that periodically verifies ZK connectivity?
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Anatoly
> > > > > >>
> > > > >
> > > >
> > >
> >
>

Re: zkclient dies after UnknownHostException in zk reconnect

Reply via email to