Agree with Stephane that it's worth at least taking a shot at trying to get ZOOKEEPER-2184 fixed rather than adding a config that will be deprecated in the not-too distant future.
I know Zookeeper development feels more like the turtle than the hare these days, but Kafka is a high-visibility project, so there's a decent chance you'll be able to get the attention of the zookeeper maintainers to get a patch merged and possibly even a new release cut incorporating this fix. On Tue, Oct 31, 2017 at 3:28 PM, Stephane Maarek < steph...@simplemachines.com.au> wrote: > Hi Jun, > > Thanks for the reply. > > 1) The reason I'm asking about it is I wonder if it's not worth focusing > the development efforts on taking ownership of the existing PR ( > https://github.com/apache/zookeeper/pull/150) to fix ZOOKEEPER-2184, > rebase it and have it merged into the ZK codebase shortly. I feel this KIP > might introduce a setting that could be deprecated shortly and confuse the > end user a bit further with one more knob to turn. > > 3) I'm not sure if I fully understand, sorry for the beginner's question: > if the default timeout is infinite, then it won't change anything to how > Kafka works from today, does it? (unless I'm missing something sorry). If > not set to infinite, then we introduce the risk of a whole cluster shutting > down at once? > > Thanks, > Stephane > > On 31/10/17, 1:00 pm, "Jun Rao" <j...@confluent.io> wrote: > > Hi, Stephane, > > Thanks for the reply. > > 1) Fixing the issue in ZK will be ideal. Not sure when it will happen > though. Once it's fixed, we can probably deprecate this config. > > 2) That could be useful. Is there a java api to do that at runtime? > Also, > invalidating DNS cache doesn't always fix the issue of unresolved > host. In > some of the cases, human intervention is needed. > > 3) The default timeout is infinite though. > > Jun > > > On Sat, Oct 28, 2017 at 11:48 PM, Stephane Maarek < > steph...@simplemachines.com.au> wrote: > > > Hi Jun, > > > > I think this is very helpful. Restarting Kafka brokers in case of > zookeeper > > host change is not a well known operation. > > > > Few questions: > > 1) would it not be worth fixing the problem at the source ? This has > been > > stuck for a while though, maybe a little push would help : > > https://issues.apache.org/jira/plugins/servlet/mobile# > issue/ZOOKEEPER-2184 > > > > 2) upon recreating the zookeeper object , is it not possible to > invalidate > > the DNS cache so that it resolves the new hostname? > > > > 3) could the cluster be down in this situation: one migrates an > entire > > zookeeper cluster to new machines (one by one). The quorum is still > alive > > without downtime, but now every broker in a cluster can't resolve > zookeeper > > at the same time. They all shut down at the same time after the new > > time-out setting. > > > > Thanks ! > > Stéphane > > > > On 28 Oct. 2017 9:42 am, "Jun Rao" <j...@confluent.io> wrote: > > > > > Hi, Everyone, > > > > > > We created "KIP-217: Expose a timeout to allow an expired ZK > session to > > be > > > re-created". > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > 217%3A+Expose+a+timeout+to+allow+an+expired+ZK+session+ > to+be+re-created > > > > > > Please take a look and provide your feedback. > > > > > > Thanks, > > > > > > Jun > > > > > > > > > -- *Jeff Widman* jeffwidman.com <http://www.jeffwidman.com/> | 740-WIDMAN-J (943-6265) <><