The following JIRA provides some background on why upgrading immediately following new release may not be prudent (though I expect this to be rare):
ZOOKEEPER-2347 On Thu, Nov 2, 2017 at 3:00 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Stephane: > bq. hasn't acted in over a year > > The above fact implies some reluctance from the zookeeper community to > fully solve the issue (maybe due to technical issues). > Anyway, we should plan on not relying on the fix to go through in the near > future. > > As for Jun's latest suggestion, I think we should add periodic logging > indicating the retry. > > A KIP is not needed if we go that route. > > Cheers > > On Thu, Nov 2, 2017 at 2:54 PM, Stephane Maarek < > steph...@simplemachines.com.au> wrote: > >> Hi Jun >> >> I think this is a better option. Would that change require a kip then as >> it's not a change in public API ? >> >> @ted it was marked as a blocked for 3.4.11 but they pushed it. It seems >> that the owner of the pr hasn't acted in over a year and I think someone >> needs to take ownership of that. Additionally, this would be a change in >> Kafka zookeeper client dependency, so no need to update your zookeeper >> quorum to benefit from the change >> >> Thanks >> Stéphane >> >> >> On 3 Nov. 2017 8:45 am, "Jun Rao" <j...@confluent.io> wrote: >> >> Stephane, Jeff, >> >> Another option is to not expose the reconnect timeout config and just >> retry >> the creation of Zookeeper forever. This is an improvement from the current >> situation and if zookeeper-2184 is fixed in the future, we don't need to >> deprecate the config. >> >> Thanks, >> >> Jun >> >> On Thu, Nov 2, 2017 at 9:02 AM, Ted Yu <yuzhih...@gmail.com> wrote: >> >> > ZOOKEEPER-2184 is scheduled for 3.4.12 whose release is unknown. >> > >> > I think adding the session recreation on Kafka side should benefit Kafka >> > users, especially those who don't plan to move to 3.4.12+ in the near >> > future. >> > >> > On Wed, Nov 1, 2017 at 6:34 PM, Jun Rao <j...@confluent.io> wrote: >> > >> > > Hi, Stephane, >> > > >> > > 3) The difference is that currently, there is no retry when >> re-creating >> > the >> > > Zookeeper object when a ZK session expires. So, if the re-creation of >> > > Zookeeper fails, the broker just logs the error and the Zookeeper >> object >> > > will never be created again. With this KIP, we will keep retrying the >> > > creation of Zookeeper until success. >> > > >> > > Thanks, >> > > >> > > Jun >> > > >> > > On Tue, Oct 31, 2017 at 3:28 PM, Stephane Maarek < >> > > steph...@simplemachines.com.au> wrote: >> > > >> > > > Hi Jun, >> > > > >> > > > Thanks for the reply. >> > > > >> > > > 1) The reason I'm asking about it is I wonder if it's not worth >> > focusing >> > > > the development efforts on taking ownership of the existing PR ( >> > > > https://github.com/apache/zookeeper/pull/150) to fix >> ZOOKEEPER-2184, >> > > > rebase it and have it merged into the ZK codebase shortly. I feel >> this >> > > KIP >> > > > might introduce a setting that could be deprecated shortly and >> confuse >> > > the >> > > > end user a bit further with one more knob to turn. >> > > > >> > > > 3) I'm not sure if I fully understand, sorry for the beginner's >> > question: >> > > > if the default timeout is infinite, then it won't change anything to >> > how >> > > > Kafka works from today, does it? (unless I'm missing something >> sorry). >> > If >> > > > not set to infinite, then we introduce the risk of a whole cluster >> > > shutting >> > > > down at once? >> > > > >> > > > Thanks, >> > > > Stephane >> > > > >> > > > On 31/10/17, 1:00 pm, "Jun Rao" <j...@confluent.io> wrote: >> > > > >> > > > Hi, Stephane, >> > > > >> > > > Thanks for the reply. >> > > > >> > > > 1) Fixing the issue in ZK will be ideal. Not sure when it will >> > happen >> > > > though. Once it's fixed, we can probably deprecate this config. >> > > > >> > > > 2) That could be useful. Is there a java api to do that at >> runtime? >> > > > Also, >> > > > invalidating DNS cache doesn't always fix the issue of >> unresolved >> > > > host. In >> > > > some of the cases, human intervention is needed. >> > > > >> > > > 3) The default timeout is infinite though. >> > > > >> > > > Jun >> > > > >> > > > >> > > > On Sat, Oct 28, 2017 at 11:48 PM, Stephane Maarek < >> > > > steph...@simplemachines.com.au> wrote: >> > > > >> > > > > Hi Jun, >> > > > > >> > > > > I think this is very helpful. Restarting Kafka brokers in case >> of >> > > > zookeeper >> > > > > host change is not a well known operation. >> > > > > >> > > > > Few questions: >> > > > > 1) would it not be worth fixing the problem at the source ? >> This >> > > has >> > > > been >> > > > > stuck for a while though, maybe a little push would help : >> > > > > https://issues.apache.org/jira/plugins/servlet/mobile# >> > > > issue/ZOOKEEPER-2184 >> > > > > >> > > > > 2) upon recreating the zookeeper object , is it not possible >> to >> > > > invalidate >> > > > > the DNS cache so that it resolves the new hostname? >> > > > > >> > > > > 3) could the cluster be down in this situation: one migrates >> an >> > > > entire >> > > > > zookeeper cluster to new machines (one by one). The quorum is >> > still >> > > > alive >> > > > > without downtime, but now every broker in a cluster can't >> resolve >> > > > zookeeper >> > > > > at the same time. They all shut down at the same time after >> the >> > new >> > > > > time-out setting. >> > > > > >> > > > > Thanks ! >> > > > > Stéphane >> > > > > >> > > > > On 28 Oct. 2017 9:42 am, "Jun Rao" <j...@confluent.io> wrote: >> > > > > >> > > > > > Hi, Everyone, >> > > > > > >> > > > > > We created "KIP-217: Expose a timeout to allow an expired ZK >> > > > session to >> > > > > be >> > > > > > re-created". >> > > > > > >> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- >> > > > > > 217%3A+Expose+a+timeout+to+allow+an+expired+ZK+session+ >> > > > to+be+re-created >> > > > > > >> > > > > > Please take a look and provide your feedback. >> > > > > > >> > > > > > Thanks, >> > > > > > >> > > > > > Jun >> > > > > > >> > > > > >> > > > >> > > > >> > > > >> > > > >> > > >> > >> > >