Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-12 Thread Michael K. Edwards
Well, I think it's fair to say that they're effectively untested prior to rc2. But it's a reasonable posture to take that the features get baked first and the field upgrade procedure gets tested late in the release cycle. Not what I would have expected personally, though, as a former developer of

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-12 Thread Szalay-Bekő Máté
FYI: PR just submitted, see https://github.com/apache/zookeeper/pull/1251 any comments welcomed! :) Kind regards, Mate On Wed, Feb 12, 2020 at 1:16 PM Andor Molnar wrote: > Hi Michael, > > "if we can get to rc2 without noticing a showstopper…” > > 200% disagree with this. > > The whole point o

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-12 Thread Andor Molnar
Hi Michael, "if we can get to rc2 without noticing a showstopper…” 200% disagree with this. The whole point of release voting system is to identify problems no matter how big they are. The message of finding a showstopper for me is that people paying attention and accurately testing the relea

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-11 Thread Enrico Olivelli
Michael, your points are valid. I would like to see the proposal from Mate. Up to ZOOKEEPER-3188 no other patch in 3.6 (from my limited point of view) introduced changes in quorum peer protocol to make it non compatible with 3.5. Enrico Il giorno mar 11 feb 2020 alle ore 23:35 Michael K. Edwards

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-11 Thread Michael K. Edwards
I think it would be prudent to emphasize in the release notes that rolling upgrades (and mixed ensembles generally) are effectively untested. That this was, in practice, a non-goal of this release cycle. Because if we can get to rc2 without noticing a showstopper, clearly it's not something that

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-11 Thread Szalay-Bekő Máté
Hi Andor, this is almost exactly what I proposed. More precisely: 1) First we make multi-address feature disabled by default. 2) If disabled, quorum protocol automatically uses the old protocol version which lets 3.5 and 3.6 communicate smoothly. The code in 3.6.0 will be able to understand both

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-11 Thread Andor Molnar
Mate, Let me reiterate to see if I understand you correctly: 1) First we make multi-address feature disabled by default. 2) If disabled, quorum protocol automatically uses the old protocol version which lets 3.5 and 3.6 communicate smoothly. 3) Once the user finished the first rolling restart

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-11 Thread Michael Han
>> but it didn't solve the problem. Yes, the constraint is 3.6.0 has to default to old protocol version so the outgoing message is backward compatible. If we do this, then it's essentially the "the simplest solution" proposed. >> disable the new MultiAddress feature and stick to the old protocol

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-11 Thread Szalay-Bekő Máté
I see the main problem here in the fact that we are missing proper versioning in the leader election / quorum protocols. I tried to simply implement backward compatibility in 3.6, but it didn't solve the problem. The new code understands the old protocol, but it can not decide when to use the new o

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-11 Thread Michael K. Edwards
I hate to say it, but I think 3.6.0 should release as is. It is impossible to *reliably* retrofit backwards compatibility / interoperability onto a release that was engineered from the beginning without that goal. Learn the lesson, set goals differently in the future. On Tue, Feb 11, 2020 at 9:4

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-11 Thread Szalay-Bekő Máté
FYI: I created these scripts for my local tests: https://github.com/symat/zk-rolling-upgrade-test For the long term I would also add some script that actually monitors the state of the quorum and also runs continuous traffic, not just 1-2 smoketests after each restart. But I don't know how importa

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-11 Thread Enrico Olivelli
Il giorno mar 11 feb 2020 alle ore 17:17 Andor Molnar ha scritto: > > The most obvious one which crosses my mind is that I previously worked on: > > 1) run old version cluster, > 2) connect to each node and run smoke tests, > 3) restart one node with new code, > 4) goto 2) until all nodes are upgr

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-11 Thread Andor Molnar
The most obvious one which crosses my mind is that I previously worked on: 1) run old version cluster, 2) connect to each node and run smoke tests, 3) restart one node with new code, 4) goto 2) until all nodes are upgraded I think this wouldn’t work in a “unit test”, we probably need a separate

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-11 Thread Patrick Hunt
Anyone have ideas how we could add testing for upgrade? Obviously something we're missing, esp given it's import. Patrick On Tue, Feb 11, 2020 at 12:40 AM Enrico Olivelli wrote: > Il giorno mar 11 feb 2020 alle ore 09:12 Szalay-Bekő Máté > ha scritto: > > > > Hi All, > > > > about the question

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-11 Thread Enrico Olivelli
Il giorno mar 11 feb 2020 alle ore 09:12 Szalay-Bekő Máté ha scritto: > > Hi All, > > about the question from Michael: > > Regarding the fix, can we just make 3.6.0 aware of the old protocol and > > speak old message format when it's talking to old server? > > In this particular case, it might be

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-11 Thread Szalay-Bekő Máté
Hi All, about the question from Michael: > Regarding the fix, can we just make 3.6.0 aware of the old protocol and > speak old message format when it's talking to old server? In this particular case, it might be enough. The protocol change happened now in the 'initial message' sent by the QuorumC

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-10 Thread Enrico Olivelli
Good. I will cancel the vote for 3.6.0rc2. I appreciate very much If Mate and his colleagues have time to work on a fix. Otherwise I will have cycles next week I would also like to spend my time in setting up a few minimal integration tests about the upgrade story Enrico Il Mar 11 Feb 2020, 07

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-10 Thread Michael Han
Kudos Enrico, very thorough work as the final gate keeper of the release! Now with this, I'd like to *vote a -1* on the 3.6.0 RC2. I'd recommend we fix this issue for 3.6.0. ZooKeeper is one of the rare piece of software that put so much emphasis on compatibilities thus it just works when upgrade

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-10 Thread Enrico Olivelli
I suggest this plan: - release 3.6.0 now - improve the migration story, the flow outlined by Mate is interesting, but it will take time 3.6.0rc2 got enough binding votes so I am going to finalize the release this evening (within 8-10 hours) if no one comes out in the VOTE thread with a -1 Enrico

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-10 Thread Patrick Hunt
On Mon, Feb 10, 2020 at 3:38 AM Andor Molnar wrote: > Hi, > > Answers inline. > > > > In my experience when you are close to a release it is better to to > > make big changes. (I am among the approvers of that patch, so I am > > responsible for this change) > > > > Although this statement is acce

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-10 Thread Andor Molnar
Hi, Answers inline. > In my experience when you are close to a release it is better to to > make big changes. (I am among the approvers of that patch, so I am > responsible for this change) Although this statement is acceptable for me, I don’t feel this patch should not have been merged into

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-10 Thread Enrico Olivelli
Thank you Mate for checking and explaining this story. I find it very interesting that the cause is ZOOKEEPER-3188 as: - it is the last "big patch" committed to 3.6 before starting the release process - it is the cause of the failure of the first RC In my experience when you are close to a releas

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-10 Thread Szalay-Bekő Máté
Actually, we have an other option: we can follow the way, how the rolling restart support for the QuorumSSL was implemented. - we can make 3.6.0 to be able to read both protocol versions - we can add a parameter that tells the 3.6.0 which protocol version to use (using the old one brakes / disables

Re: Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-10 Thread Szalay-Bekő Máté
Hi Enrico! This is caused by the different PROTOCOL_VERSION in the QuorumCnxManager. The Protocol version was changed last time in ZOOKEEPER-2186 released first in 3.4.7 and 3.5.1 to avoid some crashing / fix some bugs. Later I also changed the protocol version when the format of the initial mess

Rolling upgrade from 3.5 to 3.6 - expected behaviour

2020-02-10 Thread Enrico Olivelli
Hi, even if we had enough binding +1 on 3.6.0rc2 before closing the VOTE of 3.6.0 I wanted to finish my tests and I am coming to an apparent blocker. I am trying to upgrade a 3.5.6 cluster to 3.6.0, but it looks like peers are not able to talk to each other. I have a cluster of 3, server1, server2