Thank you Mate for checking and explaining this story. I find it very interesting that the cause is ZOOKEEPER-3188 as: - it is the last "big patch" committed to 3.6 before starting the release process - it is the cause of the failure of the first RC
In my experience when you are close to a release it is better to to make big changes. (I am among the approvers of that patch, so I am responsible for this change) This is a pointer to the change to whom who wants to understand better the context https://github.com/apache/zookeeper/pull/1048/files#diff-7a209d890686bcba351d758b64b22a7dR11 IIUC even for the upgrade from 3.4 to 3.5 the story was the same and if this statement holds then I feel we can continue with this release. - Reverting ZOOKEEPER-3188 is not an option for me, it is too complex. - Making 3.5 and 3.6 "compatible" can be very tricky and we do not have tools to certify this compatibility (at least not in the short term) I would like to ear from people that have been in the community for long time, then I am ready to complete the release process for 3.6.0rc2. I will update the website and the release notes with a specific warning about the upgrade, we should also update the Wiki Enrico Il giorno lun 10 feb 2020 alle ore 11:17 Szalay-Bekő Máté <szalay.beko.m...@gmail.com> ha scritto: > > Hi Enrico! > > This is caused by the different PROTOCOL_VERSION in the QuorumCnxManager. > The Protocol version was changed last time in ZOOKEEPER-2186 released > first in 3.4.7 and 3.5.1 to avoid some crashing / fix some bugs. Later I > also changed the protocol version when the format of the initial message > changed in ZOOKEEPER-3188. So actually the quorum protocol is not > compatible in this case and is the 'expected' behavior if you upgrade e.g > from 3.4.6 to 3.4.7, or 3.4.6 to 3.5.5 or e.g from 3.5.6 to 3.6.0. > > We had some discussion in the PR of ZOOKEEPER-3188 back then and got to the > conclusion that it is not that bad, as there will be no data loss as you > wrote. The tricky thing is that during rolling upgrade we should ensure > both backward and forward compatibility to make sure that the old and the > new part of the quorum can still speak to each other. The current solution > (simply failing if the protocol versions mismatch) is more simple and still > working just fine: as the servers are restarted one-by-one, the nodes with > the old protocol version and the nodes with the new protocol version will > form two partitions, but any given time only one partition will have the > quorum. > > Still, thinking it trough, as a side effect in these cases there will be a > short time when none of the partitions will have quorums (when we have N > servers with the old protocol version, N servers with the new protocol > version, and there is one server just being restarted). I am not sure if we > can accept this. > > For ZOOKEEPER-3188 we can add a small patch to make it possible to parse > the initial message of the old protocol version with the new code. But I am > not sure if it would be enough (as the old code will not be able to parse > the new initial message). > > One option can be to make a patch also for 3.5 to have a version which > supports both protocol versions. (let's say in 3.5.8) Then we can write to > the release note, that if you need rolling upgrade from any versions since > 3.4.7, then you have to first upgrade from 3.5.8 before upgrading to 3.6.0. > We can even make the same thing on the 3.4 branch. > > But I am also new to the community... It would be great to hear the opinion > of more experienced people. > Whatever the decision will be, I am happy to make the changes. > > And sorry for breaking the RC (if we decide that this needs to be > changed...). ZOOKEEPER-3188 was a complex patch. > > Kind regards, > Mate > > On Mon, Feb 10, 2020 at 9:47 AM Enrico Olivelli <eolive...@gmail.com> wrote: > > > Hi, > > even if we had enough binding +1 on 3.6.0rc2 before closing the VOTE > > of 3.6.0 I wanted to finish my tests and I am coming to an apparent > > blocker. > > > > I am trying to upgrade a 3.5.6 cluster to 3.6.0, but it looks like > > peers are not able to talk to each other. > > I have a cluster of 3, server1, server2 and server3. > > When I upgrade server1 to 3.6.0rc2 I see this kind of errors on 3.5 nodes: > > > > 2020-02-10 09:35:07,745 [myid:3] - INFO > > [localhost/127.0.0.1:3334:QuorumCnxManager$Listener@918] - Received > > connection request 127.0.0.1:62591 > > 2020-02-10 09:35:07,746 [myid:3] - ERROR > > [localhost/127.0.0.1:3334:QuorumCnxManager@527] - > > > > org.apache.zookeeper.server.quorum.QuorumCnxManager$InitialMessage$InitialMessageException: > > Got unrecognized protocol version -65535 > > > > Once I upgrade all of the peers the system is up and running, without > > apparently no data loss. > > > > During the upgrade as soon as I upgrade the first node, say, server1, > > server1 is not able to accept connections (error "Close of session 0x0 > > java.io.IOException: ZooKeeperServer not running") from clients, this > > is expected, because as far as it cannot talk with the other peers it > > is practically partitioned away from the cluster. > > > > My questions are: > > 1) is this expected ? I can't remember protocol changes from 3.5 to > > 3.6, but actually 3.6 diverged from 3.5 branch so long ago, and I was > > not in the community as dev so I cannot tell > > 2) is this a viable option for users ? to have some temporary glitch > > during the upgrade and hope that the upgrade completes without > > troubles ? > > > > In theory as long as two servers are running the same major version > > (3.5 or 3.6) we have a quorum and the system is able to make progress > > and to server clients. > > I feel that this is quite dangerous, but I don't have enough context > > to understand how this problem is possible and when we decided to > > break compatibility. > > > > The other option is that I am wrong in my test and I am messing up :-) > > > > The other upgrade path I would like to see working like a charm is the > > upgrade from 3.4 to 3.6, as I see that as soon as we release 3.6 we > > should encourage users to move to 3.6 and not to 3.5. > > > > Regards > > Enrico > >