On Mon, Feb 10, 2020 at 3:38 AM Andor Molnar <an...@apache.org> wrote:

> Hi,
>
> Answers inline.
>
>
> > In my experience when you are close to a release it is better to to
> > make big changes. (I am among the approvers of that patch, so I am
> > responsible for this change)
>
>
>
> Although this statement is acceptable for me, I don’t feel this patch
> should not have been merged into 3.6.0. Submission has been preceded by a
> long argument with MAPR folks who originally wanted to be merged into 3.4
> branch (considering the pace how ZooKeeper community is moving forward) and
> we reached an agreement that release it with 3.6.0.
>
> Make a long story short, this patch has been outstanding for ages without
> much attention from the community and contributors made a lot of effort to
> get it done before the release.
>
>
> > I would like to ear from people that have been in the community for
> > long time, then I am ready to complete the release process for
> > 3.6.0rc2.
>
>
> Me too.
>
> I tend to accept the way rolling restart works now - as you described
> Enrico - and given that situation was pretty much the same between 3.4 and
> 3.5, I don’t feel we have to make additional changes.
>
> On the other hand, the fix that Mate suggested sounds quite cool, I’m also
> happy to work on getting it in.
>
> Fyi, Release Management page says the following:
> https://cwiki.apache.org/confluence/display/ZOOKEEPER/ReleaseManagement
>
> "major.minor release of ZooKeeper must be backwards compatible with the
> previous minor release, major.(minor-1)"
>
>
Our users, direct and indirect, value the ability to migrate to newer
versions - esp as we drop support for older. Frictions such as this can be
a reason to go elsewhere. I'm "pro" b/w compact - esp given our published
guidelines.

Patrick


> Andor
>
>
>
>
> > On 2020. Feb 10., at 11:32, Enrico Olivelli <eolive...@gmail.com> wrote:
> >
> > Thank you Mate for checking and explaining this story.
> >
> > I find it very interesting that the cause is ZOOKEEPER-3188 as:
> > - it is the last "big patch" committed to 3.6 before starting the
> > release process
> > - it is the cause of the failure of the first RC
> >
> > In my experience when you are close to a release it is better to to
> > make big changes. (I am among the approvers of that patch, so I am
> > responsible for this change)
> >
> > This is a pointer to the change to whom who wants to understand better
> > the context
> >
> https://github.com/apache/zookeeper/pull/1048/files#diff-7a209d890686bcba351d758b64b22a7dR11
> >
> > IIUC even for the upgrade from 3.4 to 3.5 the story was the same and
> > if this statement holds then I feel we can continue
> > with this release.
> >
> > - Reverting ZOOKEEPER-3188 is not an option for me, it is too complex.
> > - Making 3.5 and 3.6 "compatible" can be very tricky and we do not
> > have tools to certify this compatibility (at least not in the short
> > term)
> >
> > I would like to ear from people that have been in the community for
> > long time, then I am ready to complete the release process for
> > 3.6.0rc2.
> >
> > I will update the website and the release notes with a specific
> > warning about the upgrade, we should also update the Wiki
> >
> > Enrico
> >
> >
> > Il giorno lun 10 feb 2020 alle ore 11:17 Szalay-Bekő Máté
> > <szalay.beko.m...@gmail.com> ha scritto:
> >>
> >> Hi Enrico!
> >>
> >> This is caused by the different PROTOCOL_VERSION in the
> QuorumCnxManager.
> >> The Protocol version  was changed last time in ZOOKEEPER-2186 released
> >> first in 3.4.7 and 3.5.1 to avoid some crashing / fix some bugs. Later I
> >> also changed the protocol version when the format of the initial message
> >> changed in ZOOKEEPER-3188. So actually the quorum protocol is not
> >> compatible in this case and is the 'expected' behavior if you upgrade
> e.g
> >> from 3.4.6 to 3.4.7, or 3.4.6 to 3.5.5 or e.g from 3.5.6 to 3.6.0.
> >>
> >> We had some discussion in the PR of ZOOKEEPER-3188 back then and got to
> the
> >> conclusion that it is not that bad, as there will be no data loss as you
> >> wrote. The tricky thing is that during rolling upgrade we should ensure
> >> both backward and forward compatibility to make sure that the old and
> the
> >> new part of the quorum can still speak to each other. The current
> solution
> >> (simply failing if the protocol versions mismatch) is more simple and
> still
> >> working just fine: as the servers are restarted one-by-one, the nodes
> with
> >> the old protocol version and the nodes with the new protocol version
> will
> >> form two partitions, but any given time only one partition will have the
> >> quorum.
> >>
> >> Still, thinking it trough, as a side effect in these cases there will
> be a
> >> short time when none of the partitions will have quorums (when we have N
> >> servers with the old protocol version, N servers with the new protocol
> >> version, and there is one server just being restarted). I am not sure
> if we
> >> can accept this.
> >>
> >> For ZOOKEEPER-3188 we can add a small patch to make it possible to parse
> >> the initial message of the old protocol version with the new code. But
> I am
> >> not sure if it would be enough (as the old code will not be able to
> parse
> >> the new initial message).
> >>
> >> One option can be to make a patch also for 3.5 to have a version which
> >> supports both protocol versions. (let's say in 3.5.8) Then we can write
> to
> >> the release note, that if you need rolling upgrade from any versions
> since
> >> 3.4.7, then you have to first upgrade from 3.5.8 before upgrading to
> 3.6.0.
> >> We can even make the same thing on the 3.4 branch.
> >>
> >> But I am also new to the community... It would be great to hear the
> opinion
> >> of more experienced people.
> >> Whatever the decision will be, I am happy to make the changes.
> >>
> >> And sorry for breaking the RC (if we decide that this needs to be
> >> changed...).  ZOOKEEPER-3188 was a complex patch.
> >>
> >> Kind regards,
> >> Mate
> >>
> >> On Mon, Feb 10, 2020 at 9:47 AM Enrico Olivelli <eolive...@gmail.com>
> wrote:
> >>
> >>> Hi,
> >>> even if we had enough binding +1 on 3.6.0rc2 before closing the VOTE
> >>> of 3.6.0 I wanted to finish my tests and I am coming to an apparent
> >>> blocker.
> >>>
> >>> I am trying to upgrade a 3.5.6 cluster to 3.6.0, but it looks like
> >>> peers are not able to talk to each other.
> >>> I have a cluster of 3, server1, server2 and server3.
> >>> When I upgrade server1 to 3.6.0rc2 I see this kind of errors on 3.5
> nodes:
> >>>
> >>> 2020-02-10 09:35:07,745 [myid:3] - INFO
> >>> [localhost/127.0.0.1:3334:QuorumCnxManager$Listener@918] - Received
> >>> connection request 127.0.0.1:62591
> >>> 2020-02-10 09:35:07,746 [myid:3] - ERROR
> >>> [localhost/127.0.0.1:3334:QuorumCnxManager@527] -
> >>>
> >>>
> org.apache.zookeeper.server.quorum.QuorumCnxManager$InitialMessage$InitialMessageException:
> >>> Got unrecognized protocol version -65535
> >>>
> >>> Once I upgrade all of the peers the system is up and running, without
> >>> apparently no data loss.
> >>>
> >>> During the upgrade as soon as I upgrade the first node, say, server1,
> >>> server1 is not able to accept connections (error "Close of session 0x0
> >>> java.io.IOException: ZooKeeperServer not running")  from clients, this
> >>> is expected, because as far as it cannot talk with the other peers it
> >>> is practically partitioned away from the cluster.
> >>>
> >>> My questions are:
> >>> 1) is this expected ? I can't remember protocol changes from 3.5 to
> >>> 3.6, but actually 3.6 diverged from 3.5 branch so long ago, and I was
> >>> not in the community as dev so I cannot tell
> >>> 2) is this a viable option for users ? to have some temporary glitch
> >>> during the upgrade and hope that the upgrade completes without
> >>> troubles ?
> >>>
> >>> In theory as long as two servers are running the same major version
> >>> (3.5 or 3.6) we have a quorum and the system is able to make progress
> >>> and to server clients.
> >>> I feel that this is quite dangerous, but I don't have enough context
> >>> to understand how this problem is possible and when we decided to
> >>> break compatibility.
> >>>
> >>> The other option is that I am wrong in my test and I am messing up :-)
> >>>
> >>> The other upgrade path I would like to see working like a charm is the
> >>> upgrade from 3.4 to 3.6, as I see that as soon as we release 3.6 we
> >>> should encourage users to move to 3.6 and not to 3.5.
> >>>
> >>> Regards
> >>> Enrico
> >>>
>
>

Reply via email to