Michael,
your points are valid.
I would like to see the proposal from Mate.
Up to  ZOOKEEPER-3188 no other patch in 3.6 (from my limited point of
view) introduced changes in quorum peer protocol to make it non
compatible with 3.5.

Enrico

Il giorno mar 11 feb 2020 alle ore 23:35 Michael K. Edwards
<m.k.edwa...@gmail.com> ha scritto:
>
> I think it would be prudent to emphasize in the release notes that rolling
> upgrades (and mixed ensembles generally) are effectively untested.  That
> this was, in practice, a non-goal of this release cycle.  Because if we can
> get to rc2 without noticing a showstopper, clearly it's not something that
> anyone has gotten around to attempting; and there have to be a hundred
> corner cases beyond the MultiAddress issue.
>
> On Tue, Feb 11, 2020 at 12:27 PM Szalay-Bekő Máté <
> szalay.beko.m...@gmail.com> wrote:
>
> > I see the main problem here in the fact that we are missing proper
> > versioning in the leader election / quorum protocols. I tried to simply
> > implement backward compatibility in 3.6, but it didn't solve the problem.
> > The new code understands the old protocol, but it can not decide when to
> > use the new or the old protocol during connection initiation. So the old
> > servers can not read the new init messages and we still temporarly end up
> > having two partitions during rolling restart.
> >
> > I already suggested two ways to handle this later, but I think for 3.6.0
> > now the simplest solution is to disable the new MultiAddress feature and
> > stick to the old protocol version by default. Plus extend the
> > documentation with the note, that enabling the MultiAddress feature is not
> > possible during a rolling upgrade, but it needs to be done with a separate
> > rolling restart. With this approach, the rolling restart should "just work"
> > with the 3.4 / 3.5 configs and we don't require any extra step /
> > configuration from the users, unless they want to use the new feature. I
> > plan to submit a PR with these changes tomorrow to ZOOKEEPER-3720, if there
> > isn't any different opinion.
> >
> > P.S. For 4.0 we might need to put some extra thinking into backward
> > compatibility / versioning for the quorum and client protocols.
> >
> >
> > On Tue, Feb 11, 2020, 20:44 Michael K. Edwards <m.k.edwa...@gmail.com>
> > wrote:
> >
> >> I hate to say it, but I think 3.6.0 should release as is.  It is
> >> impossible
> >> to *reliably* retrofit backwards compatibility / interoperability onto a
> >> release that was engineered from the beginning without that goal.  Learn
> >> the lesson, set goals differently in the future.
> >>
> >> On Tue, Feb 11, 2020 at 9:41 AM Szalay-Bekő Máté <
> >> szalay.beko.m...@gmail.com>
> >> wrote:
> >>
> >> > FYI: I created these scripts for my local tests:
> >> > https://github.com/symat/zk-rolling-upgrade-test
> >> >
> >> > For the long term I would also add some script that actually monitors
> >> the
> >> > state of the quorum and also runs continuous traffic, not just 1-2
> >> > smoketests after each restart. But I don't know how important this would
> >> > be.
> >> >
> >> > On Tue, Feb 11, 2020 at 5:25 PM Enrico Olivelli <eolive...@gmail.com>
> >> > wrote:
> >> >
> >> > > Il giorno mar 11 feb 2020 alle ore 17:17 Andor Molnar
> >> > > <an...@apache.org> ha scritto:
> >> > > >
> >> > > > The most obvious one which crosses my mind is that I previously
> >> worked
> >> > > on:
> >> > > >
> >> > > > 1) run old version cluster,
> >> > > > 2) connect to each node and run smoke tests,
> >> > > > 3) restart one node with new code,
> >> > > > 4) goto 2) until all nodes are upgraded
> >> > > >
> >> > > > I think this wouldn’t work in a “unit test”, we probably need a
> >> > separate
> >> > > Jenkins job and a nice python script to do this.
> >> > > >
> >> > > > Andor
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > > On 2020. Feb 11., at 16:38, Patrick Hunt <ph...@apache.org>
> >> wrote:
> >> > > > >
> >> > > > > Anyone have ideas how we could add testing for upgrade? Obviously
> >> > > something
> >> > > > > we're missing, esp given it's import.
> >> > >
> >> > > I will send an email next days with a proposal.
> >> > > btw my idea is very like Andor's one
> >> > >
> >> > > Once we have an automatic environment we can launch from Jenkins
> >> > >
> >> > > Enrico
> >> > >
> >> > >
> >> > > > >
> >> > > > > Patrick
> >> > > > >
> >> > > > > On Tue, Feb 11, 2020 at 12:40 AM Enrico Olivelli <
> >> > eolive...@gmail.com>
> >> > > > > wrote:
> >> > > > >
> >> > > > >> Il giorno mar 11 feb 2020 alle ore 09:12 Szalay-Bekő Máté
> >> > > > >> <szalay.beko.m...@gmail.com> ha scritto:
> >> > > > >>>
> >> > > > >>> Hi All,
> >> > > > >>>
> >> > > > >>> about the question from Michael:
> >> > > > >>>> Regarding the fix, can we just make 3.6.0 aware of the old
> >> > protocol
> >> > > and
> >> > > > >>>> speak old message format when it's talking to old server?
> >> > > > >>>
> >> > > > >>> In this particular case, it might be enough. The protocol change
> >> > > happened
> >> > > > >>> now in the 'initial message' sent by the QuorumCnxManager.
> >> Maybe it
> >> > > is
> >> > > > >> not
> >> > > > >>> a problem if the new servers can not initiate channels to the
> >> old
> >> > > > >> servers,
> >> > > > >>> maybe it is enough if these channel gets initiated by the old
> >> > servers
> >> > > > >> only.
> >> > > > >>> I will test it quickly.
> >> > > > >>>
> >> > > > >>> Although I have no idea if any other thing changed in the quorum
> >> > > protocol
> >> > > > >>> between 3.5 and 3.6. In other cases it might not be enough if
> >> the
> >> > new
> >> > > > >>> servers can understand the old messages, as the old servers can
> >> > > break by
> >> > > > >>> not understanding the messages from the new servers. Also, in
> >> the
> >> > > code
> >> > > > >>> currently (AFAIK) there is no generic knowledge of protocol
> >> > > versions, the
> >> > > > >>> servers are not storing that which protocol versions they
> >> > can/should
> >> > > use
> >> > > > >> to
> >> > > > >>> communicate to which particular other servers. Maybe we don't
> >> even
> >> > > need
> >> > > > >>> this, but I would feel better if we would have more tests around
> >> > > these
> >> > > > >>> things.
> >> > > > >>>
> >> > > > >>> My suggestion for the long term:
> >> > > > >>> - let's fix this particular issue now with 3.6.0 quickly (I
> >> start
> >> > > doing
> >> > > > >>> this today)
> >> > > > >>> - let's do some automation (backed up with jenkins) that will
> >> test
> >> > a
> >> > > > >> whole
> >> > > > >>> combinations of different ZooKeeper upgrade paths by making
> >> rolling
> >> > > > >>> upgrades during some light traffic. Let's have a bit better
> >> > > definition
> >> > > > >>> about what we expect (e.g. the quorum is up, but some clients
> >> can
> >> > get
> >> > > > >>> disconnected? What will happen to the ephemeral nodes? Do we
> >> want
> >> > to
> >> > > > >>> gracefully close or transfer the user sessions before stopping
> >> the
> >> > > old
> >> > > > >>> server?) and let's see where this broke. Just by checking the
> >> > code, I
> >> > > > >> don't
> >> > > > >>> think the quorum will always be up (e.g. between older 3.4
> >> versions
> >> > > and
> >> > > > >>> 3.5).
> >> > > > >>
> >> > > > >>
> >> > > > >> I am happy to work on this topic
> >> > > > >>
> >> > > > >>> - we need to update the Wiki about the working rolling upgrade
> >> > paths
> >> > > and
> >> > > > >>> maybe about workarounds if needed
> >> > > > >>> - we might need to do some fixes (adding backward compatible
> >> > versions
> >> > > > >>> and/or specific parameters that enforce old protocol temporary
> >> > > during the
> >> > > > >>> rolling upgrade that can be changed later to the new protocol by
> >> > > either
> >> > > > >>> dynamic reconfig or by rolling restart)
> >> > > > >>
> >> > > > >> it would be much better on 3.6 code to have some support for
> >> > > > >> compatibility with 3.5 servers
> >> > > > >> we can't require old code to be forward compatible but we can
> >> make
> >> > new
> >> > > > >> code be compatible to a certain extend with old code.
> >> > > > >> If we can achieve this compatibility goal without a flag is
> >> better,
> >> > > > >> users won't have to care about this part and they simply "trust"
> >> on
> >> > us
> >> > > > >>
> >> > > > >> The rollback story is also important, but maybe we are still not
> >> > ready
> >> > > > >> for it, in case of local changes to store,
> >> > > > >> it is better to have a clear design and plan and work for a new
> >> > > release
> >> > > > >> (3.7?)
> >> > > > >>
> >> > > > >> Enrico
> >> > > > >>
> >> > > > >>>
> >> > > > >>> Depending on your comments, I am happy to create a few Jira
> >> tickets
> >> > > > >> around
> >> > > > >>> these topics.
> >> > > > >>>
> >> > > > >>> Kind regards,
> >> > > > >>> Mate
> >> > > > >>>
> >> > > > >>> ps. Enrico, sorry about your RC... I owe you a beer, let me
> >> know if
> >> > > you
> >> > > > >> are
> >> > > > >>> near to Budapest ;)
> >> > > > >>>
> >> > > > >>> On Tue, Feb 11, 2020 at 8:43 AM Enrico Olivelli <
> >> > eolive...@gmail.com
> >> > > >
> >> > > > >> wrote:
> >> > > > >>>
> >> > > > >>>> Good.
> >> > > > >>>>
> >> > > > >>>> I will cancel the vote for 3.6.0rc2.
> >> > > > >>>>
> >> > > > >>>> I appreciate very much If Mate and his colleagues have time to
> >> > work
> >> > > on
> >> > > > >> a
> >> > > > >>>> fix.
> >> > > > >>>> Otherwise I will have cycles next week
> >> > > > >>>>
> >> > > > >>>> I would also like to spend my time in setting up a few minimal
> >> > > > >> integration
> >> > > > >>>> tests about the upgrade story
> >> > > > >>>>
> >> > > > >>>> Enrico
> >> > > > >>>>
> >> > > > >>>> Il Mar 11 Feb 2020, 07:30 Michael Han <h...@apache.org> ha
> >> > scritto:
> >> > > > >>>>
> >> > > > >>>>> Kudos Enrico, very thorough work as the final gate keeper of
> >> the
> >> > > > >> release!
> >> > > > >>>>>
> >> > > > >>>>> Now with this, I'd like to *vote a -1* on the 3.6.0 RC2.
> >> > > > >>>>>
> >> > > > >>>>> I'd recommend we fix this issue for 3.6.0. ZooKeeper is one of
> >> > the
> >> > > > >> rare
> >> > > > >>>>> piece of software that put so much emphasis on compatibilities
> >> > thus
> >> > > > >> it
> >> > > > >>>> just
> >> > > > >>>>> works when upgrade / downgrade, which is amazing. One
> >> guarantee
> >> > we
> >> > > > >> always
> >> > > > >>>>> had is during rolling upgrade, the quorum will always be
> >> > available,
> >> > > > >>>> leading
> >> > > > >>>>> to no service interruption. It would be sad we lose such
> >> > capability
> >> > > > >> given
> >> > > > >>>>> this is still a tractable problem.
> >> > > > >>>>>
> >> > > > >>>>> Regarding the fix, can we just make 3.6.0 aware of the old
> >> > protocol
> >> > > > >> and
> >> > > > >>>>> speak old message format when it's talking to old server?
> >> > > Basically,
> >> > > > >> an
> >> > > > >>>>> ugly if else check against the protocol version should work
> >> and
> >> > > > >> there is
> >> > > > >>>> no
> >> > > > >>>>> need to have multiple pass on rolling upgrade process.
> >> > > > >>>>>
> >> > > > >>>>>
> >> > > > >>>>> On Mon, Feb 10, 2020 at 10:23 PM Enrico Olivelli <
> >> > > > >> eolive...@gmail.com>
> >> > > > >>>>> wrote:
> >> > > > >>>>>
> >> > > > >>>>>> I suggest this plan:
> >> > > > >>>>>> - release 3.6.0 now
> >> > > > >>>>>> - improve the migration story, the flow outlined by Mate is
> >> > > > >>>>>> interesting, but it will take time
> >> > > > >>>>>>
> >> > > > >>>>>> 3.6.0rc2 got enough binding votes so I am going to finalize
> >> the
> >> > > > >>>>>> release this evening (within 8-10 hours) if no one comes out
> >> in
> >> > > the
> >> > > > >>>>>> VOTE thread with a -1
> >> > > > >>>>>>
> >> > > > >>>>>> Enrico
> >> > > > >>>>>>
> >> > > > >>>>>> Enrico
> >> > > > >>>>>>
> >> > > > >>>>>> Il giorno lun 10 feb 2020 alle ore 19:33 Patrick Hunt
> >> > > > >>>>>> <ph...@apache.org> ha scritto:
> >> > > > >>>>>>>
> >> > > > >>>>>>> On Mon, Feb 10, 2020 at 3:38 AM Andor Molnar <
> >> an...@apache.org
> >> > >
> >> > > > >>>> wrote:
> >> > > > >>>>>>>
> >> > > > >>>>>>>> Hi,
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> Answers inline.
> >> > > > >>>>>>>>
> >> > > > >>>>>>>>
> >> > > > >>>>>>>>> In my experience when you are close to a release it is
> >> > > > >> better to
> >> > > > >>>> to
> >> > > > >>>>>>>>> make big changes. (I am among the approvers of that patch,
> >> > > > >> so I
> >> > > > >>>> am
> >> > > > >>>>>>>>> responsible for this change)
> >> > > > >>>>>>>>
> >> > > > >>>>>>>>
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> Although this statement is acceptable for me, I don’t feel
> >> > this
> >> > > > >>>> patch
> >> > > > >>>>>>>> should not have been merged into 3.6.0. Submission has been
> >> > > > >>>> preceded
> >> > > > >>>>>> by a
> >> > > > >>>>>>>> long argument with MAPR folks who originally wanted to be
> >> > > > >> merged
> >> > > > >>>> into
> >> > > > >>>>>> 3.4
> >> > > > >>>>>>>> branch (considering the pace how ZooKeeper community is
> >> moving
> >> > > > >>>>>> forward) and
> >> > > > >>>>>>>> we reached an agreement that release it with 3.6.0.
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> Make a long story short, this patch has been outstanding
> >> for
> >> > > > >> ages
> >> > > > >>>>>> without
> >> > > > >>>>>>>> much attention from the community and contributors made a
> >> lot
> >> > > > >> of
> >> > > > >>>>>> effort to
> >> > > > >>>>>>>> get it done before the release.
> >> > > > >>>>>>>>
> >> > > > >>>>>>>>
> >> > > > >>>>>>>>> I would like to ear from people that have been in the
> >> > > > >> community
> >> > > > >>>> for
> >> > > > >>>>>>>>> long time, then I am ready to complete the release process
> >> > > > >> for
> >> > > > >>>>>>>>> 3.6.0rc2.
> >> > > > >>>>>>>>
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> Me too.
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> I tend to accept the way rolling restart works now - as you
> >> > > > >>>> described
> >> > > > >>>>>>>> Enrico - and given that situation was pretty much the same
> >> > > > >> between
> >> > > > >>>>> 3.4
> >> > > > >>>>>> and
> >> > > > >>>>>>>> 3.5, I don’t feel we have to make additional changes.
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> On the other hand, the fix that Mate suggested sounds quite
> >> > > > >> cool,
> >> > > > >>>> I’m
> >> > > > >>>>>> also
> >> > > > >>>>>>>> happy to work on getting it in.
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> Fyi, Release Management page says the following:
> >> > > > >>>>>>>>
> >> > > > >>>>>>
> >> > > > >>>>
> >> > > > >>
> >> > >
> >> https://cwiki.apache.org/confluence/display/ZOOKEEPER/ReleaseManagement
> >> > > > >>>>>>>>
> >> > > > >>>>>>>> "major.minor release of ZooKeeper must be backwards
> >> compatible
> >> > > > >> with
> >> > > > >>>>> the
> >> > > > >>>>>>>> previous minor release, major.(minor-1)"
> >> > > > >>>>>>>>
> >> > > > >>>>>>>>
> >> > > > >>>>>>> Our users, direct and indirect, value the ability to
> >> migrate to
> >> > > > >> newer
> >> > > > >>>>>>> versions - esp as we drop support for older. Frictions such
> >> as
> >> > > > >> this
> >> > > > >>>> can
> >> > > > >>>>>> be
> >> > > > >>>>>>> a reason to go elsewhere. I'm "pro" b/w compact - esp given
> >> our
> >> > > > >>>>> published
> >> > > > >>>>>>> guidelines.
> >> > > > >>>>>>>
> >> > > > >>>>>>> Patrick
> >> > > > >>>>>>>
> >> > > > >>>>>>>
> >> > > > >>>>>>>> Andor
> >> > > > >>>>>>>>
> >> > > > >>>>>>>>
> >> > > > >>>>>>>>
> >> > > > >>>>>>>>
> >> > > > >>>>>>>>> On 2020. Feb 10., at 11:32, Enrico Olivelli <
> >> > > > >> eolive...@gmail.com
> >> > > > >>>>>
> >> > > > >>>>>> wrote:
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>> Thank you Mate for checking and explaining this story.
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>> I find it very interesting that the cause is
> >> ZOOKEEPER-3188
> >> > > > >> as:
> >> > > > >>>>>>>>> - it is the last "big patch" committed to 3.6 before
> >> > > > >> starting the
> >> > > > >>>>>>>>> release process
> >> > > > >>>>>>>>> - it is the cause of the failure of the first RC
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>> In my experience when you are close to a release it is
> >> > > > >> better to
> >> > > > >>>> to
> >> > > > >>>>>>>>> make big changes. (I am among the approvers of that patch,
> >> > > > >> so I
> >> > > > >>>> am
> >> > > > >>>>>>>>> responsible for this change)
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>> This is a pointer to the change to whom who wants to
> >> > > > >> understand
> >> > > > >>>>>> better
> >> > > > >>>>>>>>> the context
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>
> >> > > > >>>>>>
> >> > > > >>>>>
> >> > > > >>>>
> >> > > > >>
> >> > >
> >> >
> >> https://github.com/apache/zookeeper/pull/1048/files#diff-7a209d890686bcba351d758b64b22a7dR11
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>> IIUC even for the upgrade from 3.4 to 3.5 the story was
> >> the
> >> > > > >> same
> >> > > > >>>>> and
> >> > > > >>>>>>>>> if this statement holds then I feel we can continue
> >> > > > >>>>>>>>> with this release.
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>> - Reverting ZOOKEEPER-3188 is not an option for me, it is
> >> too
> >> > > > >>>>>> complex.
> >> > > > >>>>>>>>> - Making 3.5 and 3.6 "compatible" can be very tricky and
> >> we
> >> > > > >> do
> >> > > > >>>> not
> >> > > > >>>>>>>>> have tools to certify this compatibility (at least not in
> >> the
> >> > > > >>>> short
> >> > > > >>>>>>>>> term)
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>> I would like to ear from people that have been in the
> >> > > > >> community
> >> > > > >>>> for
> >> > > > >>>>>>>>> long time, then I am ready to complete the release process
> >> > > > >> for
> >> > > > >>>>>>>>> 3.6.0rc2.
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>> I will update the website and the release notes with a
> >> > > > >> specific
> >> > > > >>>>>>>>> warning about the upgrade, we should also update the Wiki
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>> Enrico
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>>
> >> > > > >>>>>>>>> Il giorno lun 10 feb 2020 alle ore 11:17 Szalay-Bekő Máté
> >> > > > >>>>>>>>> <szalay.beko.m...@gmail.com> ha scritto:
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> Hi Enrico!
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> This is caused by the different PROTOCOL_VERSION in the
> >> > > > >>>>>>>> QuorumCnxManager.
> >> > > > >>>>>>>>>> The Protocol version  was changed last time in
> >> > > > >> ZOOKEEPER-2186
> >> > > > >>>>>> released
> >> > > > >>>>>>>>>> first in 3.4.7 and 3.5.1 to avoid some crashing / fix
> >> some
> >> > > > >> bugs.
> >> > > > >>>>>> Later I
> >> > > > >>>>>>>>>> also changed the protocol version when the format of the
> >> > > > >> initial
> >> > > > >>>>>> message
> >> > > > >>>>>>>>>> changed in ZOOKEEPER-3188. So actually the quorum
> >> protocol
> >> > > > >> is
> >> > > > >>>> not
> >> > > > >>>>>>>>>> compatible in this case and is the 'expected' behavior if
> >> > > > >> you
> >> > > > >>>>>> upgrade
> >> > > > >>>>>>>> e.g
> >> > > > >>>>>>>>>> from 3.4.6 to 3.4.7, or 3.4.6 to 3.5.5 or e.g from 3.5.6
> >> to
> >> > > > >>>> 3.6.0.
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> We had some discussion in the PR of ZOOKEEPER-3188 back
> >> > > > >> then and
> >> > > > >>>>>> got to
> >> > > > >>>>>>>> the
> >> > > > >>>>>>>>>> conclusion that it is not that bad, as there will be no
> >> data
> >> > > > >>>> loss
> >> > > > >>>>>> as you
> >> > > > >>>>>>>>>> wrote. The tricky thing is that during rolling upgrade we
> >> > > > >> should
> >> > > > >>>>>> ensure
> >> > > > >>>>>>>>>> both backward and forward compatibility to make sure that
> >> > > > >> the
> >> > > > >>>> old
> >> > > > >>>>>> and
> >> > > > >>>>>>>> the
> >> > > > >>>>>>>>>> new part of the quorum can still speak to each other. The
> >> > > > >>>> current
> >> > > > >>>>>>>> solution
> >> > > > >>>>>>>>>> (simply failing if the protocol versions mismatch) is
> >> more
> >> > > > >>>> simple
> >> > > > >>>>>> and
> >> > > > >>>>>>>> still
> >> > > > >>>>>>>>>> working just fine: as the servers are restarted
> >> one-by-one,
> >> > > > >> the
> >> > > > >>>>>> nodes
> >> > > > >>>>>>>> with
> >> > > > >>>>>>>>>> the old protocol version and the nodes with the new
> >> protocol
> >> > > > >>>>> version
> >> > > > >>>>>>>> will
> >> > > > >>>>>>>>>> form two partitions, but any given time only one
> >> partition
> >> > > > >> will
> >> > > > >>>>>> have the
> >> > > > >>>>>>>>>> quorum.
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> Still, thinking it trough, as a side effect in these
> >> cases
> >> > > > >> there
> >> > > > >>>>>> will
> >> > > > >>>>>>>> be a
> >> > > > >>>>>>>>>> short time when none of the partitions will have quorums
> >> > > > >> (when
> >> > > > >>>> we
> >> > > > >>>>>> have N
> >> > > > >>>>>>>>>> servers with the old protocol version, N servers with the
> >> > > > >> new
> >> > > > >>>>>> protocol
> >> > > > >>>>>>>>>> version, and there is one server just being restarted). I
> >> > > > >> am not
> >> > > > >>>>>> sure
> >> > > > >>>>>>>> if we
> >> > > > >>>>>>>>>> can accept this.
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> For ZOOKEEPER-3188 we can add a small patch to make it
> >> > > > >> possible
> >> > > > >>>> to
> >> > > > >>>>>> parse
> >> > > > >>>>>>>>>> the initial message of the old protocol version with the
> >> new
> >> > > > >>>> code.
> >> > > > >>>>>> But
> >> > > > >>>>>>>> I am
> >> > > > >>>>>>>>>> not sure if it would be enough (as the old code will not
> >> be
> >> > > > >> able
> >> > > > >>>>> to
> >> > > > >>>>>>>> parse
> >> > > > >>>>>>>>>> the new initial message).
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> One option can be to make a patch also for 3.5 to have a
> >> > > > >> version
> >> > > > >>>>>> which
> >> > > > >>>>>>>>>> supports both protocol versions. (let's say in 3.5.8)
> >> Then
> >> > > > >> we
> >> > > > >>>> can
> >> > > > >>>>>> write
> >> > > > >>>>>>>> to
> >> > > > >>>>>>>>>> the release note, that if you need rolling upgrade from
> >> any
> >> > > > >>>>> versions
> >> > > > >>>>>>>> since
> >> > > > >>>>>>>>>> 3.4.7, then you have to first upgrade from 3.5.8 before
> >> > > > >>>> upgrading
> >> > > > >>>>> to
> >> > > > >>>>>>>> 3.6.0.
> >> > > > >>>>>>>>>> We can even make the same thing on the 3.4 branch.
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> But I am also new to the community... It would be great
> >> to
> >> > > > >> hear
> >> > > > >>>>> the
> >> > > > >>>>>>>> opinion
> >> > > > >>>>>>>>>> of more experienced people.
> >> > > > >>>>>>>>>> Whatever the decision will be, I am happy to make the
> >> > > > >> changes.
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> And sorry for breaking the RC (if we decide that this
> >> needs
> >> > > > >> to
> >> > > > >>>> be
> >> > > > >>>>>>>>>> changed...).  ZOOKEEPER-3188 was a complex patch.
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> Kind regards,
> >> > > > >>>>>>>>>> Mate
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>> On Mon, Feb 10, 2020 at 9:47 AM Enrico Olivelli <
> >> > > > >>>>>> eolive...@gmail.com>
> >> > > > >>>>>>>> wrote:
> >> > > > >>>>>>>>>>
> >> > > > >>>>>>>>>>> Hi,
> >> > > > >>>>>>>>>>> even if we had enough binding +1 on 3.6.0rc2 before
> >> > > > >> closing the
> >> > > > >>>>>> VOTE
> >> > > > >>>>>>>>>>> of 3.6.0 I wanted to finish my tests and I am coming to
> >> an
> >> > > > >>>>> apparent
> >> > > > >>>>>>>>>>> blocker.
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>> I am trying to upgrade a 3.5.6 cluster to 3.6.0, but it
> >> > > > >> looks
> >> > > > >>>>> like
> >> > > > >>>>>>>>>>> peers are not able to talk to each other.
> >> > > > >>>>>>>>>>> I have a cluster of 3, server1, server2 and server3.
> >> > > > >>>>>>>>>>> When I upgrade server1 to 3.6.0rc2 I see this kind of
> >> > > > >> errors on
> >> > > > >>>>> 3.5
> >> > > > >>>>>>>> nodes:
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>> 2020-02-10 09:35:07,745 [myid:3] - INFO
> >> > > > >>>>>>>>>>> [localhost/127.0.0.1:3334:QuorumCnxManager$Listener@918]
> >> -
> >> > > > >>>>>> Received
> >> > > > >>>>>>>>>>> connection request 127.0.0.1:62591
> >> > > > >>>>>>>>>>> 2020-02-10 09:35:07,746 [myid:3] - ERROR
> >> > > > >>>>>>>>>>> [localhost/127.0.0.1:3334:QuorumCnxManager@527] -
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>
> >> > > > >>>>>>
> >> > > > >>>>>
> >> > > > >>>>
> >> > > > >>
> >> > >
> >> >
> >> org.apache.zookeeper.server.quorum.QuorumCnxManager$InitialMessage$InitialMessageException:
> >> > > > >>>>>>>>>>> Got unrecognized protocol version -65535
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>> Once I upgrade all of the peers the system is up and
> >> > > > >> running,
> >> > > > >>>>>> without
> >> > > > >>>>>>>>>>> apparently no data loss.
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>> During the upgrade as soon as I upgrade the first node,
> >> > > > >> say,
> >> > > > >>>>>> server1,
> >> > > > >>>>>>>>>>> server1 is not able to accept connections (error "Close
> >> of
> >> > > > >>>>> session
> >> > > > >>>>>> 0x0
> >> > > > >>>>>>>>>>> java.io.IOException: ZooKeeperServer not running")  from
> >> > > > >>>> clients,
> >> > > > >>>>>> this
> >> > > > >>>>>>>>>>> is expected, because as far as it cannot talk with the
> >> > > > >> other
> >> > > > >>>>> peers
> >> > > > >>>>>> it
> >> > > > >>>>>>>>>>> is practically partitioned away from the cluster.
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>> My questions are:
> >> > > > >>>>>>>>>>> 1) is this expected ? I can't remember protocol changes
> >> > > > >> from
> >> > > > >>>> 3.5
> >> > > > >>>>> to
> >> > > > >>>>>>>>>>> 3.6, but actually 3.6 diverged from 3.5 branch so long
> >> ago,
> >> > > > >>>> and I
> >> > > > >>>>>> was
> >> > > > >>>>>>>>>>> not in the community as dev so I cannot tell
> >> > > > >>>>>>>>>>> 2) is this a viable option for users ? to have some
> >> > > > >> temporary
> >> > > > >>>>>> glitch
> >> > > > >>>>>>>>>>> during the upgrade and hope that the upgrade completes
> >> > > > >> without
> >> > > > >>>>>>>>>>> troubles ?
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>> In theory as long as two servers are running the same
> >> major
> >> > > > >>>>> version
> >> > > > >>>>>>>>>>> (3.5 or 3.6) we have a quorum and the system is able to
> >> > > > >> make
> >> > > > >>>>>> progress
> >> > > > >>>>>>>>>>> and to server clients.
> >> > > > >>>>>>>>>>> I feel that this is quite dangerous, but I don't have
> >> > > > >> enough
> >> > > > >>>>>> context
> >> > > > >>>>>>>>>>> to understand how this problem is possible and when we
> >> > > > >> decided
> >> > > > >>>> to
> >> > > > >>>>>>>>>>> break compatibility.
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>> The other option is that I am wrong in my test and I am
> >> > > > >> messing
> >> > > > >>>>> up
> >> > > > >>>>>> :-)
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>> The other upgrade path I would like to see working like
> >> a
> >> > > > >> charm
> >> > > > >>>>> is
> >> > > > >>>>>> the
> >> > > > >>>>>>>>>>> upgrade from 3.4 to 3.6, as I see that as soon as we
> >> > > > >> release
> >> > > > >>>> 3.6
> >> > > > >>>>> we
> >> > > > >>>>>>>>>>> should encourage users to move to 3.6 and not to 3.5.
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>>>> Regards
> >> > > > >>>>>>>>>>> Enrico
> >> > > > >>>>>>>>>>>
> >> > > > >>>>>>>>
> >> > > > >>>>>>>>
> >> > > > >>>>>>
> >> > > > >>>>>
> >> > > > >>>>
> >> > > > >>
> >> > > >
> >> > >
> >> >
> >>
> >>

Reply via email to