I think it would be prudent to emphasize in the release notes that rolling
upgrades (and mixed ensembles generally) are effectively untested.  That
this was, in practice, a non-goal of this release cycle.  Because if we can
get to rc2 without noticing a showstopper, clearly it's not something that
anyone has gotten around to attempting; and there have to be a hundred
corner cases beyond the MultiAddress issue.

On Tue, Feb 11, 2020 at 12:27 PM Szalay-Bekő Máté <
[email protected]> wrote:

> I see the main problem here in the fact that we are missing proper
> versioning in the leader election / quorum protocols. I tried to simply
> implement backward compatibility in 3.6, but it didn't solve the problem.
> The new code understands the old protocol, but it can not decide when to
> use the new or the old protocol during connection initiation. So the old
> servers can not read the new init messages and we still temporarly end up
> having two partitions during rolling restart.
>
> I already suggested two ways to handle this later, but I think for 3.6.0
> now the simplest solution is to disable the new MultiAddress feature and
> stick to the old protocol version by default. Plus extend the
> documentation with the note, that enabling the MultiAddress feature is not
> possible during a rolling upgrade, but it needs to be done with a separate
> rolling restart. With this approach, the rolling restart should "just work"
> with the 3.4 / 3.5 configs and we don't require any extra step /
> configuration from the users, unless they want to use the new feature. I
> plan to submit a PR with these changes tomorrow to ZOOKEEPER-3720, if there
> isn't any different opinion.
>
> P.S. For 4.0 we might need to put some extra thinking into backward
> compatibility / versioning for the quorum and client protocols.
>
>
> On Tue, Feb 11, 2020, 20:44 Michael K. Edwards <[email protected]>
> wrote:
>
>> I hate to say it, but I think 3.6.0 should release as is.  It is
>> impossible
>> to *reliably* retrofit backwards compatibility / interoperability onto a
>> release that was engineered from the beginning without that goal.  Learn
>> the lesson, set goals differently in the future.
>>
>> On Tue, Feb 11, 2020 at 9:41 AM Szalay-Bekő Máté <
>> [email protected]>
>> wrote:
>>
>> > FYI: I created these scripts for my local tests:
>> > https://github.com/symat/zk-rolling-upgrade-test
>> >
>> > For the long term I would also add some script that actually monitors
>> the
>> > state of the quorum and also runs continuous traffic, not just 1-2
>> > smoketests after each restart. But I don't know how important this would
>> > be.
>> >
>> > On Tue, Feb 11, 2020 at 5:25 PM Enrico Olivelli <[email protected]>
>> > wrote:
>> >
>> > > Il giorno mar 11 feb 2020 alle ore 17:17 Andor Molnar
>> > > <[email protected]> ha scritto:
>> > > >
>> > > > The most obvious one which crosses my mind is that I previously
>> worked
>> > > on:
>> > > >
>> > > > 1) run old version cluster,
>> > > > 2) connect to each node and run smoke tests,
>> > > > 3) restart one node with new code,
>> > > > 4) goto 2) until all nodes are upgraded
>> > > >
>> > > > I think this wouldn’t work in a “unit test”, we probably need a
>> > separate
>> > > Jenkins job and a nice python script to do this.
>> > > >
>> > > > Andor
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > > On 2020. Feb 11., at 16:38, Patrick Hunt <[email protected]>
>> wrote:
>> > > > >
>> > > > > Anyone have ideas how we could add testing for upgrade? Obviously
>> > > something
>> > > > > we're missing, esp given it's import.
>> > >
>> > > I will send an email next days with a proposal.
>> > > btw my idea is very like Andor's one
>> > >
>> > > Once we have an automatic environment we can launch from Jenkins
>> > >
>> > > Enrico
>> > >
>> > >
>> > > > >
>> > > > > Patrick
>> > > > >
>> > > > > On Tue, Feb 11, 2020 at 12:40 AM Enrico Olivelli <
>> > [email protected]>
>> > > > > wrote:
>> > > > >
>> > > > >> Il giorno mar 11 feb 2020 alle ore 09:12 Szalay-Bekő Máté
>> > > > >> <[email protected]> ha scritto:
>> > > > >>>
>> > > > >>> Hi All,
>> > > > >>>
>> > > > >>> about the question from Michael:
>> > > > >>>> Regarding the fix, can we just make 3.6.0 aware of the old
>> > protocol
>> > > and
>> > > > >>>> speak old message format when it's talking to old server?
>> > > > >>>
>> > > > >>> In this particular case, it might be enough. The protocol change
>> > > happened
>> > > > >>> now in the 'initial message' sent by the QuorumCnxManager.
>> Maybe it
>> > > is
>> > > > >> not
>> > > > >>> a problem if the new servers can not initiate channels to the
>> old
>> > > > >> servers,
>> > > > >>> maybe it is enough if these channel gets initiated by the old
>> > servers
>> > > > >> only.
>> > > > >>> I will test it quickly.
>> > > > >>>
>> > > > >>> Although I have no idea if any other thing changed in the quorum
>> > > protocol
>> > > > >>> between 3.5 and 3.6. In other cases it might not be enough if
>> the
>> > new
>> > > > >>> servers can understand the old messages, as the old servers can
>> > > break by
>> > > > >>> not understanding the messages from the new servers. Also, in
>> the
>> > > code
>> > > > >>> currently (AFAIK) there is no generic knowledge of protocol
>> > > versions, the
>> > > > >>> servers are not storing that which protocol versions they
>> > can/should
>> > > use
>> > > > >> to
>> > > > >>> communicate to which particular other servers. Maybe we don't
>> even
>> > > need
>> > > > >>> this, but I would feel better if we would have more tests around
>> > > these
>> > > > >>> things.
>> > > > >>>
>> > > > >>> My suggestion for the long term:
>> > > > >>> - let's fix this particular issue now with 3.6.0 quickly (I
>> start
>> > > doing
>> > > > >>> this today)
>> > > > >>> - let's do some automation (backed up with jenkins) that will
>> test
>> > a
>> > > > >> whole
>> > > > >>> combinations of different ZooKeeper upgrade paths by making
>> rolling
>> > > > >>> upgrades during some light traffic. Let's have a bit better
>> > > definition
>> > > > >>> about what we expect (e.g. the quorum is up, but some clients
>> can
>> > get
>> > > > >>> disconnected? What will happen to the ephemeral nodes? Do we
>> want
>> > to
>> > > > >>> gracefully close or transfer the user sessions before stopping
>> the
>> > > old
>> > > > >>> server?) and let's see where this broke. Just by checking the
>> > code, I
>> > > > >> don't
>> > > > >>> think the quorum will always be up (e.g. between older 3.4
>> versions
>> > > and
>> > > > >>> 3.5).
>> > > > >>
>> > > > >>
>> > > > >> I am happy to work on this topic
>> > > > >>
>> > > > >>> - we need to update the Wiki about the working rolling upgrade
>> > paths
>> > > and
>> > > > >>> maybe about workarounds if needed
>> > > > >>> - we might need to do some fixes (adding backward compatible
>> > versions
>> > > > >>> and/or specific parameters that enforce old protocol temporary
>> > > during the
>> > > > >>> rolling upgrade that can be changed later to the new protocol by
>> > > either
>> > > > >>> dynamic reconfig or by rolling restart)
>> > > > >>
>> > > > >> it would be much better on 3.6 code to have some support for
>> > > > >> compatibility with 3.5 servers
>> > > > >> we can't require old code to be forward compatible but we can
>> make
>> > new
>> > > > >> code be compatible to a certain extend with old code.
>> > > > >> If we can achieve this compatibility goal without a flag is
>> better,
>> > > > >> users won't have to care about this part and they simply "trust"
>> on
>> > us
>> > > > >>
>> > > > >> The rollback story is also important, but maybe we are still not
>> > ready
>> > > > >> for it, in case of local changes to store,
>> > > > >> it is better to have a clear design and plan and work for a new
>> > > release
>> > > > >> (3.7?)
>> > > > >>
>> > > > >> Enrico
>> > > > >>
>> > > > >>>
>> > > > >>> Depending on your comments, I am happy to create a few Jira
>> tickets
>> > > > >> around
>> > > > >>> these topics.
>> > > > >>>
>> > > > >>> Kind regards,
>> > > > >>> Mate
>> > > > >>>
>> > > > >>> ps. Enrico, sorry about your RC... I owe you a beer, let me
>> know if
>> > > you
>> > > > >> are
>> > > > >>> near to Budapest ;)
>> > > > >>>
>> > > > >>> On Tue, Feb 11, 2020 at 8:43 AM Enrico Olivelli <
>> > [email protected]
>> > > >
>> > > > >> wrote:
>> > > > >>>
>> > > > >>>> Good.
>> > > > >>>>
>> > > > >>>> I will cancel the vote for 3.6.0rc2.
>> > > > >>>>
>> > > > >>>> I appreciate very much If Mate and his colleagues have time to
>> > work
>> > > on
>> > > > >> a
>> > > > >>>> fix.
>> > > > >>>> Otherwise I will have cycles next week
>> > > > >>>>
>> > > > >>>> I would also like to spend my time in setting up a few minimal
>> > > > >> integration
>> > > > >>>> tests about the upgrade story
>> > > > >>>>
>> > > > >>>> Enrico
>> > > > >>>>
>> > > > >>>> Il Mar 11 Feb 2020, 07:30 Michael Han <[email protected]> ha
>> > scritto:
>> > > > >>>>
>> > > > >>>>> Kudos Enrico, very thorough work as the final gate keeper of
>> the
>> > > > >> release!
>> > > > >>>>>
>> > > > >>>>> Now with this, I'd like to *vote a -1* on the 3.6.0 RC2.
>> > > > >>>>>
>> > > > >>>>> I'd recommend we fix this issue for 3.6.0. ZooKeeper is one of
>> > the
>> > > > >> rare
>> > > > >>>>> piece of software that put so much emphasis on compatibilities
>> > thus
>> > > > >> it
>> > > > >>>> just
>> > > > >>>>> works when upgrade / downgrade, which is amazing. One
>> guarantee
>> > we
>> > > > >> always
>> > > > >>>>> had is during rolling upgrade, the quorum will always be
>> > available,
>> > > > >>>> leading
>> > > > >>>>> to no service interruption. It would be sad we lose such
>> > capability
>> > > > >> given
>> > > > >>>>> this is still a tractable problem.
>> > > > >>>>>
>> > > > >>>>> Regarding the fix, can we just make 3.6.0 aware of the old
>> > protocol
>> > > > >> and
>> > > > >>>>> speak old message format when it's talking to old server?
>> > > Basically,
>> > > > >> an
>> > > > >>>>> ugly if else check against the protocol version should work
>> and
>> > > > >> there is
>> > > > >>>> no
>> > > > >>>>> need to have multiple pass on rolling upgrade process.
>> > > > >>>>>
>> > > > >>>>>
>> > > > >>>>> On Mon, Feb 10, 2020 at 10:23 PM Enrico Olivelli <
>> > > > >> [email protected]>
>> > > > >>>>> wrote:
>> > > > >>>>>
>> > > > >>>>>> I suggest this plan:
>> > > > >>>>>> - release 3.6.0 now
>> > > > >>>>>> - improve the migration story, the flow outlined by Mate is
>> > > > >>>>>> interesting, but it will take time
>> > > > >>>>>>
>> > > > >>>>>> 3.6.0rc2 got enough binding votes so I am going to finalize
>> the
>> > > > >>>>>> release this evening (within 8-10 hours) if no one comes out
>> in
>> > > the
>> > > > >>>>>> VOTE thread with a -1
>> > > > >>>>>>
>> > > > >>>>>> Enrico
>> > > > >>>>>>
>> > > > >>>>>> Enrico
>> > > > >>>>>>
>> > > > >>>>>> Il giorno lun 10 feb 2020 alle ore 19:33 Patrick Hunt
>> > > > >>>>>> <[email protected]> ha scritto:
>> > > > >>>>>>>
>> > > > >>>>>>> On Mon, Feb 10, 2020 at 3:38 AM Andor Molnar <
>> [email protected]
>> > >
>> > > > >>>> wrote:
>> > > > >>>>>>>
>> > > > >>>>>>>> Hi,
>> > > > >>>>>>>>
>> > > > >>>>>>>> Answers inline.
>> > > > >>>>>>>>
>> > > > >>>>>>>>
>> > > > >>>>>>>>> In my experience when you are close to a release it is
>> > > > >> better to
>> > > > >>>> to
>> > > > >>>>>>>>> make big changes. (I am among the approvers of that patch,
>> > > > >> so I
>> > > > >>>> am
>> > > > >>>>>>>>> responsible for this change)
>> > > > >>>>>>>>
>> > > > >>>>>>>>
>> > > > >>>>>>>>
>> > > > >>>>>>>> Although this statement is acceptable for me, I don’t feel
>> > this
>> > > > >>>> patch
>> > > > >>>>>>>> should not have been merged into 3.6.0. Submission has been
>> > > > >>>> preceded
>> > > > >>>>>> by a
>> > > > >>>>>>>> long argument with MAPR folks who originally wanted to be
>> > > > >> merged
>> > > > >>>> into
>> > > > >>>>>> 3.4
>> > > > >>>>>>>> branch (considering the pace how ZooKeeper community is
>> moving
>> > > > >>>>>> forward) and
>> > > > >>>>>>>> we reached an agreement that release it with 3.6.0.
>> > > > >>>>>>>>
>> > > > >>>>>>>> Make a long story short, this patch has been outstanding
>> for
>> > > > >> ages
>> > > > >>>>>> without
>> > > > >>>>>>>> much attention from the community and contributors made a
>> lot
>> > > > >> of
>> > > > >>>>>> effort to
>> > > > >>>>>>>> get it done before the release.
>> > > > >>>>>>>>
>> > > > >>>>>>>>
>> > > > >>>>>>>>> I would like to ear from people that have been in the
>> > > > >> community
>> > > > >>>> for
>> > > > >>>>>>>>> long time, then I am ready to complete the release process
>> > > > >> for
>> > > > >>>>>>>>> 3.6.0rc2.
>> > > > >>>>>>>>
>> > > > >>>>>>>>
>> > > > >>>>>>>> Me too.
>> > > > >>>>>>>>
>> > > > >>>>>>>> I tend to accept the way rolling restart works now - as you
>> > > > >>>> described
>> > > > >>>>>>>> Enrico - and given that situation was pretty much the same
>> > > > >> between
>> > > > >>>>> 3.4
>> > > > >>>>>> and
>> > > > >>>>>>>> 3.5, I don’t feel we have to make additional changes.
>> > > > >>>>>>>>
>> > > > >>>>>>>> On the other hand, the fix that Mate suggested sounds quite
>> > > > >> cool,
>> > > > >>>> I’m
>> > > > >>>>>> also
>> > > > >>>>>>>> happy to work on getting it in.
>> > > > >>>>>>>>
>> > > > >>>>>>>> Fyi, Release Management page says the following:
>> > > > >>>>>>>>
>> > > > >>>>>>
>> > > > >>>>
>> > > > >>
>> > >
>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/ReleaseManagement
>> > > > >>>>>>>>
>> > > > >>>>>>>> "major.minor release of ZooKeeper must be backwards
>> compatible
>> > > > >> with
>> > > > >>>>> the
>> > > > >>>>>>>> previous minor release, major.(minor-1)"
>> > > > >>>>>>>>
>> > > > >>>>>>>>
>> > > > >>>>>>> Our users, direct and indirect, value the ability to
>> migrate to
>> > > > >> newer
>> > > > >>>>>>> versions - esp as we drop support for older. Frictions such
>> as
>> > > > >> this
>> > > > >>>> can
>> > > > >>>>>> be
>> > > > >>>>>>> a reason to go elsewhere. I'm "pro" b/w compact - esp given
>> our
>> > > > >>>>> published
>> > > > >>>>>>> guidelines.
>> > > > >>>>>>>
>> > > > >>>>>>> Patrick
>> > > > >>>>>>>
>> > > > >>>>>>>
>> > > > >>>>>>>> Andor
>> > > > >>>>>>>>
>> > > > >>>>>>>>
>> > > > >>>>>>>>
>> > > > >>>>>>>>
>> > > > >>>>>>>>> On 2020. Feb 10., at 11:32, Enrico Olivelli <
>> > > > >> [email protected]
>> > > > >>>>>
>> > > > >>>>>> wrote:
>> > > > >>>>>>>>>
>> > > > >>>>>>>>> Thank you Mate for checking and explaining this story.
>> > > > >>>>>>>>>
>> > > > >>>>>>>>> I find it very interesting that the cause is
>> ZOOKEEPER-3188
>> > > > >> as:
>> > > > >>>>>>>>> - it is the last "big patch" committed to 3.6 before
>> > > > >> starting the
>> > > > >>>>>>>>> release process
>> > > > >>>>>>>>> - it is the cause of the failure of the first RC
>> > > > >>>>>>>>>
>> > > > >>>>>>>>> In my experience when you are close to a release it is
>> > > > >> better to
>> > > > >>>> to
>> > > > >>>>>>>>> make big changes. (I am among the approvers of that patch,
>> > > > >> so I
>> > > > >>>> am
>> > > > >>>>>>>>> responsible for this change)
>> > > > >>>>>>>>>
>> > > > >>>>>>>>> This is a pointer to the change to whom who wants to
>> > > > >> understand
>> > > > >>>>>> better
>> > > > >>>>>>>>> the context
>> > > > >>>>>>>>>
>> > > > >>>>>>>>
>> > > > >>>>>>
>> > > > >>>>>
>> > > > >>>>
>> > > > >>
>> > >
>> >
>> https://github.com/apache/zookeeper/pull/1048/files#diff-7a209d890686bcba351d758b64b22a7dR11
>> > > > >>>>>>>>>
>> > > > >>>>>>>>> IIUC even for the upgrade from 3.4 to 3.5 the story was
>> the
>> > > > >> same
>> > > > >>>>> and
>> > > > >>>>>>>>> if this statement holds then I feel we can continue
>> > > > >>>>>>>>> with this release.
>> > > > >>>>>>>>>
>> > > > >>>>>>>>> - Reverting ZOOKEEPER-3188 is not an option for me, it is
>> too
>> > > > >>>>>> complex.
>> > > > >>>>>>>>> - Making 3.5 and 3.6 "compatible" can be very tricky and
>> we
>> > > > >> do
>> > > > >>>> not
>> > > > >>>>>>>>> have tools to certify this compatibility (at least not in
>> the
>> > > > >>>> short
>> > > > >>>>>>>>> term)
>> > > > >>>>>>>>>
>> > > > >>>>>>>>> I would like to ear from people that have been in the
>> > > > >> community
>> > > > >>>> for
>> > > > >>>>>>>>> long time, then I am ready to complete the release process
>> > > > >> for
>> > > > >>>>>>>>> 3.6.0rc2.
>> > > > >>>>>>>>>
>> > > > >>>>>>>>> I will update the website and the release notes with a
>> > > > >> specific
>> > > > >>>>>>>>> warning about the upgrade, we should also update the Wiki
>> > > > >>>>>>>>>
>> > > > >>>>>>>>> Enrico
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>
>> > > > >>>>>>>>> Il giorno lun 10 feb 2020 alle ore 11:17 Szalay-Bekő Máté
>> > > > >>>>>>>>> <[email protected]> ha scritto:
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>> Hi Enrico!
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>> This is caused by the different PROTOCOL_VERSION in the
>> > > > >>>>>>>> QuorumCnxManager.
>> > > > >>>>>>>>>> The Protocol version  was changed last time in
>> > > > >> ZOOKEEPER-2186
>> > > > >>>>>> released
>> > > > >>>>>>>>>> first in 3.4.7 and 3.5.1 to avoid some crashing / fix
>> some
>> > > > >> bugs.
>> > > > >>>>>> Later I
>> > > > >>>>>>>>>> also changed the protocol version when the format of the
>> > > > >> initial
>> > > > >>>>>> message
>> > > > >>>>>>>>>> changed in ZOOKEEPER-3188. So actually the quorum
>> protocol
>> > > > >> is
>> > > > >>>> not
>> > > > >>>>>>>>>> compatible in this case and is the 'expected' behavior if
>> > > > >> you
>> > > > >>>>>> upgrade
>> > > > >>>>>>>> e.g
>> > > > >>>>>>>>>> from 3.4.6 to 3.4.7, or 3.4.6 to 3.5.5 or e.g from 3.5.6
>> to
>> > > > >>>> 3.6.0.
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>> We had some discussion in the PR of ZOOKEEPER-3188 back
>> > > > >> then and
>> > > > >>>>>> got to
>> > > > >>>>>>>> the
>> > > > >>>>>>>>>> conclusion that it is not that bad, as there will be no
>> data
>> > > > >>>> loss
>> > > > >>>>>> as you
>> > > > >>>>>>>>>> wrote. The tricky thing is that during rolling upgrade we
>> > > > >> should
>> > > > >>>>>> ensure
>> > > > >>>>>>>>>> both backward and forward compatibility to make sure that
>> > > > >> the
>> > > > >>>> old
>> > > > >>>>>> and
>> > > > >>>>>>>> the
>> > > > >>>>>>>>>> new part of the quorum can still speak to each other. The
>> > > > >>>> current
>> > > > >>>>>>>> solution
>> > > > >>>>>>>>>> (simply failing if the protocol versions mismatch) is
>> more
>> > > > >>>> simple
>> > > > >>>>>> and
>> > > > >>>>>>>> still
>> > > > >>>>>>>>>> working just fine: as the servers are restarted
>> one-by-one,
>> > > > >> the
>> > > > >>>>>> nodes
>> > > > >>>>>>>> with
>> > > > >>>>>>>>>> the old protocol version and the nodes with the new
>> protocol
>> > > > >>>>> version
>> > > > >>>>>>>> will
>> > > > >>>>>>>>>> form two partitions, but any given time only one
>> partition
>> > > > >> will
>> > > > >>>>>> have the
>> > > > >>>>>>>>>> quorum.
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>> Still, thinking it trough, as a side effect in these
>> cases
>> > > > >> there
>> > > > >>>>>> will
>> > > > >>>>>>>> be a
>> > > > >>>>>>>>>> short time when none of the partitions will have quorums
>> > > > >> (when
>> > > > >>>> we
>> > > > >>>>>> have N
>> > > > >>>>>>>>>> servers with the old protocol version, N servers with the
>> > > > >> new
>> > > > >>>>>> protocol
>> > > > >>>>>>>>>> version, and there is one server just being restarted). I
>> > > > >> am not
>> > > > >>>>>> sure
>> > > > >>>>>>>> if we
>> > > > >>>>>>>>>> can accept this.
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>> For ZOOKEEPER-3188 we can add a small patch to make it
>> > > > >> possible
>> > > > >>>> to
>> > > > >>>>>> parse
>> > > > >>>>>>>>>> the initial message of the old protocol version with the
>> new
>> > > > >>>> code.
>> > > > >>>>>> But
>> > > > >>>>>>>> I am
>> > > > >>>>>>>>>> not sure if it would be enough (as the old code will not
>> be
>> > > > >> able
>> > > > >>>>> to
>> > > > >>>>>>>> parse
>> > > > >>>>>>>>>> the new initial message).
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>> One option can be to make a patch also for 3.5 to have a
>> > > > >> version
>> > > > >>>>>> which
>> > > > >>>>>>>>>> supports both protocol versions. (let's say in 3.5.8)
>> Then
>> > > > >> we
>> > > > >>>> can
>> > > > >>>>>> write
>> > > > >>>>>>>> to
>> > > > >>>>>>>>>> the release note, that if you need rolling upgrade from
>> any
>> > > > >>>>> versions
>> > > > >>>>>>>> since
>> > > > >>>>>>>>>> 3.4.7, then you have to first upgrade from 3.5.8 before
>> > > > >>>> upgrading
>> > > > >>>>> to
>> > > > >>>>>>>> 3.6.0.
>> > > > >>>>>>>>>> We can even make the same thing on the 3.4 branch.
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>> But I am also new to the community... It would be great
>> to
>> > > > >> hear
>> > > > >>>>> the
>> > > > >>>>>>>> opinion
>> > > > >>>>>>>>>> of more experienced people.
>> > > > >>>>>>>>>> Whatever the decision will be, I am happy to make the
>> > > > >> changes.
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>> And sorry for breaking the RC (if we decide that this
>> needs
>> > > > >> to
>> > > > >>>> be
>> > > > >>>>>>>>>> changed...).  ZOOKEEPER-3188 was a complex patch.
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>> Kind regards,
>> > > > >>>>>>>>>> Mate
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>> On Mon, Feb 10, 2020 at 9:47 AM Enrico Olivelli <
>> > > > >>>>>> [email protected]>
>> > > > >>>>>>>> wrote:
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>>> Hi,
>> > > > >>>>>>>>>>> even if we had enough binding +1 on 3.6.0rc2 before
>> > > > >> closing the
>> > > > >>>>>> VOTE
>> > > > >>>>>>>>>>> of 3.6.0 I wanted to finish my tests and I am coming to
>> an
>> > > > >>>>> apparent
>> > > > >>>>>>>>>>> blocker.
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>> I am trying to upgrade a 3.5.6 cluster to 3.6.0, but it
>> > > > >> looks
>> > > > >>>>> like
>> > > > >>>>>>>>>>> peers are not able to talk to each other.
>> > > > >>>>>>>>>>> I have a cluster of 3, server1, server2 and server3.
>> > > > >>>>>>>>>>> When I upgrade server1 to 3.6.0rc2 I see this kind of
>> > > > >> errors on
>> > > > >>>>> 3.5
>> > > > >>>>>>>> nodes:
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>> 2020-02-10 09:35:07,745 [myid:3] - INFO
>> > > > >>>>>>>>>>> [localhost/127.0.0.1:3334:QuorumCnxManager$Listener@918]
>> -
>> > > > >>>>>> Received
>> > > > >>>>>>>>>>> connection request 127.0.0.1:62591
>> > > > >>>>>>>>>>> 2020-02-10 09:35:07,746 [myid:3] - ERROR
>> > > > >>>>>>>>>>> [localhost/127.0.0.1:3334:QuorumCnxManager@527] -
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>
>> > > > >>>>>>
>> > > > >>>>>
>> > > > >>>>
>> > > > >>
>> > >
>> >
>> org.apache.zookeeper.server.quorum.QuorumCnxManager$InitialMessage$InitialMessageException:
>> > > > >>>>>>>>>>> Got unrecognized protocol version -65535
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>> Once I upgrade all of the peers the system is up and
>> > > > >> running,
>> > > > >>>>>> without
>> > > > >>>>>>>>>>> apparently no data loss.
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>> During the upgrade as soon as I upgrade the first node,
>> > > > >> say,
>> > > > >>>>>> server1,
>> > > > >>>>>>>>>>> server1 is not able to accept connections (error "Close
>> of
>> > > > >>>>> session
>> > > > >>>>>> 0x0
>> > > > >>>>>>>>>>> java.io.IOException: ZooKeeperServer not running")  from
>> > > > >>>> clients,
>> > > > >>>>>> this
>> > > > >>>>>>>>>>> is expected, because as far as it cannot talk with the
>> > > > >> other
>> > > > >>>>> peers
>> > > > >>>>>> it
>> > > > >>>>>>>>>>> is practically partitioned away from the cluster.
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>> My questions are:
>> > > > >>>>>>>>>>> 1) is this expected ? I can't remember protocol changes
>> > > > >> from
>> > > > >>>> 3.5
>> > > > >>>>> to
>> > > > >>>>>>>>>>> 3.6, but actually 3.6 diverged from 3.5 branch so long
>> ago,
>> > > > >>>> and I
>> > > > >>>>>> was
>> > > > >>>>>>>>>>> not in the community as dev so I cannot tell
>> > > > >>>>>>>>>>> 2) is this a viable option for users ? to have some
>> > > > >> temporary
>> > > > >>>>>> glitch
>> > > > >>>>>>>>>>> during the upgrade and hope that the upgrade completes
>> > > > >> without
>> > > > >>>>>>>>>>> troubles ?
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>> In theory as long as two servers are running the same
>> major
>> > > > >>>>> version
>> > > > >>>>>>>>>>> (3.5 or 3.6) we have a quorum and the system is able to
>> > > > >> make
>> > > > >>>>>> progress
>> > > > >>>>>>>>>>> and to server clients.
>> > > > >>>>>>>>>>> I feel that this is quite dangerous, but I don't have
>> > > > >> enough
>> > > > >>>>>> context
>> > > > >>>>>>>>>>> to understand how this problem is possible and when we
>> > > > >> decided
>> > > > >>>> to
>> > > > >>>>>>>>>>> break compatibility.
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>> The other option is that I am wrong in my test and I am
>> > > > >> messing
>> > > > >>>>> up
>> > > > >>>>>> :-)
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>> The other upgrade path I would like to see working like
>> a
>> > > > >> charm
>> > > > >>>>> is
>> > > > >>>>>> the
>> > > > >>>>>>>>>>> upgrade from 3.4 to 3.6, as I see that as soon as we
>> > > > >> release
>> > > > >>>> 3.6
>> > > > >>>>> we
>> > > > >>>>>>>>>>> should encourage users to move to 3.6 and not to 3.5.
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>> Regards
>> > > > >>>>>>>>>>> Enrico
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>
>> > > > >>>>>>>>
>> > > > >>>>>>
>> > > > >>>>>
>> > > > >>>>
>> > > > >>
>> > > >
>> > >
>> >
>>
>>

Reply via email to