The most obvious one which crosses my mind is that I previously worked on:

1) run old version cluster,
2) connect to each node and run smoke tests,
3) restart one node with new code,
4) goto 2) until all nodes are upgraded

I think this wouldn’t work in a “unit test”, we probably need a separate 
Jenkins job and a nice python script to do this.

Andor




> On 2020. Feb 11., at 16:38, Patrick Hunt <ph...@apache.org> wrote:
> 
> Anyone have ideas how we could add testing for upgrade? Obviously something
> we're missing, esp given it's import.
> 
> Patrick
> 
> On Tue, Feb 11, 2020 at 12:40 AM Enrico Olivelli <eolive...@gmail.com>
> wrote:
> 
>> Il giorno mar 11 feb 2020 alle ore 09:12 Szalay-Bekő Máté
>> <szalay.beko.m...@gmail.com> ha scritto:
>>> 
>>> Hi All,
>>> 
>>> about the question from Michael:
>>>> Regarding the fix, can we just make 3.6.0 aware of the old protocol and
>>>> speak old message format when it's talking to old server?
>>> 
>>> In this particular case, it might be enough. The protocol change happened
>>> now in the 'initial message' sent by the QuorumCnxManager. Maybe it is
>> not
>>> a problem if the new servers can not initiate channels to the old
>> servers,
>>> maybe it is enough if these channel gets initiated by the old servers
>> only.
>>> I will test it quickly.
>>> 
>>> Although I have no idea if any other thing changed in the quorum protocol
>>> between 3.5 and 3.6. In other cases it might not be enough if the new
>>> servers can understand the old messages, as the old servers can break by
>>> not understanding the messages from the new servers. Also, in the code
>>> currently (AFAIK) there is no generic knowledge of protocol versions, the
>>> servers are not storing that which protocol versions they can/should use
>> to
>>> communicate to which particular other servers. Maybe we don't even need
>>> this, but I would feel better if we would have more tests around these
>>> things.
>>> 
>>> My suggestion for the long term:
>>> - let's fix this particular issue now with 3.6.0 quickly (I start doing
>>> this today)
>>> - let's do some automation (backed up with jenkins) that will test a
>> whole
>>> combinations of different ZooKeeper upgrade paths by making rolling
>>> upgrades during some light traffic. Let's have a bit better definition
>>> about what we expect (e.g. the quorum is up, but some clients can get
>>> disconnected? What will happen to the ephemeral nodes? Do we want to
>>> gracefully close or transfer the user sessions before stopping the old
>>> server?) and let's see where this broke. Just by checking the code, I
>> don't
>>> think the quorum will always be up (e.g. between older 3.4 versions and
>>> 3.5).
>> 
>> 
>> I am happy to work on this topic
>> 
>>> - we need to update the Wiki about the working rolling upgrade paths and
>>> maybe about workarounds if needed
>>> - we might need to do some fixes (adding backward compatible versions
>>> and/or specific parameters that enforce old protocol temporary during the
>>> rolling upgrade that can be changed later to the new protocol by either
>>> dynamic reconfig or by rolling restart)
>> 
>> it would be much better on 3.6 code to have some support for
>> compatibility with 3.5 servers
>> we can't require old code to be forward compatible but we can make new
>> code be compatible to a certain extend with old code.
>> If we can achieve this compatibility goal without a flag is better,
>> users won't have to care about this part and they simply "trust" on us
>> 
>> The rollback story is also important, but maybe we are still not ready
>> for it, in case of local changes to store,
>> it is better to have a clear design and plan and work for a new release
>> (3.7?)
>> 
>> Enrico
>> 
>>> 
>>> Depending on your comments, I am happy to create a few Jira tickets
>> around
>>> these topics.
>>> 
>>> Kind regards,
>>> Mate
>>> 
>>> ps. Enrico, sorry about your RC... I owe you a beer, let me know if you
>> are
>>> near to Budapest ;)
>>> 
>>> On Tue, Feb 11, 2020 at 8:43 AM Enrico Olivelli <eolive...@gmail.com>
>> wrote:
>>> 
>>>> Good.
>>>> 
>>>> I will cancel the vote for 3.6.0rc2.
>>>> 
>>>> I appreciate very much If Mate and his colleagues have time to work on
>> a
>>>> fix.
>>>> Otherwise I will have cycles next week
>>>> 
>>>> I would also like to spend my time in setting up a few minimal
>> integration
>>>> tests about the upgrade story
>>>> 
>>>> Enrico
>>>> 
>>>> Il Mar 11 Feb 2020, 07:30 Michael Han <h...@apache.org> ha scritto:
>>>> 
>>>>> Kudos Enrico, very thorough work as the final gate keeper of the
>> release!
>>>>> 
>>>>> Now with this, I'd like to *vote a -1* on the 3.6.0 RC2.
>>>>> 
>>>>> I'd recommend we fix this issue for 3.6.0. ZooKeeper is one of the
>> rare
>>>>> piece of software that put so much emphasis on compatibilities thus
>> it
>>>> just
>>>>> works when upgrade / downgrade, which is amazing. One guarantee we
>> always
>>>>> had is during rolling upgrade, the quorum will always be available,
>>>> leading
>>>>> to no service interruption. It would be sad we lose such capability
>> given
>>>>> this is still a tractable problem.
>>>>> 
>>>>> Regarding the fix, can we just make 3.6.0 aware of the old protocol
>> and
>>>>> speak old message format when it's talking to old server? Basically,
>> an
>>>>> ugly if else check against the protocol version should work and
>> there is
>>>> no
>>>>> need to have multiple pass on rolling upgrade process.
>>>>> 
>>>>> 
>>>>> On Mon, Feb 10, 2020 at 10:23 PM Enrico Olivelli <
>> eolive...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> I suggest this plan:
>>>>>> - release 3.6.0 now
>>>>>> - improve the migration story, the flow outlined by Mate is
>>>>>> interesting, but it will take time
>>>>>> 
>>>>>> 3.6.0rc2 got enough binding votes so I am going to finalize the
>>>>>> release this evening (within 8-10 hours) if no one comes out in the
>>>>>> VOTE thread with a -1
>>>>>> 
>>>>>> Enrico
>>>>>> 
>>>>>> Enrico
>>>>>> 
>>>>>> Il giorno lun 10 feb 2020 alle ore 19:33 Patrick Hunt
>>>>>> <ph...@apache.org> ha scritto:
>>>>>>> 
>>>>>>> On Mon, Feb 10, 2020 at 3:38 AM Andor Molnar <an...@apache.org>
>>>> wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Answers inline.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> In my experience when you are close to a release it is
>> better to
>>>> to
>>>>>>>>> make big changes. (I am among the approvers of that patch,
>> so I
>>>> am
>>>>>>>>> responsible for this change)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Although this statement is acceptable for me, I don’t feel this
>>>> patch
>>>>>>>> should not have been merged into 3.6.0. Submission has been
>>>> preceded
>>>>>> by a
>>>>>>>> long argument with MAPR folks who originally wanted to be
>> merged
>>>> into
>>>>>> 3.4
>>>>>>>> branch (considering the pace how ZooKeeper community is moving
>>>>>> forward) and
>>>>>>>> we reached an agreement that release it with 3.6.0.
>>>>>>>> 
>>>>>>>> Make a long story short, this patch has been outstanding for
>> ages
>>>>>> without
>>>>>>>> much attention from the community and contributors made a lot
>> of
>>>>>> effort to
>>>>>>>> get it done before the release.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> I would like to ear from people that have been in the
>> community
>>>> for
>>>>>>>>> long time, then I am ready to complete the release process
>> for
>>>>>>>>> 3.6.0rc2.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Me too.
>>>>>>>> 
>>>>>>>> I tend to accept the way rolling restart works now - as you
>>>> described
>>>>>>>> Enrico - and given that situation was pretty much the same
>> between
>>>>> 3.4
>>>>>> and
>>>>>>>> 3.5, I don’t feel we have to make additional changes.
>>>>>>>> 
>>>>>>>> On the other hand, the fix that Mate suggested sounds quite
>> cool,
>>>> I’m
>>>>>> also
>>>>>>>> happy to work on getting it in.
>>>>>>>> 
>>>>>>>> Fyi, Release Management page says the following:
>>>>>>>> 
>>>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/ReleaseManagement
>>>>>>>> 
>>>>>>>> "major.minor release of ZooKeeper must be backwards compatible
>> with
>>>>> the
>>>>>>>> previous minor release, major.(minor-1)"
>>>>>>>> 
>>>>>>>> 
>>>>>>> Our users, direct and indirect, value the ability to migrate to
>> newer
>>>>>>> versions - esp as we drop support for older. Frictions such as
>> this
>>>> can
>>>>>> be
>>>>>>> a reason to go elsewhere. I'm "pro" b/w compact - esp given our
>>>>> published
>>>>>>> guidelines.
>>>>>>> 
>>>>>>> Patrick
>>>>>>> 
>>>>>>> 
>>>>>>>> Andor
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 2020. Feb 10., at 11:32, Enrico Olivelli <
>> eolive...@gmail.com
>>>>> 
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Thank you Mate for checking and explaining this story.
>>>>>>>>> 
>>>>>>>>> I find it very interesting that the cause is ZOOKEEPER-3188
>> as:
>>>>>>>>> - it is the last "big patch" committed to 3.6 before
>> starting the
>>>>>>>>> release process
>>>>>>>>> - it is the cause of the failure of the first RC
>>>>>>>>> 
>>>>>>>>> In my experience when you are close to a release it is
>> better to
>>>> to
>>>>>>>>> make big changes. (I am among the approvers of that patch,
>> so I
>>>> am
>>>>>>>>> responsible for this change)
>>>>>>>>> 
>>>>>>>>> This is a pointer to the change to whom who wants to
>> understand
>>>>>> better
>>>>>>>>> the context
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> https://github.com/apache/zookeeper/pull/1048/files#diff-7a209d890686bcba351d758b64b22a7dR11
>>>>>>>>> 
>>>>>>>>> IIUC even for the upgrade from 3.4 to 3.5 the story was the
>> same
>>>>> and
>>>>>>>>> if this statement holds then I feel we can continue
>>>>>>>>> with this release.
>>>>>>>>> 
>>>>>>>>> - Reverting ZOOKEEPER-3188 is not an option for me, it is too
>>>>>> complex.
>>>>>>>>> - Making 3.5 and 3.6 "compatible" can be very tricky and we
>> do
>>>> not
>>>>>>>>> have tools to certify this compatibility (at least not in the
>>>> short
>>>>>>>>> term)
>>>>>>>>> 
>>>>>>>>> I would like to ear from people that have been in the
>> community
>>>> for
>>>>>>>>> long time, then I am ready to complete the release process
>> for
>>>>>>>>> 3.6.0rc2.
>>>>>>>>> 
>>>>>>>>> I will update the website and the release notes with a
>> specific
>>>>>>>>> warning about the upgrade, we should also update the Wiki
>>>>>>>>> 
>>>>>>>>> Enrico
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Il giorno lun 10 feb 2020 alle ore 11:17 Szalay-Bekő Máté
>>>>>>>>> <szalay.beko.m...@gmail.com> ha scritto:
>>>>>>>>>> 
>>>>>>>>>> Hi Enrico!
>>>>>>>>>> 
>>>>>>>>>> This is caused by the different PROTOCOL_VERSION in the
>>>>>>>> QuorumCnxManager.
>>>>>>>>>> The Protocol version  was changed last time in
>> ZOOKEEPER-2186
>>>>>> released
>>>>>>>>>> first in 3.4.7 and 3.5.1 to avoid some crashing / fix some
>> bugs.
>>>>>> Later I
>>>>>>>>>> also changed the protocol version when the format of the
>> initial
>>>>>> message
>>>>>>>>>> changed in ZOOKEEPER-3188. So actually the quorum protocol
>> is
>>>> not
>>>>>>>>>> compatible in this case and is the 'expected' behavior if
>> you
>>>>>> upgrade
>>>>>>>> e.g
>>>>>>>>>> from 3.4.6 to 3.4.7, or 3.4.6 to 3.5.5 or e.g from 3.5.6 to
>>>> 3.6.0.
>>>>>>>>>> 
>>>>>>>>>> We had some discussion in the PR of ZOOKEEPER-3188 back
>> then and
>>>>>> got to
>>>>>>>> the
>>>>>>>>>> conclusion that it is not that bad, as there will be no data
>>>> loss
>>>>>> as you
>>>>>>>>>> wrote. The tricky thing is that during rolling upgrade we
>> should
>>>>>> ensure
>>>>>>>>>> both backward and forward compatibility to make sure that
>> the
>>>> old
>>>>>> and
>>>>>>>> the
>>>>>>>>>> new part of the quorum can still speak to each other. The
>>>> current
>>>>>>>> solution
>>>>>>>>>> (simply failing if the protocol versions mismatch) is more
>>>> simple
>>>>>> and
>>>>>>>> still
>>>>>>>>>> working just fine: as the servers are restarted one-by-one,
>> the
>>>>>> nodes
>>>>>>>> with
>>>>>>>>>> the old protocol version and the nodes with the new protocol
>>>>> version
>>>>>>>> will
>>>>>>>>>> form two partitions, but any given time only one partition
>> will
>>>>>> have the
>>>>>>>>>> quorum.
>>>>>>>>>> 
>>>>>>>>>> Still, thinking it trough, as a side effect in these cases
>> there
>>>>>> will
>>>>>>>> be a
>>>>>>>>>> short time when none of the partitions will have quorums
>> (when
>>>> we
>>>>>> have N
>>>>>>>>>> servers with the old protocol version, N servers with the
>> new
>>>>>> protocol
>>>>>>>>>> version, and there is one server just being restarted). I
>> am not
>>>>>> sure
>>>>>>>> if we
>>>>>>>>>> can accept this.
>>>>>>>>>> 
>>>>>>>>>> For ZOOKEEPER-3188 we can add a small patch to make it
>> possible
>>>> to
>>>>>> parse
>>>>>>>>>> the initial message of the old protocol version with the new
>>>> code.
>>>>>> But
>>>>>>>> I am
>>>>>>>>>> not sure if it would be enough (as the old code will not be
>> able
>>>>> to
>>>>>>>> parse
>>>>>>>>>> the new initial message).
>>>>>>>>>> 
>>>>>>>>>> One option can be to make a patch also for 3.5 to have a
>> version
>>>>>> which
>>>>>>>>>> supports both protocol versions. (let's say in 3.5.8) Then
>> we
>>>> can
>>>>>> write
>>>>>>>> to
>>>>>>>>>> the release note, that if you need rolling upgrade from any
>>>>> versions
>>>>>>>> since
>>>>>>>>>> 3.4.7, then you have to first upgrade from 3.5.8 before
>>>> upgrading
>>>>> to
>>>>>>>> 3.6.0.
>>>>>>>>>> We can even make the same thing on the 3.4 branch.
>>>>>>>>>> 
>>>>>>>>>> But I am also new to the community... It would be great to
>> hear
>>>>> the
>>>>>>>> opinion
>>>>>>>>>> of more experienced people.
>>>>>>>>>> Whatever the decision will be, I am happy to make the
>> changes.
>>>>>>>>>> 
>>>>>>>>>> And sorry for breaking the RC (if we decide that this needs
>> to
>>>> be
>>>>>>>>>> changed...).  ZOOKEEPER-3188 was a complex patch.
>>>>>>>>>> 
>>>>>>>>>> Kind regards,
>>>>>>>>>> Mate
>>>>>>>>>> 
>>>>>>>>>> On Mon, Feb 10, 2020 at 9:47 AM Enrico Olivelli <
>>>>>> eolive...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> even if we had enough binding +1 on 3.6.0rc2 before
>> closing the
>>>>>> VOTE
>>>>>>>>>>> of 3.6.0 I wanted to finish my tests and I am coming to an
>>>>> apparent
>>>>>>>>>>> blocker.
>>>>>>>>>>> 
>>>>>>>>>>> I am trying to upgrade a 3.5.6 cluster to 3.6.0, but it
>> looks
>>>>> like
>>>>>>>>>>> peers are not able to talk to each other.
>>>>>>>>>>> I have a cluster of 3, server1, server2 and server3.
>>>>>>>>>>> When I upgrade server1 to 3.6.0rc2 I see this kind of
>> errors on
>>>>> 3.5
>>>>>>>> nodes:
>>>>>>>>>>> 
>>>>>>>>>>> 2020-02-10 09:35:07,745 [myid:3] - INFO
>>>>>>>>>>> [localhost/127.0.0.1:3334:QuorumCnxManager$Listener@918] -
>>>>>> Received
>>>>>>>>>>> connection request 127.0.0.1:62591
>>>>>>>>>>> 2020-02-10 09:35:07,746 [myid:3] - ERROR
>>>>>>>>>>> [localhost/127.0.0.1:3334:QuorumCnxManager@527] -
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> org.apache.zookeeper.server.quorum.QuorumCnxManager$InitialMessage$InitialMessageException:
>>>>>>>>>>> Got unrecognized protocol version -65535
>>>>>>>>>>> 
>>>>>>>>>>> Once I upgrade all of the peers the system is up and
>> running,
>>>>>> without
>>>>>>>>>>> apparently no data loss.
>>>>>>>>>>> 
>>>>>>>>>>> During the upgrade as soon as I upgrade the first node,
>> say,
>>>>>> server1,
>>>>>>>>>>> server1 is not able to accept connections (error "Close of
>>>>> session
>>>>>> 0x0
>>>>>>>>>>> java.io.IOException: ZooKeeperServer not running")  from
>>>> clients,
>>>>>> this
>>>>>>>>>>> is expected, because as far as it cannot talk with the
>> other
>>>>> peers
>>>>>> it
>>>>>>>>>>> is practically partitioned away from the cluster.
>>>>>>>>>>> 
>>>>>>>>>>> My questions are:
>>>>>>>>>>> 1) is this expected ? I can't remember protocol changes
>> from
>>>> 3.5
>>>>> to
>>>>>>>>>>> 3.6, but actually 3.6 diverged from 3.5 branch so long ago,
>>>> and I
>>>>>> was
>>>>>>>>>>> not in the community as dev so I cannot tell
>>>>>>>>>>> 2) is this a viable option for users ? to have some
>> temporary
>>>>>> glitch
>>>>>>>>>>> during the upgrade and hope that the upgrade completes
>> without
>>>>>>>>>>> troubles ?
>>>>>>>>>>> 
>>>>>>>>>>> In theory as long as two servers are running the same major
>>>>> version
>>>>>>>>>>> (3.5 or 3.6) we have a quorum and the system is able to
>> make
>>>>>> progress
>>>>>>>>>>> and to server clients.
>>>>>>>>>>> I feel that this is quite dangerous, but I don't have
>> enough
>>>>>> context
>>>>>>>>>>> to understand how this problem is possible and when we
>> decided
>>>> to
>>>>>>>>>>> break compatibility.
>>>>>>>>>>> 
>>>>>>>>>>> The other option is that I am wrong in my test and I am
>> messing
>>>>> up
>>>>>> :-)
>>>>>>>>>>> 
>>>>>>>>>>> The other upgrade path I would like to see working like a
>> charm
>>>>> is
>>>>>> the
>>>>>>>>>>> upgrade from 3.4 to 3.6, as I see that as soon as we
>> release
>>>> 3.6
>>>>> we
>>>>>>>>>>> should encourage users to move to 3.6 and not to 3.5.
>>>>>>>>>>> 
>>>>>>>>>>> Regards
>>>>>>>>>>> Enrico
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 

Reply via email to