ptlrs commented on code in PR #9664:
URL: https://github.com/apache/ozone/pull/9664#discussion_r2739492223


##########
hadoop-hdds/docs/content/design/zdu-design.md:
##########
@@ -0,0 +1,535 @@
+---
+jira: HDDS-3331
+authors:
+- Stephen O'Donnell
+- Ethan Rose
+- Istvan Fajth
+---
+
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# Zero Downtime Upgrade (ZDU)
+
+## The Goal
+
+The goal of Zero Downtime Upgrade (ZDU) is to allow the software running an 
existing Ozone cluster to be upgraded while the cluster remains operational. 
There should be no gaps in service and the upgrade should be transparent to 
applications using the cluster.
+
+Ozone is already designed to be fault tolerant, so the rolling restart of SCM, 
OM and Datanodes is already possible without impacting users of the cluster. 
The challenge with ZDU is therefore related to wire and disk compatibility, as 
different components within the cluster can be running different software 
versions concurrently. This design will focus on how we solve the wire and disk 
compatibility issues.
+
+## Component Upgrade Order
+
+To simplify reasoning about components of different types running in different 
versions, we should reduce the number of possible version combinations allowed 
as much as possible. Clients are considered external to the Ozone cluster, 
therefore we cannot control their version. However, we already have a framework 
to handle client/server cross compatibility, so rolling upgrade only needs to 
focus on compatibility of internal components. For internal Ozone components, 
we can define and enforce an order that the components must be upgraded in. 
Consider the following Ozone service diagram:
+
+![Ozone connection diagram](zdu-image1.png)
+
+Here the arrows represent client to server interactions between components, 
with the arrow pointing from the client to the server. The red arrow is 
external clients interacting with Ozone. The shield means that the client needs 
to see a consistent API surface despite leader changes in mixed version 
clusters so that APIs do not seem to disappear and reappear based on the node 
serving the request. The orange lines represent client to server interactions 
for internal Ozone components. For components connected by this internal line, 
**we can control the order that they are upgraded such that the server is 
always newer and handles all compatibility issues**. This greatly reduces the 
matrix of possible versions we may see within Ozone and mostly eliminates the 
need for internal Ozone components to be aware of each other’s versions, as 
long as servers remain backwards compatible. This order is:
+
+1. Upgrade all SCMs to the new version  
+2. Upgrade Recon to the new version  
+3. Upgrade all Datanodes to the new version  
+4. Upgrade all OMs to the new version  
+5. Upgrade all S3 gateways to the new version
+
+Note that in this ordering, Recon will still have a new client/old server 
relationship with OM for a period of time. The OM sync process in Recon is the 
only API that needs to account for this, and it is not on the main data read, 
write, delete, or recovery path. Recon should be upgraded with the SCMs because 
its container report processing from the datanodes shares SCM code, so we do 
not want Recon to handle a different version matrix among datanodes than SCM.
+
+## Software Version Framework
+
+The previous section defines an upgrade order to handle API compatibility 
between internal components of different types without the need for explicit 
versioning. For internal components of the same type, we need to provide 
stronger guarantees when they are in mixed versions:
+
+* Components of the same type must persist the same data  
+* Components of the same type must expose a consistent API surface
+
+To accomplish these goals, we need a versioning framework to track component 
specific versions and ensure components of the same type operate in unison. 
Note that this versioning framework will not extend beyond Ozone into lower 
level libraries like Ratis, Hadoop RPC, gRPC, and protobuf. We are dependent on 
these libraries providing their own cross compatibility guarantees for ZDU to 
function.
+
+### Versioning in the Existing Upgrade Framework
+
+Before discussing versioning in the context of ZDU, we should first review the 
versioning framework currently present which allows for upgrades and downgrades 
within Ozone, and cross compatibility between Ozone and external clients of 
various versions.
+
+Ozone components currently define their version in two classes: 
ComponentVersion and LayoutFeature. Any change to the on-disk format increments 
the Layout Feature/Version, which is internal to the component. You can see 
examples of the Layout Version in classes such as HDDSLayoutFeature, 
OMLayoutFeature and ReconLayoutFeature. Any change to the API layer which may 
affect external clients will increment the ComponentVersion. Component versions 
are defined in classes like OzoneManagerVersion and DatanodeVersion. One change 
may have an impact in both areas and need to increment both versions.
+
+The existing upgrade framework uses the following terminology:
+
+**Component version**: The logical versioning system used to track 
incompatible changes to components that affect client/server network 
compatibility. Currently it is only used in communication with clients outside 
of Ozone, not within Ozone components itself. The component version is 
hardcoded in the software and does not change. We currently use the following 
component versions:
+- **OM Version**: Provided to external clients communicating with OM in case a 
newer external client needs to handle compatibility.
+- **Datanode Version**: Provided by Datanodes to external clients in case a 
newer external client needs to handle compatibility.
+- **Client Version**: Provided by external clients to internal Ozone 
components (OM and Datanode) in case a newer Ozone server needs to handle 
compatibility.
+
+**Layout Feature/Version:** The logical versioning system used to track 
incompatible changes to components that affect their internal disk layout. This 
is used to track downgrade compatibility. We currently use the following layout 
features:
+- **OM Layout Feature**: Used to track disk changes within the OM
+- **HDDS Layout Feature**: Used to track disk changes within SCM and 
Datanodes. One shared version is required so that SCM can orchestrate Datanode 
finalization.
+
+**Software Layout Version (SLV):** The highest layout version within the code. 
When the cluster is finalized, it will use this layout version.
+
+**Metadata Layout Version (MLV):** The layout version that is persisted to the 
disk, which indicates what format the component should use when writing 
changes. This may be less than or equal to the software layout version. 
+
+**Pre-finalized:** State a component enters when the MLV is less than the SLV 
after an upgrade. At this time existing features are fully operational. New 
features are blocked, but the cluster can be downgraded to the old software 
version. Pre-finalized status does not affect component version, which always 
reflects the version of the software currently running.
+
+**Finalized:** State a component enters when the MLV is equal to the SLV. A 
component makes this transition from pre-finalized to finalized when it 
receives a finalize command from the admin. At this time all new features are 
fully operational,  but downgrade is not allowed. Finalized status does not 
affect component version, which always reflects the version of the software 
currently running.
+
+In the existing upgrade framework, OM and SCM can be finalized in any order. 
SCM will finalize before instructing datanodes to finalize. Recon currently has 
no finalization process, and S3 gateway does not need finalization because it 
is stateless.
+
+### Versioning in the New Upgrade Framework
+
+In practice, tracking network and disk changes separately has proven difficult 
to reason about. Developers are often confused about whether one or both 
versions need to be changed for a feature, and each version’s relationship with 
finalization. Before adding complexity to the upgrade flow with ZDU, it will be 
beneficial to simplify the two versioning schemes into one version that gets 
incremented for any incompatible change. This gives us the following new 
definitions:
+
+**Component version**: The logical versioning system used to track 
incompatible changes to a component, regardless whether they affect disk or 
network compatibility between the same or different types of components. This 
will extend the existing component version framework. We will use the following 
component versions:
+- **OM Version**: Used within the Ozone Manager ring and provided to external 
clients in case a newer external client needs to handle compatibility.
+- **HDDS Version**: Used within SCM and Datanodes and provided to external 
clients in case a newer external client needs to handle compatibility. One 
shared version is required so that SCM can orchestrate Datanode finalization.
+- **Client Version**: Provided by external clients to internal Ozone 
components in case a newer Ozone server needs to handle compatibility.
+
+**Software version:** The Component Version of the bits that are installed. 
This is always the highest component version contained in the code that is 
running.
+
+**Apparent version:** The Component Version the software is acting as, which 
is persisted to the disk. The apparent version determines the API that is 
exposed by the component and the format it uses to persist data.
+
+**Pre-finalized:** State a component enters when the apparent version on disk 
is less than the software version. At this time all other machines may or may 
not be running the new bits, new features are blocked, and downgrade is allowed.
+
+**Finalized:** State a component enters when the apparent version is equal to 
the software version. A component makes this transition from pre-finalized to 
finalized when it receives a finalize command from the admin. At this time all 
machines are running the new bits, and even though this component is finalized, 
different types of components may not be. Downgrade is not allowed after this 
point.
+
+This simplified version framework lets us enforce **three invariants** to 
reason about the upgrade process among internal components (OM, SCM, Datanode):
+
+* **Internal components of the same type will always operate at the same 
apparent version.**  
+* **At the time of finalization, all internal components must be running the 
new bits.**  
+* **For internal client/server relationships, the server will always finalize 
before the client.**
+
+This table provides a visual example of apparent and software version during a 
rolling upgrade of the Ozone Managers. Each component's version is indicated 
using the notation `<apparent version>/<software version>`, and bold versions 
are running the new software. This notation will be used throughout the 
document to refer to component versions. Note that the apparent versions of the 
OMs always match. See the 
[appendix](#Appendix%20Step%20by%20Step%20ZDU%20Process) for a complete 
cluster-wide example.
+
+| Status                                          | OM1         | OM2         
| OM3         |
+| ----------------------------------------------- | ----------- | ----------- 
| ----------- |
+| All OMs are finalized in the old version        | 100/100     | 100/100     
| 100/100     |
+| OM1 is stopped and started with the new version | **100/105** | 100/100     
| 100/100     |
+| OM2 is stopped and started with the new version | **100/105** | **100/105** 
| 100/100     |
+| All OMs are running the new version             | **100/105** | **100/105** 
| **100/105** |
+| All OMs are finalized atomically via Ratis      | **105/105** | **105/105** 
| **105/105** |
+
+Later sections on upgrade flow and ordering will detail how these invariants 
are enforced. Note that external clients can be in any version and we will 
support full client/server version cross compatibility between internal 
components and external clients.
+
+### Usage of the Versioning Framework During Upgrades
+
+When a cluster is running, its version will be stored on disk as the apparent 
version. An upgrade is triggered when a process is started with a newer version 
than the apparent version written to disk. On startup, the process can read the 
apparent version from disk and notice that its software version is higher. 
Since it has not been finalized, it will then “act as” this earlier apparent 
version until it is later finalized. In this state, the code must be 
implemented such that the API surface, the API behaviour and the on-disk format 
of persisted data are identical to the older versions. Even though the new 
version can have new features, APIs, and persist different data to disk, they 
must all be feature gated and unavailable until the upgrade is finalized. This 
will maintain a consistent API surface for clients despite internal components 
having different versions. This will be the case for ZDU upgrades and 
non-rolling upgrades.
+
+For external clients, the apparent version is what will be communicated from 
the server to provide their view of the server’s version. This differs from the 
current model where clients receive the static component version which is 
always defined by the latest version the software supports. While a client from 
another cluster could in theory attempt to use some of the new features, which 
would result in an error, this is unlikely to happen as the Ozone clients are 
version aware and should similarly be coded so they don’t attempt incompatible 
calls supported by newer versions.
+
+For internal components, the “new client old server” invariant makes version 
passing among internal components of different types mostly unnecessary. For 
example, SCM does not need to worry about whether the OM client it is 
communicating with has the new bits or whether the OM has been finalized. SCM’s 
server will always be newer and finalized before OM. Therefore it can remain 
backwards compatible and will work with the OM in either case.
+
+Recon to OM communication will be the only case where an internal client is 
newer than the server, and therefore the only case where we need to do version 
checks between components using an internal API. The newer Recon client may 
need to learn the older OM's apparent version to handle compatibility during 
the upgrade.
+
+SCM will need to know the software and apparent versions of the Datanodes, but 
not for API compatibility. By using the same HDDS component version instead of 
separate SCM and Datanode versions, SCM can accurately track the expected 
apparent and software versions of Datanodes to either instruct them to finalize 
or fence them out of the cluster. If there was a discrepancy among versions 
being reported by Datanodes and SCM was using its own separate SCM version, it 
would have no source of truth. The HDDS versions of the Datanodes should not 
need to be checked for client/server compatibility during heartbeat processing 
because SCM's server will always be newer.
+
+### Migrating to the New Unified Component Version Framework
+
+The existing component version enum will be the basis of the new unified 
versioning framework. This is because it is shared with external clients who 
can be in any version and may contact the cluster at any time. The existing 
layout feature enum is internal to components, and therefore easier to control 
during the migration.
+
+To migrate to one single layout version, we will add a new software version 
“100” to each existing component version enum. Version 100 will universally 
indicate the first version that is ZDU ready, and the point from which this 
unified version will be used to track all changes through the existing 
component version enum.
+
+Note that the version number we use for this migration must be larger than 
both the largest existing component version and largest existing layout version 
to prevent either one from appearing to go back in time before the migrated 
version is finalized. 100 was chosen as an easily identifiable number that can 
be used across all components to indicate the epoch from which they all have 
migrated to the unified framework and support rolling upgrade.
+
+This migration will be transparent in client/server interactions for network 
changes. It will simply appear as a new larger version with all the previous 
versions in the existing component version enum still intact.
+
+This migration will need some handling for disk changes. When the upgraded 
component starts up with software version 100 and sees a version less than that 
persisted to the disk, it must use the old `LayoutFeature` enum to look up that 
version until the cluster is finalized. After finalization, version 100 will be 
written to the disk and all versions from here on can be referenced from the 
`ComponentVersion` enum.
+
+In the current code, Datanodes use their own `DatanodeComponentVersion` and 
there is no `ScmComponentVersion`. However, Datanodes and SCM share the same 
`HDDSLayoutFeature` for disk versioning. We need to collapse these into a 
single `HDDSVersion` in the new versioning framework. The existing 
`DatanodeVersion` can simply be renamed to  `HDDSVersion` , since there is no 
`ScmVersion` to merge it with. From there, migrating from `HDDSLayoutFeature` 
to `HDDSVersion` can be done using the same process outlined above.
+
+##  Strategy To Achieve ZDU
+
+###  Prerequisites
+
+Before an Ozone cluster can use ZDU in an upgrade, the initial version being 
upgraded must also support ZDU. All software versions from 100 onward will be 
ZDU ready, and any Ozone changes after version 100 have to be made in a ZDU 
compatible way. We can say that version 100 is the minimal eligible version for 
ZDU. For example, a cluster would need to be upgraded from version 5 version to 
105 with the existing non-rolling upgrade process. All upgrades starting from 
version 105 could then optionally be done with ZDU or non-rolling upgrades.
+
+###  Invariants
+
+This is a summary of invariants for internal components outlined in earlier 
sections which will be maintained during the upgrade flow. These provide a 
framework for developers to reason about changes during the upgrade:
+
+* Internal components of different types will always have the server in the 
same or newer version as the client.  
+  * The only exception is Recon to OM communication.  
+* Internal components of the same type will always operate at the same 
apparent version.  
+  * This implies that they expose the same API surface and persist data in the 
same format.  
+* At the time of finalization, all internal components must be running the new 
bits.  
+* Internal components of different types will always have the server side 
finalize before the client side.
+
+###  Order of Operations During the Upgrade
+
+This is a high level summary of all the steps that will happen during a 
rolling upgrade. For a more detailed view, see the 
[appendix](#Appendix%20Step%20by%20Step%20ZDU%20Process). Note that rolling 
upgrade has stricter requirements than non-rolling upgrade, so this process can 
also be used for a non-rolling upgrade by performing steps 1-5 at the same time.
+
+1. Deploy the new software version to SCM and rolling restart the SCMs.  
+2. Deploy the new software version to Recon and restart Recon.  
+3. Deploy the new software version to all datanodes and rolling restart the 
DNs.  
+4. Deploy the new software version to all OMs and rolling restart the OMs.  
+5. Deploy the new software and rolling restart all client processes like S3 
Gateway, HTTPFS, Prometheus etc. These processes are all Ozone clients and sit 
somewhat outside of the core Ozone cluster.
+    - At this stage, the cluster is operating with the new software version, 
but is still “acting as” the older apparent version. No data will be written to 
disk in a new format, and new features will be unavailable.
+6. The finalize command is sent to SCM by the admin - this is what is used to 
switch the cluster to act as the new version. Upon receipt of the finalize 
command:  
+   7. SCM will finalize itself over Ratis, saving the new finalized version.
+   8. It will notify datanodes over the heartbeat to finalize.
+   9. After all healthy datanodes have been finalized, OM can be finalized. To 
do this, OM will have been polling SCM periodically to see if it should 
finalize. Only after SCM and all datanodes have been finalized will OM get a 
“ready to finalize” response from the poll. The OM leader will then send a 
finalize command over Ratis to all OMs.
+   10. As OM is the entry point to the cluster for external clients, 
finalizing OM unlocks any new features in the upgraded version.

Review Comment:
   These don't render correctly



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to