chesnokoff commented on code in PR #13217:
URL: https://github.com/apache/ignite/pull/13217#discussion_r3380829177
##########
modules/core/src/main/java/org/apache/ignite/internal/processors/rollingupgrade/RollingUpgradeProcessor.java:
##########
@@ -85,26 +78,25 @@ public RollingUpgradeProcessor(GridKernalContext ctx) {
/** {@inheritDoc} */
@Override public void onKernalStart(boolean active) throws
IgniteCheckedException {
- DiscoverySpi spi = ctx.config().getDiscoverySpi();
-
- if (spi instanceof TcpDiscoverySpi)
- ring = ((TcpDiscoverySpi)spi).discoveryRing();
-
startLatch.countDown();
}
/** {@inheritDoc} */
@Override public void start() throws IgniteCheckedException {
- ctx.event().addLocalEventListener(new GridLocalEventListener() {
- @Override public void onEvent(Event evt) {
+ ctx.event().addLocalEventListener(
+ evt -> {
UUID nodeId = ((DiscoveryEvent)evt).eventNode().id();
synchronized (lock) {
if (lastJoiningNode != null &&
lastJoiningNode.id().equals(nodeId))
Review Comment:
```suggestion
ClusterNode eventNode = ((DiscoveryEvent)evt).eventNode();
synchronized (lock) {
if (lastJoiningNode == eventNode)
lastJoiningNode = null;
}
```
Looks like there is a race here. Suppose coordinator node **A** has version
2.18, and node **B** with version 2.19 tries to join.
On the first attempt, node **B** passes
`RollingUpgradeProcessor#validateNode`, so `lastJoiningNode` points to this
attempt. Then it fails in a later validation step, and
`EVT_NODE_VALIDATION_FAILED` is queued asynchronously. After that, node **B**
immediately retries with the same node ID and passes
`RollingUpgradeProcessor#validateNode` again, so `lastJoiningNode` points to
the second attempt.
After that, the old `EVT_NODE_VALIDATION_FAILED` from the first attempt can
be delivered. Since we compare only by node ID, this old event clears
`lastJoiningNode` for the second attempt. If `disable` is called before the
second attempt is added to the `ring`, it can succeed. As a result, rolling
upgrade can be disabled while a 2.19 node is still joining a 2.18 cluster.
We should match the exact join attempt, not only the node ID.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]