[
https://issues.apache.org/jira/browse/IGNITE-28381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mikhail Petrov updated IGNITE-28381:
------------------------------------
Description:
The problem was reproduced during
org.apache.ignite.client.ReliabilityTest#testFailover test on TC. See
attachments for the full log.
Exception:
{code:java}
java.lang.AssertionError
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicUpdateFuture.onPrimaryResponse(GridNearAtomicUpdateFuture.java:400)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateResponse(GridDhtAtomicCache.java:3236)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:311)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:306)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1084)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:594)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:396)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1862)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1483)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.managers.communication.GridIoManager$9.execute(GridIoManager.java:1387)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.managers.communication.TraceRunnable.run(TraceRunnable.java:55)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.thread.pool.IgniteStripedExecutor$Stripe.body(IgniteStripedExecutor.java:638)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125)
[04:31:10] [org.apache.ignite:ignite-indexing] at
java.base/java.lang.Thread.run(Thread.java:829)
{code}
The root cause of the assertion error:
Consider a cluster consisting of 3 nodes - node0, node1, node2
1. node2 is gracefully stopped
2. PME is started on node0 and node 1
3. PME is completed on node 0. So its topology version is now V+1.
4. Ignite#close method is called to gracefully stop the node 1
5. PME future on node1 is cancelled with the following exception
{code:java}
org.apache.ignite.internal.NodeStoppingException: Node is stopping:
e2c0e324-70b6-45b8-925d-fe9a9fa5d877
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.onKernalStop0(GridCachePartitionExchangeManager.java:842)
at
org.apache.ignite.internal.processors.cache.GridCacheSharedManagerAdapter.onKernalStop(GridCacheSharedManagerAdapter.java:131)
at
org.apache.ignite.internal.processors.cache.GridCacheProcessor.onKernalStop(GridCacheProcessor.java:819)
at org.apache.ignite.internal.IgniteKernal.stop0(IgniteKernal.java:1782)
at org.apache.ignite.internal.IgniteKernal.stop(IgniteKernal.java:1731)
at
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.stop0(IgnitionEx.java:2297)
at
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.stop(IgnitionEx.java:2117)
at org.apache.ignite.internal.IgnitionEx.stop(IgnitionEx.java:308)
at org.apache.ignite.Ignition.stop(Ignition.java:202)
at org.apache.ignite.internal.IgniteKernal.close(IgniteKernal.java:2686)
at
org.apache.ignite.client.LocalIgniteCluster.failNode(LocalIgniteCluster.java:110)
at
org.apache.ignite.client.ReliabilityTest.lambda$assertOnUnstableCluster$17(ReliabilityTest.java:714)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at
java.base/java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:264)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java)
{code}
6. GridCachePartitionExchangeManager#onExchangeDone and
GridDhtPartitionTopology#onExchangeDone are not called on node 1 because the
PME future was completed with the exception.
7.Even when stopped, node 1 can receive and process requests from other nodes
for a short period of time. Node 1 (primary) receives
GridDhtAtomicAbstractUpdateRequest from node 0 (near). Which was mapped by node
0 to V+1 topology version.
8. PME future was completed on node 1 so nothing prevents it from handling the
request (see GridMessageListener#onMessage)
9. The cache request is handled by GridDhtAtomicCache#updateAllAsyncInternal0
10. GridDhtCacheAdapter#needRemap check returns true because cache request was
mapped on V+1 topology version but node1 local version is V (see
`cctx.topology().readyTopologyVersion()` call in
GridCacheAffinityManager#isCompatibleWithCurrentTopologyVersion method and
clause 6).
11. node1 sends response to node0 with remap topology version == V+1. See
`res.remapTopologyVersion(top.lastTopologyChangeVersion());` call.
`top.lastTopologyChangeVersion()` returns V+1 because even PME future failed,
it was started and related to V+1 topology version.
12. node0 receives node1 response and fails with the mentioned above stack
trace
Actually it may differ whether single cache put is called or putAll.
was:
The problem was reproduced during
org.apache.ignite.client.ReliabilityTest#testFailover test on TC. See
attachments for the full log.
Exception:
{code:java}
java.lang.AssertionError
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicUpdateFuture.onPrimaryResponse(GridNearAtomicUpdateFuture.java:400)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateResponse(GridDhtAtomicCache.java:3236)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:311)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:306)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1084)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:594)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:396)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1862)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1483)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.managers.communication.GridIoManager$9.execute(GridIoManager.java:1387)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.managers.communication.TraceRunnable.run(TraceRunnable.java:55)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.thread.pool.IgniteStripedExecutor$Stripe.body(IgniteStripedExecutor.java:638)
[04:31:10] [org.apache.ignite:ignite-indexing] at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125)
[04:31:10] [org.apache.ignite:ignite-indexing] at
java.base/java.lang.Thread.run(Thread.java:829)
{code}
> Assertion Error during Atomic Cache operation topology remapping
> ----------------------------------------------------------------
>
> Key: IGNITE-28381
> URL: https://issues.apache.org/jira/browse/IGNITE-28381
> Project: Ignite
> Issue Type: Bug
> Reporter: Mikhail Petrov
> Assignee: Mikhail Petrov
> Priority: Major
> Labels: ise
> Fix For: 2.19
>
> Attachments: _Apache_Ignite_2.x_Tests_Thin_Client_Java_16197.log.zip
>
>
> The problem was reproduced during
> org.apache.ignite.client.ReliabilityTest#testFailover test on TC. See
> attachments for the full log.
> Exception:
> {code:java}
> java.lang.AssertionError
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicUpdateFuture.onPrimaryResponse(GridNearAtomicUpdateFuture.java:400)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateResponse(GridDhtAtomicCache.java:3236)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:311)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:306)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1084)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:594)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:396)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1862)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1483)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.managers.communication.GridIoManager$9.execute(GridIoManager.java:1387)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.managers.communication.TraceRunnable.run(TraceRunnable.java:55)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.thread.pool.IgniteStripedExecutor$Stripe.body(IgniteStripedExecutor.java:638)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125)
> [04:31:10] [org.apache.ignite:ignite-indexing] at
> java.base/java.lang.Thread.run(Thread.java:829)
> {code}
> The root cause of the assertion error:
> Consider a cluster consisting of 3 nodes - node0, node1, node2
> 1. node2 is gracefully stopped
> 2. PME is started on node0 and node 1
> 3. PME is completed on node 0. So its topology version is now V+1.
> 4. Ignite#close method is called to gracefully stop the node 1
> 5. PME future on node1 is cancelled with the following exception
> {code:java}
> org.apache.ignite.internal.NodeStoppingException: Node is stopping:
> e2c0e324-70b6-45b8-925d-fe9a9fa5d877
> at
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.onKernalStop0(GridCachePartitionExchangeManager.java:842)
> at
> org.apache.ignite.internal.processors.cache.GridCacheSharedManagerAdapter.onKernalStop(GridCacheSharedManagerAdapter.java:131)
> at
> org.apache.ignite.internal.processors.cache.GridCacheProcessor.onKernalStop(GridCacheProcessor.java:819)
> at org.apache.ignite.internal.IgniteKernal.stop0(IgniteKernal.java:1782)
> at org.apache.ignite.internal.IgniteKernal.stop(IgniteKernal.java:1731)
> at
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.stop0(IgnitionEx.java:2297)
> at
> org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.stop(IgnitionEx.java:2117)
> at org.apache.ignite.internal.IgnitionEx.stop(IgnitionEx.java:308)
> at org.apache.ignite.Ignition.stop(Ignition.java:202)
> at org.apache.ignite.internal.IgniteKernal.close(IgniteKernal.java:2686)
> at
> org.apache.ignite.client.LocalIgniteCluster.failNode(LocalIgniteCluster.java:110)
> at
> org.apache.ignite.client.ReliabilityTest.lambda$assertOnUnstableCluster$17(ReliabilityTest.java:714)
> at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
> at
> java.base/java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:264)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java)
> {code}
> 6. GridCachePartitionExchangeManager#onExchangeDone and
> GridDhtPartitionTopology#onExchangeDone are not called on node 1 because the
> PME future was completed with the exception.
> 7.Even when stopped, node 1 can receive and process requests from other nodes
> for a short period of time. Node 1 (primary) receives
> GridDhtAtomicAbstractUpdateRequest from node 0 (near). Which was mapped by
> node 0 to V+1 topology version.
> 8. PME future was completed on node 1 so nothing prevents it from handling
> the request (see GridMessageListener#onMessage)
> 9. The cache request is handled by GridDhtAtomicCache#updateAllAsyncInternal0
> 10. GridDhtCacheAdapter#needRemap check returns true because cache request
> was mapped on V+1 topology version but node1 local version is V (see
> `cctx.topology().readyTopologyVersion()` call in
> GridCacheAffinityManager#isCompatibleWithCurrentTopologyVersion method and
> clause 6).
> 11. node1 sends response to node0 with remap topology version == V+1. See
> `res.remapTopologyVersion(top.lastTopologyChangeVersion());` call.
> `top.lastTopologyChangeVersion()` returns V+1 because even PME future failed,
> it was started and related to V+1 topology version.
> 12. node0 receives node1 response and fails with the mentioned above stack
> trace
> Actually it may differ whether single cache put is called or putAll.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)