[
https://issues.apache.org/jira/browse/IGNITE-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mikhail Petrov updated IGNITE-27109:
------------------------------------
Description:
IgniteCache#putAll call may succeed, but some of the passed entries will not be
stored in the cache. This may happen for ATOMIC FULL_SYNC caches when a node
leaves the cluster during IgniteCache#putAll execution. Even though it is
expected that putAll can partially fail for atomic caches, user still should
get CachePartialUpdateException.
The problem is reproduced by ReliabilityTest.testFailover test. Cache
configuration: ATOMIC, REPLICATED, FULL_SYNC
See:
https://ci2.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=-8360567487297938069&tab=testDetails&branch_IgniteTests24Java8=%3Cdefault%3E
Explanation :
Consider cluster with 3 nodes - node0, node1, node2
1. node0 accepts putAll request, maps all keys to corresponding primary nodes
and sends GridNearAtomicFullUpdateRequest to node1 and node2.
2. node1 starts processing cache entries. Halfway through this process node1
receives stop signal (Ignite#close). All remaining attempts to process cache
entries will fail with exception - see
IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#invoke and
IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#operationCancelledException.
3. node1 manages to sends GridDhtAtomicUpdateRequest with entries that node1
processed before it was stopped to backups(node2 and node0)
4. node1 fails to send GridNearAtomicUpdateResponse with failed keys to node0
because node is stopping (see GridCacheIoManager#onSend). This message is an
indication to the "near" node that some keys could not be processed and the
operation should be terminated with an exception.
5. node0 and node2 process entries from GridDhtAtomicUpdateRequest`s and sends
GridDhtAtomicNearResponse`s to node0.
6. node1 is removed from the cluster.
7. Currently node0 does not wait for node 1 (primary node for some keys) to
respond in FULL_SYNC mode. node0 completes putAll operation when
GridDhtAtomicNearResponse`s are received from all backups. But backups does not
inform node0 (near node) that some putAll entries were not processed. And
operations completes successfully.
Proposal:
In case primary node fails to process any entries during putAll - send
GridDhtAtomicUpdateRequest to backups with DHT_ATOMIC_HAS_RESULT_MASK==false.
It will cause Near Node to wait for Primary Node to respond even if all
responses from backups are received.
As a result - if primary node fails to process any entries, it either will
manage to send GridNearAtomicUpdateResponse to near node and putAll will
completes with failure, or primary node leaves without sending
GridNearAtomicUpdateResponse and entries that were mapped to left primary node
will be remapped to a new topology by the Near Node.
was:
IgniteCache#putAll call may succeed, but some of the passed entries will not be
stored in the cache. This may happen for ATOMIC FULL_SYNC caches when a node
leaves the cluster during IgniteCache#putAll execution. Even though it is
expected that putAll can partially fail for atomic caches, user still should
get CachePartialUpdateException.
The problem is reproduced by ReliabilityTest.testFailover test. Cache
configuration: ATOMIC, REPLICATED, FULL_SYNC
See:
https://ci2.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=-8360567487297938069&tab=testDetails&branch_IgniteTests24Java8=%3Cdefault%3E
Explanation :
Consider cluster with 3 nodes - node0, node1, node2
1. node0 accepts putAll request, maps all keys to corresponding primary nodes
and sends GridNearAtomicFullUpdateRequest to node1 and node2.
2. node1 starts processing cache entries. Halfway through this process node1
receives stop signal (Ignite#close). All remaining attempts to process cache
entries will fail with exception - see
IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#invoke and
IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#operationCancelledException.
3. node1 manages to sends GridDhtAtomicUpdateRequest with entries that node1
processed before it was stopped to backups(node2 and node0)
4. node1 fails to send GridNearAtomicUpdateResponse with failed keys to node0
because node is stopping (see GridCacheIoManager#onSend). This message is an
indication to the "near" node that some keys could not be processed and the
operation should be terminated with an exception.
5. node0 and node2 process entries from GridDhtAtomicUpdateRequest`s and sends
GridDhtAtomicNearResponse`s to node0.
6. node1 is removed from the cluster.
7. Currently node0 does not wait for node 1 (primary node for some keys) to
respond in FULL_SYNC mode. node0 completes putAll operation when
GridDhtAtomicNearResponse`s are received from all backups. But backups does not
inform node0 (near node) that only some putAll entries were not processed. And
operations completes successfully.
Proposal:
In case primary node fails to process any entries during putAll - send
GridDhtAtomicUpdateRequest to backups with DHT_ATOMIC_HAS_RESULT_MASK==false.
It will cause Near Node to wait for Primary Node to respond even if all
responses from backups are received.
As a result - if primary node fails to process any entries, it either will
manage to send GridNearAtomicUpdateResponse to near node and putAll will
completes with failure, or primary node leaves without sending
GridNearAtomicUpdateResponse and entries that were mapped to left primary node
will be remapped to a new topology by the Near Node.
> IgniteCache#putAll may silently lose entries while any primary node is
> leaving the cluster
> ------------------------------------------------------------------------------------------
>
> Key: IGNITE-27109
> URL: https://issues.apache.org/jira/browse/IGNITE-27109
> Project: Ignite
> Issue Type: Bug
> Reporter: Mikhail Petrov
> Assignee: Mikhail Petrov
> Priority: Major
> Labels: ise
> Time Spent: 2h 10m
> Remaining Estimate: 0h
>
> IgniteCache#putAll call may succeed, but some of the passed entries will not
> be stored in the cache. This may happen for ATOMIC FULL_SYNC caches when a
> node leaves the cluster during IgniteCache#putAll execution. Even though it
> is expected that putAll can partially fail for atomic caches, user still
> should get CachePartialUpdateException.
> The problem is reproduced by ReliabilityTest.testFailover test. Cache
> configuration: ATOMIC, REPLICATED, FULL_SYNC
> See:
> https://ci2.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=-8360567487297938069&tab=testDetails&branch_IgniteTests24Java8=%3Cdefault%3E
> Explanation :
> Consider cluster with 3 nodes - node0, node1, node2
> 1. node0 accepts putAll request, maps all keys to corresponding primary nodes
> and sends GridNearAtomicFullUpdateRequest to node1 and node2.
> 2. node1 starts processing cache entries. Halfway through this process node1
> receives stop signal (Ignite#close). All remaining attempts to process cache
> entries will fail with exception - see
> IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#invoke and
> IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#operationCancelledException.
> 3. node1 manages to sends GridDhtAtomicUpdateRequest with entries that node1
> processed before it was stopped to backups(node2 and node0)
> 4. node1 fails to send GridNearAtomicUpdateResponse with failed keys to node0
> because node is stopping (see GridCacheIoManager#onSend). This message is an
> indication to the "near" node that some keys could not be processed and the
> operation should be terminated with an exception.
> 5. node0 and node2 process entries from GridDhtAtomicUpdateRequest`s and
> sends GridDhtAtomicNearResponse`s to node0.
> 6. node1 is removed from the cluster.
> 7. Currently node0 does not wait for node 1 (primary node for some keys) to
> respond in FULL_SYNC mode. node0 completes putAll operation when
> GridDhtAtomicNearResponse`s are received from all backups. But backups does
> not inform node0 (near node) that some putAll entries were not processed. And
> operations completes successfully.
> Proposal:
> In case primary node fails to process any entries during putAll - send
> GridDhtAtomicUpdateRequest to backups with DHT_ATOMIC_HAS_RESULT_MASK==false.
> It will cause Near Node to wait for Primary Node to respond even if all
> responses from backups are received.
> As a result - if primary node fails to process any entries, it either will
> manage to send GridNearAtomicUpdateResponse to near node and putAll will
> completes with failure, or primary node leaves without sending
> GridNearAtomicUpdateResponse and entries that were mapped to left primary
> node will be remapped to a new topology by the Near Node.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)