[ 
https://issues.apache.org/jira/browse/IGNITE-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18054651#comment-18054651
 ] 

Ignite TC Bot commented on IGNITE-27109:
----------------------------------------

{panel:title=Branch: [pull/12656/head] Base: [master] : No blockers 
found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel}
{panel:title=Branch: [pull/12656/head] Base: [master] : New Tests 
(2)|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}
{color:#00008b}Cache 13{color} [[tests 
2|https://ci2.ignite.apache.org/viewLog.html?buildId=8823916]]
* {color:#013220}IgniteCacheTestSuite13: 
IgniteCacheAtomicFullSyncPartialUpdateAllTest.testCacheEntriesProcessingFailureCausedByNodeStop
 - PASSED{color}
* {color:#013220}IgniteCacheTestSuite13: 
IgniteCacheAtomicFullSyncPartialUpdateAllTest.testCacheEntriesProcessingFailureCausedByInterceptorException
 - PASSED{color}

{panel}
[TeamCity *--> Run :: All* 
Results|https://ci2.ignite.apache.org/viewLog.html?buildId=8824022&buildTypeId=IgniteTests24Java8_RunAll]

> IgniteCache#putAll may silently lose entries while any primary node is 
> leaving the cluster
> ------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-27109
>                 URL: https://issues.apache.org/jira/browse/IGNITE-27109
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mikhail Petrov
>            Assignee: Mikhail Petrov
>            Priority: Major
>              Labels: ise
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> IgniteCache#putAll call may succeed, but some of the passed entries will not 
> be stored in the cache. This may happen for ATOMIC FULL_SYNC caches when a 
> node leaves the cluster during IgniteCache#putAll execution. Even though it 
> is expected that putAll can partially fail for atomic caches, user still 
> should get CachePartialUpdateException.
> The problem is reproduced by ReliabilityTest.testFailover test. Cache 
> configuration: ATOMIC, REPLICATED, FULL_SYNC
> See: 
> https://ci2.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=-8360567487297938069&tab=testDetails&branch_IgniteTests24Java8=%3Cdefault%3E
> Explanation :
> Consider cluster with 3 nodes - node0, node1, node2
> 1. node0 accepts putAll request, maps all keys to corresponding primary nodes 
> and sends GridNearAtomicFullUpdateRequest to node1 and node2.
> 2. node1 starts processing cache entries. Halfway through this process node1 
> receives stop signal (Ignite#close). All remaining attempts to process cache 
> entries will fail with exception - see 
> IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#invoke and 
> IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#operationCancelledException.
> 3. node1 manages to sends GridDhtAtomicUpdateRequest with entries that node1 
> processed before it was stopped to backups(node2 and node0)
> 4. node1 fails to send GridNearAtomicUpdateResponse with failed keys to node0 
> because node is stopping (see GridCacheIoManager#onSend). This message is an 
> indication to the "near" node that some keys could not be processed and the 
> operation should be terminated with an exception.
> 5. node0 and node2 process entries from GridDhtAtomicUpdateRequest`s and 
> sends GridDhtAtomicNearResponse`s to node0.
> 6. node1 is removed from the cluster.
> 7. Currently node0 does not wait for node 1 (primary node for some keys) to 
> respond in FULL_SYNC mode. node0 completes putAll operation when 
> GridDhtAtomicNearResponse`s are received from all backups. But backups does 
> not inform node0 (near node) that some putAll entries were not processed. And 
> operations completes successfully.
> Proposal:
> In case primary node fails to process any entries during putAll - send 
> GridDhtAtomicUpdateRequest to backups with DHT_ATOMIC_HAS_RESULT_MASK==false. 
> It will cause Near Node to wait for Primary Node to respond even if all 
> responses from backups are received.
> As a result - if primary node fails to process any entries, it either will 
> manage to send GridNearAtomicUpdateResponse to near node and putAll will 
> completes with failure,  or primary node leaves without sending 
> GridNearAtomicUpdateResponse and entries that were mapped to left primary 
> node will be remapped to a new topology by the Near Node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to