[ 
https://issues.apache.org/jira/browse/IGNITE-17279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563003#comment-17563003
 ] 

Vladislav Pyatkov commented on IGNITE-17279:
--------------------------------------------

LGTM

> Mapping of partition states to nodes can erroneously skip lost partitions on 
> the coordinator node
> -------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-17279
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17279
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Vyacheslav Koptilin
>            Assignee: Vyacheslav Koptilin
>            Priority: Minor
>             Fix For: 2.14
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> It seems that a coordinator node does not correctly update node2part mapping 
> for lost partitions. 
> {noformat}
> [test-runner-#1%distributed.CachePartitionLostAfterSupplierHasLeftTest%][root]
>  dump partitions state for <default>:
> ----preload sync futures----
> nodeId=b57ca812-416d-40d7-bb4f-271994900000 
> consistentId=distributed.CachePartitionLostAfterSupplierHasLeftTest0 
> isDone=true
> nodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001 
> consistentId=distributed.CachePartitionLostAfterSupplierHasLeftTest1 
> isDone=true
> ----rebalance futures----
> nodeId=b57ca812-416d-40d7-bb4f-271994900000 isDone=true res=true topVer=null
> remaining: {}
> nodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001 isDone=true res=false 
> topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0]
> remaining: {}
> ----partition state----
> localNodeId=b57ca812-416d-40d7-bb4f-271994900000 
> grid=distributed.CachePartitionLostAfterSupplierHasLeftTest0
> local part=0 counters=Counter [lwm=200, missed=[], maxApplied=200, hwm=200] 
> fullSize=200 *state=LOST* reservations=0 isAffNode=true
>  nodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001 part=0 *state=LOST* 
> isAffNode=true
> ...
> localNodeId=20fdfa4a-ddf6-4229-b25e-38cd8d300001 
> grid=distributed.CachePartitionLostAfterSupplierHasLeftTest1
> local part=0 counters=Counter [lwm=0, missed=[], maxApplied=0, hwm=0] 
> fullSize=100 *state=LOST* reservations=0 isAffNode=true
>  nodeId=b57ca812-416d-40d7-bb4f-271994900000 part=0 *state=OWNING* 
> isAffNode=true
> ...
> {noformat}
> *Update*:
>     The root cause of the issue is that the coordinator node incorrectly 
> update mapping nodes to partition states on PME (see 
> GridDhtPartitionTopologyImpl.node2part). It seems to me, that the coordinator 
> node should set partition state to LOST on all affinity nodes (if this 
> partition is assumed as LOST on the coordinator) before creating and sending 
> a “full map” message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to