[ https://issues.apache.org/jira/browse/IGNITE-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pavel Kovalenko updated IGNITE-7871: ------------------------------------ Description: Using validation implemented in IGNITE-7467 we can observe the following situation: Let's we have some partition and nodes which owning it N1 (primary) and N2 (backup) 1) Exchange is started 2) N2 finished waiting for partitions release and started to create Single message (with update counters). 3) N1 waits for partitions release. 4) We have pending cache update N1 -> N2. This update is done after step 2. 5) This update increments update counters both on N1 and N2. 6) N1 finished waiting for partitions release, while N2 already sent Single message to coordinator with outdated update counter. 7) Coordinator sees different partition update counters for N1 and N2. Validation is failed, while data is equal. Solution: Every server node participating in PME should wait while all other server nodes will finish their ongoing updates (finish wait for partition release method) was: Using validation implemented in IGNITE-7467 we can observe the following situation: Let's we have some partition and nodes which owning it N1 (primary) and N2 (backup) 1) Exchange is started 2) N2 finished waiting for partitions release and started to create Single message (with update counters). 3) N1 waits for partitions release. 4) We have pending cache update N1 -> N2. This update is done after step 2. 5) This update increments update counters both on N1 and N2. 6) N1 finished waiting for partitions release, while N2 already sent Single message to coordinator with outdated update counter. 7) Coordinator sees different partition update counters for N1 and N2. Validation is failed, while data is equal. Possible solutions: 1) Cancel transactions and atomic updates on backups if topology version on them is already changed (or waiting for partitions release is finished). 2) Each node participating in exchange should wait for partitions release of other nodes not only self (like distributed countdown latch right after waiting for partitions release). > Implement 2-phase waiting for partition release > ----------------------------------------------- > > Key: IGNITE-7871 > URL: https://issues.apache.org/jira/browse/IGNITE-7871 > Project: Ignite > Issue Type: Bug > Components: cache > Affects Versions: 2.4 > Reporter: Pavel Kovalenko > Assignee: Pavel Kovalenko > Priority: Major > Fix For: 2.5 > > > Using validation implemented in IGNITE-7467 we can observe the following > situation: > Let's we have some partition and nodes which owning it N1 (primary) and N2 > (backup) > 1) Exchange is started > 2) N2 finished waiting for partitions release and started to create Single > message (with update counters). > 3) N1 waits for partitions release. > 4) We have pending cache update N1 -> N2. This update is done after step 2. > 5) This update increments update counters both on N1 and N2. > 6) N1 finished waiting for partitions release, while N2 already sent Single > message to coordinator with outdated update counter. > 7) Coordinator sees different partition update counters for N1 and N2. > Validation is failed, while data is equal. > Solution: > Every server node participating in PME should wait while all other server > nodes will finish their ongoing updates (finish wait for partition release > method) -- This message was sent by Atlassian JIRA (v7.6.3#76005)