[ https://issues.apache.org/jira/browse/IGNITE-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16843791#comment-16843791 ]
Anton Vinogradov commented on IGNITE-10078: ------------------------------------------- Folks, We're currently checking fix using brandnew jepsen's test. We have inconsistent state at bank test [1] at master at node's failures. Checking PR now, will let you know about the results. [1] https://github.com/jepsen-io/jepsen/blob/master/ignite/src/jepsen/ignite/bank.clj > Node failure during concurrent partition updates may cause partition desync > between primary and backup. > ------------------------------------------------------------------------------------------------------- > > Key: IGNITE-10078 > URL: https://issues.apache.org/jira/browse/IGNITE-10078 > Project: Ignite > Issue Type: Bug > Reporter: Alexei Scherbakov > Assignee: Alexei Scherbakov > Priority: Major > Fix For: 2.8 > > Time Spent: 2.5h > Remaining Estimate: 0h > > This is possible if some updates are not written to WAL before node failure. > They will be not applied by rebalancing due to same partition counters in > certain scenario: > 1. Start grid with 3 nodes, 2 backups. > 2. Preload some data to partition P. > 3. Start two concurrent transactions writing single key to the same partition > P, keys are different > {noformat} > try(Transaction tx = client.transactions().txStart(PESSIMISTIC, > REPEATABLE_READ, 0, 1)) { > client.cache(DEFAULT_CACHE_NAME).put(k, v); > tx.commit(); > } > {noformat} > 4. Order updates on backup in the way such update with max partition counter > is written to WAL and update with lesser partition counter failed due to > triggering of FH before it's added to WAL > 5. Return failed node to grid, observe no rebalancing due to same partition > counters. > Possible solution: detect gaps in update counters on recovery and force > rebalance from a node without gaps if detected. -- This message was sent by Atlassian JIRA (v7.6.3#76005)