[ https://issues.apache.org/jira/browse/CASSANDRA-18422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Capwell updated CASSANDRA-18422: -------------------------------------- Fix Version/s: 5.0 (was: 5.x) Since Version: NA Source Control Link: https://github.com/apache/cassandra/commit/10671745a254b0a7acf50310d7504896c9f2c584 Resolution: Fixed Status: Resolved (was: Ready to Commit) > CEP-15 (Accord) Original and recover coordinators may hit a race condition > with PreApply where reads and writes are interleaved, causing one of the > coordinators to see the writes from the other > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-18422 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18422 > Project: Cassandra > Issue Type: Bug > Components: Accord > Reporter: David Capwell > Assignee: David Capwell > Priority: Normal > Labels: pull-request-available > Fix For: 5.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > While verifying CASSANDRA-18364 I saw the following history violation in > simulator > {code} > [junit-timeout] Testcase: > simulationTest(org.apache.cassandra.simulator.test.ShortAccordSimulationTest)-.jdk1.8: > Caused an ERROR > [junit-timeout] Failed on seed 0xadaca81151490353 > [junit-timeout] org.apache.cassandra.simulator.SimulationException: Failed on > seed 0xadaca81151490353 > [junit-timeout] Caused by: java.lang.AssertionError: History violations > detected > [junit-timeout] at > org.apache.cassandra.simulator.paxos.PaxosSimulation.logAndThrow(PaxosSimulation.java:315) > [junit-timeout] at > org.apache.cassandra.simulator.paxos.PaxosSimulation.isDone(PaxosSimulation.java:278) > [junit-timeout] at > org.apache.cassandra.simulator.paxos.PaxosSimulation$2.hasNext(PaxosSimulation.java:249) > [junit-timeout] at > org.apache.cassandra.simulator.paxos.PaxosSimulation.run(PaxosSimulation.java:224) > [junit-timeout] at > org.apache.cassandra.simulator.paxos.AbstractPairOfSequencesPaxosSimulation.run(AbstractPairOfSequencesPaxosSimulation.java:297) > [junit-timeout] at > org.apache.cassandra.simulator.paxos.PairOfSequencesAccordSimulation.run(PairOfSequencesAccordSimulation.java:62) > [junit-timeout] at > org.apache.cassandra.simulator.SimulationRunner$Run.run(SimulationRunner.java:374) > [junit-timeout] at > org.apache.cassandra.simulator.paxos.AccordSimulationRunner$Run.run(AccordSimulationRunner.java:39) > [junit-timeout] at > org.apache.cassandra.simulator.paxos.AccordSimulationRunner$Run.run(AccordSimulationRunner.java:30) > [junit-timeout] at > org.apache.cassandra.simulator.SimulationRunner$BasicCommand.run(SimulationRunner.java:355) > [junit-timeout] at > org.apache.cassandra.simulator.paxos.AccordSimulationRunner.main(AccordSimulationRunner.java:76) > [junit-timeout] at > org.apache.cassandra.simulator.test.ShortAccordSimulationTest.simulationTest(ShortAccordSimulationTest.java:32) > [junit-timeout] Suppressed: > org.apache.cassandra.simulator.paxos.HistoryViolation: Inconsistent sequences > on 1: [2, 0, 1, 6, 8, 9, 13, 14, 16, 19, 20, 22, 23, 25, 26, 28, 29, 31, 32, > 34, 35, 37, 40, 43, 47, 48, 49, 54, 56, 57, 58, 60, 64, 68, 70, 71, 74, 76, > 79, 80, 83, 85, 87, 87] vs [2, 0, 1, 6, 8, 9, 13, 14, 16, 19, 20, 22, 23, 25, > 26, 28, 29, 31, 32, 34, 35, 37, 40, 43, 47, 48, 49, 54, 56, 57, 58, 60, 64, > 68, 70, 71, 74, 76, 79, 80, 83, 85, 87]+90 > [junit-timeout] at > accord.verify.StrictSerializabilityVerifier$Register.updateSequence(StrictSerializabilityVerifier.java:607) > [junit-timeout] at > accord.verify.StrictSerializabilityVerifier$Register.access$100(StrictSerializabilityVerifier.java:576) > [junit-timeout] at > accord.verify.StrictSerializabilityVerifier.apply(StrictSerializabilityVerifier.java:825) > [junit-timeout] at > org.apache.cassandra.simulator.paxos.StrictSerializabilityValidator$1.lambda$close$0(StrictSerializabilityValidator.java:66) > [junit-timeout] at > org.apache.cassandra.simulator.paxos.StrictSerializabilityValidator.convertHistoryViolation(StrictSerializabilityValidator.java:89) > [junit-timeout] at > org.apache.cassandra.simulator.paxos.StrictSerializabilityValidator.access$200(StrictSerializabilityValidator.java:27) > [junit-timeout] at > org.apache.cassandra.simulator.paxos.StrictSerializabilityValidator$1.close(StrictSerializabilityValidator.java:66) > [junit-timeout] at > org.apache.cassandra.simulator.paxos.LoggingHistoryValidator$1.close(LoggingHistoryValidator.java:63) > [junit-timeout] at > org.apache.cassandra.simulator.paxos.PairOfSequencesAccordSimulation$ReadWriteOperation.verify(PairOfSequencesAccordSimulation.java:218) > [junit-timeout] at > org.apache.cassandra.simulator.paxos.PaxosSimulation$Operation.accept(PaxosSimulation.java:135) > [junit-timeout] at > org.apache.cassandra.simulator.paxos.PairOfSequencesAccordSimulation$ReadWriteOperation.accept(PairOfSequencesAccordSimulation.java:171) > [junit-timeout] at > org.apache.cassandra.simulator.paxos.PaxosSimulation$Operation.accept(PaxosSimulation.java:83) > [junit-timeout] at > org.apache.cassandra.simulator.systems.SimulatedActionCallable$1.run(SimulatedActionCallable.java:47) > [junit-timeout] at > org.apache.cassandra.simulator.systems.InterceptingExecutor$InterceptingPooledExecutor$WaitingThread.lambda$new$1(InterceptingExecutor.java:317) > [junit-timeout] at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > [junit-timeout] at java.lang.Thread.run(Thread.java:750) > {code} > Adding logging to track message passing, reads, and writes, I have the > following ordering > {code} > [isolatedExecutor:3] node3 2023-04-03 12:54:30,200 send(/127.0.0.1:7012, > (from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_PREACCEPT_REQ)) > [isolatedExecutor:3] node3 2023-04-03 12:54:30,200 send(/127.0.0.2:7012, > (from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_PREACCEPT_REQ)) > [isolatedExecutor:3] node3 2023-04-03 12:54:30,200 send(/127.0.0.3:7012, > (from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_PREACCEPT_REQ)) > [CommandStore[2]:1] node1 2023-04-03 12:54:30,208 CS:[2] OP:0xea64a268 > reply(/127.0.0.3:7012, (from:/127.0.0.3:7012, type:IMMEDIATE > verb:ACCORD_PREACCEPT_REQ), (from:/127.0.0.1:7012, type:REQUEST_RESPONSE > verb:ACCORD_PREACCEPT_RSP)) > [CommandStore[2]:1] node3 2023-04-03 12:54:30,209 CS:[2] OP:0x9761fb36 > reply(/127.0.0.3:7012, (from:/127.0.0.3:7012, type:IMMEDIATE > verb:ACCORD_PREACCEPT_REQ), (from:/127.0.0.3:7012, type:REQUEST_RESPONSE > verb:ACCORD_PREACCEPT_RSP)) > [CommandStore[2]:1] node2 2023-04-03 12:54:30,210 CS:[2] OP:0xe62230f2 > reply(/127.0.0.3:7012, (from:/127.0.0.3:7012, type:IMMEDIATE > verb:ACCORD_PREACCEPT_REQ), (from:/127.0.0.2:7012, type:REQUEST_RESPONSE > verb:ACCORD_PREACCEPT_RSP)) > [CommandStore[2]:1] node3 2023-04-03 12:54:30,211 CS:[2] OP:0xc5563e5d > send(/127.0.0.1:7012, (from:/127.0.0.3:7012, type:IMMEDIATE > verb:ACCORD_COMMIT_REQ)) > [CommandStore[2]:1] node3 2023-04-03 12:54:30,212 CS:[2] OP:0xc5563e5d > send(/127.0.0.2:7012, (from:/127.0.0.3:7012, type:IMMEDIATE > verb:ACCORD_COMMIT_REQ)) > [CommandStore[2]:1] node3 2023-04-03 12:54:30,212 CS:[2] OP:0xc5563e5d > send(/127.0.0.3:7012, (from:/127.0.0.3:7012, type:IMMEDIATE > verb:ACCORD_COMMIT_REQ)) > [CommandStore[2]:1] node2 2023-04-03 12:54:30,329 CS:[2] OP:0xa3e62850 > send(/127.0.0.1:7012, (from:/127.0.0.2:7012, type:IMMEDIATE > verb:ACCORD_RECOVER_REQ)) > [CommandStore[2]:1] node2 2023-04-03 12:54:30,329 CS:[2] OP:0xa3e62850 > send(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE > verb:ACCORD_RECOVER_REQ)) > [CommandStore[2]:1] node2 2023-04-03 12:54:30,329 CS:[2] OP:0xa3e62850 > send(/127.0.0.3:7012, (from:/127.0.0.2:7012, type:IMMEDIATE > verb:ACCORD_RECOVER_REQ)) > [CommandStore[2]:1] node3 2023-04-03 12:54:30,334 CS:[2] OP:0xf8562cfb > reply(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE > verb:ACCORD_RECOVER_REQ), (from:/127.0.0.3:7012, type:REQUEST_RESPONSE > verb:ACCORD_RECOVER_RSP)) > [CommandStore[2]:1] node1 2023-04-03 12:54:30,338 CS:[2] OP:0xcfd2540f > reply(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE > verb:ACCORD_RECOVER_REQ), (from:/127.0.0.1:7012, type:REQUEST_RESPONSE > verb:ACCORD_RECOVER_RSP)) > [CommandStore[2]:1] node2 2023-04-03 12:54:30,338 CS:[2] OP:0xc4cf5af8 > reply(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE > verb:ACCORD_RECOVER_REQ), (from:/127.0.0.2:7012, type:REQUEST_RESPONSE > verb:ACCORD_RECOVER_RSP)) > [CommandStore[2]:1] node2 2023-04-03 12:54:30,340 CS:[2] OP:0x9e4f00f0 > send(/127.0.0.1:7012, (from:/127.0.0.2:7012, type:IMMEDIATE > verb:ACCORD_GET_DEPS_REQ)) > [CommandStore[2]:1] node2 2023-04-03 12:54:30,340 CS:[2] OP:0x9e4f00f0 > send(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE > verb:ACCORD_GET_DEPS_REQ)) > [CommandStore[2]:1] node2 2023-04-03 12:54:30,340 CS:[2] OP:0x9e4f00f0 > send(/127.0.0.3:7012, (from:/127.0.0.2:7012, type:IMMEDIATE > verb:ACCORD_GET_DEPS_REQ)) > [CommandStore[2]:1] node3 2023-04-03 12:54:30,343 CS:[2] OP:0xb60153ab > reply(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE > verb:ACCORD_GET_DEPS_REQ), (from:/127.0.0.3:7012, type:REQUEST_RESPONSE > verb:ACCORD_GET_DEPS_RSP)) > [CommandStore[2]:1] node1 2023-04-03 12:54:30,344 CS:[2] OP:0xac20a1d6 > reply(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE > verb:ACCORD_GET_DEPS_REQ), (from:/127.0.0.1:7012, type:REQUEST_RESPONSE > verb:ACCORD_GET_DEPS_RSP)) > [CommandStore[2]:1] node2 2023-04-03 12:54:30,345 CS:[2] OP:0xa73f7484 > reply(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE > verb:ACCORD_GET_DEPS_REQ), (from:/127.0.0.2:7012, type:REQUEST_RESPONSE > verb:ACCORD_GET_DEPS_RSP)) > [CommandStore[2]:1] node2 2023-04-03 12:54:30,347 CS:[2] OP:0xfc37fb1a > send(/127.0.0.1:7012, (from:/127.0.0.2:7012, type:IMMEDIATE > verb:ACCORD_COMMIT_REQ)) > [CommandStore[2]:1] node2 2023-04-03 12:54:30,347 CS:[2] OP:0xfc37fb1a > send(/127.0.0.2:7012, (from:/127.0.0.2:7012, type:IMMEDIATE > verb:ACCORD_COMMIT_REQ)) > [CommandStore[2]:1] node2 2023-04-03 12:54:30,347 CS:[2] OP:0xfc37fb1a > send(/127.0.0.3:7012, (from:/127.0.0.2:7012, type:IMMEDIATE > verb:ACCORD_COMMIT_REQ)) > [CommandStore[2]:1] node3 2023-04-03 12:54:30,349 CS:[2] OP:0xff574276 > Performing read > [CommandStore[2]:1] node3 2023-04-03 12:54:30,349 CS:[2] OP:0xff574276 > Performing read > [ReadStage:1] node3 2023-04-03 12:54:30,351 Performing read; post > [ReadStage:1] node3 2023-04-03 12:54:30,351 Performing read; post > [ReadStage:1] node3 2023-04-03 12:54:30,351 Performing read; pre > [ReadStage:1] node3 2023-04-03 12:54:30,351 Performing read; pre > [ReadStage:1] node3 2023-04-03 12:54:30,351 reply(/127.0.0.2:7012, > (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_COMMIT_REQ), > (from:/127.0.0.3:7012, type:REQUEST_RESPONSE verb:ACCORD_READ_RSP)) > [CommandStore[2]:1] node2 2023-04-03 12:54:30,359 Performing coordinated > write > [CommandStore[2]:1] node2 2023-04-03 12:54:30,359 send(/127.0.0.1:7012, > (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_APPLY_REQ)) > [CommandStore[2]:1] node2 2023-04-03 12:54:30,359 send(/127.0.0.2:7012, > (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_APPLY_REQ)) > [CommandStore[2]:1] node2 2023-04-03 12:54:30,360 send(/127.0.0.3:7012, > (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_APPLY_REQ)) > [CommandStore[2]:1] node3 2023-04-03 12:54:30,363 CS:[2] OP:0x8bdb6795 > Performing read > [CommandStore[2]:1] node3 2023-04-03 12:54:30,363 CS:[2] OP:0x8bdb6795 > Performing read > [CommandStore[2]:1] node3 2023-04-03 12:54:30,364 CS:[2] OP:0x92e94460 > Performing write > [MutationStage:4] node3 2023-04-03 12:54:30,364 Performing write: pre > [MutationStage:4] node3 2023-04-03 12:54:30,365 Performing write: post > [ReadStage:1] node3 2023-04-03 12:54:30,365 Performing read; post > [ReadStage:1] node3 2023-04-03 12:54:30,365 Performing read; pre > [ReadStage:1] node3 2023-04-03 12:54:30,369 Performing read; post > [ReadStage:1] node3 2023-04-03 12:54:30,369 Performing read; pre > [ReadStage:1] node3 2023-04-03 12:54:30,369 reply(/127.0.0.3:7012, > (from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_COMMIT_REQ), > (from:/127.0.0.3:7012, type:REQUEST_RESPONSE verb:ACCORD_READ_RSP)) > [CommandStore[2]:1] node1 2023-04-03 12:54:30,370 CS:[2] OP:0xa59dc286 > Performing write > [CommandStore[2]:1] node2 2023-04-03 12:54:30,374 CS:[2] OP:0xab0f3ca4 > Performing write > [MutationStage:1] node2 2023-04-03 12:54:30,374 Performing write: pre > [MutationStage:1] node2 2023-04-03 12:54:30,375 Performing write: post > [CommandStore[2]:1] node3 2023-04-03 12:54:30,376 Performing coordinated > write > [MutationStage:3] node1 2023-04-03 12:54:30,376 Performing write: pre > [CommandStore[2]:1] node3 2023-04-03 12:54:30,377 send(/127.0.0.1:7012, > (from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_APPLY_REQ)) > [CommandStore[2]:1] node3 2023-04-03 12:54:30,377 send(/127.0.0.2:7012, > (from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_APPLY_REQ)) > [CommandStore[2]:1] node3 2023-04-03 12:54:30,377 send(/127.0.0.3:7012, > (from:/127.0.0.3:7012, type:IMMEDIATE verb:ACCORD_APPLY_REQ)) > [CommandStore[2]:1] node3 2023-04-03 12:54:30,382 reply(/127.0.0.2:7012, > (from:/127.0.0.2:7012, type:IMMEDIATE verb:ACCORD_APPLY_REQ), > (from:/127.0.0.3:7012, type:REQUEST_RESPONSE verb:ACCORD_APPLY_RSP)) > [MutationStage:3] node1 2023-04-03 12:54:30,382 Performing write: post > {code} > (The transaction has a returning select and an auto-read, which is why there > are double logs for reads) > Here we see the following timing > {code} > T00 node3 starts txn > T01 node3 sends COMMIT > T02 node2 starts recover > T03 all nodes ack to the pending recover > T04 node2 sends COMMIT > T05 node3 performs reads needed for txn > T06 node3 sends read results to node2 > T07 node2 performs write locally and send APPLY > T08 node3 performs write > T09 node3 performs reads needed for txn > T10 node3 send reads to node3 > T11 node3 performs write and sends APPLY > T12 node3 ACKs APPLY to node2 > {code} > Given the fact the simulator got a response back and didn’t get a preempt, > this implies that the original coordinator was able to complete the full > transaction without issues and reply back, but the reads/writes were > interleaved between node3 and node2 causing the second write to observe the > first write -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org