[jira] [Comment Edited] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp

Henrik Ingo (Jira) Tue, 17 Oct 2023 07:07:12 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776200#comment-17776200
 ]


Henrik Ingo edited comment on CASSANDRA-18798 at 10/17/23 2:06 PM:
-------------------------------------------------------------------

Pushed new snapshot of progress: 
https://github.com/henrikingo/cassandra/commit/4b2292bfa52ed713163abbc4f72b8300bf630e8e

This commit "fixes" the issue in the sense that 
{{updateAllTimestampAndLocalDeletionTime()}} will now also update the {{path}} 
variable for elements of a ListType. However, this does not actualy fix the 
issue. In the unit test that's also part of the patch, the transactions end up 
always having the same timestamp, and hence generate the same TimeUUID().

(Note that separately we might wonder what would happen if we append 2 list 
elements in the same transaction?)

To emphasize  the point that the above does the right thing given the original 
assumptions, if I just use { {nextTimeUUID()}}, which generates new UUIDs, not 
just maps the current timestamp to a deterministic UUID, then the test 
"passes", though I doubt that would be correct in a real cluster with multiple 
nodes. IT works on a single node because this code executes serially in the 
accord execution phase, so newly generated UUIDs are ordered correctly, even if 
they are not the correct UUIDs. (...as in derived from the Accord transaction 
id)

But ok, debugging this I realized another issue, which I first thought was with 
the test setup, but might be some kind of race condition. It turns out the two 
transactions in the unit test end up executing with the exact same timestamps.

{noformat}
lastmicros 0
DEBUG [node2_CommandStore[1]:1] node2 2023-10-17 15:39:35,579 
AccordMessageSink.java:167 - Replying ACCORD_APPLY_RSP ApplyApplied to 
/127.0.0.1:7012
DEBUG [node1_RequestResponseStage-1] node1 2023-10-17 15:39:35,580 
AccordCallback.java:49 - Received response ApplyApplied from /127.0.0.2:7012
lastmicros 0
raw 0  (NO_LAST_EXECUTED_HLC=-9223372036854775808
lastExecutedTimestamp [0,0,0,0]
lastmicros 1697546374434000
raw 0  (NO_LAST_EXECUTED_HLC=-9223372036854775808
raw -9223372036854775808  (NO_LAST_EXECUTED_HLC=-9223372036854775808
lastExecutedTimestamp [0,0,0,0]
lastExecutedTimestamp [10,1697546374434000,10,1]
lastmicros 1697546374434000
raw -9223372036854775808  (NO_LAST_EXECUTED_HLC=-9223372036854775808
lastExecutedTimestamp [10,1697546374434000,10,1]
timestamp 1697546374434000    executeAt[10,1697546374434000,10,1]
timestamp 1697546374434000    executeAt[10,1697546374434000,10,1]
{noformat}

But adding a sleep to one thread, it resolves the issue (also makes the test 
pass, actually):
{code}
                ForkJoinTask<?> add2 = ForkJoinPool.commonPool().submit(() -> {
                    try {
                        Thread.sleep(1000);
                    }catch (InterruptedException e){
                        // It's ok
                    }

                    latch.awaitThrowUncheckedOnInterrupt();
                    SHARED_CLUSTER.get(1).executeInternal("BEGIN TRANSACTION " +
                            "UPDATE " + currentTable + " SET l = l + [2] WHERE 
k = 1; " +
                            "COMMIT TRANSACTION");
                    completionOrder.add(2);
                });

{code}

{noformat}
lastmicros 1697544893676000
raw 1697544893676000  (NO_LAST_EXECUTED_HLC=-9223372036854775808
lastExecutedTimestamp [10,1697544893676000,10,1]
lastmicros 1697544894677000
raw -9223372036854775808  (NO_LAST_EXECUTED_HLC=-9223372036854775808
lastExecutedTimestamp [10,1697544894677000,10,1]
timestamp 1697544894677000    executeAt[10,1697544894677000,10,1]
DEBUG [node2_CommandStore[1]:1] node2 2023-10-17 15:14:54,728 
AccordMessageSink.java:167 - Replying ACCORD_APPLY_RSP ApplyApplied to 
/127.0.0.1:7012
DEBUG [node1_RequestResponseStage-1] node1 2023-10-17 15:14:54,728 
AccordCallback.java:49 - Received response ApplyApplied from /127.0.0.1:7012
DEBUG [node2_Messaging-EventLoop-3-4] node2 2023-10-17 15:14:54,728 
AccordVerbHandler.java:54 - Receiving 
PreAccept{txnId:[10,1697544894711000,0,1], 
txn:{read:TxnRead{TxnNamedRead{name='RETURNING:', 
key=distributed_test_keyspace:DecoratedKey(-4069959284402364209, 00000001), 
update=Read(distributed_test_keyspace.tbl0 columns=*/[l] rowFilter= limits= 
key=1 filter=names(EMPTY), nowInSec=0)}}}, 
scope:[distributed_test_keyspace:-4069959284402364209]} from /127.0.0.1:7012
DEBUG [node1_CommandStore[1]:1] node1 2023-10-17 15:14:54,730 
AbstractCell.java:144 - timestamp: 1697544894677000   buffer: 0    newPath: 
java.nio.HeapByteBuffer[pos=0 lim=16 cap=16]
lastmicros 1697544893676000
raw 1697544893676000  (NO_LAST_EXECUTED_HLC=-9223372036854775808
lastExecutedTimestamp [10,1697544893676000,10,1]
lastmicros 1697544894677000
raw -9223372036854775808  (NO_LAST_EXECUTED_HLC=-9223372036854775808
lastExecutedTimestamp [10,1697544894677000,10,1]
DEBUG [node1_RequestResponseStage-1] node1 2023-10-17 15:14:54,734 
AccordCallback.java:49 - Received response ApplyApplied from /127.0.0.2:7012
timestamp 1697544894677000    executeAt[10,1697544894677000,10,1]
{noformat}

{noformat}
{noformat}



The fact that the sleep helps makes me suspicious this is not a test setup 
issue, but actual race condition.


was (Author: henrik.ingo):
Pushed new snapshot of progress: 
https://github.com/henrikingo/cassandra/commit/4b2292bfa52ed713163abbc4f72b8300bf630e8e

This commit "fixes" the issue in the sense that 
{{updateAllTimestampAndLocalDeletionTime()}} will now also update the {{path}} 
variable for elements of a ListType. However, this does not actualy fix the 
issue. In the unit test that's also part of the patch, the transactions end up 
always having the same timestamp, and hence generate the same TimeUUID().

To emphasize  the point that the above does the right thing given the original 
assumptions, if I just use { {nextTimeUUID()}}, which generates new UUIDs, not 
just maps the current timestamp to a deterministic UUID, then the test 
"passes", though I doubt that would be correct in a real cluster with multiple 
nodes. IT works on a single node because this code executes serially in the 
accord execution phase, so newly generated UUIDs are ordered correctly, even if 
they are not the correct UUIDs. (...as in derived from the Accord transaction 
id)

But ok, debugging this I realized another issue, which I first thought was with 
the test setup, but might be some kind of race condition. It turns out the two 
transactions in the unit test end up executing with the exact same timestamps.

{noformat}
lastmicros 0
DEBUG [node2_CommandStore[1]:1] node2 2023-10-17 15:39:35,579 
AccordMessageSink.java:167 - Replying ACCORD_APPLY_RSP ApplyApplied to 
/127.0.0.1:7012
DEBUG [node1_RequestResponseStage-1] node1 2023-10-17 15:39:35,580 
AccordCallback.java:49 - Received response ApplyApplied from /127.0.0.2:7012
lastmicros 0
raw 0  (NO_LAST_EXECUTED_HLC=-9223372036854775808
lastExecutedTimestamp [0,0,0,0]
lastmicros 1697546374434000
raw 0  (NO_LAST_EXECUTED_HLC=-9223372036854775808
raw -9223372036854775808  (NO_LAST_EXECUTED_HLC=-9223372036854775808
lastExecutedTimestamp [0,0,0,0]
lastExecutedTimestamp [10,1697546374434000,10,1]
lastmicros 1697546374434000
raw -9223372036854775808  (NO_LAST_EXECUTED_HLC=-9223372036854775808
lastExecutedTimestamp [10,1697546374434000,10,1]
timestamp 1697546374434000    executeAt[10,1697546374434000,10,1]
timestamp 1697546374434000    executeAt[10,1697546374434000,10,1]
{noformat}

But adding a sleep to one thread, it resolves the issue (also makes the test 
pass, actually):
{code}
                ForkJoinTask<?> add2 = ForkJoinPool.commonPool().submit(() -> {
                    try {
                        Thread.sleep(1000);
                    }catch (InterruptedException e){
                        // It's ok
                    }

                    latch.awaitThrowUncheckedOnInterrupt();
                    SHARED_CLUSTER.get(1).executeInternal("BEGIN TRANSACTION " +
                            "UPDATE " + currentTable + " SET l = l + [2] WHERE 
k = 1; " +
                            "COMMIT TRANSACTION");
                    completionOrder.add(2);
                });

{code}

{noformat}
lastmicros 1697544893676000
raw 1697544893676000  (NO_LAST_EXECUTED_HLC=-9223372036854775808
lastExecutedTimestamp [10,1697544893676000,10,1]
lastmicros 1697544894677000
raw -9223372036854775808  (NO_LAST_EXECUTED_HLC=-9223372036854775808
lastExecutedTimestamp [10,1697544894677000,10,1]
timestamp 1697544894677000    executeAt[10,1697544894677000,10,1]
DEBUG [node2_CommandStore[1]:1] node2 2023-10-17 15:14:54,728 
AccordMessageSink.java:167 - Replying ACCORD_APPLY_RSP ApplyApplied to 
/127.0.0.1:7012
DEBUG [node1_RequestResponseStage-1] node1 2023-10-17 15:14:54,728 
AccordCallback.java:49 - Received response ApplyApplied from /127.0.0.1:7012
DEBUG [node2_Messaging-EventLoop-3-4] node2 2023-10-17 15:14:54,728 
AccordVerbHandler.java:54 - Receiving 
PreAccept{txnId:[10,1697544894711000,0,1], 
txn:{read:TxnRead{TxnNamedRead{name='RETURNING:', 
key=distributed_test_keyspace:DecoratedKey(-4069959284402364209, 00000001), 
update=Read(distributed_test_keyspace.tbl0 columns=*/[l] rowFilter= limits= 
key=1 filter=names(EMPTY), nowInSec=0)}}}, 
scope:[distributed_test_keyspace:-4069959284402364209]} from /127.0.0.1:7012
DEBUG [node1_CommandStore[1]:1] node1 2023-10-17 15:14:54,730 
AbstractCell.java:144 - timestamp: 1697544894677000   buffer: 0    newPath: 
java.nio.HeapByteBuffer[pos=0 lim=16 cap=16]
lastmicros 1697544893676000
raw 1697544893676000  (NO_LAST_EXECUTED_HLC=-9223372036854775808
lastExecutedTimestamp [10,1697544893676000,10,1]
lastmicros 1697544894677000
raw -9223372036854775808  (NO_LAST_EXECUTED_HLC=-9223372036854775808
lastExecutedTimestamp [10,1697544894677000,10,1]
DEBUG [node1_RequestResponseStage-1] node1 2023-10-17 15:14:54,734 
AccordCallback.java:49 - Received response ApplyApplied from /127.0.0.2:7012
timestamp 1697544894677000    executeAt[10,1697544894677000,10,1]
{noformat}

{noformat}
{noformat}



The fact that the sleep helps makes me suspicious this is not a test setup 
issue, but actual race condition.

> Appending to list in Accord transactions uses insertion timestamp
> -----------------------------------------------------------------
>
>                 Key: CASSANDRA-18798
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18798
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Accord
>            Reporter: Jaroslaw Kijanowski
>            Assignee: Henrik Ingo
>            Priority: Normal
>             Fix For: 5.0-alpha2
>
>         Attachments: image-2023-09-26-20-05-25-846.png
>
>
> Given the following schema:
> {code:java}
> CREATE KEYSPACE IF NOT EXISTS accord WITH replication = {'class': 
> 'SimpleStrategy', 'replication_factor': 3};
> CREATE TABLE IF NOT EXISTS accord.list_append(id int PRIMARY KEY,contents 
> LIST<bigint>);
> TRUNCATE accord.list_append;{code}
> And the following two possible queries executed by 10 threads in parallel:
> {code:java}
> BEGIN TRANSACTION
>   LET row = (SELECT * FROM list_append WHERE id = ?);
>   SELECT row.contents;
> COMMIT TRANSACTION;"
> BEGIN TRANSACTION
>   UPDATE list_append SET contents += ? WHERE id = ?;
> COMMIT TRANSACTION;"
> {code}
> there seems to be an issue with transaction guarantees. Here's an excerpt in 
> the edn format from a test.
> {code:java}
> {:type :invoke    :process 8    :value [[:append 5 352]]    :tid 3    :n 52   
>  :time 1692607285967116627}
> {:type :invoke    :process 9    :value [[:r 5 nil]]    :tid 1    :n 54    
> :time 1692607286078732473}
> {:type :invoke    :process 6    :value [[:append 5 553]]    :tid 5    :n 53   
>  :time 1692607286133833428}
> {:type :invoke    :process 7    :value [[:append 5 455]]    :tid 4    :n 55   
>  :time 1692607286149702511}
> {:type :ok    :process 8    :value [[:append 5 352]]    :tid 3    :n 52    
> :time 1692607286156314099}
> {:type :invoke    :process 5    :value [[:r 5 nil]]    :tid 9    :n 52    
> :time 1692607286167090389}
> {:type :ok    :process 9    :value [[:r 5 [303 304 604 6 306 509 909 409 912 
> 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 
> 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 
> 852 352]]]    :tid 1    :n 54    :time 1692607286168657534}
> {:type :invoke    :process 1    :value [[:r 5 nil]]    :tid 0    :n 51    
> :time 1692607286201762938}
> {:type :ok    :process 7    :value [[:append 5 455]]    :tid 4    :n 55    
> :time 1692607286245571513}
> {:type :invoke    :process 7    :value [[:r 5 nil]]    :tid 4    :n 56    
> :time 1692607286245655775}
> {:type :ok    :process 5    :value [[:r 5 [303 304 604 6 306 509 909 409 912 
> 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 
> 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 
> 852 352 455]]]    :tid 9    :n 52    :time 1692607286253928906}
> {:type :invoke    :process 5    :value [[:r 5 nil]]    :tid 9    :n 53    
> :time 1692607286254095215}
> {:type :ok    :process 6    :value [[:append 5 553]]    :tid 5    :n 53    
> :time 1692607286266263422}
> {:type :ok    :process 1    :value [[:r 5 [303 304 604 6 306 509 909 409 912 
> 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 
> 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 
> 852 352 553 455]]]    :tid 0    :n 51    :time 1692607286271617955}
> {:type :ok    :process 7    :value [[:r 5 [303 304 604 6 306 509 909 409 912 
> 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 
> 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 
> 852 352 553 455]]]    :tid 4    :n 56    :time 1692607286271816933}
> {:type :ok    :process 5    :value [[:r 5 [303 304 604 6 306 509 909 409 912 
> 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 
> 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 
> 852 352 553 455]]]    :tid 9    :n 53    :time 1692607286281483026}
> {:type :invoke    :process 9    :value [[:r 5 nil]]    :tid 1    :n 56    
> :time 1692607286284097561}
> {:type :ok    :process 9    :value [[:r 5 [303 304 604 6 306 509 909 409 912 
> 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 
> 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 
> 852 352 553 455]]]    :tid 1    :n 56    :time 1692607286306445242}
> {code}
> Processes process 6 and process 7 are appending the values 553 and 455 
> respectively. 455 succeeded and a read by process 5 confirms that. But then 
> also 553 is appended and a read by process 1 confirms that as well, however 
> it sees 553 before 455.
> process 5 reads [... 852 352 455] where as process 1 reads [... 852 352 553 
> 455] and the latter order is returned in subsequent reads as well.
> [~blambov] suggested that one reason for that behavior could be the way how 
> unfrozen lists are updated. The backing datatype is a _kind of a map_ which 
> uses insertion timestamps as indexes which are used to sort the list when the 
> list is composed from chunks from various sources/sstables before being 
> returned to the client.
> In such a case it indeed can happen, that process 5 reads [... 852 352 455] 
> but later process 1 reads [... 852 352 553 455] because 553 has been 
> _appended_ with an earlier timestamp than 455 but it has been _committed_ 
> with a later timestamp. 
> Now with Accord we have the timestamp _of the transaction_ at hand. Could 
> Accord use that for the index instead? Which would lead to the correct 
> behavior? The value 553 has been appended after 455 and using the transaction 
> id/timestamp as the list index would place it properly in the underlying map, 
> wouldn't it?
> Steps to reproduce:
>  
> {code:java}
> git clone https://github.com/datastax/accordclient.git
> git checkout append-to-list-index
> lein run --list-append -t 10 -r 1,2,3,4,5 -n 1000 -H <host-ips> -s `date 
> +%s%N` > test-la.edn
>  
> curl -L -o elle-cli.zip 
> https://github.com/ligurio/elle-cli/releases/download/0.1.6/elle-cli-bin-0.1.6.zip
> unzip -d elle-cli elle-cli.zip
> java -jar elle-cli/target/elle-cli-0.1.6-standalone.jar --model list-append 
> --anomalies G0 --consistency-models strict-serializable --directory out-la 
> --verbose test-la.edn
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp

Reply via email to