[jira] [Commented] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17785479#comment-17785479 ] Henrik Ingo commented on CASSANDRA-18798: - Status: The issue with TimeUUID() generating different values on each replica is a universal problem. Halting this work while waiting for discussion in #cassandra-accord > Appending to list in Accord transactions uses insertion timestamp > - > > Key: CASSANDRA-18798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18798 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Jaroslaw Kijanowski >Assignee: Henrik Ingo >Priority: Normal > Fix For: 5.x > > Attachments: image-2023-09-26-20-05-25-846.png > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Given the following schema: > {code:java} > CREATE KEYSPACE IF NOT EXISTS accord WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': 3}; > CREATE TABLE IF NOT EXISTS accord.list_append(id int PRIMARY KEY,contents > LIST); > TRUNCATE accord.list_append;{code} > And the following two possible queries executed by 10 threads in parallel: > {code:java} > BEGIN TRANSACTION > LET row = (SELECT * FROM list_append WHERE id = ?); > SELECT row.contents; > COMMIT TRANSACTION;" > BEGIN TRANSACTION > UPDATE list_append SET contents += ? WHERE id = ?; > COMMIT TRANSACTION;" > {code} > there seems to be an issue with transaction guarantees. Here's an excerpt in > the edn format from a test. > {code:java} > {:type :invoke :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607285967116627} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 54 > :time 1692607286078732473} > {:type :invoke :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286133833428} > {:type :invoke :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286149702511} > {:type :ok :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607286156314099} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 52 > :time 1692607286167090389} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352]]] :tid 1 :n 54 :time 1692607286168657534} > {:type :invoke :process 1 :value [[:r 5 nil]] :tid 0 :n 51 > :time 1692607286201762938} > {:type :ok :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286245571513} > {:type :invoke :process 7 :value [[:r 5 nil]] :tid 4 :n 56 > :time 1692607286245655775} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 455]]] :tid 9 :n 52 :time 1692607286253928906} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 53 > :time 1692607286254095215} > {:type :ok :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286266263422} > {:type :ok :process 1 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 0 :n 51 :time 1692607286271617955} > {:type :ok :process 7 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 4 :n 56 :time 1692607286271816933} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 9 :n 53 :time 1692607286281483026} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 56 > :time 1692607286284097561} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 1 :n 56 :time 1692607286306445242} > {code} > Processes process 6 and process 7 are appending the values 553 and 455 > respectively. 455
[jira] [Updated] (CASSANDRA-18989) Accord: UX: Force transactions / automatic transactions
[ https://issues.apache.org/jira/browse/CASSANDRA-18989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik Ingo updated CASSANDRA-18989: Summary: Accord: UX: Force transactions / automatic transactions (was: Accord: Force transactions / automatic transactions) > Accord: UX: Force transactions / automatic transactions > --- > > Key: CASSANDRA-18989 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18989 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Henrik Ingo >Priority: Normal > Fix For: 5.x > > > Note: I chose "bug" because I think this is a serious UX issue we should > consider. But strictly speaking this is a UX issue and technically the > implementation is working as designed. The UX conclusion rather is that the > desing needs improvement... > I'm submitting this based on observing [~antithesis-luis] creating a checker > with some accord transactions. A discussion that followed his experience is > here https://the-asf.slack.com/archives/C0459N9R5C6/p1698352614742079 > The tl;dr is that users are likely to expect single SELECT queries, and maybe > even single UPDATE/INSERT to be consistent even if they neglect the > BEGIN...COMMIT around the single statement. > My proposed fix for improved UX is an ability to force or default also single > statements to be wrapper in an accord transaction. > There are two ways to implement this: > 1. Add configuration option to reject queries that are not accord > transactions. This could be a per table or per keyspace option. > 2. A per session setting that enables automatic transactions, combined with a > global setting to have this behavior as default. MySQL's AUTOCOMMIT is an > example of this approach. > My preference is #2. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18989) Accord: Force transactions / automatic transactions
[ https://issues.apache.org/jira/browse/CASSANDRA-18989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik Ingo updated CASSANDRA-18989: Description: Note: I chose "bug" because I think this is a serious UX issue we should consider. But strictly speaking this is a UX issue and technically the implementation is working as designed. The UX conclusion rather is that the desing needs improvement... I'm submitting this based on observing [~antithesis-luis] creating a checker with some accord transactions. A discussion that followed his experience is here https://the-asf.slack.com/archives/C0459N9R5C6/p1698352614742079 The tl;dr is that users are likely to expect single SELECT queries, and maybe even single UPDATE/INSERT to be consistent even if they neglect the BEGIN...COMMIT around the single statement. My proposed fix for improved UX is an ability to force or default also single statements to be wrapper in an accord transaction. There are two ways to implement this: 1. Add configuration option to reject queries that are not accord transactions. This could be a per table or per keyspace option. 2. A per session setting that enables automatic transactions, combined with a global setting to have this behavior as default. MySQL's AUTOCOMMIT is an example of this approach. My preference is #2. was: Note: I chose "bug" because I think this is a serious UX issue we should consider. But strictly speaking this is a UX issue and technically the implementation is working as designed. The UX conclusion rather is that the desing needs improvement... I'm submitting this based on observing [~alfprado] creating a checker with some accord transactions. A discussion that followed his experience is here https://the-asf.slack.com/archives/C0459N9R5C6/p1698352614742079 The tl;dr is that users are likely to expect single SELECT queries, and maybe even single UPDATE/INSERT to be consistent even if they neglect the BEGIN...COMMIT around the single statement. My proposed fix for improved UX is an ability to force or default also single statements to be wrapper in an accord transaction. There are two ways to implement this: 1. Add configuration option to reject queries that are not accord transactions. This could be a per table or per keyspace option. 2. A per session setting that enables automatic transactions, combined with a global setting to have this behavior as default. MySQL's AUTOCOMMIT is an example of this approach. My preference is #2. > Accord: Force transactions / automatic transactions > --- > > Key: CASSANDRA-18989 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18989 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Henrik Ingo >Priority: Normal > Fix For: 5.x > > > Note: I chose "bug" because I think this is a serious UX issue we should > consider. But strictly speaking this is a UX issue and technically the > implementation is working as designed. The UX conclusion rather is that the > desing needs improvement... > I'm submitting this based on observing [~antithesis-luis] creating a checker > with some accord transactions. A discussion that followed his experience is > here https://the-asf.slack.com/archives/C0459N9R5C6/p1698352614742079 > The tl;dr is that users are likely to expect single SELECT queries, and maybe > even single UPDATE/INSERT to be consistent even if they neglect the > BEGIN...COMMIT around the single statement. > My proposed fix for improved UX is an ability to force or default also single > statements to be wrapper in an accord transaction. > There are two ways to implement this: > 1. Add configuration option to reject queries that are not accord > transactions. This could be a per table or per keyspace option. > 2. A per session setting that enables automatic transactions, combined with a > global setting to have this behavior as default. MySQL's AUTOCOMMIT is an > example of this approach. > My preference is #2. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-18989) Accord: Force transactions / automatic transactions
Henrik Ingo created CASSANDRA-18989: --- Summary: Accord: Force transactions / automatic transactions Key: CASSANDRA-18989 URL: https://issues.apache.org/jira/browse/CASSANDRA-18989 Project: Cassandra Issue Type: Bug Components: Accord Reporter: Henrik Ingo Note: I chose "bug" because I think this is a serious UX issue we should consider. But strictly speaking this is a UX issue and technically the implementation is working as designed. The UX conclusion rather is that the desing needs improvement... I'm submitting this based on observing [~alfprado] creating a checker with some accord transactions. A discussion that followed his experience is here https://the-asf.slack.com/archives/C0459N9R5C6/p1698352614742079 The tl;dr is that users are likely to expect single SELECT queries, and maybe even single UPDATE/INSERT to be consistent even if they neglect the BEGIN...COMMIT around the single statement. My proposed fix for improved UX is an ability to force or default also single statements to be wrapper in an accord transaction. There are two ways to implement this: 1. Add configuration option to reject queries that are not accord transactions. This could be a per table or per keyspace option. 2. A per session setting that enables automatic transactions, combined with a global setting to have this behavior as default. MySQL's AUTOCOMMIT is an example of this approach. My preference is #2. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik Ingo updated CASSANDRA-18798: Test and Documentation Plan: Added 2 unit tests Retest with the list append Elle test Does not impact documentation Status: Patch Available (was: In Progress) Ok, above PR now ready for review [~jlewandowski]. Let me know if I need to squash commits or something first? > Appending to list in Accord transactions uses insertion timestamp > - > > Key: CASSANDRA-18798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18798 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Jaroslaw Kijanowski >Assignee: Henrik Ingo >Priority: Normal > Fix For: 5.0-alpha2 > > Attachments: image-2023-09-26-20-05-25-846.png > > Time Spent: 10m > Remaining Estimate: 0h > > Given the following schema: > {code:java} > CREATE KEYSPACE IF NOT EXISTS accord WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': 3}; > CREATE TABLE IF NOT EXISTS accord.list_append(id int PRIMARY KEY,contents > LIST); > TRUNCATE accord.list_append;{code} > And the following two possible queries executed by 10 threads in parallel: > {code:java} > BEGIN TRANSACTION > LET row = (SELECT * FROM list_append WHERE id = ?); > SELECT row.contents; > COMMIT TRANSACTION;" > BEGIN TRANSACTION > UPDATE list_append SET contents += ? WHERE id = ?; > COMMIT TRANSACTION;" > {code} > there seems to be an issue with transaction guarantees. Here's an excerpt in > the edn format from a test. > {code:java} > {:type :invoke :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607285967116627} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 54 > :time 1692607286078732473} > {:type :invoke :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286133833428} > {:type :invoke :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286149702511} > {:type :ok :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607286156314099} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 52 > :time 1692607286167090389} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352]]] :tid 1 :n 54 :time 1692607286168657534} > {:type :invoke :process 1 :value [[:r 5 nil]] :tid 0 :n 51 > :time 1692607286201762938} > {:type :ok :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286245571513} > {:type :invoke :process 7 :value [[:r 5 nil]] :tid 4 :n 56 > :time 1692607286245655775} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 455]]] :tid 9 :n 52 :time 1692607286253928906} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 53 > :time 1692607286254095215} > {:type :ok :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286266263422} > {:type :ok :process 1 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 0 :n 51 :time 1692607286271617955} > {:type :ok :process 7 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 4 :n 56 :time 1692607286271816933} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 9 :n 53 :time 1692607286281483026} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 56 > :time 1692607286284097561} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 1 :n 56 :time 1692607286306445242} > {code} > Processes proc
[jira] [Commented] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1896#comment-1896 ] Henrik Ingo commented on CASSANDRA-18798: - New branch btw: https://github.com/apache/cassandra/pull/2830 This seems to work now. Key was to understand what each method in TimeUUID class really does. I'll add some source code commentary in that regard on Monday, then submit for review. [~kijanowski] to confirm whether the Elle test now passes > Appending to list in Accord transactions uses insertion timestamp > - > > Key: CASSANDRA-18798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18798 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Jaroslaw Kijanowski >Assignee: Henrik Ingo >Priority: Normal > Fix For: 5.0-alpha2 > > Attachments: image-2023-09-26-20-05-25-846.png > > Time Spent: 10m > Remaining Estimate: 0h > > Given the following schema: > {code:java} > CREATE KEYSPACE IF NOT EXISTS accord WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': 3}; > CREATE TABLE IF NOT EXISTS accord.list_append(id int PRIMARY KEY,contents > LIST); > TRUNCATE accord.list_append;{code} > And the following two possible queries executed by 10 threads in parallel: > {code:java} > BEGIN TRANSACTION > LET row = (SELECT * FROM list_append WHERE id = ?); > SELECT row.contents; > COMMIT TRANSACTION;" > BEGIN TRANSACTION > UPDATE list_append SET contents += ? WHERE id = ?; > COMMIT TRANSACTION;" > {code} > there seems to be an issue with transaction guarantees. Here's an excerpt in > the edn format from a test. > {code:java} > {:type :invoke :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607285967116627} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 54 > :time 1692607286078732473} > {:type :invoke :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286133833428} > {:type :invoke :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286149702511} > {:type :ok :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607286156314099} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 52 > :time 1692607286167090389} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352]]] :tid 1 :n 54 :time 1692607286168657534} > {:type :invoke :process 1 :value [[:r 5 nil]] :tid 0 :n 51 > :time 1692607286201762938} > {:type :ok :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286245571513} > {:type :invoke :process 7 :value [[:r 5 nil]] :tid 4 :n 56 > :time 1692607286245655775} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 455]]] :tid 9 :n 52 :time 1692607286253928906} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 53 > :time 1692607286254095215} > {:type :ok :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286266263422} > {:type :ok :process 1 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 0 :n 51 :time 1692607286271617955} > {:type :ok :process 7 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 4 :n 56 :time 1692607286271816933} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 9 :n 53 :time 1692607286281483026} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 56 > :time 1692607286284097561} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 1
[jira] [Commented] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776780#comment-17776780 ] Henrik Ingo commented on CASSANDRA-18798: - Okay actually bug was in my own code after all. I had lost 10 microseconds granularity. Seems to work now, pushed a commit, taking a break. > Appending to list in Accord transactions uses insertion timestamp > - > > Key: CASSANDRA-18798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18798 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Jaroslaw Kijanowski >Assignee: Henrik Ingo >Priority: Normal > Fix For: 5.0-alpha2 > > Attachments: image-2023-09-26-20-05-25-846.png > > > Given the following schema: > {code:java} > CREATE KEYSPACE IF NOT EXISTS accord WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': 3}; > CREATE TABLE IF NOT EXISTS accord.list_append(id int PRIMARY KEY,contents > LIST); > TRUNCATE accord.list_append;{code} > And the following two possible queries executed by 10 threads in parallel: > {code:java} > BEGIN TRANSACTION > LET row = (SELECT * FROM list_append WHERE id = ?); > SELECT row.contents; > COMMIT TRANSACTION;" > BEGIN TRANSACTION > UPDATE list_append SET contents += ? WHERE id = ?; > COMMIT TRANSACTION;" > {code} > there seems to be an issue with transaction guarantees. Here's an excerpt in > the edn format from a test. > {code:java} > {:type :invoke :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607285967116627} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 54 > :time 1692607286078732473} > {:type :invoke :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286133833428} > {:type :invoke :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286149702511} > {:type :ok :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607286156314099} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 52 > :time 1692607286167090389} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352]]] :tid 1 :n 54 :time 1692607286168657534} > {:type :invoke :process 1 :value [[:r 5 nil]] :tid 0 :n 51 > :time 1692607286201762938} > {:type :ok :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286245571513} > {:type :invoke :process 7 :value [[:r 5 nil]] :tid 4 :n 56 > :time 1692607286245655775} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 455]]] :tid 9 :n 52 :time 1692607286253928906} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 53 > :time 1692607286254095215} > {:type :ok :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286266263422} > {:type :ok :process 1 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 0 :n 51 :time 1692607286271617955} > {:type :ok :process 7 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 4 :n 56 :time 1692607286271816933} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 9 :n 53 :time 1692607286281483026} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 56 > :time 1692607286284097561} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 1 :n 56 :time 1692607286306445242} > {code} > Processes process 6 and process 7 are appending the values 553 and 455 > respectively. 455 succeeded and a read by process 5 confirms that. But then > also 553 is append
[jira] [Updated] (CASSANDRA-18937) Two accord transactions have the exact same transaction id
[ https://issues.apache.org/jira/browse/CASSANDRA-18937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik Ingo updated CASSANDRA-18937: Resolution: Invalid Status: Resolved (was: Triage Needed) Closing: Bug was in code added by myself. > Two accord transactions have the exact same transaction id > -- > > Key: CASSANDRA-18937 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18937 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Henrik Ingo >Priority: Normal > > When testing solutions for CASSANDRA-18798 I noticed that two independent > transactions running at the same time in two parallel threads ended up having > the exact same transaction id: > {code} > public void testListAddition() throws Exception > { > SHARED_CLUSTER.schemaChange("CREATE TABLE " + currentTable + " (k int > PRIMARY KEY, l list)"); > SHARED_CLUSTER.forEach(node -> node.runOnInstance(() -> > AccordService.instance().setCacheSize(0))); > CountDownLatch latch = CountDownLatch.newCountDownLatch(1); > Vector completionOrder = new Vector<>(); > try > { > for (int i=0; i<100; i++) > { > ForkJoinTask add1 = ForkJoinPool.commonPool().submit(() -> > { > latch.awaitThrowUncheckedOnInterrupt(); > SHARED_CLUSTER.get(1).executeInternal("BEGIN TRANSACTION > " + > "UPDATE " + currentTable + " SET l = l + [1] > WHERE k = 1; " + > "COMMIT TRANSACTION"); > completionOrder.add(1); > }); > ForkJoinTask add2 = ForkJoinPool.commonPool().submit(() -> > { > try { > Thread.sleep(0); > {code} > Adding some logging in TxnWrite.java reveals the two threads ave identical > executeAt and unix timestamps: > {noformat} > lastmicros 0 > DEBUG [node2_Messaging-EventLoop-3-4] node2 2023-10-18 18:26:08,954 > AccordVerbHandler.java:54 - Receiving Apply{kind:Minimal, > txnId:[10,1697642767659000,10,1], deps:[distributed_test_keyspace:[(-Inf,-1], > (-1,9223372036854775805], (9223372036854775805,+Inf]]]:{}, {}, > executeAt:[10,1697642767659000,10,1], > writes:TxnWrites{executeAt:[10,1697642767659000,10,1], > keys:[distributed_test_keyspace:DecoratedKey(-4069959284402364209, > 0001)], write:TxnWrite{}}, result:accord.api.Result$1@253c102e} from > /127.0.0.1:7012 > raw 0 (NO_LAST_EXECUTED_HLC=-9223372036854775808 > lastExecutedTimestamp [0,0,0,0] > lastmicros 1697642767659000 > raw -9223372036854775808 (NO_LAST_EXECUTED_HLC=-9223372036854775808 > lastExecutedTimestamp [10,1697642767659000,10,1] > DEBUG [node2_CommandStore[1]:1] node2 2023-10-18 18:26:09,023 > AccordMessageSink.java:167 - Replying ACCORD_APPLY_RSP ApplyApplied to > /127.0.0.1:7012 > lastmicros 0 > raw 0 (NO_LAST_EXECUTED_HLC=-9223372036854775808 > lastExecutedTimestamp [0,0,0,0] > lastmicros 1697642767659000 > raw -9223372036854775808 (NO_LAST_EXECUTED_HLC=-9223372036854775808 > lastExecutedTimestamp [10,1697642767659000,10,1] > timestamp 1697642767659000executeAt[10,1697642767659000,10,1] > timestamp 1697642767659000executeAt[10,1697642767659000,10,1] > {noformat} > Increasing the Thread.sleep() to9 or 10 helps so that the transactions have > different IDs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-18937) Two accord transactions have the exact same transaction id
Henrik Ingo created CASSANDRA-18937: --- Summary: Two accord transactions have the exact same transaction id Key: CASSANDRA-18937 URL: https://issues.apache.org/jira/browse/CASSANDRA-18937 Project: Cassandra Issue Type: Bug Components: Accord Reporter: Henrik Ingo When testing solutions for CASSANDRA-18798 I noticed that two independent transactions running at the same time in two parallel threads ended up having the exact same transaction id: {code} public void testListAddition() throws Exception { SHARED_CLUSTER.schemaChange("CREATE TABLE " + currentTable + " (k int PRIMARY KEY, l list)"); SHARED_CLUSTER.forEach(node -> node.runOnInstance(() -> AccordService.instance().setCacheSize(0))); CountDownLatch latch = CountDownLatch.newCountDownLatch(1); Vector completionOrder = new Vector<>(); try { for (int i=0; i<100; i++) { ForkJoinTask add1 = ForkJoinPool.commonPool().submit(() -> { latch.awaitThrowUncheckedOnInterrupt(); SHARED_CLUSTER.get(1).executeInternal("BEGIN TRANSACTION " + "UPDATE " + currentTable + " SET l = l + [1] WHERE k = 1; " + "COMMIT TRANSACTION"); completionOrder.add(1); }); ForkJoinTask add2 = ForkJoinPool.commonPool().submit(() -> { try { Thread.sleep(0); {code} Adding some logging in TxnWrite.java reveals the two threads ave identical executeAt and unix timestamps: {noformat} lastmicros 0 DEBUG [node2_Messaging-EventLoop-3-4] node2 2023-10-18 18:26:08,954 AccordVerbHandler.java:54 - Receiving Apply{kind:Minimal, txnId:[10,1697642767659000,10,1], deps:[distributed_test_keyspace:[(-Inf,-1], (-1,9223372036854775805], (9223372036854775805,+Inf]]]:{}, {}, executeAt:[10,1697642767659000,10,1], writes:TxnWrites{executeAt:[10,1697642767659000,10,1], keys:[distributed_test_keyspace:DecoratedKey(-4069959284402364209, 0001)], write:TxnWrite{}}, result:accord.api.Result$1@253c102e} from /127.0.0.1:7012 raw 0 (NO_LAST_EXECUTED_HLC=-9223372036854775808 lastExecutedTimestamp [0,0,0,0] lastmicros 1697642767659000 raw -9223372036854775808 (NO_LAST_EXECUTED_HLC=-9223372036854775808 lastExecutedTimestamp [10,1697642767659000,10,1] DEBUG [node2_CommandStore[1]:1] node2 2023-10-18 18:26:09,023 AccordMessageSink.java:167 - Replying ACCORD_APPLY_RSP ApplyApplied to /127.0.0.1:7012 lastmicros 0 raw 0 (NO_LAST_EXECUTED_HLC=-9223372036854775808 lastExecutedTimestamp [0,0,0,0] lastmicros 1697642767659000 raw -9223372036854775808 (NO_LAST_EXECUTED_HLC=-9223372036854775808 lastExecutedTimestamp [10,1697642767659000,10,1] timestamp 1697642767659000executeAt[10,1697642767659000,10,1] timestamp 1697642767659000executeAt[10,1697642767659000,10,1] {noformat} Increasing the Thread.sleep() to9 or 10 helps so that the transactions have different IDs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776200#comment-17776200 ] Henrik Ingo edited comment on CASSANDRA-18798 at 10/17/23 2:06 PM: --- Pushed new snapshot of progress: https://github.com/henrikingo/cassandra/commit/4b2292bfa52ed713163abbc4f72b8300bf630e8e This commit "fixes" the issue in the sense that {{updateAllTimestampAndLocalDeletionTime()}} will now also update the {{path}} variable for elements of a ListType. However, this does not actualy fix the issue. In the unit test that's also part of the patch, the transactions end up always having the same timestamp, and hence generate the same TimeUUID(). (Note that separately we might wonder what would happen if we append 2 list elements in the same transaction?) To emphasize the point that the above does the right thing given the original assumptions, if I just use { {nextTimeUUID()}}, which generates new UUIDs, not just maps the current timestamp to a deterministic UUID, then the test "passes", though I doubt that would be correct in a real cluster with multiple nodes. IT works on a single node because this code executes serially in the accord execution phase, so newly generated UUIDs are ordered correctly, even if they are not the correct UUIDs. (...as in derived from the Accord transaction id) But ok, debugging this I realized another issue, which I first thought was with the test setup, but might be some kind of race condition. It turns out the two transactions in the unit test end up executing with the exact same timestamps. {noformat} lastmicros 0 DEBUG [node2_CommandStore[1]:1] node2 2023-10-17 15:39:35,579 AccordMessageSink.java:167 - Replying ACCORD_APPLY_RSP ApplyApplied to /127.0.0.1:7012 DEBUG [node1_RequestResponseStage-1] node1 2023-10-17 15:39:35,580 AccordCallback.java:49 - Received response ApplyApplied from /127.0.0.2:7012 lastmicros 0 raw 0 (NO_LAST_EXECUTED_HLC=-9223372036854775808 lastExecutedTimestamp [0,0,0,0] lastmicros 1697546374434000 raw 0 (NO_LAST_EXECUTED_HLC=-9223372036854775808 raw -9223372036854775808 (NO_LAST_EXECUTED_HLC=-9223372036854775808 lastExecutedTimestamp [0,0,0,0] lastExecutedTimestamp [10,1697546374434000,10,1] lastmicros 1697546374434000 raw -9223372036854775808 (NO_LAST_EXECUTED_HLC=-9223372036854775808 lastExecutedTimestamp [10,1697546374434000,10,1] timestamp 1697546374434000executeAt[10,1697546374434000,10,1] timestamp 1697546374434000executeAt[10,1697546374434000,10,1] {noformat} But adding a sleep to one thread, it resolves the issue (also makes the test pass, actually): {code} ForkJoinTask add2 = ForkJoinPool.commonPool().submit(() -> { try { Thread.sleep(1000); }catch (InterruptedException e){ // It's ok } latch.awaitThrowUncheckedOnInterrupt(); SHARED_CLUSTER.get(1).executeInternal("BEGIN TRANSACTION " + "UPDATE " + currentTable + " SET l = l + [2] WHERE k = 1; " + "COMMIT TRANSACTION"); completionOrder.add(2); }); {code} {noformat} lastmicros 1697544893676000 raw 1697544893676000 (NO_LAST_EXECUTED_HLC=-9223372036854775808 lastExecutedTimestamp [10,1697544893676000,10,1] lastmicros 1697544894677000 raw -9223372036854775808 (NO_LAST_EXECUTED_HLC=-9223372036854775808 lastExecutedTimestamp [10,1697544894677000,10,1] timestamp 1697544894677000executeAt[10,1697544894677000,10,1] DEBUG [node2_CommandStore[1]:1] node2 2023-10-17 15:14:54,728 AccordMessageSink.java:167 - Replying ACCORD_APPLY_RSP ApplyApplied to /127.0.0.1:7012 DEBUG [node1_RequestResponseStage-1] node1 2023-10-17 15:14:54,728 AccordCallback.java:49 - Received response ApplyApplied from /127.0.0.1:7012 DEBUG [node2_Messaging-EventLoop-3-4] node2 2023-10-17 15:14:54,728 AccordVerbHandler.java:54 - Receiving PreAccept{txnId:[10,1697544894711000,0,1], txn:{read:TxnRead{TxnNamedRead{name='RETURNING:', key=distributed_test_keyspace:DecoratedKey(-4069959284402364209, 0001), update=Read(distributed_test_keyspace.tbl0 columns=*/[l] rowFilter= limits= key=1 filter=names(EMPTY), nowInSec=0)}}}, scope:[distributed_test_keyspace:-4069959284402364209]} from /127.0.0.1:7012 DEBUG [node1_CommandStore[1]:1] node1 2023-10-17 15:14:54,730 AbstractCell.java:144 - timestamp: 1697544894677000 buffer: 0newPath: java.nio.HeapByteBuffer[pos=0 lim=16 cap=16] lastmicros 1697544893676000 raw 1697544893676000 (NO_LAST_EXECUTED_HLC=-9223372036854775808 lastExecutedTimestamp [10,1697544893676000,10,1] lastmicros 1697544894677000 raw -9223372036854775808 (NO_LAST_EXECUTED_HLC=-9223372036854775808 lastExecutedTimestamp [10,1697544894677000,10,1] DEBUG [node1_RequestResponseS
[jira] [Commented] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776200#comment-17776200 ] Henrik Ingo commented on CASSANDRA-18798: - Pushed new snapshot of progress: https://github.com/henrikingo/cassandra/commit/4b2292bfa52ed713163abbc4f72b8300bf630e8e This commit "fixes" the issue in the sense that {{updateAllTimestampAndLocalDeletionTime()}} will now also update the {{path}} variable for elements of a ListType. However, this does not actualy fix the issue. In the unit test that's also part of the patch, the transactions end up always having the same timestamp, and hence generate the same TimeUUID(). To emphasize the point that the above does the right thing given the original assumptions, if I just use { {nextTimeUUID()}}, which generates new UUIDs, not just maps the current timestamp to a deterministic UUID, then the test "passes", though I doubt that would be correct in a real cluster with multiple nodes. IT works on a single node because this code executes serially in the accord execution phase, so newly generated UUIDs are ordered correctly, even if they are not the correct UUIDs. (...as in derived from the Accord transaction id) But ok, debugging this I realized another issue, which I first thought was with the test setup, but might be some kind of race condition. It turns out the two transactions in the unit test end up executing with the exact same timestamps. {noformat} lastmicros 0 DEBUG [node2_CommandStore[1]:1] node2 2023-10-17 15:39:35,579 AccordMessageSink.java:167 - Replying ACCORD_APPLY_RSP ApplyApplied to /127.0.0.1:7012 DEBUG [node1_RequestResponseStage-1] node1 2023-10-17 15:39:35,580 AccordCallback.java:49 - Received response ApplyApplied from /127.0.0.2:7012 lastmicros 0 raw 0 (NO_LAST_EXECUTED_HLC=-9223372036854775808 lastExecutedTimestamp [0,0,0,0] lastmicros 1697546374434000 raw 0 (NO_LAST_EXECUTED_HLC=-9223372036854775808 raw -9223372036854775808 (NO_LAST_EXECUTED_HLC=-9223372036854775808 lastExecutedTimestamp [0,0,0,0] lastExecutedTimestamp [10,1697546374434000,10,1] lastmicros 1697546374434000 raw -9223372036854775808 (NO_LAST_EXECUTED_HLC=-9223372036854775808 lastExecutedTimestamp [10,1697546374434000,10,1] timestamp 1697546374434000executeAt[10,1697546374434000,10,1] timestamp 1697546374434000executeAt[10,1697546374434000,10,1] {noformat} But adding a sleep to one thread, it resolves the issue (also makes the test pass, actually): {code} ForkJoinTask add2 = ForkJoinPool.commonPool().submit(() -> { try { Thread.sleep(1000); }catch (InterruptedException e){ // It's ok } latch.awaitThrowUncheckedOnInterrupt(); SHARED_CLUSTER.get(1).executeInternal("BEGIN TRANSACTION " + "UPDATE " + currentTable + " SET l = l + [2] WHERE k = 1; " + "COMMIT TRANSACTION"); completionOrder.add(2); }); {code} {noformat} lastmicros 1697544893676000 raw 1697544893676000 (NO_LAST_EXECUTED_HLC=-9223372036854775808 lastExecutedTimestamp [10,1697544893676000,10,1] lastmicros 1697544894677000 raw -9223372036854775808 (NO_LAST_EXECUTED_HLC=-9223372036854775808 lastExecutedTimestamp [10,1697544894677000,10,1] timestamp 1697544894677000executeAt[10,1697544894677000,10,1] DEBUG [node2_CommandStore[1]:1] node2 2023-10-17 15:14:54,728 AccordMessageSink.java:167 - Replying ACCORD_APPLY_RSP ApplyApplied to /127.0.0.1:7012 DEBUG [node1_RequestResponseStage-1] node1 2023-10-17 15:14:54,728 AccordCallback.java:49 - Received response ApplyApplied from /127.0.0.1:7012 DEBUG [node2_Messaging-EventLoop-3-4] node2 2023-10-17 15:14:54,728 AccordVerbHandler.java:54 - Receiving PreAccept{txnId:[10,1697544894711000,0,1], txn:{read:TxnRead{TxnNamedRead{name='RETURNING:', key=distributed_test_keyspace:DecoratedKey(-4069959284402364209, 0001), update=Read(distributed_test_keyspace.tbl0 columns=*/[l] rowFilter= limits= key=1 filter=names(EMPTY), nowInSec=0)}}}, scope:[distributed_test_keyspace:-4069959284402364209]} from /127.0.0.1:7012 DEBUG [node1_CommandStore[1]:1] node1 2023-10-17 15:14:54,730 AbstractCell.java:144 - timestamp: 1697544894677000 buffer: 0newPath: java.nio.HeapByteBuffer[pos=0 lim=16 cap=16] lastmicros 1697544893676000 raw 1697544893676000 (NO_LAST_EXECUTED_HLC=-9223372036854775808 lastExecutedTimestamp [10,1697544893676000,10,1] lastmicros 1697544894677000 raw -9223372036854775808 (NO_LAST_EXECUTED_HLC=-9223372036854775808 lastExecutedTimestamp [10,1697544894677000,10,1] DEBUG [node1_RequestResponseStage-1] node1 2023-10-17 15:14:54,734 AccordCallback.java:49 - Received response ApplyApplied from /127.0.0.2:7012 timestamp 1697544894677000executeAt[10,16975
[jira] [Commented] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17774673#comment-17774673 ] Henrik Ingo commented on CASSANDRA-18798: - update: Working on what Branimir suggested earlier: {quote} Could we do this by adding an updatePathTimestamps method in AbstractType that does nothing by default but is implemented by ListType to adjust all the timestamp part of its path UUIDs, and call it from ColumnData.updateAllTimestamps? {quote} Will continue and elaborate tomorrow. > Appending to list in Accord transactions uses insertion timestamp > - > > Key: CASSANDRA-18798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18798 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Jaroslaw Kijanowski >Assignee: Henrik Ingo >Priority: Normal > Fix For: 5.0-alpha2 > > Attachments: image-2023-09-26-20-05-25-846.png > > > Given the following schema: > {code:java} > CREATE KEYSPACE IF NOT EXISTS accord WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': 3}; > CREATE TABLE IF NOT EXISTS accord.list_append(id int PRIMARY KEY,contents > LIST); > TRUNCATE accord.list_append;{code} > And the following two possible queries executed by 10 threads in parallel: > {code:java} > BEGIN TRANSACTION > LET row = (SELECT * FROM list_append WHERE id = ?); > SELECT row.contents; > COMMIT TRANSACTION;" > BEGIN TRANSACTION > UPDATE list_append SET contents += ? WHERE id = ?; > COMMIT TRANSACTION;" > {code} > there seems to be an issue with transaction guarantees. Here's an excerpt in > the edn format from a test. > {code:java} > {:type :invoke :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607285967116627} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 54 > :time 1692607286078732473} > {:type :invoke :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286133833428} > {:type :invoke :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286149702511} > {:type :ok :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607286156314099} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 52 > :time 1692607286167090389} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352]]] :tid 1 :n 54 :time 1692607286168657534} > {:type :invoke :process 1 :value [[:r 5 nil]] :tid 0 :n 51 > :time 1692607286201762938} > {:type :ok :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286245571513} > {:type :invoke :process 7 :value [[:r 5 nil]] :tid 4 :n 56 > :time 1692607286245655775} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 455]]] :tid 9 :n 52 :time 1692607286253928906} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 53 > :time 1692607286254095215} > {:type :ok :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286266263422} > {:type :ok :process 1 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 0 :n 51 :time 1692607286271617955} > {:type :ok :process 7 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 4 :n 56 :time 1692607286271816933} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 9 :n 53 :time 1692607286281483026} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 56 > :time 1692607286284097561} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 1 :n 56 :tim
[jira] [Commented] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17774632#comment-17774632 ] Henrik Ingo commented on CASSANDRA-18798: - Hmm... It did pass for me, but you're right, if repeating the test multiple times it does fail quite soon, at runs 2 to 4. Btw I added: {code} try { for (int i=0; i<100; i++) { ForkJoinTask add1 = ForkJoinPool.commonPool().submit(() -> { {code} ...so that the test is practically guáranteed to fail. (Otherwise it would be a flaky test if it passes 50% of the time...) I should note that I did rerun the --list-append test that is the test that discovered this bug in the first place, and it can no longer repro this. It passes even a fairly lengthy run. ... I would say the addition of https://github.com/apache/cassandra/blame/cep-15-accord/src/java/org/apache/cassandra/service/accord/AccordKeyspace.java#L361-L363 clearly helps. The {{BufferCell[] ListType.elements}} now get their timestamps updated. But what's missing? One possibility is that the list items now have their timestamps correctly aligned with Accord, but the list is never re-sorted after this? > Appending to list in Accord transactions uses insertion timestamp > - > > Key: CASSANDRA-18798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18798 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Jaroslaw Kijanowski >Assignee: Henrik Ingo >Priority: Normal > Fix For: 5.0-alpha2 > > Attachments: image-2023-09-26-20-05-25-846.png > > > Given the following schema: > {code:java} > CREATE KEYSPACE IF NOT EXISTS accord WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': 3}; > CREATE TABLE IF NOT EXISTS accord.list_append(id int PRIMARY KEY,contents > LIST); > TRUNCATE accord.list_append;{code} > And the following two possible queries executed by 10 threads in parallel: > {code:java} > BEGIN TRANSACTION > LET row = (SELECT * FROM list_append WHERE id = ?); > SELECT row.contents; > COMMIT TRANSACTION;" > BEGIN TRANSACTION > UPDATE list_append SET contents += ? WHERE id = ?; > COMMIT TRANSACTION;" > {code} > there seems to be an issue with transaction guarantees. Here's an excerpt in > the edn format from a test. > {code:java} > {:type :invoke :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607285967116627} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 54 > :time 1692607286078732473} > {:type :invoke :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286133833428} > {:type :invoke :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286149702511} > {:type :ok :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607286156314099} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 52 > :time 1692607286167090389} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352]]] :tid 1 :n 54 :time 1692607286168657534} > {:type :invoke :process 1 :value [[:r 5 nil]] :tid 0 :n 51 > :time 1692607286201762938} > {:type :ok :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286245571513} > {:type :invoke :process 7 :value [[:r 5 nil]] :tid 4 :n 56 > :time 1692607286245655775} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 455]]] :tid 9 :n 52 :time 1692607286253928906} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 53 > :time 1692607286254095215} > {:type :ok :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286266263422} > {:type :ok :process 1 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 0 :n 51 :time 1692607286271617955} > {:type :ok :process 7 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 4 :n 56 :time 1692607286271816933} > {:type :ok
[jira] [Updated] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik Ingo updated CASSANDRA-18798: Fix Version/s: 5.0-alpha2 > Appending to list in Accord transactions uses insertion timestamp > - > > Key: CASSANDRA-18798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18798 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Jaroslaw Kijanowski >Assignee: Henrik Ingo >Priority: Normal > Fix For: 5.0-alpha2 > > Attachments: image-2023-09-26-20-05-25-846.png > > > Given the following schema: > {code:java} > CREATE KEYSPACE IF NOT EXISTS accord WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': 3}; > CREATE TABLE IF NOT EXISTS accord.list_append(id int PRIMARY KEY,contents > LIST); > TRUNCATE accord.list_append;{code} > And the following two possible queries executed by 10 threads in parallel: > {code:java} > BEGIN TRANSACTION > LET row = (SELECT * FROM list_append WHERE id = ?); > SELECT row.contents; > COMMIT TRANSACTION;" > BEGIN TRANSACTION > UPDATE list_append SET contents += ? WHERE id = ?; > COMMIT TRANSACTION;" > {code} > there seems to be an issue with transaction guarantees. Here's an excerpt in > the edn format from a test. > {code:java} > {:type :invoke :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607285967116627} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 54 > :time 1692607286078732473} > {:type :invoke :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286133833428} > {:type :invoke :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286149702511} > {:type :ok :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607286156314099} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 52 > :time 1692607286167090389} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352]]] :tid 1 :n 54 :time 1692607286168657534} > {:type :invoke :process 1 :value [[:r 5 nil]] :tid 0 :n 51 > :time 1692607286201762938} > {:type :ok :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286245571513} > {:type :invoke :process 7 :value [[:r 5 nil]] :tid 4 :n 56 > :time 1692607286245655775} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 455]]] :tid 9 :n 52 :time 1692607286253928906} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 53 > :time 1692607286254095215} > {:type :ok :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286266263422} > {:type :ok :process 1 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 0 :n 51 :time 1692607286271617955} > {:type :ok :process 7 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 4 :n 56 :time 1692607286271816933} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 9 :n 53 :time 1692607286281483026} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 56 > :time 1692607286284097561} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 1 :n 56 :time 1692607286306445242} > {code} > Processes process 6 and process 7 are appending the values 553 and 455 > respectively. 455 succeeded and a read by process 5 confirms that. But then > also 553 is appended and a read by process 1 confirms that as well, however > it sees 553 before 455. > process 5 reads [... 852 352 455] where as process 1 reads [... 852 352 553 > 455
[jira] [Updated] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik Ingo updated CASSANDRA-18798: Mentor: Jacek Lewandowski Resolution: Fixed Status: Resolved (was: Open) > Appending to list in Accord transactions uses insertion timestamp > - > > Key: CASSANDRA-18798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18798 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Jaroslaw Kijanowski >Assignee: Henrik Ingo >Priority: Normal > Attachments: image-2023-09-26-20-05-25-846.png > > > Given the following schema: > {code:java} > CREATE KEYSPACE IF NOT EXISTS accord WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': 3}; > CREATE TABLE IF NOT EXISTS accord.list_append(id int PRIMARY KEY,contents > LIST); > TRUNCATE accord.list_append;{code} > And the following two possible queries executed by 10 threads in parallel: > {code:java} > BEGIN TRANSACTION > LET row = (SELECT * FROM list_append WHERE id = ?); > SELECT row.contents; > COMMIT TRANSACTION;" > BEGIN TRANSACTION > UPDATE list_append SET contents += ? WHERE id = ?; > COMMIT TRANSACTION;" > {code} > there seems to be an issue with transaction guarantees. Here's an excerpt in > the edn format from a test. > {code:java} > {:type :invoke :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607285967116627} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 54 > :time 1692607286078732473} > {:type :invoke :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286133833428} > {:type :invoke :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286149702511} > {:type :ok :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607286156314099} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 52 > :time 1692607286167090389} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352]]] :tid 1 :n 54 :time 1692607286168657534} > {:type :invoke :process 1 :value [[:r 5 nil]] :tid 0 :n 51 > :time 1692607286201762938} > {:type :ok :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286245571513} > {:type :invoke :process 7 :value [[:r 5 nil]] :tid 4 :n 56 > :time 1692607286245655775} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 455]]] :tid 9 :n 52 :time 1692607286253928906} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 53 > :time 1692607286254095215} > {:type :ok :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286266263422} > {:type :ok :process 1 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 0 :n 51 :time 1692607286271617955} > {:type :ok :process 7 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 4 :n 56 :time 1692607286271816933} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 9 :n 53 :time 1692607286281483026} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 56 > :time 1692607286284097561} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 1 :n 56 :time 1692607286306445242} > {code} > Processes process 6 and process 7 are appending the values 553 and 455 > respectively. 455 succeeded and a read by process 5 confirms that. But then > also 553 is appended and a read by process 1 confirms that as well, however > it sees 553 before 455. > process 5 reads [... 852 352 455] where as process 1 r
[jira] [Commented] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17774270#comment-17774270 ] Henrik Ingo commented on CASSANDRA-18798: - Confirmed: First run --list-append on an old branch of cep-15-accord: {code} hingo@odysseus:~/Documents/github/accordclient$ java -jar elle-cli/target/elle-cli-0.1.6-standalone.jar --model list-append --anomalies G0 --consistency-models strict-serializable --directory out-la --verbose test-la.edn java.lang.AssertionError: Assert failed: No transaction wrote 5 12 t2 at elle.list_append$dirty_update_cases$fn__1930$fn__1935.invoke(list_append.clj:377) at clojure.lang.PersistentVector.reduce(PersistentVector.java:343) at clojure.core$reduce.invokeStatic(core.clj:6829) at clojure.core$reduce.invoke(core.clj:6812) at elle.list_append$dirty_update_cases$fn__1930.invoke(list_append.clj:372) at clojure.core$map$fn__5884.invoke(core.clj:2759) at clojure.lang.LazySeq.sval(LazySeq.java:42) at clojure.lang.LazySeq.seq(LazySeq.java:51) at clojure.lang.Cons.next(Cons.java:39) at clojure.lang.RT.boundedLength(RT.java:1793) at clojure.lang.RestFn.applyTo(RestFn.java:130) at clojure.core$apply.invokeStatic(core.clj:667) at clojure.core$mapcat.invokeStatic(core.clj:2787) at clojure.core$mapcat.doInvoke(core.clj:2787) at clojure.lang.RestFn.invoke(RestFn.java:423) at elle.list_append$dirty_update_cases.invokeStatic(list_append.clj:370) at elle.list_append$dirty_update_cases.invoke(list_append.clj:361) at elle.list_append$check$dirty_update_task__2257.invoke(list_append.clj:875) at jepsen.history.task.Task.run(task.clj:282) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) {code} Then today's checkout of cep-15-accord {code} hingo@odysseus:~/Documents/github/accordclient$ java -jar elle-cli/target/elle-cli-0.1.6-standalone.jar --model list-append --anomalies G0 --consistency-models strict-serializable --directory out-la --verbose test-la.edn {"valid?":true} {code} (Full list of steps as in description of this ticket) > Appending to list in Accord transactions uses insertion timestamp > - > > Key: CASSANDRA-18798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18798 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Jaroslaw Kijanowski >Assignee: Henrik Ingo >Priority: Normal > Attachments: image-2023-09-26-20-05-25-846.png > > > Given the following schema: > {code:java} > CREATE KEYSPACE IF NOT EXISTS accord WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': 3}; > CREATE TABLE IF NOT EXISTS accord.list_append(id int PRIMARY KEY,contents > LIST); > TRUNCATE accord.list_append;{code} > And the following two possible queries executed by 10 threads in parallel: > {code:java} > BEGIN TRANSACTION > LET row = (SELECT * FROM list_append WHERE id = ?); > SELECT row.contents; > COMMIT TRANSACTION;" > BEGIN TRANSACTION > UPDATE list_append SET contents += ? WHERE id = ?; > COMMIT TRANSACTION;" > {code} > there seems to be an issue with transaction guarantees. Here's an excerpt in > the edn format from a test. > {code:java} > {:type :invoke :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607285967116627} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 54 > :time 1692607286078732473} > {:type :invoke :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286133833428} > {:type :invoke :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286149702511} > {:type :ok :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607286156314099} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 52 > :time 1692607286167090389} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352]]] :tid 1 :n 54 :time 1692607286168657534} > {:type :invoke :process 1 :value [[:r 5 nil]] :tid 0 :n 51 > :time 1692607286201762938} > {:type :ok :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286245571513} > {:type :invoke :process 7 :value [[:r 5 nil]] :tid 4 :n 56 > :time 1692607286245655775} > {:type
[jira] [Commented] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773851#comment-17773851 ] Henrik Ingo commented on CASSANDRA-18798: - After pulling the most recent cep-15-accord branch, it seems this issue is fixed: https://github.com/apache/cassandra/blob/cep-15-accord/src/java/org/apache/cassandra/service/accord/AccordKeyspace.java#L361-L363 I'll rerun Jaroslaw's original consistency test tomorrow to verify. > Appending to list in Accord transactions uses insertion timestamp > - > > Key: CASSANDRA-18798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18798 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Jaroslaw Kijanowski >Assignee: Henrik Ingo >Priority: Normal > Attachments: image-2023-09-26-20-05-25-846.png > > > Given the following schema: > {code:java} > CREATE KEYSPACE IF NOT EXISTS accord WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': 3}; > CREATE TABLE IF NOT EXISTS accord.list_append(id int PRIMARY KEY,contents > LIST); > TRUNCATE accord.list_append;{code} > And the following two possible queries executed by 10 threads in parallel: > {code:java} > BEGIN TRANSACTION > LET row = (SELECT * FROM list_append WHERE id = ?); > SELECT row.contents; > COMMIT TRANSACTION;" > BEGIN TRANSACTION > UPDATE list_append SET contents += ? WHERE id = ?; > COMMIT TRANSACTION;" > {code} > there seems to be an issue with transaction guarantees. Here's an excerpt in > the edn format from a test. > {code:java} > {:type :invoke :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607285967116627} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 54 > :time 1692607286078732473} > {:type :invoke :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286133833428} > {:type :invoke :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286149702511} > {:type :ok :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607286156314099} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 52 > :time 1692607286167090389} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352]]] :tid 1 :n 54 :time 1692607286168657534} > {:type :invoke :process 1 :value [[:r 5 nil]] :tid 0 :n 51 > :time 1692607286201762938} > {:type :ok :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286245571513} > {:type :invoke :process 7 :value [[:r 5 nil]] :tid 4 :n 56 > :time 1692607286245655775} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 455]]] :tid 9 :n 52 :time 1692607286253928906} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 53 > :time 1692607286254095215} > {:type :ok :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286266263422} > {:type :ok :process 1 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 0 :n 51 :time 1692607286271617955} > {:type :ok :process 7 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 4 :n 56 :time 1692607286271816933} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 9 :n 53 :time 1692607286281483026} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 56 > :time 1692607286284097561} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 1 :n 56 :time 1692607286306445242} > {code} > Processes process 6 and process 7 are appending the values 553 an
[jira] [Comment Edited] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772178#comment-17772178 ] Henrik Ingo edited comment on CASSANDRA-18798 at 10/5/23 11:35 AM: --- Yes. Sorry, Branimir educated me on Friday about this. I thought a fix would be trivial aftert that, so didn't bother to summarize a comment here by then. Now 4 days later it's obvious I should have... Basically, the TimeUUID also needs to be (re)generated from the Accord executeAt timestamp. This way operations like appending to a list will result in a correct and consistent ordering of the list elements. Ok, so clearly I cannot brute force my way through this with late nights. I've pushed a branch which contains my work so far. I'm stuck with the unit test, I expect the actual fix to be a one liner. Status: Blocked by org.apache.cassandra.exceptions.WriteTimeoutException for debugging accord transactions. Beyond that, I know what I have to do wrt how timestamps affect ordering of List elements. However in the unit test I created the ts value still is 0, and therefore all the inserted rows end up deleted. https://github.com/henrikingo/cassandra/tree/C-18798-ListType-Accord was (Author: henrik.ingo): Yes. Sorry, Branimir educated me on Friday about this. I thought a fix would be trivial aftert that, so didn't bother to summarize a comment here by then. Now 4 days later it's obvious I should have... Ok, so clearly I cannot brute force my way through this with late nights. I've pushed a branch which contains my work so far. I'm stuck with the unit test, I expect the actual fix to be a one liner. Status: Blocked by org.apache.cassandra.exceptions.WriteTimeoutException for debugging accord transactions. Beyond that, I know what I have to do wrt how timestamps affect ordering of List elements. However in the unit test I created the ts value still is 0, and therefore all the inserted rows end up deleted. https://github.com/henrikingo/cassandra/tree/C-18798-ListType-Accord > Appending to list in Accord transactions uses insertion timestamp > - > > Key: CASSANDRA-18798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18798 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Jaroslaw Kijanowski >Assignee: Henrik Ingo >Priority: Normal > Attachments: image-2023-09-26-20-05-25-846.png > > > Given the following schema: > {code:java} > CREATE KEYSPACE IF NOT EXISTS accord WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': 3}; > CREATE TABLE IF NOT EXISTS accord.list_append(id int PRIMARY KEY,contents > LIST); > TRUNCATE accord.list_append;{code} > And the following two possible queries executed by 10 threads in parallel: > {code:java} > BEGIN TRANSACTION > LET row = (SELECT * FROM list_append WHERE id = ?); > SELECT row.contents; > COMMIT TRANSACTION;" > BEGIN TRANSACTION > UPDATE list_append SET contents += ? WHERE id = ?; > COMMIT TRANSACTION;" > {code} > there seems to be an issue with transaction guarantees. Here's an excerpt in > the edn format from a test. > {code:java} > {:type :invoke :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607285967116627} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 54 > :time 1692607286078732473} > {:type :invoke :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286133833428} > {:type :invoke :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286149702511} > {:type :ok :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607286156314099} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 52 > :time 1692607286167090389} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352]]] :tid 1 :n 54 :time 1692607286168657534} > {:type :invoke :process 1 :value [[:r 5 nil]] :tid 0 :n 51 > :time 1692607286201762938} > {:type :ok :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286245571513} > {:type :invoke :process 7 :value [[:r 5 nil]] :tid 4 :n 56 > :time 1692607286245655775} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 455]]] :tid 9 :n 52 :time 1692607286253928906} > {:type :invoke :process 5 :value
[jira] [Commented] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772178#comment-17772178 ] Henrik Ingo commented on CASSANDRA-18798: - Yes. Sorry, Branimir educated me on Friday about this. I thought a fix would be trivial aftert that, so didn't bother to summarize a comment here by then. Now 4 days later it's obvious I should have... Ok, so clearly I cannot brute force my way through this with late nights. I've pushed a branch which contains my work so far. I'm stuck with the unit test, I expect the actual fix to be a one liner. Status: Blocked by org.apache.cassandra.exceptions.WriteTimeoutException for debugging accord transactions. Beyond that, I know what I have to do wrt how timestamps affect ordering of List elements. However in the unit test I created the ts value still is 0, and therefore all the inserted rows end up deleted. https://github.com/henrikingo/cassandra/tree/C-18798-ListType-Accord > Appending to list in Accord transactions uses insertion timestamp > - > > Key: CASSANDRA-18798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18798 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Jaroslaw Kijanowski >Assignee: Henrik Ingo >Priority: Normal > Attachments: image-2023-09-26-20-05-25-846.png > > > Given the following schema: > {code:java} > CREATE KEYSPACE IF NOT EXISTS accord WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': 3}; > CREATE TABLE IF NOT EXISTS accord.list_append(id int PRIMARY KEY,contents > LIST); > TRUNCATE accord.list_append;{code} > And the following two possible queries executed by 10 threads in parallel: > {code:java} > BEGIN TRANSACTION > LET row = (SELECT * FROM list_append WHERE id = ?); > SELECT row.contents; > COMMIT TRANSACTION;" > BEGIN TRANSACTION > UPDATE list_append SET contents += ? WHERE id = ?; > COMMIT TRANSACTION;" > {code} > there seems to be an issue with transaction guarantees. Here's an excerpt in > the edn format from a test. > {code:java} > {:type :invoke :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607285967116627} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 54 > :time 1692607286078732473} > {:type :invoke :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286133833428} > {:type :invoke :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286149702511} > {:type :ok :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607286156314099} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 52 > :time 1692607286167090389} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352]]] :tid 1 :n 54 :time 1692607286168657534} > {:type :invoke :process 1 :value [[:r 5 nil]] :tid 0 :n 51 > :time 1692607286201762938} > {:type :ok :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286245571513} > {:type :invoke :process 7 :value [[:r 5 nil]] :tid 4 :n 56 > :time 1692607286245655775} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 455]]] :tid 9 :n 52 :time 1692607286253928906} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 53 > :time 1692607286254095215} > {:type :ok :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286266263422} > {:type :ok :process 1 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 0 :n 51 :time 1692607286271617955} > {:type :ok :process 7 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 4 :n 56 :time 1692607286271816933} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 9 :n 53 :time 1692
[jira] [Comment Edited] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769861#comment-17769861 ] Henrik Ingo edited comment on CASSANDRA-18798 at 9/28/23 1:05 AM: -- [~kijanowski] When you wake up, can you try this: {code} diff --git a/src/java/org/apache/cassandra/db/rows/UnfilteredRowIteratorSerializer.java b/src/java/org/apache/cassandra/db/rows/UnfilteredRowIteratorSerializer.java index 1be3d54558..3b0d7b78cc 100644 --- a/src/java/org/apache/cassandra/db/rows/UnfilteredRowIteratorSerializer.java +++ b/src/java/org/apache/cassandra/db/rows/UnfilteredRowIteratorSerializer.java @@ -230,7 +230,7 @@ public class UnfilteredRowIteratorSerializer final SerializationHeader sHeader = header.sHeader; return new AbstractUnfilteredRowIterator(metadata, header.key, header.partitionDeletion, sHeader.columns(), header.staticRow, header.isReversed, sHeader.stats()) { -private final Row.Builder builder = BTreeRow.sortedBuilder(); +private final Row.Builder builder = BTreeRow.unsortedBuilder(); protected Unfiltered computeNext() { diff --git a/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java b/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java index d528a70a18..22bdbc745b 100644 --- a/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java +++ b/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java @@ -455,7 +455,7 @@ public class UnfilteredSerializer throws IOException { // It wouldn't be wrong per-se to use an unsorted builder, but it would be inefficient so make sure we don't do it by mistake -assert builder.isSorted(); +//assert builder.isSorted(); int flags = in.readUnsignedByte(); if (isEndOfPartition(flags)) {code} Note the funny naming: sortedBuilder = data is already sorted, builder not sorting unsortedBuilder = data is not sorted, builder makes it sorted was (Author: henrik.ingo): [~kijanowski] When you wake up, can you try this: {code} diff --git a/src/java/org/apache/cassandra/db/rows/UnfilteredRowIteratorSerializer.java b/src/java/org/apache/cassandra/db/rows/UnfilteredRowIteratorSerializer.java index 1be3d54558..3b0d7b78cc 100644 --- a/src/java/org/apache/cassandra/db/rows/UnfilteredRowIteratorSerializer.java +++ b/src/java/org/apache/cassandra/db/rows/UnfilteredRowIteratorSerializer.java @@ -230,7 +230,7 @@ public class UnfilteredRowIteratorSerializer final SerializationHeader sHeader = header.sHeader; return new AbstractUnfilteredRowIterator(metadata, header.key, header.partitionDeletion, sHeader.columns(), header.staticRow, header.isReversed, sHeader.stats()) { -private final Row.Builder builder = BTreeRow.sortedBuilder(); +private final Row.Builder builder = BTreeRow.unsortedBuilder(); protected Unfiltered computeNext() { diff --git a/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java b/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java index d528a70a18..22bdbc745b 100644 --- a/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java +++ b/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java @@ -455,7 +455,7 @@ public class UnfilteredSerializer throws IOException { // It wouldn't be wrong per-se to use an unsorted builder, but it would be inefficient so make sure we don't do it by mistake -assert builder.isSorted(); +//assert builder.isSorted(); int flags = in.readUnsignedByte(); if (isEndOfPartition(flags)) {code} > Appending to list in Accord transactions uses insertion timestamp > - > > Key: CASSANDRA-18798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18798 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Jaroslaw Kijanowski >Assignee: Henrik Ingo >Priority: Normal > Attachments: image-2023-09-26-20-05-25-846.png > > > Given the following schema: > {code:java} > CREATE KEYSPACE IF NOT EXISTS accord WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': 3}; > CREATE TABLE IF NOT EXISTS accord.list_append(id int PRIMARY KEY,contents > LIST); > TRUNCATE accord.list_append;{code} > And the following two possible queries executed by 10 threads in parallel: > {code:java} > BEGIN TRANSACTION > LET row = (SELECT * FROM list_append WHERE id = ?); > SELECT row.contents; > COMMIT TRANSACTION;" > BEGIN TRANSACTION > UPDATE list_append SET contents += ? WHERE id = ?; > COMMIT TRANSACTION;" > {code} > there seems to be an issue with transaction guarantees. Here's
[jira] [Comment Edited] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769861#comment-17769861 ] Henrik Ingo edited comment on CASSANDRA-18798 at 9/28/23 1:03 AM: -- [~kijanowski] When you wake up, can you try this: {code} diff --git a/src/java/org/apache/cassandra/db/rows/UnfilteredRowIteratorSerializer.java b/src/java/org/apache/cassandra/db/rows/UnfilteredRowIteratorSerializer.java index 1be3d54558..3b0d7b78cc 100644 --- a/src/java/org/apache/cassandra/db/rows/UnfilteredRowIteratorSerializer.java +++ b/src/java/org/apache/cassandra/db/rows/UnfilteredRowIteratorSerializer.java @@ -230,7 +230,7 @@ public class UnfilteredRowIteratorSerializer final SerializationHeader sHeader = header.sHeader; return new AbstractUnfilteredRowIterator(metadata, header.key, header.partitionDeletion, sHeader.columns(), header.staticRow, header.isReversed, sHeader.stats()) { -private final Row.Builder builder = BTreeRow.sortedBuilder(); +private final Row.Builder builder = BTreeRow.unsortedBuilder(); protected Unfiltered computeNext() { diff --git a/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java b/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java index d528a70a18..22bdbc745b 100644 --- a/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java +++ b/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java @@ -455,7 +455,7 @@ public class UnfilteredSerializer throws IOException { // It wouldn't be wrong per-se to use an unsorted builder, but it would be inefficient so make sure we don't do it by mistake -assert builder.isSorted(); +//assert builder.isSorted(); int flags = in.readUnsignedByte(); if (isEndOfPartition(flags)) {code} was (Author: henrik.ingo): [~kijanowski] When you wake up, can you try this: {{diff --git a/src/java/org/apache/cassandra/db/rows/UnfilteredRowIteratorSerializer.java b/src/java/org/apache/cassandra/db/rows/UnfilteredRowIteratorSerializer.java }} {{index 1be3d54558..3b0d7b78cc 100644 }} {{--- a/src/java/org/apache/cassandra/db/rows/UnfilteredRowIteratorSerializer.java }} {{+++ b/src/java/org/apache/cassandra/db/rows/UnfilteredRowIteratorSerializer.java }} {{@@ -230,7 +230,7 @@ public class UnfilteredRowIteratorSerializer }} {{final SerializationHeader sHeader = header.sHeader; }} {{return new AbstractUnfilteredRowIterator(metadata, header.key, header.partitionDeletion, sHeader.columns(), header.staticRow, header.isReversed, sHeader.stats()) }} {{{ }} {{- private final Row.Builder builder = BTreeRow.sortedBuilder(); }} {{+ private final Row.Builder builder = BTreeRow.unsortedBuilder(); }} {{ }} {{protected Unfiltered computeNext() }} {{{ }} {{diff --git a/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java b/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java }} {{index d528a70a18..22bdbc745b 100644 }} {{--- a/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java }} {{+++ b/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java }} {{@@ -455,7 +455,7 @@ public class UnfilteredSerializer }} {{throws IOException }} {{{ }} {{// It wouldn't be wrong per-se to use an unsorted builder, but it would be inefficient so make sure we don't do it by mistake }} {{- assert builder.isSorted(); }} {{+ //assert builder.isSorted(); }} {{ }} {{int flags = in.readUnsignedByte(); }} {{if (isEndOfPartition(flags))}} {{}} > Appending to list in Accord transactions uses insertion timestamp > - > > Key: CASSANDRA-18798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18798 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Jaroslaw Kijanowski >Assignee: Henrik Ingo >Priority: Normal > Attachments: image-2023-09-26-20-05-25-846.png > > > Given the following schema: > {code:java} > CREATE KEYSPACE IF NOT EXISTS accord WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': 3}; > CREATE TABLE IF NOT EXISTS accord.list_append(id int PRIMARY KEY,contents > LIST); > TRUNCATE accord.list_append;{code} > And the following two possible queries executed by 10 threads in parallel: > {code:java} > BEGIN TRANSACTION > LET row = (SELECT * FROM list_append WHERE id = ?); > SELECT row.contents; > COMMIT TRANSACTION;" > BEGIN TRANSACTION > UPDATE list_append SET contents += ? WHERE id = ?; > COMMIT TRANSACTION;" > {code} > there seems to be an issue with transaction guarantees. Here's an excerpt in > the edn format from
[jira] [Commented] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769861#comment-17769861 ] Henrik Ingo commented on CASSANDRA-18798: - [~kijanowski] When you wake up, can you try this: {{diff --git a/src/java/org/apache/cassandra/db/rows/UnfilteredRowIteratorSerializer.java b/src/java/org/apache/cassandra/db/rows/UnfilteredRowIteratorSerializer.java }} {{index 1be3d54558..3b0d7b78cc 100644 }} {{--- a/src/java/org/apache/cassandra/db/rows/UnfilteredRowIteratorSerializer.java }} {{+++ b/src/java/org/apache/cassandra/db/rows/UnfilteredRowIteratorSerializer.java }} {{@@ -230,7 +230,7 @@ public class UnfilteredRowIteratorSerializer }} {{final SerializationHeader sHeader = header.sHeader; }} {{return new AbstractUnfilteredRowIterator(metadata, header.key, header.partitionDeletion, sHeader.columns(), header.staticRow, header.isReversed, sHeader.stats()) }} {{{ }} {{- private final Row.Builder builder = BTreeRow.sortedBuilder(); }} {{+ private final Row.Builder builder = BTreeRow.unsortedBuilder(); }} {{ }} {{protected Unfiltered computeNext() }} {{{ }} {{diff --git a/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java b/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java }} {{index d528a70a18..22bdbc745b 100644 }} {{--- a/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java }} {{+++ b/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java }} {{@@ -455,7 +455,7 @@ public class UnfilteredSerializer }} {{throws IOException }} {{{ }} {{// It wouldn't be wrong per-se to use an unsorted builder, but it would be inefficient so make sure we don't do it by mistake }} {{- assert builder.isSorted(); }} {{+ //assert builder.isSorted(); }} {{ }} {{int flags = in.readUnsignedByte(); }} {{if (isEndOfPartition(flags))}} {{}} > Appending to list in Accord transactions uses insertion timestamp > - > > Key: CASSANDRA-18798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18798 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Jaroslaw Kijanowski >Assignee: Henrik Ingo >Priority: Normal > Attachments: image-2023-09-26-20-05-25-846.png > > > Given the following schema: > {code:java} > CREATE KEYSPACE IF NOT EXISTS accord WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': 3}; > CREATE TABLE IF NOT EXISTS accord.list_append(id int PRIMARY KEY,contents > LIST); > TRUNCATE accord.list_append;{code} > And the following two possible queries executed by 10 threads in parallel: > {code:java} > BEGIN TRANSACTION > LET row = (SELECT * FROM list_append WHERE id = ?); > SELECT row.contents; > COMMIT TRANSACTION;" > BEGIN TRANSACTION > UPDATE list_append SET contents += ? WHERE id = ?; > COMMIT TRANSACTION;" > {code} > there seems to be an issue with transaction guarantees. Here's an excerpt in > the edn format from a test. > {code:java} > {:type :invoke :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607285967116627} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 54 > :time 1692607286078732473} > {:type :invoke :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286133833428} > {:type :invoke :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286149702511} > {:type :ok :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607286156314099} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 52 > :time 1692607286167090389} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352]]] :tid 1 :n 54 :time 1692607286168657534} > {:type :invoke :process 1 :value [[:r 5 nil]] :tid 0 :n 51 > :time 1692607286201762938} > {:type :ok :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286245571513} > {:type :invoke :process 7 :value [[:r 5 nil]] :tid 4 :n 56 > :time 1692607286245655775} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 455]]] :tid 9 :n 52 :time 1692607286253928906} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 53 > :time 1692607286254095215} > {:type :ok :process 6 :value
[jira] [Commented] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769677#comment-17769677 ] Henrik Ingo commented on CASSANDRA-18798: - Thanks! This somewhat confirms the theory then. The only exception is that this isn't about loss of precision at all. All of those timestamps are unique and the problem is just that the ListType isn't sorting at all now. > Appending to list in Accord transactions uses insertion timestamp > - > > Key: CASSANDRA-18798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18798 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Jaroslaw Kijanowski >Assignee: Henrik Ingo >Priority: Normal > Attachments: image-2023-09-26-20-05-25-846.png > > > Given the following schema: > {code:java} > CREATE KEYSPACE IF NOT EXISTS accord WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': 3}; > CREATE TABLE IF NOT EXISTS accord.list_append(id int PRIMARY KEY,contents > LIST); > TRUNCATE accord.list_append;{code} > And the following two possible queries executed by 10 threads in parallel: > {code:java} > BEGIN TRANSACTION > LET row = (SELECT * FROM list_append WHERE id = ?); > SELECT row.contents; > COMMIT TRANSACTION;" > BEGIN TRANSACTION > UPDATE list_append SET contents += ? WHERE id = ?; > COMMIT TRANSACTION;" > {code} > there seems to be an issue with transaction guarantees. Here's an excerpt in > the edn format from a test. > {code:java} > {:type :invoke :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607285967116627} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 54 > :time 1692607286078732473} > {:type :invoke :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286133833428} > {:type :invoke :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286149702511} > {:type :ok :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607286156314099} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 52 > :time 1692607286167090389} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352]]] :tid 1 :n 54 :time 1692607286168657534} > {:type :invoke :process 1 :value [[:r 5 nil]] :tid 0 :n 51 > :time 1692607286201762938} > {:type :ok :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286245571513} > {:type :invoke :process 7 :value [[:r 5 nil]] :tid 4 :n 56 > :time 1692607286245655775} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 455]]] :tid 9 :n 52 :time 1692607286253928906} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 53 > :time 1692607286254095215} > {:type :ok :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286266263422} > {:type :ok :process 1 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 0 :n 51 :time 1692607286271617955} > {:type :ok :process 7 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 4 :n 56 :time 1692607286271816933} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 9 :n 53 :time 1692607286281483026} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 56 > :time 1692607286284097561} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 1 :n 56 :time 1692607286306445242} > {code} > Processes process 6 and process 7 are appending the values 553 and 455 > respectively. 455 succeeded and a read by process 5
[jira] [Comment Edited] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769223#comment-17769223 ] Henrik Ingo edited comment on CASSANDRA-18798 at 9/27/23 3:41 PM: -- Okay, finally got to the bottom of this. Report of findings follows: TL:DR; Accord is correctly propagating the {{executeAt}} timestamp into the legacy {{timestamp}} and {{executeNow}} fields. There's loss of precision though, so appending to the list twice within the same millisecond is the likely explanation of this particular symptom. Underneath this, it's however the {{ListType}} itself that is broken, at least for the read path. Part of the reason to write all of the below is that I'm seeking guidance on the ListType. What is it even supposed to do? h2. Accord The ListType is internally like a table/partition of BTreeRow's, that are sorted by their timestamp. This makes lists ecentualy consistent: The application can append entries to a list from two clients simultaneously, and the ordering of the resulting list, once all elements have "arrived", is deterministic. The initial hyptohesis for my research was that the Accord executeAt timestamp isn't correctly propagated into each list element (BTreeRow). However, this is not the case: Once an Accord transaction has determined its transaction id, called executeAt in Cassandra, and we arrive at the write portion of the exeuction phase, we have this: {{ cfk.updateLastExecutionTimestamps(executeAt, true);}} {{ long timestamp = cfk.current().timestampMicrosFor(executeAt, true);}} {{ // TODO (low priority - do we need to compute nowInSeconds, or can we just use executeAt?)}} {{ int nowInSeconds = cfk.current().nowInSecondsFor(executeAt, true);}} _modules/accord/accord-core/src/main/java/accord/primitives/Timestamp.java_ This eventually reaches {{ public Row updateAllTimestamp(long newTimestamp)}} {{ LivenessInfo newInfo = primaryKeyLivenessInfo.isEmpty() ? primaryKeyLivenessInfo : primaryKeyLivenessInfo.withUpdatedTimestamp(newTimestamp);}} {{ // If the deletion is shadowable and the row has a timestamp, we'll forced the deletion timestamp to be less than the row one, so we}} {{ // should get rid of said deletion.}} {{ Deletion newDeletion = deletion.isLive() || (deletion.isShadowable() && !primaryKeyLivenessInfo.isEmpty())}} {{ ? Deletion.LIVE}} {{ : new Deletion(new DeletionTime(newTimestamp - 1, deletion.time().localDeletionTime()), deletion.isShadowable());}} {{ return transformAndFilter(newInfo, newDeletion, (cd) -> cd.updateAllTimestamp(newTimestamp));}} _src/java/org/apache/cassandra/db/rows/BTreeRow.java_ The only problem I can see is loss of precision: This call will use the hlc() part of the executeAt timestamp, and not the node id (nor epoch). It seems possible and even likely that two different nodes will append to the same list during the same millisecond. After this, the ordering of those two (BTreeRow) elements is deterministic but arbitrary, and not guaranteed to be the same as the Accord transactions that wrote them. Also note the loss of precision is unnecessary! Cassandra legacy timestamps are microseconds, but Accord has only millisecond precision. A better implementation here would be to use the last 3 digits of the timestamp field to encode the node id, and maybe also epoch. The Accord originated timestamps are easy to spot with their 3 trailing zeros: {{$ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-5-big-Data.db }} {{[2]@241 Row[info=[ts=1695222739434337] ]: | del(names)=deletedAt=1695222739434336, localDeletion=1695222739, [names[177f79d0-57c8-11ee-b578-7dbb261b6e16]=Albert ts=1695222739434337], [names[177f79da-57c8-11ee-b578-7dbb261b6e16]=Ebba ts=1695222739434337], [names[3d4371d0-}} {{57c8-11ee-b578-7dbb261b6e16]=poppari ts=1695222802794082]}} $ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-26-big-Data.db [2]@0 Row[info=[ts=-9223372036854775808] ]: | , [names[6d8b88c0-5979-11ee-9ee4-1ff7dd1e5050]=HENKKA ts=1695408855885000] h2. ListType My understanding is that a ListType is expected to return the elements of the list sorted by their timestamp. Some elements don't have a timestamp of their own, in which case they use the timestamp from the row header + physical order. When a ListType is read from disk and deserialized, a good point to start observing what happens is in BTreeRow.Builder.build(): {{ public Row build()}} {{ {}} {{ if (isSorted || !isSorted)}} {{ getCells().sort();}} {{ // we can avoid resolving if we're sorted and have no complex values}} {{
[jira] [Updated] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik Ingo updated CASSANDRA-18798: Bug Category: Parent values: Correctness(12982)Level 1 values: Consistency(12989) Complexity: Normal Discovered By: Adhoc Test Reviewers: Jaroslaw Kijanowski Severity: Normal Assignee: Henrik Ingo Status: Open (was: Triage Needed) > Appending to list in Accord transactions uses insertion timestamp > - > > Key: CASSANDRA-18798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18798 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Jaroslaw Kijanowski >Assignee: Henrik Ingo >Priority: Normal > Attachments: image-2023-09-26-20-05-25-846.png > > > Given the following schema: > {code:java} > CREATE KEYSPACE IF NOT EXISTS accord WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': 3}; > CREATE TABLE IF NOT EXISTS accord.list_append(id int PRIMARY KEY,contents > LIST); > TRUNCATE accord.list_append;{code} > And the following two possible queries executed by 10 threads in parallel: > {code:java} > BEGIN TRANSACTION > LET row = (SELECT * FROM list_append WHERE id = ?); > SELECT row.contents; > COMMIT TRANSACTION;" > BEGIN TRANSACTION > UPDATE list_append SET contents += ? WHERE id = ?; > COMMIT TRANSACTION;" > {code} > there seems to be an issue with transaction guarantees. Here's an excerpt in > the edn format from a test. > {code:java} > {:type :invoke :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607285967116627} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 54 > :time 1692607286078732473} > {:type :invoke :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286133833428} > {:type :invoke :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286149702511} > {:type :ok :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607286156314099} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 52 > :time 1692607286167090389} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352]]] :tid 1 :n 54 :time 1692607286168657534} > {:type :invoke :process 1 :value [[:r 5 nil]] :tid 0 :n 51 > :time 1692607286201762938} > {:type :ok :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286245571513} > {:type :invoke :process 7 :value [[:r 5 nil]] :tid 4 :n 56 > :time 1692607286245655775} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 455]]] :tid 9 :n 52 :time 1692607286253928906} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 53 > :time 1692607286254095215} > {:type :ok :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286266263422} > {:type :ok :process 1 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 0 :n 51 :time 1692607286271617955} > {:type :ok :process 7 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 4 :n 56 :time 1692607286271816933} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 9 :n 53 :time 1692607286281483026} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 56 > :time 1692607286284097561} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 1 :n 56 :time 1692607286306445242} > {code} > Processes process 6 and process 7 are appending the values 553 and 455 > respectively. 455 succeeded and a read by proc
[jira] [Comment Edited] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769223#comment-17769223 ] Henrik Ingo edited comment on CASSANDRA-18798 at 9/26/23 5:19 PM: -- Okay, finally got to the bottom of this. Report of findings follows: TL:DR; Accord is correctly propagating the {{executeAt}} timestamp into the legacy {{timestamp}} and {{executeNow}} fields. There's loss of precision though, so appending to the list twice within the same millisecond is the likely explanation of this particular symptom. Underneath this, it's however the {{ListType}} itself that is broken, at least for the read path. Part of the reason to write all of the below is that I'm seeking guidance on the ListType. What is it even supposed to do? h2. Accord The ListType is internally like a table/partition of BTreeRow's, that are sorted by their timestamp. This makes lists ecentualy consistent: The application can append entries to a list from two clients simultaneously, and the ordering of the resulting list, once all elements have "arrived", is deterministic. The initial hyptohesis for my research was that the Accord executeAt timestamp isn't correctly propagated into each list element (BTreeRow). However, this is not the case: Once an Accord transaction has determined its transaction id, called executeAt in Cassandra, and we arrive at the write portion of the exeuction phase, we have this: {{ cfk.updateLastExecutionTimestamps(executeAt, true);}} {{ long timestamp = cfk.current().timestampMicrosFor(executeAt, true);}} {{ // TODO (low priority - do we need to compute nowInSeconds, or can we just use executeAt?)}} {{ int nowInSeconds = cfk.current().nowInSecondsFor(executeAt, true);}} _modules/accord/accord-core/src/main/java/accord/primitives/Timestamp.java_ This eventually reaches {{ public Row updateAllTimestamp(long newTimestamp)}} {{ LivenessInfo newInfo = primaryKeyLivenessInfo.isEmpty() ? primaryKeyLivenessInfo : primaryKeyLivenessInfo.withUpdatedTimestamp(newTimestamp);}} {{ // If the deletion is shadowable and the row has a timestamp, we'll forced the deletion timestamp to be less than the row one, so we}} {{ // should get rid of said deletion.}} {{ Deletion newDeletion = deletion.isLive() || (deletion.isShadowable() && !primaryKeyLivenessInfo.isEmpty())}} {{ ? Deletion.LIVE}} {{ : new Deletion(new DeletionTime(newTimestamp - 1, deletion.time().localDeletionTime()), deletion.isShadowable());}} {{ return transformAndFilter(newInfo, newDeletion, (cd) -> cd.updateAllTimestamp(newTimestamp));}} _src/java/org/apache/cassandra/db/rows/BTreeRow.java_ The only problem I can see is loss of precision: This call will use the hlc() part of the executeAt timestamp, and not the node id (nor epoch). It seems possible and even likely that two different nodes will append to the same list during the same millisecond. After this, the ordering of those two (BTreeRow) elements is deterministic but arbitrary, and not guaranteed to be the same as the Accord transactions that wrote them. Also note the loss of precision is unnecessary! Cassandra legacy timestamps are microseconds, but Accord has only millisecond precision. A better implementation here would be to use the last 3 digits of the timestamp field to encode the node id, and maybe also epoch. The Accord originated timestamps are easy to spot with their 3 trailing zeros: {{$ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-5-big-Data.db }} {{[2]@241 Row[info=[ts=1695222739434337] ]: | del(names)=deletedAt=1695222739434336, localDeletion=1695222739, [names[177f79d0-57c8-11ee-b578-7dbb261b6e16]=Albert ts=1695222739434337], [names[177f79da-57c8-11ee-b578-7dbb261b6e16]=Ebba ts=1695222739434337], [names[3d4371d0-}} {{57c8-11ee-b578-7dbb261b6e16]=poppari ts=1695222802794082]}} $ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-26-big-Data.db [2]@0 Row[info=[ts=-9223372036854775808] ]: | , [names[6d8b88c0-5979-11ee-9ee4-1ff7dd1e5050]=HENKKA ts=1695408855885000] h2. ListType My understanding is that a ListType is expected to return the elements of the list sorted by their timestamp. Some elements don't have a timestamp of their own, in which case they use the timestamp from the row header + physical order. When a ListType is read from disk and deserialized, a good point to start observing what happens is in BTreeRow.Builder.build(): {{ public Row build()}} {{ {}} {{ if (isSorted || !isSorted)}} {{ getCells().sort();}} {{ // we can avoid resolving if we're sorted and have no complex values}} {{
[jira] [Comment Edited] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769223#comment-17769223 ] Henrik Ingo edited comment on CASSANDRA-18798 at 9/26/23 5:05 PM: -- Okay, finally got to the bottom of this. Report of findings follows: TL:DR; Accord is correctly propagating the {{executeAt}} timestamp into the legacy {{timestamp}} and {{executeNow}} fields. There's loss of precision though, so appending to the list twice within the same millisecond is the likely explanation of this particular symptom. Underneath this, it's however the {{ListType}} itself that is broken, at least for the read path. Part of the reason to write all of the below is that I'm seeking guidance on the ListType. What is it even supposed to do? h2. Accord The ListType is internally like a table/partition of BTreeRow's, that are sorted by their timestamp. This makes lists ecentualy consistent: The application can append entries to a list from two clients simultaneously, and the ordering of the resulting list, once all elements have "arrived", is deterministic. The initial hyptohesis for my research was that the Accord executeAt timestamp isn't correctly propagated into each list element (BTreeRow). However, this is not the case: Once an Accord transaction has determined its transaction id, called executeAt in Cassandra, and we arrive at the write portion of the exeuction phase, we have this: {{ cfk.updateLastExecutionTimestamps(executeAt, true);}} {{ long timestamp = cfk.current().timestampMicrosFor(executeAt, true);}} {{ // TODO (low priority - do we need to compute nowInSeconds, or can we just use executeAt?)}} {{ int nowInSeconds = cfk.current().nowInSecondsFor(executeAt, true);}} _modules/accord/accord-core/src/main/java/accord/primitives/Timestamp.java_ This eventually reaches {{ public Row updateAllTimestamp(long newTimestamp)}} {{ LivenessInfo newInfo = primaryKeyLivenessInfo.isEmpty() ? primaryKeyLivenessInfo : primaryKeyLivenessInfo.withUpdatedTimestamp(newTimestamp);}} {{ // If the deletion is shadowable and the row has a timestamp, we'll forced the deletion timestamp to be less than the row one, so we}} {{ // should get rid of said deletion.}} {{ Deletion newDeletion = deletion.isLive() || (deletion.isShadowable() && !primaryKeyLivenessInfo.isEmpty())}} {{ ? Deletion.LIVE}} {{ : new Deletion(new DeletionTime(newTimestamp - 1, deletion.time().localDeletionTime()), deletion.isShadowable());}} {{ return transformAndFilter(newInfo, newDeletion, (cd) -> cd.updateAllTimestamp(newTimestamp));}} _src/java/org/apache/cassandra/db/rows/BTreeRow.java_ The only problem I can see is loss of precision: This call will use the hlc() part of the executeAt timestamp, and not the node id (nor epoch). It seems possible and even likely that two different nodes will append to the same list during the same millisecond. After this, the ordering of those two (BTreeRow) elements is deterministic but arbitrary, and not guaranteed to be the same as the Accord transactions that wrote them. Also note the loss of precision is unnecessary! Cassandra legacy timestamps are microseconds, but Accord has only millisecond precision. A better implementation here would be to use the last 3 digits of the timestamp field to encode the node id, and maybe also epoch. The Accord originated timestamps are easy to spot with their 3 trailing zeros: {{$ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-5-big-Data.db }} {{[2]@241 Row[info=[ts=1695222739434337] ]: | del(names)=deletedAt=1695222739434336, localDeletion=1695222739, [names[177f79d0-57c8-11ee-b578-7dbb261b6e16]=Albert ts=1695222739434337], [names[177f79da-57c8-11ee-b578-7dbb261b6e16]=Ebba ts=1695222739434337], [names[3d4371d0-}} {{57c8-11ee-b578-7dbb261b6e16]=poppari ts=1695222802794082]}} $ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-26-big-Data.db [2]@0 Row[info=[ts=-9223372036854775808] ]: | , [names[6d8b88c0-5979-11ee-9ee4-1ff7dd1e5050]=HENKKA ts=1695408855885000] h2. ListType My understanding is that a ListType is expected to return the elements of the list sorted by their timestamp. Some elements don't have a timestamp of their own, in which case they use the timestamp from the row header + physical order. When a ListType is read from disk and deserialized, a good point to start observing what happens is in BTreeRow.Builder.build(): {{ public Row build()}} {{ {}} {{ if (isSorted || !isSorted)}} {{ getCells().sort();}} {{ // we can avoid resolving if we're sorted and have no complex values}} {{
[jira] [Comment Edited] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769223#comment-17769223 ] Henrik Ingo edited comment on CASSANDRA-18798 at 9/26/23 5:04 PM: -- Okay, finally got to the bottom of this. Report of findings follows: TL:DR; Accord is correctly propagating the {{executeAt}} timestamp into the legacy {{timestamp}} and {{executeNow}} fields. There's loss of precision though, so appending to the list twice within the same millisecond is the likely explanation of this particular symptom. Underneath this, it's however the {{ListType}} itself that is broken, at least for the read path. Part of the reason to write all of the below is that I'm seeking guidance on the ListType. What is it even supposed to do? h2. Accord The ListType is internally like a table/partition of BTreeRow's, that are sorted by their timestamp. This makes lists ecentualy consistent: The application can append entries to a list from two clients simultaneously, and the ordering of the resulting list, once all elements have "arrived", is deterministic. The initial hyptohesis for my research was that the Accord executeAt timestamp isn't correctly propagated into each list element (BTreeRow). However, this is not the case: Once an Accord transaction has determined its transaction id, called executeAt in Cassandra, and we arrive at the write portion of the exeuction phase, we have this: {{ cfk.updateLastExecutionTimestamps(executeAt, true);}} {{ long timestamp = cfk.current().timestampMicrosFor(executeAt, true);}} {{ // TODO (low priority - do we need to compute nowInSeconds, or can we just use executeAt?)}} {{ int nowInSeconds = cfk.current().nowInSecondsFor(executeAt, true);}} _modules/accord/accord-core/src/main/java/accord/primitives/Timestamp.java_ This eventually reaches {{ public Row updateAllTimestamp(long newTimestamp)}} {{ LivenessInfo newInfo = primaryKeyLivenessInfo.isEmpty() ? primaryKeyLivenessInfo : primaryKeyLivenessInfo.withUpdatedTimestamp(newTimestamp);}} {{ // If the deletion is shadowable and the row has a timestamp, we'll forced the deletion timestamp to be less than the row one, so we}} {{ // should get rid of said deletion.}} {{ Deletion newDeletion = deletion.isLive() || (deletion.isShadowable() && !primaryKeyLivenessInfo.isEmpty())}} {{ ? Deletion.LIVE}} {{ : new Deletion(new DeletionTime(newTimestamp - 1, deletion.time().localDeletionTime()), deletion.isShadowable());}} {{ return transformAndFilter(newInfo, newDeletion, (cd) -> cd.updateAllTimestamp(newTimestamp));}} _src/java/org/apache/cassandra/db/rows/BTreeRow.java_ The only problem I can see is loss of precision: This call will use the hlc() part of the executeAt timestamp, and not the node id (nor epoch). It seems possible and even likely that two different nodes will append to the same list during the same millisecond. After this, the ordering of those two (BTreeRow) elements is deterministic but arbitrary, and not guaranteed to be the same as the Accord transactions that wrote them. Also note the loss of precision is unnecessary! Cassandra legacy timestamps are microseconds, but Accord has only millisecond precision. A better implementation here would be to use the last 3 digits of the timestamp field to encode the node id, and maybe also epoch. The Accord originated timestamps are easy to spot with their 3 trailing zeros: {{$ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-5-big-Data.db }} {{[2]@241 Row[info=[ts=1695222739434337] ]: | del(names)=deletedAt=1695222739434336, localDeletion=1695222739, [names[177f79d0-57c8-11ee-b578-7dbb261b6e16]=Albert ts=1695222739434337], [names[177f79da-57c8-11ee-b578-7dbb261b6e16]=Ebba ts=1695222739434337], [names[3d4371d0-}} {{57c8-11ee-b578-7dbb261b6e16]=poppari ts=1695222802794082]}} $ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-26-big-Data.db [2]@0 Row[info=[ts=-9223372036854775808] ]: | , [names[6d8b88c0-5979-11ee-9ee4-1ff7dd1e5050]=HENKKA ts=1695408855885000] h2. ListType My understanding is that a ListType is expected to return the elements of the list sorted by their timestamp. Some elements don't have a timestamp of their own, in which case they use the timestamp from the row header + physical order. When a ListType is read from disk and deserialized, a good point to start observing what happens is in BTreeRow.Builder.build(): {{ public Row build()}} {{ {}} {{ if (isSorted || !isSorted)}} {{ getCells().sort();}} {{ // we can avoid resolving if we're sorted and have no complex values}} {{
[jira] [Comment Edited] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769223#comment-17769223 ] Henrik Ingo edited comment on CASSANDRA-18798 at 9/26/23 2:55 PM: -- Okay, finally got to the bottom of this. Report of findings follows: TL:DR; Accord is correctly propagating the {{executeAt}} timestamp into the legacy {{timestamp}} and {{executeNow}} fields. There's loss of precision though, so appending to the list twice within the same millisecond is the likely explanation of this particular symptom. Underneath this, it's however the {{ListType}} itself that is broken, at least for the read path. h2. Accord The ListType is internally like a table/partition of BTreeRow's, that are sorted by their timestamp. This makes lists ecentualy consistent: The application can append entries to a list from two clients simultaneously, and the ordering of the resulting list, once all elements have "arrived", is deterministic. The initial hyptohesis for my research was that the Accord executeAt timestamp isn't correctly propagated into each list element (BTreeRow). However, this is not the case: Once an Accord transaction has determined its transaction id, called executeAt in Cassandra, and we arrive at the write portion of the exeuction phase, we have this: {{ cfk.updateLastExecutionTimestamps(executeAt, true);}} {{ long timestamp = cfk.current().timestampMicrosFor(executeAt, true);}} {{ // TODO (low priority - do we need to compute nowInSeconds, or can we just use executeAt?)}} {{ int nowInSeconds = cfk.current().nowInSecondsFor(executeAt, true);}} _modules/accord/accord-core/src/main/java/accord/primitives/Timestamp.java_ This eventually reaches {{ public Row updateAllTimestamp(long newTimestamp)}} {{ LivenessInfo newInfo = primaryKeyLivenessInfo.isEmpty() ? primaryKeyLivenessInfo : primaryKeyLivenessInfo.withUpdatedTimestamp(newTimestamp);}} {{ // If the deletion is shadowable and the row has a timestamp, we'll forced the deletion timestamp to be less than the row one, so we}} {{ // should get rid of said deletion.}} {{ Deletion newDeletion = deletion.isLive() || (deletion.isShadowable() && !primaryKeyLivenessInfo.isEmpty())}} {{ ? Deletion.LIVE}} {{ : new Deletion(new DeletionTime(newTimestamp - 1, deletion.time().localDeletionTime()), deletion.isShadowable());}} {{ return transformAndFilter(newInfo, newDeletion, (cd) -> cd.updateAllTimestamp(newTimestamp));}} _src/java/org/apache/cassandra/db/rows/BTreeRow.java_ The only problem I can see is loss of precision: This call will use the hlc() part of the executeAt timestamp, and not the node id (nor epoch). It seems possible and even likely that two different nodes will append to the same list during the same millisecond. After this, the ordering of those two (BTreeRow) elements is deterministic but arbitrary, and not guaranteed to be the same as the Accord transactions that wrote them. Also note the loss of precision is unnecessary! Cassandra legacy timestamps are microseconds, but Accord has only millisecond precision. A better implementation here would be to use the last 3 digits of the timestamp field to encode the node id, and maybe also epoch. The Accord originated timestamps are easy to spot with their 3 trailing zeros: {{$ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-5-big-Data.db }} {{[2]@241 Row[info=[ts=1695222739434337] ]: | del(names)=deletedAt=1695222739434336, localDeletion=1695222739, [names[177f79d0-57c8-11ee-b578-7dbb261b6e16]=Albert ts=1695222739434337], [names[177f79da-57c8-11ee-b578-7dbb261b6e16]=Ebba ts=1695222739434337], [names[3d4371d0-}} {{57c8-11ee-b578-7dbb261b6e16]=poppari ts=1695222802794082]}} $ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-26-big-Data.db [2]@0 Row[info=[ts=-9223372036854775808] ]: | , [names[6d8b88c0-5979-11ee-9ee4-1ff7dd1e5050]=HENKKA ts=1695408855885000] h2. ListType was (Author: henrik.ingo): Okay, finally got to the bottom of this. Report of findings follows: TL:DR; Accord is correctly propagating the {{executeAt}} timestamp into the legacy {{timestamp}} and {{executeNow}} fields. There's loss of precision though, so appending to the list twice within the same millisecond is the likely explanation of this particular symptom. Underneath this, it's however the {{ListType}} itself that is broken, at least for the read path. h2. Accord The ListType is internally like a table/partition of BTreeRow's, that are sorted by their timestamp. This makes lists ecentualy consistent: The application can append entries to a list from two clients simul
[jira] [Comment Edited] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769223#comment-17769223 ] Henrik Ingo edited comment on CASSANDRA-18798 at 9/26/23 2:54 PM: -- Okay, finally got to the bottom of this. Report of findings follows: TL:DR; Accord is correctly propagating the {{executeAt}} timestamp into the legacy {{timestamp}} and {{executeNow}} fields. There's loss of precision though, so appending to the list twice within the same millisecond is the likely explanation of this particular symptom. Underneath this, it's however the {{ListType}} itself that is broken, at least for the read path. h2. Accord The ListType is internally like a table/partition of BTreeRow's, that are sorted by their timestamp. This makes lists ecentualy consistent: The application can append entries to a list from two clients simultaneously, and the ordering of the resulting list, once all elements have "arrived", is deterministic. The initial hyptohesis for my research was that the Accord executeAt timestamp isn't correctly propagated into each list element (BTreeRow). However, this is not the case: Once an Accord transaction has determined its transaction id, called executeAt in Cassandra, and we arrive at the write portion of the exeuction phase, we have this: {{ cfk.updateLastExecutionTimestamps(executeAt, true);}} {{ long timestamp = cfk.current().timestampMicrosFor(executeAt, true);}} {{ // TODO (low priority - do we need to compute nowInSeconds, or can we just use executeAt?)}} {{ int nowInSeconds = cfk.current().nowInSecondsFor(executeAt, true);}} _modules/accord/accord-core/src/main/java/accord/primitives/Timestamp.java_ This eventually reaches {{ public Row updateAllTimestamp(long newTimestamp)}} {{ LivenessInfo newInfo = primaryKeyLivenessInfo.isEmpty() ? primaryKeyLivenessInfo : primaryKeyLivenessInfo.withUpdatedTimestamp(newTimestamp);}} {{ // If the deletion is shadowable and the row has a timestamp, we'll forced the deletion timestamp to be less than the row one, so we}} {{ // should get rid of said deletion.}} {{ Deletion newDeletion = deletion.isLive() || (deletion.isShadowable() && !primaryKeyLivenessInfo.isEmpty())}} {{ ? Deletion.LIVE}} {{ : new Deletion(new DeletionTime(newTimestamp - 1, deletion.time().localDeletionTime()), deletion.isShadowable());}} {{ return transformAndFilter(newInfo, newDeletion, (cd) -> cd.updateAllTimestamp(newTimestamp));}} _src/java/org/apache/cassandra/db/rows/BTreeRow.java_ The only problem I can see is loss of precision: This call will use the hlc() part of the executeAt timestamp, and not the node id (nor epoch). It seems possible and even likely that two different nodes will append to the same list during the same millisecond. After this, the ordering of those two (BTreeRow) elements is deterministic but arbitrary, and not guaranteed to be the same as the Accord transactions that wrote them. Also note the loss of precision is unnecessary! Cassandra legacy timestamps are microseconds, but Accord has only millisecond precision. A better implementation here would be to use the last 3 digits of the timestamp field to encode the node id, and maybe also epoch. The Accord originated timestamps are easy to spot with their 3 trailing zeros: {{$ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-5-big-Data.db }} {{[2]@241 Row[info=[ts=1695222739434337] ]: | del(names)=deletedAt=1695222739434336, localDeletion=1695222739, [names[177f79d0-57c8-11ee-b578-7dbb261b6e16]=Albert ts=1695222739434337], [names[177f79da-57c8-11ee-b578-7dbb261b6e16]=Ebba ts=1695222739434337], [names[3d4371d0-}} {{57c8-11ee-b578-7dbb261b6e16]=poppari ts=1695222802794082]}} {{ $ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-26-big-Data.db [2]@0 Row[info=[ts=-9223372036854775808] ]: | , [names[6d8b88c0-5979-11ee-9ee4-1ff7dd1e5050]=HENKKA ts=1695408855885000] }} h2. ListType was (Author: henrik.ingo): Okay, finally got to the bottom of this. Report of findings follows: TL:DR; Accord is correctly propagating the {{executeAt}} timestamp into the legacy {{timestamp}} and {{executeNow}} fields. There's loss of precision though, so appending to the list twice within the same millisecond is the likely explanation of this particular symptom. Underneath this, it's however the {{ListType}} itself that is broken, at least for the read path. h2. Accord The ListType is internally like a table/partition of BTreeRow's, that are sorted by their timestamp. This makes lists ecentualy consistent: The application can append entries to a list from two clients
[jira] [Comment Edited] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769223#comment-17769223 ] Henrik Ingo edited comment on CASSANDRA-18798 at 9/26/23 2:54 PM: -- Okay, finally got to the bottom of this. Report of findings follows: TL:DR; Accord is correctly propagating the {{executeAt}} timestamp into the legacy {{timestamp}} and {{executeNow}} fields. There's loss of precision though, so appending to the list twice within the same millisecond is the likely explanation of this particular symptom. Underneath this, it's however the {{ListType}} itself that is broken, at least for the read path. h2. Accord The ListType is internally like a table/partition of BTreeRow's, that are sorted by their timestamp. This makes lists ecentualy consistent: The application can append entries to a list from two clients simultaneously, and the ordering of the resulting list, once all elements have "arrived", is deterministic. The initial hyptohesis for my research was that the Accord executeAt timestamp isn't correctly propagated into each list element (BTreeRow). However, this is not the case: Once an Accord transaction has determined its transaction id, called executeAt in Cassandra, and we arrive at the write portion of the exeuction phase, we have this: {{ cfk.updateLastExecutionTimestamps(executeAt, true);}} {{ long timestamp = cfk.current().timestampMicrosFor(executeAt, true);}} {{ // TODO (low priority - do we need to compute nowInSeconds, or can we just use executeAt?)}} {{ int nowInSeconds = cfk.current().nowInSecondsFor(executeAt, true);}} _modules/accord/accord-core/src/main/java/accord/primitives/Timestamp.java_ This eventually reaches {{ public Row updateAllTimestamp(long newTimestamp)}} {{ LivenessInfo newInfo = primaryKeyLivenessInfo.isEmpty() ? primaryKeyLivenessInfo : primaryKeyLivenessInfo.withUpdatedTimestamp(newTimestamp);}} {{ // If the deletion is shadowable and the row has a timestamp, we'll forced the deletion timestamp to be less than the row one, so we}} {{ // should get rid of said deletion.}} {{ Deletion newDeletion = deletion.isLive() || (deletion.isShadowable() && !primaryKeyLivenessInfo.isEmpty())}} {{ ? Deletion.LIVE}} {{ : new Deletion(new DeletionTime(newTimestamp - 1, deletion.time().localDeletionTime()), deletion.isShadowable());}} {{ return transformAndFilter(newInfo, newDeletion, (cd) -> cd.updateAllTimestamp(newTimestamp));}} _src/java/org/apache/cassandra/db/rows/BTreeRow.java_ The only problem I can see is loss of precision: This call will use the hlc() part of the executeAt timestamp, and not the node id (nor epoch). It seems possible and even likely that two different nodes will append to the same list during the same millisecond. After this, the ordering of those two (BTreeRow) elements is deterministic but arbitrary, and not guaranteed to be the same as the Accord transactions that wrote them. Also note the loss of precision is unnecessary! Cassandra legacy timestamps are microseconds, but Accord has only millisecond precision. A better implementation here would be to use the last 3 digits of the timestamp field to encode the node id, and maybe also epoch. The Accord originated timestamps are easy to spot with their 3 trailing zeros: {{$ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-5-big-Data.db }} {{[2]@241 Row[info=[ts=1695222739434337] ]: | del(names)=deletedAt=1695222739434336, localDeletion=1695222739, [names[177f79d0-57c8-11ee-b578-7dbb261b6e16]=Albert ts=1695222739434337], [names[177f79da-57c8-11ee-b578-7dbb261b6e16]=Ebba ts=1695222739434337], [names[3d4371d0-}} {{57c8-11ee-b578-7dbb261b6e16]=poppari ts=1695222802794082]}} $ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-26-big-Data.db [2]@0 Row[info=[ts=-9223372036854775808] ]: | , [names[6d8b88c0-5979-11ee-9ee4-1ff7dd1e5050]=HENKKA ts=1695408855885000] h2. ListType was (Author: henrik.ingo): Okay, finally got to the bottom of this. Report of findings follows: TL:DR; Accord is correctly propagating the {{executeAt}} timestamp into the legacy {{timestamp}} and {{executeNow}} fields. There's loss of precision though, so appending to the list twice within the same millisecond is the likely explanation of this particular symptom. Underneath this, it's however the {{ListType}} itself that is broken, at least for the read path. h2. Accord The ListType is internally like a table/partition of BTreeRow's, that are sorted by their timestamp. This makes lists ecentualy consistent: The application can append entries to a list from two clients simul
[jira] [Comment Edited] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769223#comment-17769223 ] Henrik Ingo edited comment on CASSANDRA-18798 at 9/26/23 2:53 PM: -- Okay, finally got to the bottom of this. Report of findings follows: TL:DR; Accord is correctly propagating the {{executeAt}} timestamp into the legacy {{timestamp}} and {{executeNow}} fields. There's loss of precision though, so appending to the list twice within the same millisecond is the likely explanation of this particular symptom. Underneath this, it's however the {{ListType}} itself that is broken, at least for the read path. h2. Accord The ListType is internally like a table/partition of BTreeRow's, that are sorted by their timestamp. This makes lists ecentualy consistent: The application can append entries to a list from two clients simultaneously, and the ordering of the resulting list, once all elements have "arrived", is deterministic. The initial hyptohesis for my research was that the Accord executeAt timestamp isn't correctly propagated into each list element (BTreeRow). However, this is not the case: Once an Accord transaction has determined its transaction id, called executeAt in Cassandra, and we arrive at the write portion of the exeuction phase, we have this: {{ cfk.updateLastExecutionTimestamps(executeAt, true);}} {{ long timestamp = cfk.current().timestampMicrosFor(executeAt, true);}} {{ // TODO (low priority - do we need to compute nowInSeconds, or can we just use executeAt?)}} {{ int nowInSeconds = cfk.current().nowInSecondsFor(executeAt, true);}} _modules/accord/accord-core/src/main/java/accord/primitives/Timestamp.java_ This eventually reaches {{ public Row updateAllTimestamp(long newTimestamp)}} {{ LivenessInfo newInfo = primaryKeyLivenessInfo.isEmpty() ? primaryKeyLivenessInfo : primaryKeyLivenessInfo.withUpdatedTimestamp(newTimestamp);}} {{ // If the deletion is shadowable and the row has a timestamp, we'll forced the deletion timestamp to be less than the row one, so we}} {{ // should get rid of said deletion.}} {{ Deletion newDeletion = deletion.isLive() || (deletion.isShadowable() && !primaryKeyLivenessInfo.isEmpty())}} {{ ? Deletion.LIVE}} {{ : new Deletion(new DeletionTime(newTimestamp - 1, deletion.time().localDeletionTime()), deletion.isShadowable());}} {{ return transformAndFilter(newInfo, newDeletion, (cd) -> cd.updateAllTimestamp(newTimestamp));}} _src/java/org/apache/cassandra/db/rows/BTreeRow.java_ The only problem I can see is loss of precision: This call will use the hlc() part of the executeAt timestamp, and not the node id (nor epoch). It seems possible and even likely that two different nodes will append to the same list during the same millisecond. After this, the ordering of those two (BTreeRow) elements is deterministic but arbitrary, and not guaranteed to be the same as the Accord transactions that wrote them. Also note the loss of precision is unnecessary! Cassandra legacy timestamps are microseconds, but Accord has only millisecond precision. A better implementation here would be to use the last 3 digits of the timestamp field to encode the node id, and maybe also epoch. The Accord originated timestamps are easy to spot with their 3 trailing zeros: {{$ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-5-big-Data.db }} {{[2]@241 Row[info=[ts=1695222739434337] ]: | del(names)=deletedAt=1695222739434336, localDeletion=1695222739, [names[177f79d0-57c8-11ee-b578-7dbb261b6e16]=Albert ts=1695222739434337], [names[177f79da-57c8-11ee-b578-7dbb261b6e16]=Ebba ts=1695222739434337], [names[3d4371d0-}} {{57c8-11ee-b578-7dbb261b6e16]=poppari ts=1695222802794082]}} {{$ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-26-big-Data.db [2]@0 Row[info=[ts=-9223372036854775808] ]: | , [names[6d8b88c0-5979-11ee-9ee4-1ff7dd1e5050]=HENKKA ts=1695408855885000] h2. ListType was (Author: henrik.ingo): Okay, finally got to the bottom of this. Report of findings follows: TL:DR; Accord is correctly propagating the {{executeAt}} timestamp into the legacy {{timestamp}} and {{executeNow}} fields. There's loss of precision though, so appending to the list twice within the same millisecond is the likely explanation of this particular symptom. Underneath this, it's however the {{ListType}} itself that is broken, at least for the read path. h2. Accord The ListType is internally like a table/partition of BTreeRow's, that are sorted by their timestamp. This makes lists ecentualy consistent: The application can append entries to a list from two clients sim
[jira] [Comment Edited] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769223#comment-17769223 ] Henrik Ingo edited comment on CASSANDRA-18798 at 9/26/23 2:53 PM: -- Okay, finally got to the bottom of this. Report of findings follows: TL:DR; Accord is correctly propagating the {{executeAt}} timestamp into the legacy {{timestamp}} and {{executeNow}} fields. There's loss of precision though, so appending to the list twice within the same millisecond is the likely explanation of this particular symptom. Underneath this, it's however the {{ListType}} itself that is broken, at least for the read path. h2. Accord The ListType is internally like a table/partition of BTreeRow's, that are sorted by their timestamp. This makes lists ecentualy consistent: The application can append entries to a list from two clients simultaneously, and the ordering of the resulting list, once all elements have "arrived", is deterministic. The initial hyptohesis for my research was that the Accord executeAt timestamp isn't correctly propagated into each list element (BTreeRow). However, this is not the case: Once an Accord transaction has determined its transaction id, called executeAt in Cassandra, and we arrive at the write portion of the exeuction phase, we have this: {{ cfk.updateLastExecutionTimestamps(executeAt, true);}} {{ long timestamp = cfk.current().timestampMicrosFor(executeAt, true);}} {{ // TODO (low priority - do we need to compute nowInSeconds, or can we just use executeAt?)}} {{ int nowInSeconds = cfk.current().nowInSecondsFor(executeAt, true);}} _modules/accord/accord-core/src/main/java/accord/primitives/Timestamp.java_ This eventually reaches {{ public Row updateAllTimestamp(long newTimestamp)}} {{ LivenessInfo newInfo = primaryKeyLivenessInfo.isEmpty() ? primaryKeyLivenessInfo : primaryKeyLivenessInfo.withUpdatedTimestamp(newTimestamp);}} {{ // If the deletion is shadowable and the row has a timestamp, we'll forced the deletion timestamp to be less than the row one, so we}} {{ // should get rid of said deletion.}} {{ Deletion newDeletion = deletion.isLive() || (deletion.isShadowable() && !primaryKeyLivenessInfo.isEmpty())}} {{ ? Deletion.LIVE}} {{ : new Deletion(new DeletionTime(newTimestamp - 1, deletion.time().localDeletionTime()), deletion.isShadowable());}} {{ return transformAndFilter(newInfo, newDeletion, (cd) -> cd.updateAllTimestamp(newTimestamp));}} {{ }} {{ }}_src/java/org/apache/cassandra/db/rows/BTreeRow.java_ The only problem I can see is loss of precision: This call will use the hlc() part of the executeAt timestamp, and not the node id (nor epoch). It seems possible and even likely that two different nodes will append to the same list during the same millisecond. After this, the ordering of those two (BTreeRow) elements is deterministic but arbitrary, and not guaranteed to be the same as the Accord transactions that wrote them. Also note the loss of precision is unnecessary! Cassandra legacy timestamps are microseconds, but Accord has only millisecond precision. A better implementation here would be to use the last 3 digits of the timestamp field to encode the node id, and maybe also epoch. The Accord originated timestamps are easy to spot with their 3 trailing zeros: {{$ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-5-big-Data.db }} {{[2]@241 Row[info=[ts=1695222739434337] ]: | del(names)=deletedAt=1695222739434336, localDeletion=1695222739, [names[177f79d0-57c8-11ee-b578-7dbb261b6e16]=Albert ts=1695222739434337], [names[177f79da-57c8-11ee-b578-7dbb261b6e16]=Ebba ts=1695222739434337], [names[3d4371d0-}} {{57c8-11ee-b578-7dbb261b6e16]=poppari ts=1695222802794082]}} {{$ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-26-big-Data.db [2]@0 Row[info=[ts=-9223372036854775808] ]: | , [names[6d8b88c0-5979-11ee-9ee4-1ff7dd1e5050]=HENKKA ts=1695408855885000] h2. ListType was (Author: henrik.ingo): Okay, finally got to the bottom of this. Report of findings follows: TL:DR; Accord is correctly propagating the {{executeAt}} timestamp into the legacy {{timestamp}} and {{executeNow}} fields. There's loss of precision though, so appending to the list twice within the same millisecond is the likely explanation of this particular symptom. Underneath this, it's however the {{ListType}} itself that is broken, at least for the read path. h2. Accord The ListType is internally like a table/partition of BTreeRow's, that are sorted by their timestamp. This makes lists ecentualy consistent: The application can append entries to a list from two cli
[jira] [Comment Edited] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769223#comment-17769223 ] Henrik Ingo edited comment on CASSANDRA-18798 at 9/26/23 2:52 PM: -- Okay, finally got to the bottom of this. Report of findings follows: TL:DR; Accord is correctly propagating the {{executeAt}} timestamp into the legacy {{timestamp}} and {{executeNow}} fields. There's loss of precision though, so appending to the list twice within the same millisecond is the likely explanation of this particular symptom. Underneath this, it's however the {{ListType}} itself that is broken, at least for the read path. h2. Accord The ListType is internally like a table/partition of BTreeRow's, that are sorted by their timestamp. This makes lists ecentualy consistent: The application can append entries to a list from two clients simultaneously, and the ordering of the resulting list, once all elements have "arrived", is deterministic. The initial hyptohesis for my research was that the Accord executeAt timestamp isn't correctly propagated into each list element (BTreeRow). However, this is not the case: Once an Accord transaction has determined its transaction id, called executeAt in Cassandra, and we arrive at the write portion of the exeuction phase, we have this: {{ cfk.updateLastExecutionTimestamps(executeAt, true);}} {{ long timestamp = cfk.current().timestampMicrosFor(executeAt, true);}} {{ // TODO (low priority - do we need to compute nowInSeconds, or can we just use executeAt?)}} {{ int nowInSeconds = cfk.current().nowInSecondsFor(executeAt, true);}} _{{}}modules/accord/accord-core/src/main/java/accord/primitives/Timestamp.java_ This eventually reaches {{ public Row updateAllTimestamp(long newTimestamp)}} {{ LivenessInfo newInfo = primaryKeyLivenessInfo.isEmpty() ? primaryKeyLivenessInfo : primaryKeyLivenessInfo.withUpdatedTimestamp(newTimestamp);}} {{ // If the deletion is shadowable and the row has a timestamp, we'll forced the deletion timestamp to be less than the row one, so we}} {{ // should get rid of said deletion.}} {{ Deletion newDeletion = deletion.isLive() || (deletion.isShadowable() && !primaryKeyLivenessInfo.isEmpty())}} {{ ? Deletion.LIVE}} {{ : new Deletion(new DeletionTime(newTimestamp - 1, deletion.time().localDeletionTime()), deletion.isShadowable());}} {{ return transformAndFilter(newInfo, newDeletion, (cd) -> cd.updateAllTimestamp(newTimestamp));}} {{ }} {{ }}{_}src/java/org/apache/cassandra/db/rows/BTreeRow.java{_} The only problem I can see is loss of precision: This call will use the hlc() part of the executeAt timestamp, and not the node id (nor epoch). It seems possible and even likely that two different nodes will append to the same list during the same millisecond. After this, the ordering of those two (BTreeRow) elements is deterministic but arbitrary, and not guaranteed to be the same as the Accord transactions that wrote them. Also note the loss of precision is unnecessary! Cassandra legacy timestamps are microseconds, but Accord has only millisecond precision. A better implementation here would be to use the last 3 digits of the timestamp field to encode the node id, and maybe also epoch. The Accord originated timestamps are easy to spot with their 3 trailing zeros: {{$ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-5-big-Data.db }} {{[2]@241 Row[info=[ts=1695222739434337] ]: | del(names)=deletedAt=1695222739434336, localDeletion=1695222739, [names[177f79d0-57c8-11ee-b578-7dbb261b6e16]=Albert ts=1695222739434337], [names[177f79da-57c8-11ee-b578-7dbb261b6e16]=Ebba ts=1695222739434337], [names[3d4371d0-}} {{57c8-11ee-b578-7dbb261b6e16]=poppari ts=1695222802794082]}} {{$ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-26-big-Data.db [2]@0 Row[info=[ts=-9223372036854775808] ]: | , [names[6d8b88c0-5979-11ee-9ee4-1ff7dd1e5050]=HENKKA ts=1695408855885000] h2. ListType was (Author: henrik.ingo): Okay, finally got to the bottom of this. Report of findings follows: TL:DR; Accord is correctly propagating the {{executeAt}} timestamp into the legacy {{timestamp}} and {{executeNow}} fields. There's loss of precision though, so appending to the list twice within the same millisecond is the likely explanation of this particular symptom. Underneath this, it's however the {{ListType}} itself that is broken, at least for the read path. h2. Accord The ListType is internally like a table/partition of BTreeRow's, that are sorted by their timestamp. This makes lists ecentualy consistent: The application can append entries to a list from
[jira] [Comment Edited] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769223#comment-17769223 ] Henrik Ingo edited comment on CASSANDRA-18798 at 9/26/23 2:52 PM: -- Okay, finally got to the bottom of this. Report of findings follows: TL:DR; Accord is correctly propagating the {{executeAt}} timestamp into the legacy {{timestamp}} and {{executeNow}} fields. There's loss of precision though, so appending to the list twice within the same millisecond is the likely explanation of this particular symptom. Underneath this, it's however the {{ListType}} itself that is broken, at least for the read path. h2. Accord The ListType is internally like a table/partition of BTreeRow's, that are sorted by their timestamp. This makes lists ecentualy consistent: The application can append entries to a list from two clients simultaneously, and the ordering of the resulting list, once all elements have "arrived", is deterministic. The initial hyptohesis for my research was that the Accord executeAt timestamp isn't correctly propagated into each list element (BTreeRow). However, this is not the case: Once an Accord transaction has determined its transaction id, called executeAt in Cassandra, and we arrive at the write portion of the exeuction phase, we have this: {{ cfk.updateLastExecutionTimestamps(executeAt, true);}} {{ long timestamp = cfk.current().timestampMicrosFor(executeAt, true);}} {{ // TODO (low priority - do we need to compute nowInSeconds, or can we just use executeAt?)}} {{ int nowInSeconds = cfk.current().nowInSecondsFor(executeAt, true);}} _{{}}modules/accord/accord-core/src/main/java/accord/primitives/Timestamp.java_ This eventually reaches {{ public Row updateAllTimestamp(long newTimestamp)}} {{ }} {{ LivenessInfo newInfo = primaryKeyLivenessInfo.isEmpty() ? primaryKeyLivenessInfo : primaryKeyLivenessInfo.withUpdatedTimestamp(newTimestamp);}} {{ // If the deletion is shadowable and the row has a timestamp, we'll forced the deletion timestamp to be less than the row one, so we}} {{ // should get rid of said deletion.}} {{ Deletion newDeletion = deletion.isLive() || (deletion.isShadowable() && !primaryKeyLivenessInfo.isEmpty())}} {{ ? Deletion.LIVE}} {{ : new Deletion(new DeletionTime(newTimestamp - 1, deletion.time().localDeletionTime()), deletion.isShadowable());}} {{ return transformAndFilter(newInfo, newDeletion, (cd) -> cd.updateAllTimestamp(newTimestamp));}} {{ }} {{ }}{_}src/java/org/apache/cassandra/db/rows/BTreeRow.java{_} The only problem I can see is loss of precision: This call will use the hlc() part of the executeAt timestamp, and not the node id (nor epoch). It seems possible and even likely that two different nodes will append to the same list during the same millisecond. After this, the ordering of those two (BTreeRow) elements is deterministic but arbitrary, and not guaranteed to be the same as the Accord transactions that wrote them. Also note the loss of precision is unnecessary! Cassandra legacy timestamps are microseconds, but Accord has only millisecond precision. A better implementation here would be to use the last 3 digits of the timestamp field to encode the node id, and maybe also epoch. The Accord originated timestamps are easy to spot with their 3 trailing zeros: {{$ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-5-big-Data.db }} {{[2]@241 Row[info=[ts=1695222739434337] ]: | del(names)=deletedAt=1695222739434336, localDeletion=1695222739, [names[177f79d0-57c8-11ee-b578-7dbb261b6e16]=Albert ts=1695222739434337], [names[177f79da-57c8-11ee-b578-7dbb261b6e16]=Ebba ts=1695222739434337], [names[3d4371d0-}} {{57c8-11ee-b578-7dbb261b6e16]=poppari ts=1695222802794082]}} {{$ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-26-big-Data.db [2]@0 Row[info=[ts=-9223372036854775808] ]: | , [names[6d8b88c0-5979-11ee-9ee4-1ff7dd1e5050]=HENKKA ts=1695408855885000] }} h2. ListType was (Author: henrik.ingo): Okay, finally got to the bottom of this. Report of findings follows: TL:DR; Accord is correctly propagating the {{executeAt}} timestamp into the legacy {{timestamp}} and {{executeNow}} fields. There's loss of precision though, so appending to the list twice within the same millisecond is the likely explanation of this particular symptom. Underneath this, it's however the {{ListType}} itself that is broken, at least for the read path. h2. Accord The ListType is internally like a table/partition of BTreeRow's, that are sorted by their timestamp. This makes lists ecentualy consistent: The application can append entries to a lis
[jira] [Comment Edited] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769223#comment-17769223 ] Henrik Ingo edited comment on CASSANDRA-18798 at 9/26/23 2:50 PM: -- Okay, finally got to the bottom of this. Report of findings follows: TL:DR; Accord is correctly propagating the {{executeAt}} timestamp into the legacy {{timestamp}} and {{executeNow}} fields. There's loss of precision though, so appending to the list twice within the same millisecond is the likely explanation of this particular symptom. Underneath this, it's however the {{ListType}} itself that is broken, at least for the read path. h2. Accord The ListType is internally like a table/partition of BTreeRow's, that are sorted by their timestamp. This makes lists ecentualy consistent: The application can append entries to a list from two clients simultaneously, and the ordering of the resulting list, once all elements have "arrived", is deterministic. The initial hyptohesis for my research was that the Accord executeAt timestamp isn't correctly propagated into each list element (BTreeRow). However, this is not the case: Once an Accord transaction has determined its transaction id, called executeAt in Cassandra, and we arrive at the write portion of the exeuction phase, we have this: {{ cfk.updateLastExecutionTimestamps(executeAt, true);}} {{ long timestamp = cfk.current().timestampMicrosFor(executeAt, true);}} {{ // TODO (low priority - do we need to compute nowInSeconds, or can we just use executeAt?)}} {{ int nowInSeconds = cfk.current().nowInSecondsFor(executeAt, true);}} _{{}}modules/accord/accord-core/src/main/java/accord/primitives/Timestamp.java_ This eventually reaches {{ public Row updateAllTimestamp(long newTimestamp)}} {{ {}} {{ LivenessInfo newInfo = primaryKeyLivenessInfo.isEmpty() ? primaryKeyLivenessInfo : primaryKeyLivenessInfo.withUpdatedTimestamp(newTimestamp);}} {{ // If the deletion is shadowable and the row has a timestamp, we'll forced the deletion timestamp to be less than the row one, so we}} {{ // should get rid of said deletion.}} {{ Deletion newDeletion = deletion.isLive() || (deletion.isShadowable() && !primaryKeyLivenessInfo.isEmpty())}} {{ ? Deletion.LIVE}} {{ : new Deletion(new DeletionTime(newTimestamp - 1, deletion.time().localDeletionTime()), deletion.isShadowable());}} {{ return transformAndFilter(newInfo, newDeletion, (cd) -> cd.updateAllTimestamp(newTimestamp));}} {\{ }}} {\{ }}{_}src/java/org/apache/cassandra/db/rows/BTreeRow.java{_} The only problem I can see is loss of precision: This call will use the hlc() part of the executeAt timestamp, and not the node id (nor epoch). It seems possible and even likely that two different nodes will append to the same list during the same millisecond. After this, the ordering of those two (BTreeRow) elements is deterministic but arbitrary, and not guaranteed to be the same as the Accord transactions that wrote them. Also note the loss of precision is unnecessary! Cassandra legacy timestamps are microseconds, but Accord has only millisecond precision. A better implementation here would be to use the last 3 digits of the timestamp field to encode the node id, and maybe also epoch. The Accord originated timestamps are easy to spot with their 3 trailing zeros: {{$ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-5-big-Data.db }} {{[2]@241 Row[info=[ts=1695222739434337] ]: | del(names)=deletedAt=1695222739434336, localDeletion=1695222739, [names[177f79d0-57c8-11ee-b578-7dbb261b6e16]=Albert ts=1695222739434337], [names[177f79da-57c8-11ee-b578-7dbb261b6e16]=Ebba ts=1695222739434337], [names[3d4371d0-}} {{57c8-11ee-b578-7dbb261b6e16]=poppari ts=1695222802794082]}} {{$ tools/bin/sstabledump -d -t data/data/myspace/listtest-8574ceb057c611eeb5787dbb261b6e16/nc-26-big-Data.db [2]@0 Row[info=[ts=-9223372036854775808] ]: | , [names[6d8b88c0-5979-11ee-9ee4-1ff7dd1e5050]=HENKKA ts=1695408855885000] }} h2. ListType was (Author: henrik.ingo): Okay, finally got to the bottom of this. Report of findings follows: TL:DR; Accord is correctly propagating the {{executeAt}} timestamp into the legacy {{timestamp}} and {{executeNow}} fields. There's loss of precision though, so appending to the list twice within the same millisecond is the likely explanation of this particular symptom. Underneath this, it's however the {{ListType}} itself that is broken, at least for the read path. h2. Accord The ListType is internally like a table/partition of BTreeRow's, that are sorted by their timestamp. This makes lists ecentualy consistent: The application can append entries to a
[jira] [Commented] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769223#comment-17769223 ] Henrik Ingo commented on CASSANDRA-18798: - Okay, finally got to the bottom of this. Report of findings follows: TL:DR; Accord is correctly propagating the {{executeAt}} timestamp into the legacy {{timestamp}} and {{executeNow}} fields. There's loss of precision though, so appending to the list twice within the same millisecond is the likely explanation of this particular symptom. Underneath this, it's however the {{ListType}} itself that is broken, at least for the read path. h2. Accord The ListType is internally like a table/partition of BTreeRow's, that are sorted by their timestamp. This makes lists ecentualy consistent: The application can append entries to a list from two clients simultaneously, and the ordering of the resulting list, once all elements have "arrived", is deterministic. The initial hyptohesis for my research was that the Accord executeAt timestamp isn't correctly propagated into each list element (BTreeRow). However, this is not the case: Once an Accord transaction has determined its transaction id, called executeAt in Cassandra, and we arrive at the write portion of the exeuction phase, we have this: {{ cfk.updateLastExecutionTimestamps(executeAt, true);}} {{ long timestamp = cfk.current().timestampMicrosFor(executeAt, true);}} {{ // TODO (low priority - do we need to compute nowInSeconds, or can we just use executeAt?)}} {{ int nowInSeconds = cfk.current().nowInSecondsFor(executeAt, true);}} _{{}}modules/accord/accord-core/src/main/java/accord/primitives/Timestamp.java_ This eventually reaches {{ public Row updateAllTimestamp(long newTimestamp)}} {{ {}} {{ LivenessInfo newInfo = primaryKeyLivenessInfo.isEmpty() ? primaryKeyLivenessInfo : primaryKeyLivenessInfo.withUpdatedTimestamp(newTimestamp);}} {{ // If the deletion is shadowable and the row has a timestamp, we'll forced the deletion timestamp to be less than the row one, so we}} {{ // should get rid of said deletion.}} {{ Deletion newDeletion = deletion.isLive() || (deletion.isShadowable() && !primaryKeyLivenessInfo.isEmpty())}} {{ ? Deletion.LIVE}} {{ : new Deletion(new DeletionTime(newTimestamp - 1, deletion.time().localDeletionTime()), deletion.isShadowable());}} {{ return transformAndFilter(newInfo, newDeletion, (cd) -> cd.updateAllTimestamp(newTimestamp));}} {{ }}} {{ }}_src/java/org/apache/cassandra/db/rows/BTreeRow.java_ The only problem I can see is loss of precision: This call will use the hlc() part of the executeAt timestamp, and not the node id (nor epoch). It seems possible and even likely that two different nodes will append to the same list during the same millisecond. After this, the ordering of those two (BTreeRow) elements is deterministic but arbitrary, and not guaranteed to be the same as the Accord transactions that wrote them. Also note the loss of precision is unnecessary! Cassandra legacy timestamps are microseconds, but Accord has only millisecond precision. A better implementation here would be to use the last 3 digits of the timestamp field to encode the node id, and maybe also epoch. h2. ListType > Appending to list in Accord transactions uses insertion timestamp > - > > Key: CASSANDRA-18798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18798 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Jaroslaw Kijanowski >Priority: Normal > > Given the following schema: > {code:java} > CREATE KEYSPACE IF NOT EXISTS accord WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': 3}; > CREATE TABLE IF NOT EXISTS accord.list_append(id int PRIMARY KEY,contents > LIST); > TRUNCATE accord.list_append;{code} > And the following two possible queries executed by 10 threads in parallel: > {code:java} > BEGIN TRANSACTION > LET row = (SELECT * FROM list_append WHERE id = ?); > SELECT row.contents; > COMMIT TRANSACTION;" > BEGIN TRANSACTION > UPDATE list_append SET contents += ? WHERE id = ?; > COMMIT TRANSACTION;" > {code} > there seems to be an issue with transaction guarantees. Here's an excerpt in > the edn format from a test. > {code:java} > {:type :invoke :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607285967116627} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 54 > :time 1692607286078732473} > {:type :invoke :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286133833428} > {:type :invoke :process 7 :v
[jira] [Commented] (CASSANDRA-18682) Create TLA+ spec of Accord
[ https://issues.apache.org/jira/browse/CASSANDRA-18682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760152#comment-17760152 ] Henrik Ingo commented on CASSANDRA-18682: - Ok so after "sleeping on it" quite a few nights, and doing other tasks in between, I returned to this project with some success. I removed all PlusCal code and just did everything in TLA syntax. This greatly improves both robustness and clarity. What I pushed yesterday (https://github.com/henrikingo/cassandra-accord/commit/25c43b98ed15d5762aeeab8aa1539fa0f00b9458) is based on modeling Accord such that each "operator" (I mean function) is a coordinator or "writer" node in a Cassandra cluster, and these steps are connected (or separated) by message queues. This is intuitive way to think about Accord, and luckily it seems to work. If you try to run it, just note that TLC is less concerned about the implementation completing an end to end trip for a given transaction, and more focused on running it through every possible permutation of input variables. I have also been working on an approach where the algorithm from the white paper is pretty much 1:1 mapped into TLA+ syntax. I believe this is how TLA+ is designed to be used. And certainly it would give more confidence in the results of this project if the TLA+ implementation is such that it is crystal clear that the TLA+ code definitely does the same thing as the algorithm in the paper. In the latter approach it wasn't at first obvious how the parts that are executed by different nodes/shards... can be modeled. But now that I've seen more, it might be possible. Even so, going to focus on the message queue approach first. > Create TLA+ spec of Accord > -- > > Key: CASSANDRA-18682 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18682 > Project: Cassandra > Issue Type: Task > Components: Accord >Reporter: Henrik Ingo >Assignee: Henrik Ingo >Priority: Normal > Fix For: 5.x > > > Create a TLA+ Spec of Accord. > > For this ticket, goal is just to cover Algorithm 1. No significant > discoveries are expected, and to really check correctness, one will have to > implement all 4 algorithms. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-18682) Create TLA+ spec of Accord
[ https://issues.apache.org/jira/browse/CASSANDRA-18682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17748625#comment-17748625 ] Henrik Ingo edited comment on CASSANDRA-18682 at 7/28/23 2:11 PM: -- Weekly status update: * Decided to get rid of PlusCal. It's easy, the PlusCal code is compiled into TLA+, so just remove the PlusCal and you're left with the equivalent TLA+ implementation. * TLA Toolbox still feels it's on shaky ground. Decided to try [PlusPy|https://github.com/tlaplus/PlusPy], a Python implementation of the TLA+ syntax. One motivation here is that by reading the Python code, it could be easier to at least understand what it is trying to do. In the extreme case one could make careful modifications to avoid some of the oddest TLA+ syntax... * However, PlusPy quite early on raises an exception on me. After two days I found out what was wrong (variables need to be initialized). Patch below. After fixing that issue, there's now an exception because it doesn't find the Init stage... In summary, this could work, but how confident would we be that it is checking anything correctly at all, if I have to supply N fixes to even run the thing. * Took a step back and decided to read up on alternatives to TLA+. Through [Wikipedia|https://en.wikipedia.org/wiki/List_of_model_checking_tools], found [Mazzanti, Franco; Ferrari, Alessio (2018)|https://arxiv.org/abs/1803.10324v1] who implemented the same algorithm in 10 different tools and share their results. [They later have produced a 100+ page report surveying what tools are used the most.|http://www.astrail.eu/download.aspx?id=bb46b81b-a5bf-4036-9018-cc6e7d91e2c2] TLA+ isn't one of them... * Based on that report, I'm curious now to test the [Eclipse based Rodin framework,|http://www.event-b.org/] and the Event-B language it uses. That's for next week. {{diff --git a/pluspy.py b/pluspy.py}} {{index d1ba07a..2103fcf 100644}} {{--- a/pluspy.py}} {{+++ b/pluspy.py}} {{@@ -2185,7 +2185,10 @@ class VariableExpression(Expression):}} {{ if initializing:}} {{ return PrimeExpression(expr=subs[self])}} {{ else:}} {{- return subs[self]}} {{+ v = subs[self]}} {{+ if isinstance(v,bool):}} {{+ v = ValueExpression(v)}} {{+ return v}} {{ }} {{ def eval(self, containers, boundedvars):}} {{ print("Error: variable", self.id, "not realized", containers, boundedvars)}} was (Author: henrik.ingo): Weekly status update: * Decided to get rid of PlusCal. It's easy, the PlusCal code is compiled into TLA+, so just remove the PlusCal and you're left with the equivalent TLA+ implementation. * TLA Toolbox still feels it's on shaky ground. Decided to try [PlusPy|https://github.com/tlaplus/PlusPy], a Python implementation of the TLA+ syntax. One motivation here is that by reading the Python code, it could be easier to at least understand what it is trying to do. In the extreme case one could make careful modifications to avoid some of the oddest TLA+ syntax... * However, PlusPy quite early on raises an exception on me. After two days I found out what was wrong (variables need to be initialized). Patch below. After fixing that issue, there's now an exception because it doesn't find the Init stage... In summary, this could work, but how confident would we be that it is checking anything correctly at all, if I have to supply N fixes to even run the thing. * Took a step back and decided to read up on alternatives to TLA+. Through [Wikipedia|https://en.wikipedia.org/wiki/List_of_model_checking_tools], found [Mazzanti, Franco; Ferrari, Alessio (2018)|https://arxiv.org/abs/1803.10324v1] who implemented the same algorithm in 10 different tools and share their results. [They later have produced a 100+ page report surveying what tools are used the most.|http://www.astrail.eu/download.aspx?id=bb46b81b-a5bf-4036-9018-cc6e7d91e2c2] TLA+ isn't one of them... * Based on that report, I'm curious now to test the [Eclipse based Rodin framework,|http://www.event-b.org/] and the Event-B language it uses. That's for next week. > Create TLA+ spec of Accord > -- > > Key: CASSANDRA-18682 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18682 > Project: Cassandra > Issue Type: Task > Components: Accord >Reporter: Henrik Ingo >Assignee: Henrik Ingo >Priority: Normal > Fix For: 5.x > > > Create a TLA+ Spec of Accord. > > For this ticket, goal is just to cover Algorithm 1. No significant > discoveries are expected, and to really check correctness, one will have to > implement all 4 algorithms. -- This message was sent by Atlassian Jira (v8.20.10#820010) -
[jira] [Commented] (CASSANDRA-18682) Create TLA+ spec of Accord
[ https://issues.apache.org/jira/browse/CASSANDRA-18682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17748625#comment-17748625 ] Henrik Ingo commented on CASSANDRA-18682: - Weekly status update: * Decided to get rid of PlusCal. It's easy, the PlusCal code is compiled into TLA+, so just remove the PlusCal and you're left with the equivalent TLA+ implementation. * TLA Toolbox still feels it's on shaky ground. Decided to try [PlusPy|https://github.com/tlaplus/PlusPy], a Python implementation of the TLA+ syntax. One motivation here is that by reading the Python code, it could be easier to at least understand what it is trying to do. In the extreme case one could make careful modifications to avoid some of the oddest TLA+ syntax... * However, PlusPy quite early on raises an exception on me. After two days I found out what was wrong (variables need to be initialized). Patch below. After fixing that issue, there's now an exception because it doesn't find the Init stage... In summary, this could work, but how confident would we be that it is checking anything correctly at all, if I have to supply N fixes to even run the thing. * Took a step back and decided to read up on alternatives to TLA+. Through [Wikipedia|https://en.wikipedia.org/wiki/List_of_model_checking_tools], found [Mazzanti, Franco; Ferrari, Alessio (2018)|https://arxiv.org/abs/1803.10324v1] who implemented the same algorithm in 10 different tools and share their results. [They later have produced a 100+ page report surveying what tools are used the most.|http://www.astrail.eu/download.aspx?id=bb46b81b-a5bf-4036-9018-cc6e7d91e2c2] TLA+ isn't one of them... * Based on that report, I'm curious now to test the [Eclipse based Rodin framework,|http://www.event-b.org/] and the Event-B language it uses. That's for next week. > Create TLA+ spec of Accord > -- > > Key: CASSANDRA-18682 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18682 > Project: Cassandra > Issue Type: Task > Components: Accord >Reporter: Henrik Ingo >Assignee: Henrik Ingo >Priority: Normal > Fix For: 5.x > > > Create a TLA+ Spec of Accord. > > For this ticket, goal is just to cover Algorithm 1. No significant > discoveries are expected, and to really check correctness, one will have to > implement all 4 algorithms. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18682) Create TLA+ spec of Accord
[ https://issues.apache.org/jira/browse/CASSANDRA-18682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746648#comment-17746648 ] Henrik Ingo commented on CASSANDRA-18682: - PHP was the 5th or 6th programming language I learned, and also it was the first time I could pick up a new programming language just by starting to code and reading the manual as needed. They are all variations of each other, and in 2 weeks you could learn a 7th and 8th language easily... TLA+ was... hard. I actually tried to watch the official video tutorials from Leslie Lamport, but the problem is he teaches math and I gave up when the 3rd video still didn't show any TLA syntax... Hillel's learntla.com was better, and got me this far, but... It seems like a joke to think that this language is used to prove correctness and robustness of algorithms. This is the sillyest, most fragile, booby trapped language I've ever seen. Anyway... The linked branch contains the beginnings of a TLA implementation as a TLA+ Spec. Current status is that there's a key range from which a transaction will pick a set of keys that the transaction operates on. (There is no payload and nothing is done to the keys other than checking whether a concurrent trx might be holding the same key.) Similarly each transaction has a t_0 and T timestamps. A complete transaction {{ newTrx := <>;}} is passed to all nodes and back to the coordinator. However... * Only the fast path, so the beginning of algorithm1 is implemented so far. * Partitioning is skipped, so all nodes always hold all keys. * deps are not actually collected, it is always the empty set * Consequently, there isn't really any checking for conflicting transactions, because the deps to check aren't there. In addition: * I chose to implement this is PlusCal, mostly because learntla/Hillel seems to recommend it. (And Leslie recommends TLA+, which is a god reason to NOT use it...) * Looking at the above now, I'll probably do a second attempt where it's just raw TLA+. PlusCal seems to make a hard language actually whimsical. For example it introduces two new ways to assign to a variable: := and =. It depends on the context which one to use. * I'm considering using PlusPy as the interpreter and checker. This way I could easily follow in python what is actually happening. I'm also tempted to just redefine TLA+ to be less crazy, starting with using = for assignment. Finally, in PlusPy you have the option to use python for a section instead of TLA. For example choosing a set of numbers of varying, random size, should not be a hard problem, but it is in TLA. As I'm writing this it's unclear to me whether I will continue with this tomorrow or whether I was asked to context switch to another Accord related task. > Create TLA+ spec of Accord > -- > > Key: CASSANDRA-18682 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18682 > Project: Cassandra > Issue Type: Task > Components: Accord >Reporter: Henrik Ingo >Assignee: Henrik Ingo >Priority: Normal > Fix For: 5.x > > > Create a TLA+ Spec of Accord. > > For this ticket, goal is just to cover Algorithm 1. No significant > discoveries are expected, and to really check correctness, one will have to > implement all 4 algorithms. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18682) Create TLA+ spec of Accord
[ https://issues.apache.org/jira/browse/CASSANDRA-18682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik Ingo updated CASSANDRA-18682: Change Category: Quality Assurance Complexity: Challenging Fix Version/s: 5.x Status: Open (was: Triage Needed) > Create TLA+ spec of Accord > -- > > Key: CASSANDRA-18682 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18682 > Project: Cassandra > Issue Type: Task > Components: Accord >Reporter: Henrik Ingo >Assignee: Henrik Ingo >Priority: Normal > Fix For: 5.x > > > Create a TLA+ Spec of Accord. > > For this ticket, goal is just to cover Algorithm 1. No significant > discoveries are expected, and to really check correctness, one will have to > implement all 4 algorithms. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-18682) Create TLA+ spec of Accord
Henrik Ingo created CASSANDRA-18682: --- Summary: Create TLA+ spec of Accord Key: CASSANDRA-18682 URL: https://issues.apache.org/jira/browse/CASSANDRA-18682 Project: Cassandra Issue Type: Task Components: Accord Reporter: Henrik Ingo Assignee: Henrik Ingo Create a TLA+ Spec of Accord. For this ticket, goal is just to cover Algorithm 1. No significant discoveries are expected, and to really check correctness, one will have to implement all 4 algorithms. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18260) Add details to Error message: Not enough space for compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717756#comment-17717756 ] Henrik Ingo commented on CASSANDRA-18260: - Ok I have added whitespace also the patch against trunk. If I'm following correctly, I've addressed all comments, except for the DEBUG log message, which isn't introduced by this patch, it is merely adjacent and copied from one branch to another when backporting. So I would suggest we merge and close this, and if there is sufficient movitvation to remove the DEBUG message, that can easily be done in a separate commit later. > Add details to Error message: Not enough space for compaction > -- > > Key: CASSANDRA-18260 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18260 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction, Observability/Logging >Reporter: Brad Schoening >Assignee: Henrik Ingo >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.x > > Time Spent: 3h 20m > Remaining Estimate: 0h > > When compaction fails, the log message should list a) the free space > available on disk at that point in time and b) perhaps the number and/or size > of the source sstables being compacted. > Free space can change from one moment to the next, so when the below > compaction failed because it needed 161GB, upon checking the server a few > minutes later, it had 184GB free. Similarly, the error message mentions it > was writing out one SSTable on this STCS table, but its not clear if it was > combining X -> 1 tables, or something else. > {noformat} > [INFO ] [CompactionExecutor:77758] cluster_id=87 ip_address=127.1.1.1 > CompactionTask.java:241 - Compacted (8a1cffe0-abb5-11ed-b3fc-8d2ac2c52f0d) 1 > sstables to [...] to level=0. 86.997GiB to 86.997GiB (~99% of original) in > 1,508,155ms. Read Throughput = 59.069MiB/s, Write Throughput = 59.069MiB/s, > Row Throughput = ~20,283/s. 21,375 total partitions merged to 21,375. > Partition merge counts were \{1:21375, } > [ERROR] [CompactionExecutor:4] cluster_id=87 ip_address=127.1.1.1 > CassandraDaemon.java:581 - Exception in thread > Thread[CompactionExecutor:4,1,main] > java.lang.RuntimeException: Not enough space for compaction, estimated > sstables = 1, expected write size = 161228934093 > at > org.apache.cassandra.db.compaction.CompactionTask.buildCompactionCandidatesForAvailableDiskSpace(CompactionTask.java:386) > at > org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:126) > at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:77) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:100) > at > org.apache.cassandra.db.compaction.CompactionManager$7.execute(CompactionManager.java:613) > at > org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:377) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Thread.java:834) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18260) Add details to Error message: Not enough space for compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714610#comment-17714610 ] Henrik Ingo commented on CASSANDRA-18260: - [~maxwellguo] For reference, this is the patch against trunk: [https://github.com/apache/cassandra/pull/2244] And yes indeed, they don't have much in common. > Add details to Error message: Not enough space for compaction > -- > > Key: CASSANDRA-18260 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18260 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction, Observability/Logging >Reporter: Brad Schoening >Assignee: Henrik Ingo >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.x > > Time Spent: 1h 40m > Remaining Estimate: 0h > > When compaction fails, the log message should list a) the free space > available on disk at that point in time and b) perhaps the number and/or size > of the source sstables being compacted. > Free space can change from one moment to the next, so when the below > compaction failed because it needed 161GB, upon checking the server a few > minutes later, it had 184GB free. Similarly, the error message mentions it > was writing out one SSTable on this STCS table, but its not clear if it was > combining X -> 1 tables, or something else. > {noformat} > [INFO ] [CompactionExecutor:77758] cluster_id=87 ip_address=127.1.1.1 > CompactionTask.java:241 - Compacted (8a1cffe0-abb5-11ed-b3fc-8d2ac2c52f0d) 1 > sstables to [...] to level=0. 86.997GiB to 86.997GiB (~99% of original) in > 1,508,155ms. Read Throughput = 59.069MiB/s, Write Throughput = 59.069MiB/s, > Row Throughput = ~20,283/s. 21,375 total partitions merged to 21,375. > Partition merge counts were \{1:21375, } > [ERROR] [CompactionExecutor:4] cluster_id=87 ip_address=127.1.1.1 > CassandraDaemon.java:581 - Exception in thread > Thread[CompactionExecutor:4,1,main] > java.lang.RuntimeException: Not enough space for compaction, estimated > sstables = 1, expected write size = 161228934093 > at > org.apache.cassandra.db.compaction.CompactionTask.buildCompactionCandidatesForAvailableDiskSpace(CompactionTask.java:386) > at > org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:126) > at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:77) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:100) > at > org.apache.cassandra.db.compaction.CompactionManager$7.execute(CompactionManager.java:613) > at > org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:377) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Thread.java:834) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18260) Add details to Error message: Not enough space for compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik Ingo updated CASSANDRA-18260: Mentor: Michael Semb Wever Status: Review In Progress (was: Changes Suggested) > Add details to Error message: Not enough space for compaction > -- > > Key: CASSANDRA-18260 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18260 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction, Observability/Logging >Reporter: Brad Schoening >Assignee: Henrik Ingo >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.x > > Time Spent: 1h > Remaining Estimate: 0h > > When compaction fails, the log message should list a) the free space > available on disk at that point in time and b) perhaps the number and/or size > of the source sstables being compacted. > Free space can change from one moment to the next, so when the below > compaction failed because it needed 161GB, upon checking the server a few > minutes later, it had 184GB free. Similarly, the error message mentions it > was writing out one SSTable on this STCS table, but its not clear if it was > combining X -> 1 tables, or something else. > {noformat} > [INFO ] [CompactionExecutor:77758] cluster_id=87 ip_address=127.1.1.1 > CompactionTask.java:241 - Compacted (8a1cffe0-abb5-11ed-b3fc-8d2ac2c52f0d) 1 > sstables to [...] to level=0. 86.997GiB to 86.997GiB (~99% of original) in > 1,508,155ms. Read Throughput = 59.069MiB/s, Write Throughput = 59.069MiB/s, > Row Throughput = ~20,283/s. 21,375 total partitions merged to 21,375. > Partition merge counts were \{1:21375, } > [ERROR] [CompactionExecutor:4] cluster_id=87 ip_address=127.1.1.1 > CassandraDaemon.java:581 - Exception in thread > Thread[CompactionExecutor:4,1,main] > java.lang.RuntimeException: Not enough space for compaction, estimated > sstables = 1, expected write size = 161228934093 > at > org.apache.cassandra.db.compaction.CompactionTask.buildCompactionCandidatesForAvailableDiskSpace(CompactionTask.java:386) > at > org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:126) > at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:77) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:100) > at > org.apache.cassandra.db.compaction.CompactionManager$7.execute(CompactionManager.java:613) > at > org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:377) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Thread.java:834) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18260) Add details to Error message: Not enough space for compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714151#comment-17714151 ] Henrik Ingo commented on CASSANDRA-18260: - Okay, here are the 4.1 and 4.0 "backports". I tries to make the log messages as similar as possible, but since the code actually does something different compared to trunk, the message text reflects that. [https://github.com/apache/cassandra/pull/2285] [https://github.com/apache/cassandra/pull/2284] The above two are pretty identical, just a simple 1 line change was needed. The difference from these two to the trunk PR is huge. I spent a couple hours pondering whether some other approach would be more correct (for trunk) but in the end they're just different. But here we have them now. Maybe my main criticism against these is the growing number of lines asserting log output. If anything changes in the surrounding code, you have like 20+ assertions to update as well. > Add details to Error message: Not enough space for compaction > -- > > Key: CASSANDRA-18260 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18260 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction, Observability/Logging >Reporter: Brad Schoening >Assignee: Henrik Ingo >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.x > > Time Spent: 1h > Remaining Estimate: 0h > > When compaction fails, the log message should list a) the free space > available on disk at that point in time and b) perhaps the number and/or size > of the source sstables being compacted. > Free space can change from one moment to the next, so when the below > compaction failed because it needed 161GB, upon checking the server a few > minutes later, it had 184GB free. Similarly, the error message mentions it > was writing out one SSTable on this STCS table, but its not clear if it was > combining X -> 1 tables, or something else. > {noformat} > [INFO ] [CompactionExecutor:77758] cluster_id=87 ip_address=127.1.1.1 > CompactionTask.java:241 - Compacted (8a1cffe0-abb5-11ed-b3fc-8d2ac2c52f0d) 1 > sstables to [...] to level=0. 86.997GiB to 86.997GiB (~99% of original) in > 1,508,155ms. Read Throughput = 59.069MiB/s, Write Throughput = 59.069MiB/s, > Row Throughput = ~20,283/s. 21,375 total partitions merged to 21,375. > Partition merge counts were \{1:21375, } > [ERROR] [CompactionExecutor:4] cluster_id=87 ip_address=127.1.1.1 > CassandraDaemon.java:581 - Exception in thread > Thread[CompactionExecutor:4,1,main] > java.lang.RuntimeException: Not enough space for compaction, estimated > sstables = 1, expected write size = 161228934093 > at > org.apache.cassandra.db.compaction.CompactionTask.buildCompactionCandidatesForAvailableDiskSpace(CompactionTask.java:386) > at > org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:126) > at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:77) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:100) > at > org.apache.cassandra.db.compaction.CompactionManager$7.execute(CompactionManager.java:613) > at > org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:377) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Thread.java:834) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18260) Add details to Error message: Not enough space for compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17712263#comment-17712263 ] Henrik Ingo commented on CASSANDRA-18260: - Actually.. this isn't a bug fix. Why should it be merged to stable branches? I would understand a motivation to keep branches in sync wrt trivial ghanges but if the patch already doesn't apply, why should we merge new functionality to a stable branch? > Add details to Error message: Not enough space for compaction > -- > > Key: CASSANDRA-18260 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18260 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction, Observability/Logging >Reporter: Brad Schoening >Assignee: Henrik Ingo >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.x > > Time Spent: 40m > Remaining Estimate: 0h > > When compaction fails, the log message should list a) the free space > available on disk at that point in time and b) perhaps the number and/or size > of the source sstables being compacted. > Free space can change from one moment to the next, so when the below > compaction failed because it needed 161GB, upon checking the server a few > minutes later, it had 184GB free. Similarly, the error message mentions it > was writing out one SSTable on this STCS table, but its not clear if it was > combining X -> 1 tables, or something else. > {noformat} > [INFO ] [CompactionExecutor:77758] cluster_id=87 ip_address=127.1.1.1 > CompactionTask.java:241 - Compacted (8a1cffe0-abb5-11ed-b3fc-8d2ac2c52f0d) 1 > sstables to [...] to level=0. 86.997GiB to 86.997GiB (~99% of original) in > 1,508,155ms. Read Throughput = 59.069MiB/s, Write Throughput = 59.069MiB/s, > Row Throughput = ~20,283/s. 21,375 total partitions merged to 21,375. > Partition merge counts were \{1:21375, } > [ERROR] [CompactionExecutor:4] cluster_id=87 ip_address=127.1.1.1 > CassandraDaemon.java:581 - Exception in thread > Thread[CompactionExecutor:4,1,main] > java.lang.RuntimeException: Not enough space for compaction, estimated > sstables = 1, expected write size = 161228934093 > at > org.apache.cassandra.db.compaction.CompactionTask.buildCompactionCandidatesForAvailableDiskSpace(CompactionTask.java:386) > at > org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:126) > at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:77) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:100) > at > org.apache.cassandra.db.compaction.CompactionManager$7.execute(CompactionManager.java:613) > at > org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:377) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Thread.java:834) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18260) Add details to Error message: Not enough space for compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17708131#comment-17708131 ] Henrik Ingo commented on CASSANDRA-18260: - [~mck] I believe what you are seeing here it that the mock {{FakeFileStore}} class didn't implement a {{toString()}} method. CASSANDRA-18287 seems to have an example of what the same message looks like in production use. > Add details to Error message: Not enough space for compaction > -- > > Key: CASSANDRA-18260 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18260 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction, Observability/Logging >Reporter: Brad Schoening >Assignee: Henrik Ingo >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.x > > Time Spent: 10m > Remaining Estimate: 0h > > When compaction fails, the log message should list a) the free space > available on disk at that point in time and b) perhaps the number and/or size > of the source sstables being compacted. > Free space can change from one moment to the next, so when the below > compaction failed because it needed 161GB, upon checking the server a few > minutes later, it had 184GB free. Similarly, the error message mentions it > was writing out one SSTable on this STCS table, but its not clear if it was > combining X -> 1 tables, or something else. > {noformat} > [INFO ] [CompactionExecutor:77758] cluster_id=87 ip_address=127.1.1.1 > CompactionTask.java:241 - Compacted (8a1cffe0-abb5-11ed-b3fc-8d2ac2c52f0d) 1 > sstables to [...] to level=0. 86.997GiB to 86.997GiB (~99% of original) in > 1,508,155ms. Read Throughput = 59.069MiB/s, Write Throughput = 59.069MiB/s, > Row Throughput = ~20,283/s. 21,375 total partitions merged to 21,375. > Partition merge counts were \{1:21375, } > [ERROR] [CompactionExecutor:4] cluster_id=87 ip_address=127.1.1.1 > CassandraDaemon.java:581 - Exception in thread > Thread[CompactionExecutor:4,1,main] > java.lang.RuntimeException: Not enough space for compaction, estimated > sstables = 1, expected write size = 161228934093 > at > org.apache.cassandra.db.compaction.CompactionTask.buildCompactionCandidatesForAvailableDiskSpace(CompactionTask.java:386) > at > org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:126) > at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:77) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:100) > at > org.apache.cassandra.db.compaction.CompactionManager$7.execute(CompactionManager.java:613) > at > org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:377) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Thread.java:834) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18260) Add details to Error message: Not enough space for compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706191#comment-17706191 ] Henrik Ingo commented on CASSANDRA-18260: - Ok that's better. Updated PR to use it. > Add details to Error message: Not enough space for compaction > -- > > Key: CASSANDRA-18260 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18260 > Project: Cassandra > Issue Type: Improvement > Components: Observability/Logging >Reporter: Brad Schoening >Assignee: Henrik Ingo >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.x > > Time Spent: 10m > Remaining Estimate: 0h > > When compaction fails, the log message should list a) the free space > available on disk at that point in time and b) perhaps the number and/or size > of the source sstables being compacted. > Free space can change from one moment to the next, so when the below > compaction failed because it needed 161GB, upon checking the server a few > minutes later, it had 184GB free. Similarly, the error message mentions it > was writing out one SSTable on this STCS table, but its not clear if it was > combining X -> 1 tables, or something else. > [INFO ] [CompactionExecutor:77758] cluster_id=87 ip_address=127.1.1.1 > CompactionTask.java:241 - Compacted (8a1cffe0-abb5-11ed-b3fc-8d2ac2c52f0d) 1 > sstables to [...] to level=0. 86.997GiB to 86.997GiB (~99% of original) in > 1,508,155ms. Read Throughput = 59.069MiB/s, Write Throughput = 59.069MiB/s, > Row Throughput = ~20,283/s. 21,375 total partitions merged to 21,375. > Partition merge counts were \{1:21375, } > [ERROR] [CompactionExecutor:4] cluster_id=87 ip_address=127.1.1.1 > CassandraDaemon.java:581 - Exception in thread > Thread[CompactionExecutor:4,1,main] > java.lang.RuntimeException: Not enough space for compaction, estimated > sstables = 1, expected write size = 161228934093 > at > org.apache.cassandra.db.compaction.CompactionTask.buildCompactionCandidatesForAvailableDiskSpace(CompactionTask.java:386) > at > org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:126) > at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:77) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:100) > at > org.apache.cassandra.db.compaction.CompactionManager$7.execute(CompactionManager.java:613) > at > org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:377) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Thread.java:834) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18260) Add details to Error message: Not enough space for compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705963#comment-17705963 ] Henrik Ingo commented on CASSANDRA-18260: - [~bschoeni] sure, here you go: {quote}{color:#00}FileStore org.apache.cassandra.db.DirectoriesTest$FakeFileStore@6ddee60f has 30 bytes available, checking if we can write 20 bytes {color} FileStore org.apache.cassandra.db.DirectoriesTest$FakeFileStore@1f87607c has 30 bytes available, checking if we can write 20 bytes FileStore org.apache.cassandra.db.DirectoriesTest$FakeFileStore@4b862408 has 30 bytes available, checking if we can write 20 bytes FileStore org.apache.cassandra.db.DirectoriesTest$FakeFileStore@6ddee60f has 30 bytes available, checking if we can write 20 bytes FileStore org.apache.cassandra.db.DirectoriesTest$FakeFileStore@1f87607c has 19 bytes available, checking if we can write 20 bytes FileStore org.apache.cassandra.db.DirectoriesTest$FakeFileStore@1f87607c has only 0 MiB available, but 0 MiB is needed FileStore org.apache.cassandra.db.DirectoriesTest$FakeFileStore@4b862408 has 30 bytes available, checking if we can write 20 bytes FileStore org.apache.cassandra.db.DirectoriesTest$FakeFileStore@6ddee60f has 30 bytes available, checking if we can write 20 bytes FileStore org.apache.cassandra.db.DirectoriesTest$FakeFileStore@1f87607c has 19 bytes available, checking if we can write 20 bytes FileStore org.apache.cassandra.db.DirectoriesTest$FakeFileStore@1f87607c has only 0 MiB available, but 0 MiB is needed FileStore org.apache.cassandra.db.DirectoriesTest$FakeFileStore@4b862408 has 20971511 bytes available, checking if we can write 26214409 bytes FileStore org.apache.cassandra.db.DirectoriesTest$FakeFileStore@4b862408 has only 20 MiB available, but 25 MiB is needed {quote} Where this is the new row: {quote}FileStore org.apache.cassandra.db.DirectoriesTest$FakeFileStore@4b862408 has only X MiB available, but X MiB is needed {quote} The above is from the unit test, so the error message "{_}Not enough space for compaction, estimated sstables = 1, expected write size = 161228934093{_}" is not there but would happen after this. This reminds me also, I wanted to highlight the question: Do we want to preserve the exact format "1234567 bytes" or "1.2 MiB"? > Add details to Error message: Not enough space for compaction > -- > > Key: CASSANDRA-18260 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18260 > Project: Cassandra > Issue Type: Improvement > Components: Observability/Logging >Reporter: Brad Schoening >Assignee: Henrik Ingo >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.x > > Time Spent: 10m > Remaining Estimate: 0h > > When compaction fails, the log message should list a) the free space > available on disk at that point in time and b) perhaps the number and/or size > of the source sstables being compacted. > Free space can change from one moment to the next, so when the below > compaction failed because it needed 161GB, upon checking the server a few > minutes later, it had 184GB free. Similarly, the error message mentions it > was writing out one SSTable on this STCS table, but its not clear if it was > combining X -> 1 tables, or something else. > [INFO ] [CompactionExecutor:77758] cluster_id=87 ip_address=127.1.1.1 > CompactionTask.java:241 - Compacted (8a1cffe0-abb5-11ed-b3fc-8d2ac2c52f0d) 1 > sstables to [...] to level=0. 86.997GiB to 86.997GiB (~99% of original) in > 1,508,155ms. Read Throughput = 59.069MiB/s, Write Throughput = 59.069MiB/s, > Row Throughput = ~20,283/s. 21,375 total partitions merged to 21,375. > Partition merge counts were \{1:21375, } > [ERROR] [CompactionExecutor:4] cluster_id=87 ip_address=127.1.1.1 > CassandraDaemon.java:581 - Exception in thread > Thread[CompactionExecutor:4,1,main] > java.lang.RuntimeException: Not enough space for compaction, estimated > sstables = 1, expected write size = 161228934093 > at > org.apache.cassandra.db.compaction.CompactionTask.buildCompactionCandidatesForAvailableDiskSpace(CompactionTask.java:386) > at > org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:126) > at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:77) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:100) > at > org.apache.cassandra.db.compaction.CompactionManager$7.execute(CompactionManager.java:613) > at > org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:377) > at java.base/java.ut
[jira] [Updated] (CASSANDRA-18260) Add details to Error message: Not enough space for compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik Ingo updated CASSANDRA-18260: Test and Documentation Plan: Tested on laptop with ant testclasslist I'll get myself a circleci account next to run the full testsuite Status: Patch Available (was: In Progress) > Add details to Error message: Not enough space for compaction > -- > > Key: CASSANDRA-18260 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18260 > Project: Cassandra > Issue Type: Improvement > Components: Observability/Logging >Reporter: Brad Schoening >Assignee: Henrik Ingo >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.x > > Time Spent: 10m > Remaining Estimate: 0h > > When compaction fails, the log message should list a) the free space > available on disk at that point in time and b) perhaps the number and/or size > of the source sstables being compacted. > Free space can change from one moment to the next, so when the below > compaction failed because it needed 161GB, upon checking the server a few > minutes later, it had 184GB free. Similarly, the error message mentions it > was writing out one SSTable on this STCS table, but its not clear if it was > combining X -> 1 tables, or something else. > [INFO ] [CompactionExecutor:77758] cluster_id=87 ip_address=127.1.1.1 > CompactionTask.java:241 - Compacted (8a1cffe0-abb5-11ed-b3fc-8d2ac2c52f0d) 1 > sstables to [...] to level=0. 86.997GiB to 86.997GiB (~99% of original) in > 1,508,155ms. Read Throughput = 59.069MiB/s, Write Throughput = 59.069MiB/s, > Row Throughput = ~20,283/s. 21,375 total partitions merged to 21,375. > Partition merge counts were \{1:21375, } > [ERROR] [CompactionExecutor:4] cluster_id=87 ip_address=127.1.1.1 > CassandraDaemon.java:581 - Exception in thread > Thread[CompactionExecutor:4,1,main] > java.lang.RuntimeException: Not enough space for compaction, estimated > sstables = 1, expected write size = 161228934093 > at > org.apache.cassandra.db.compaction.CompactionTask.buildCompactionCandidatesForAvailableDiskSpace(CompactionTask.java:386) > at > org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:126) > at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:77) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:100) > at > org.apache.cassandra.db.compaction.CompactionManager$7.execute(CompactionManager.java:613) > at > org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:377) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Thread.java:834) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18260) Add details to Error message: Not enough space for compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704557#comment-17704557 ] Henrik Ingo commented on CASSANDRA-18260: - Hi I've addressed a) in a PR that I just sent: [https://github.com/apache/cassandra/pull/2244] This is my first Cassandra patch, and also first time in decades that I'm writing Java code professionally, so humbly and eagerly looking forward to all feedback, including trivial stuff I did wrong. I've tested locally with `ant testclasslist` but that's all. I'll get myself a circleci account now to run the full test suite. I elected to write a new, separate log message from the part in the code where the free space calculation is done. This way I don't have to pass those variables somewhere else only for the purpose of adding them to the error message. Asserting log messages was surprisingly difficult experience. If there is some preferred way to deal with this, I will happily change. For example, I believe it's possible to configure Logback into a mode where log messages can be expected in a deterministic order (at least when guarding against other threads with MDC). But I'm concerned such a configuration could impact performance and therefor test turnaround time. Also generally tests should test whatever is the default or production config, I wouldn't want to use a different config just for tests. > Add details to Error message: Not enough space for compaction > -- > > Key: CASSANDRA-18260 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18260 > Project: Cassandra > Issue Type: Improvement > Components: Observability/Logging >Reporter: Brad Schoening >Assignee: Henrik Ingo >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.x > > Time Spent: 10m > Remaining Estimate: 0h > > When compaction fails, the log message should list a) the free space > available on disk at that point in time and b) perhaps the number and/or size > of the source sstables being compacted. > Free space can change from one moment to the next, so when the below > compaction failed because it needed 161GB, upon checking the server a few > minutes later, it had 184GB free. Similarly, the error message mentions it > was writing out one SSTable on this STCS table, but its not clear if it was > combining X -> 1 tables, or something else. > [INFO ] [CompactionExecutor:77758] cluster_id=87 ip_address=127.1.1.1 > CompactionTask.java:241 - Compacted (8a1cffe0-abb5-11ed-b3fc-8d2ac2c52f0d) 1 > sstables to [...] to level=0. 86.997GiB to 86.997GiB (~99% of original) in > 1,508,155ms. Read Throughput = 59.069MiB/s, Write Throughput = 59.069MiB/s, > Row Throughput = ~20,283/s. 21,375 total partitions merged to 21,375. > Partition merge counts were \{1:21375, } > [ERROR] [CompactionExecutor:4] cluster_id=87 ip_address=127.1.1.1 > CassandraDaemon.java:581 - Exception in thread > Thread[CompactionExecutor:4,1,main] > java.lang.RuntimeException: Not enough space for compaction, estimated > sstables = 1, expected write size = 161228934093 > at > org.apache.cassandra.db.compaction.CompactionTask.buildCompactionCandidatesForAvailableDiskSpace(CompactionTask.java:386) > at > org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:126) > at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:77) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:100) > at > org.apache.cassandra.db.compaction.CompactionManager$7.execute(CompactionManager.java:613) > at > org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:377) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Thread.java:834) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-18260) Add details to Error message: Not enough space for compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrik Ingo reassigned CASSANDRA-18260: --- Assignee: Henrik Ingo > Add details to Error message: Not enough space for compaction > -- > > Key: CASSANDRA-18260 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18260 > Project: Cassandra > Issue Type: Improvement > Components: Observability/Logging >Reporter: Brad Schoening >Assignee: Henrik Ingo >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.x > > > When compaction fails, the log message should list a) the free space > available on disk at that point in time and b) perhaps the number and/or size > of the source sstables being compacted. > Free space can change from one moment to the next, so when the below > compaction failed because it needed 161GB, upon checking the server a few > minutes later, it had 184GB free. Similarly, the error message mentions it > was writing out one SSTable on this STCS table, but its not clear if it was > combining X -> 1 tables, or something else. > [INFO ] [CompactionExecutor:77758] cluster_id=87 ip_address=127.1.1.1 > CompactionTask.java:241 - Compacted (8a1cffe0-abb5-11ed-b3fc-8d2ac2c52f0d) 1 > sstables to [...] to level=0. 86.997GiB to 86.997GiB (~99% of original) in > 1,508,155ms. Read Throughput = 59.069MiB/s, Write Throughput = 59.069MiB/s, > Row Throughput = ~20,283/s. 21,375 total partitions merged to 21,375. > Partition merge counts were \{1:21375, } > [ERROR] [CompactionExecutor:4] cluster_id=87 ip_address=127.1.1.1 > CassandraDaemon.java:581 - Exception in thread > Thread[CompactionExecutor:4,1,main] > java.lang.RuntimeException: Not enough space for compaction, estimated > sstables = 1, expected write size = 161228934093 > at > org.apache.cassandra.db.compaction.CompactionTask.buildCompactionCandidatesForAvailableDiskSpace(CompactionTask.java:386) > at > org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:126) > at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:77) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:100) > at > org.apache.cassandra.db.compaction.CompactionManager$7.execute(CompactionManager.java:613) > at > org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:377) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Thread.java:834) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18193) Provide design and API documentation
[ https://issues.apache.org/jira/browse/CASSANDRA-18193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17681389#comment-17681389 ] Henrik Ingo commented on CASSANDRA-18193: - I spent some hours on Wednesday high level eyeballing the diff against G* trunk, to form my own opinion of what I see. I might post something about that next week, but for now I just wanted to share a by-product that I found and according to Benedict could be something you want to uncomment before the merge? https://github.com/apache/cassandra/blob/cep-15-accord/src/java/org/apache/cassandra/dht/Murmur3Partitioner.java#L240-L255 > Provide design and API documentation > > > Key: CASSANDRA-18193 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18193 > Project: Cassandra > Issue Type: Task > Components: Accord >Reporter: Jacek Lewandowski >Priority: Normal > > Would be great if we have at minimum: > - white paper in a form of an AsciiDoc or Markdown somewhere in the project > tree > - all interfaces and all methods in {{acccord.api}} have API docs explaining > the requirements for the implementations > - enums and their values across the project are documented > - interfaces, abstract classes, or classes that do not inherit from anything > in the project have at least some class level explanation > Eventually, it would really awesome if concepts from the whitepaper are > somehow referenced in the code (or vice-versa). It would make it much easier > to understand the implementation and I believe it would improve reuse of this > project for external applications -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-18145) Run entire Cassandra Jenkins in an independent EC2/EKS account
Henrik Ingo created CASSANDRA-18145: --- Summary: Run entire Cassandra Jenkins in an independent EC2/EKS account Key: CASSANDRA-18145 URL: https://issues.apache.org/jira/browse/CASSANDRA-18145 Project: Cassandra Issue Type: Task Reporter: Henrik Ingo -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org