[jira] [Commented] (CASSANDRA-19297) Accord: RejectBefore must be up-to-date on joining nodes before ready to coordinate
[ https://issues.apache.org/jira/browse/CASSANDRA-19297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17867267#comment-17867267 ] Benedict Elliott Smith commented on CASSANDRA-19297: Thanks! +1 > Accord: RejectBefore must be up-to-date on joining nodes before ready to > coordinate > --- > > Key: CASSANDRA-19297 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19297 > Project: Cassandra > Issue Type: Improvement > Components: Accord >Reporter: Benedict Elliott Smith >Assignee: Blake Eggleston >Priority: Normal > Labels: pull-request-available > > The exclusive sync point used to join the shard will be known by a majority > of the existing replicas, but in the event the quorum changes and the new > replica has not recorded the exclusive sync point this might in principle > lead to failing to reject a TxnId that should be rejected. > Simple fix, but introduce tests to corroborate this issue, and see if can > reproduce in burn test. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19758) Accord: CommandsForKey should self-prune
Benedict Elliott Smith created CASSANDRA-19758: -- Summary: Accord: CommandsForKey should self-prune Key: CASSANDRA-19758 URL: https://issues.apache.org/jira/browse/CASSANDRA-19758 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith CommandsForKey should periodically self-prune, so as to continue functioning well in-between garbage collections. This is a bit complicated, as once we prune we are left with potentially incomplete information, and have to sometimes load per-command information from disk. But the payoff is ensuring CommandsForKey objects - which drive the majority of the state machine - are kept to a reasonable size. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19288) Accord: Asynchronous reads may be unsafe
[ https://issues.apache.org/jira/browse/CASSANDRA-19288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17854202#comment-17854202 ] Benedict Elliott Smith commented on CASSANDRA-19288: If I remember correctly, when David introduced asynchronous reads into accord-core, it threw up problems. It might however have been a validation issue rather than a correctness issue. I think I vaguely recall realising after filing this that it might be that the merge logic assumes we won't see into the future, but we _can_ safely see into the future during the read so long as it is discarded, so we might only want to run merge validation logic on the coordinator and not the replica. But, I never properly investigated, so might just be best to enable async reads in accord-core we can begin exercising them again and see what fails? > Accord: Asynchronous reads may be unsafe > > > Key: CASSANDRA-19288 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19288 > Project: Cassandra > Issue Type: Improvement > Components: Accord >Reporter: Benedict Elliott Smith >Assignee: Blake Eggleston >Priority: Normal > > In principle we should invalidate asynchronous reads before they complete if > the data they read may be invalid, but this anyway causes faults when we > permit them to occur in accord-core. We can and perhaps should simply ensure > the reads are issued against an sstable/memtable snapshot taken by the > command store, as this is lower cost and more robust. Otherwise we should > investigate what issue asynchronous reads cause. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-19445) Cassandra 4.1.4 floods logs with "Completed 0 uncommitted paxos instances for"
[ https://issues.apache.org/jira/browse/CASSANDRA-19445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850291#comment-17850291 ] Benedict Elliott Smith edited comment on CASSANDRA-19445 at 5/29/24 8:14 AM: - I would defer to [~bdeggleston] here, but if you are facing difficulties you can immediately supply your own logback config that sets this class' logging to WARN. was (Author: benedict): I would defer to [~bdeggleston] here, but if you are facing difficulties you can immediately supply your own logback config that sets this classes' logging to WARN. > Cassandra 4.1.4 floods logs with "Completed 0 uncommitted paxos instances for" > -- > > Key: CASSANDRA-19445 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19445 > Project: Cassandra > Issue Type: Bug > Components: Feature/Lightweight Transactions >Reporter: Zbyszek Z >Priority: Normal > Fix For: 4.1.x, 5.0.x, 5.x > > Attachments: paxos-entry.txt, paxos-multiple.txt > > > Hello, > On our cluster logs are flooded with: > {code:java} > INFO [OptionalTasks:1] 2024-02-27 14:27:51,213 > PaxosCleanupLocalCoordinator.java:185 - Completed 0 uncommitted paxos > instances for X on ranges > [(9210458530128018597,-9222146739399525061], > (-9222146739399525061,-9174246180597321488], > (-9174246180597321488,-9155837684527496840], > (-9155837684527496840,-9148981328078890812], > (-9148981328078890812,-9141853035919151700], > (-9141853035919151700,-9138872620588476741], {code} > I cannot find anything in doc regarding this longline. Also this are huge log > payloads that heavy flood system.log. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19445) Cassandra 4.1.4 floods logs with "Completed 0 uncommitted paxos instances for"
[ https://issues.apache.org/jira/browse/CASSANDRA-19445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850291#comment-17850291 ] Benedict Elliott Smith commented on CASSANDRA-19445: I would defer to [~bdeggleston] here, but if you are facing difficulties you can immediately supply your own logback config that sets this classes' logging to WARN. > Cassandra 4.1.4 floods logs with "Completed 0 uncommitted paxos instances for" > -- > > Key: CASSANDRA-19445 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19445 > Project: Cassandra > Issue Type: Bug > Components: Feature/Lightweight Transactions >Reporter: Zbyszek Z >Priority: Normal > Fix For: 4.1.x, 5.0.x, 5.x > > Attachments: paxos-entry.txt, paxos-multiple.txt > > > Hello, > On our cluster logs are flooded with: > {code:java} > INFO [OptionalTasks:1] 2024-02-27 14:27:51,213 > PaxosCleanupLocalCoordinator.java:185 - Completed 0 uncommitted paxos > instances for X on ranges > [(9210458530128018597,-9222146739399525061], > (-9222146739399525061,-9174246180597321488], > (-9174246180597321488,-9155837684527496840], > (-9155837684527496840,-9148981328078890812], > (-9148981328078890812,-9141853035919151700], > (-9141853035919151700,-9138872620588476741], {code} > I cannot find anything in doc regarding this longline. Also this are huge log > payloads that heavy flood system.log. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19668) SIGSEV origininating in Paxos Scheduled Task
[ https://issues.apache.org/jira/browse/CASSANDRA-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850279#comment-17850279 ] Benedict Elliott Smith commented on CASSANDRA-19668: I suspect the {{repairIterator}} version of this isn't guarded by an {{OpOrder}} so that it doesn't prevent the memtable being flushed and reclaimed, which is a bigger problem for off heap but a problem for regular memtables too. Probably we should be either taking an in-memory copy of the relevant data or else flushing and reading from disk. [~bdeggleston]? > SIGSEV origininating in Paxos Scheduled Task > > > Key: CASSANDRA-19668 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19668 > Project: Cassandra > Issue Type: Bug >Reporter: Jon Haddad >Priority: Normal > > I haven't gotten to the root cause of this yet. Several 4.1 nodes have > crashed in in production. I'm not sure if this is related to Paxos v2 or > not, but it is enabled. offheap_objects also enabled. > I'm not sure if this affects 5.0, yet. > Most of the crashes don't have a stacktrace - they only reference this > {noformat} > Stack: [0x7fabf4c34000,0x7fabf4d34000], sp=0x7fabf4d31f00, free > space=1015k > Native frames: (J=compiled Java code, A=aot compiled Java code, > j=interpreted, Vv=VM code, C=native code) > v ~StubRoutines::jint_disjoint_arraycopy > {noformat} > They all are in the {{ScheduledTasks}} thread. > However, one node does have this in the crash log: > {noformat} > --- T H R E A D --- > Current thread (0x78b375eac800): JavaThread "ScheduledTasks:1" daemon > [_thread_in_Java, id=151791, stack(0x78b34b78,0x78b34b88)] > Stack: [0x78b34b78,0x78b34b88], sp=0x78b34b87c350, free > space=1008k > Native frames: (J=compiled Java code, A=aot compiled Java code, > j=interpreted, Vv=VM code, C=native code) > J 29467 c2 > org.apache.cassandra.db.rows.AbstractCell.clone(Lorg/apache/cassandra/utils/memory/ByteBufferCloner;)Lorg/apache/cassandra/db/rows/Cell; > (50 bytes) @ 0x78b3dd40a42f [0x78b3dd409de0+0x064f] > J 17669 c2 > org.apache.cassandra.db.rows.Cell.clone(Lorg/apache/cassandra/utils/memory/Cloner;)Lorg/apache/cassandra/db/rows/ColumnData; > (6 bytes) @ 0x78b3dc54edc0 [0x78b3dc54ed40+0x0080] > J 17816 c2 > org.apache.cassandra.db.rows.BTreeRow$$Lambda$845.apply(Ljava/lang/Object;)Ljava/lang/Object; > (12 bytes) @ 0x78b3dbed01a4 [0x78b3dbed0120+0x0084] > J 17828 c2 > org.apache.cassandra.utils.btree.BTree.transform([Ljava/lang/Object;Ljava/util/function/Function;)[Ljava/lang/Object; > (194 bytes) @ 0x78b3dc5f35f0 [0x78b3dc5f34a0+0x0150] > J 35096 c2 > org.apache.cassandra.db.rows.BTreeRow.clone(Lorg/apache/cassandra/utils/memory/Cloner;)Lorg/apache/cassandra/db/rows/Row; > (37 bytes) @ 0x78b3dda9111c [0x78b3dda90fe0+0x013c] > J 30500 c2 > org.apache.cassandra.utils.memory.EnsureOnHeap$CloneToHeap.applyToRow(Lorg/apache/cassandra/db/rows/Row;)Lorg/apache/cassandra/db/rows/Row; > (16 bytes) @ 0x78b3dd59b91c [0x78b3dd59b8c0+0x005c] > J 26498 c2 org.apache.cassandra.db.transform.BaseRows.hasNext()Z (215 bytes) > @ 0x78b3dcf1c454 [0x78b3dcf1c180+0x02d4] > J 30775 c2 > org.apache.cassandra.utils.MergeIterator$OneToOne.computeNext()Ljava/lang/Object; > (49 bytes) @ 0x78b3dc789020 [0x78b3dc788fc0+0x0060] > J 9082 c2 org.apache.cassandra.utils.AbstractIterator.hasNext()Z (80 bytes) @ > 0x78b3dbb3c544 [0x78b3dbb3c440+0x0104] > J 35593 c2 > org.apache.cassandra.service.paxos.uncommitted.PaxosRows$PaxosMemtableToKeyStateIterator.computeNext()Lorg/apache/cassandra/service/paxos/uncommitted/PaxosKeyState; > (126 bytes) @ 0x78b3dc7ceeec [0x78b3dc7cee20+0x00cc] > J 35591 c2 > org.apache.cassandra.service.paxos.uncommitted.PaxosRows$PaxosMemtableToKeyStateIterator.computeNext()Ljava/lang/Object; > (5 bytes) @ 0x78b3dc7d09e4 [0x78b3dc7d09a0+0x0044] > J 9082 c2 org.apache.cassandra.utils.AbstractIterator.hasNext()Z (80 bytes) @ > 0x78b3dbb3c544 [0x78b3dbb3c440+0x0104] > J 34146 c2 > com.google.common.collect.Iterators.addAll(Ljava/util/Collection;Ljava/util/Iterator;)Z > (41 bytes) @ 0x78b3dd9197e8 [0x78b3dd919680+0x0168] > J 38256 c1 > org.apache.cassandra.service.paxos.uncommitted.PaxosRows.toIterator(Lorg/apache/cassandra/db/partitions/UnfilteredPartitionIterator;Lorg/apache/cassandra/schema/TableId;Z)Lorg/apache/cassandra/utils/CloseableIterator; > (49 bytes) @ 0x78b3d6b677ac [0x78b3d6b672e0+0x04cc] > J 34823 c1 >
[jira] [Updated] (CASSANDRA-19617) Paxos may re-distribute stale commits that predate a collectable tombstone
[ https://issues.apache.org/jira/browse/CASSANDRA-19617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith updated CASSANDRA-19617: --- Bug Category: Parent values: Correctness(12982)Level 1 values: Recoverable Corruption / Loss(12986) Complexity: Byzantine Discovered By: Diff Testing Fix Version/s: 4.1.x 5.0-rc Severity: Critical Assignee: Benedict Elliott Smith Status: Open (was: Triage Needed) > Paxos may re-distribute stale commits that predate a collectable tombstone > -- > > Key: CASSANDRA-19617 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19617 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination >Reporter: Benedict Elliott Smith >Assignee: Benedict Elliott Smith >Priority: Normal > Fix For: 4.1.x, 5.0-rc > > > Note: this bug only affects {{paxos_state_purging: {gc_grace, repaired}}}, > i.e. those introduced alongside Paxos v2. > There are two problems: > 1) Purging is applied only on compaction, not on load, which can lead to very > old commits being resurfaced in certain circumstances > 2) PaxosPrepare does not filter commits based on paxos repair low bound > This permits surprising situations to arise, where some replicas purge a > stale commit _and all newer commits_, but due to compaction peculiarities > some other replica may purge only the newer commits, leaving a stale commit > in some compaction "purgatory"\[1] to be returned to reads indefinitely. > So long as there are no newer commits, the paxos coordinator will see this > commit is not universally known and redistribute it - no matter how old it > is. This can permit an insert to be reapplied after GC grace has elapsed and > the tombstone has been collected. > For proposals this is not a problem, as we correctly filter proposals based > on the last paxos repair time. This also does not affect clusters with the > legacy (and default) paxos state purging using TTL. Problem (1) only applies > also to the new {{gc_grace}} compatibility mode for purging. > \[1] Compaction purgatory can arise for instance because paxos purging allows > whole sstables to be erased quite effectively, and if this is able to > ordinarily prevent sstables being promoted to L1, then if for some abnormal > reason sstables reach L1 (e.g. repairs being disabled for some time), those > that collect may remain uncompacted for an extended period without purging > being applied. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-19597) SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-19597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842098#comment-17842098 ] Benedict Elliott Smith edited comment on CASSANDRA-19597 at 4/29/24 8:02 PM: - Yes, exactly. If I remember correctly, this "queue" was originally intended to achieve two things: 1) ensure commit log records are invalidated correctly, as it used to only support essentially invalidations of a complete prefix; 2) serve as a kind of fsync so that when awaiting the completion of a flush on a particular table you can be certain all data written prior has made it to sstables I'm not actually sure if any of this is necessary today though. Pretty sure we invalidate explicit ranges now, so the commit log semantics do not require this. I'm not sure off the top of my head why (except for non-durable tables/writes, or things that might want to read sstables prior to commit log replay) you would ever need to know all prior flushes had completed though, since the commit log will ensure they are re-written on restart. But a low risk approach would be to just make this a per table queue. was (Author: benedict): Yes, exactly. If I remember correctly, this "queue" was originally intended to achieve two things: 1) ensure commit log records are invalidated correctly, as it used to only support essentially invalidations of a complete prefix; 2) serve as a kind of fsync so that when awaiting the completion of a flush on a particular table you can be certain all data written prior has made it to disk I'm not actually sure if any of this is necessary today though. Pretty sure we invalidate explicit ranges now, so the commit log semantics do not require this. I'm not off the top of my head sure why (except for non-durable tables/writes) you would ever need to know all prior flushes had completed though, since the commit log will ensure they are re-written on restart. But a low risk approach would be to just make this a per table queue. > SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction > - > > Key: CASSANDRA-19597 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19597 > Project: Cassandra > Issue Type: Bug >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Normal > > There is a single post flush thread and that thread processes tasks in order > and one of those tasks can be a memtable flush for an unrelated keyspace/cfs, > and that memtable flush can be blocked by slow IntervalTree building and > racing with compactors to try and build an interval tree. > Unless there is a requirement for ordering we probably want to loosen this to > the actual ordering requirement so that problems in one keyspace can’t effect > another. > SystemKeyspace and Gossip in particular cause lots of weird problems like > nodes marking each other down because Gossip can’t process nodes being > removed (blocking flush each time in SystemKeyspace.removeNode) > A very simple fix here might be to queue the post flush task at the same time > as the flush in a per CFS queue, and then submit the task only once the flush > is completed. > If flushes complete out of order the queue will still ensure their > completions are processed in order. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19597) SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-19597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842098#comment-17842098 ] Benedict Elliott Smith commented on CASSANDRA-19597: Yes, exactly. If I remember correctly, this "queue" was originally intended to achieve two things: 1) ensure commit log records are invalidated correctly, as it used to only support essentially invalidations of a complete prefix; 2) serve as a kind of fsync so that when awaiting the completion of a flush on a particular table you can be certain all data written prior has made it to disk I'm not actually sure if any of this is necessary today though. Pretty sure we invalidate explicit ranges now, so the commit log semantics do not require this. I'm not off the top of my head sure why (except for non-durable tables/writes) you would ever need to know all prior flushes had completed though, since the commit log will ensure they are re-written on restart. But a low risk approach would be to just make this a per table queue. > SystemKeyspace CFS flushing blocked by unrelated keyspace flushing/compaction > - > > Key: CASSANDRA-19597 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19597 > Project: Cassandra > Issue Type: Bug >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Normal > > There is a single post flush thread and that thread processes tasks in order > and one of those tasks can be a memtable flush for an unrelated keyspace/cfs, > and that memtable flush can be blocked by slow IntervalTree building and > racing with compactors to try and build an interval tree. > Unless there is a requirement for ordering we probably want to loosen this to > the actual ordering requirement so that problems in one keyspace can’t effect > another. > SystemKeyspace and Gossip in particular cause lots of weird problems like > nodes marking each other down because Gossip can’t process nodes being > removed (blocking flush each time in SystemKeyspace.removeNode) > A very simple fix here might be to queue the post flush task at the same time > as the flush in a per CFS queue, and then submit the task only once the flush > is completed. > If flushes complete out of order the queue will still ensure their > completions are processed in order. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19564) MemtablePostFlush deadlock leads to stuck nodes and crashes
[ https://issues.apache.org/jira/browse/CASSANDRA-19564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838525#comment-17838525 ] Benedict Elliott Smith commented on CASSANDRA-19564: The {{isBlocking}} flag is what indicates that you can skip the memtable allocator limit checks. The earliest possible {{OpOrder.Group}} (so walking the {{prev}} links until there are no more) is the one that will be stopping progress. If you can upload / send a jstack dump while the node is locked up I can _probably_ diagnose it. > MemtablePostFlush deadlock leads to stuck nodes and crashes > --- > > Key: CASSANDRA-19564 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19564 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction, Local/Memtable >Reporter: Jon Haddad >Priority: Urgent > Fix For: 4.1.x > > Attachments: image-2024-04-16-11-55-54-750.png, > image-2024-04-16-12-29-15-386.png, image-2024-04-16-13-43-11-064.png, > image-2024-04-16-13-53-24-455.png, image-2024-04-17-18-46-29-474.png, > image-2024-04-17-19-13-06-769.png, image-2024-04-17-19-14-34-344.png > > > I've run into an issue on a 4.1.4 cluster where an entire node has locked up > due to what I believe is a deadlock in memtable flushing. Here's what I know > so far. I've stitched together what happened based on conversations, logs, > and some flame graphs. > *Log reports memtable flushing* > The last successful flush happens at 12:19. > {noformat} > INFO [NativePoolCleaner] 2024-04-16 12:19:53,634 > AbstractAllocatorMemtable.java:286 - Flushing largest CFS(Keyspace='ks', > ColumnFamily='version') to free up room. Used total: 0.24/0.33, live: > 0.16/0.20, flushing: 0.09/0.13, this: 0.13/0.15 > INFO [NativePoolCleaner] 2024-04-16 12:19:53,634 ColumnFamilyStore.java:1012 > - Enqueuing flush of ks.version, Reason: MEMTABLE_LIMIT, Usage: 660.521MiB > (13%) on-heap, 790.606MiB (15%) off-heap > {noformat} > *MemtablePostFlush appears to be blocked* > At this point, MemtablePostFlush completed tasks stops incrementing, active > stays at 1 and pending starts to rise. > {noformat} > MemtablePostFlush 1 1 3446 0 0 > {noformat} > > The flame graph reveals that PostFlush.call is stuck. I don't have the line > number, but I know we're stuck in > {{org.apache.cassandra.db.ColumnFamilyStore.PostFlush#call}} given the visual > below: > *!image-2024-04-16-13-43-11-064.png!* > *Memtable flushing is now blocked.* > All MemtableFlushWriter threads are Parked waiting on > {{{}OpOrder.Barrier.await{}}}. A wall clock profile of 30s reveals all time > is spent here. Presumably we're waiting on the single threaded Post Flush. > !image-2024-04-16-12-29-15-386.png! > *Memtable allocations start to block* > Eventually it looks like the NativeAllocator stops successfully allocating > memory. I assume it's waiting on memory to be freed, but since memtable > flushes are blocked, we wait indefinitely. > Looking at a wall clock flame graph, all writer threads have reached the > allocation failure path of {{MemtableAllocator.allocate()}}. I believe we're > waiting on {{signal.awaitThrowUncheckedOnInterrupt()}} > {noformat} > MutationStage 48 828425 980253369 0 0{noformat} > !image-2024-04-16-11-55-54-750.png! > > *Compaction Stops* > Since we write to the compaction history table, and that requires memtables, > compactions are now blocked as well. > > !image-2024-04-16-13-53-24-455.png! > > The node is now doing basically nothing and must be restarted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19564) MemtablePostFlush deadlock leads to stuck nodes and crashes
[ https://issues.apache.org/jira/browse/CASSANDRA-19564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838376#comment-17838376 ] Benedict Elliott Smith commented on CASSANDRA-19564: Honestly a jstack output during the issue would probably be enough to spot a candidate issue. If you have one feel free to back channel it to me for a quick peek, in case I can easily spot something to dig into. > MemtablePostFlush deadlock leads to stuck nodes and crashes > --- > > Key: CASSANDRA-19564 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19564 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction, Local/Memtable >Reporter: Jon Haddad >Priority: Urgent > Fix For: 4.1.x > > Attachments: image-2024-04-16-11-55-54-750.png, > image-2024-04-16-12-29-15-386.png, image-2024-04-16-13-43-11-064.png, > image-2024-04-16-13-53-24-455.png > > > I've run into an issue on a 4.1.4 cluster where an entire node has locked up > due to what I believe is a deadlock in memtable flushing. Here's what I know > so far. I've stitched together what happened based on conversations, logs, > and some flame graphs. > *Log reports memtable flushing* > The last successful flush happens at 12:19. > {noformat} > INFO [NativePoolCleaner] 2024-04-16 12:19:53,634 > AbstractAllocatorMemtable.java:286 - Flushing largest CFS(Keyspace='ks', > ColumnFamily='version') to free up room. Used total: 0.24/0.33, live: > 0.16/0.20, flushing: 0.09/0.13, this: 0.13/0.15 > INFO [NativePoolCleaner] 2024-04-16 12:19:53,634 ColumnFamilyStore.java:1012 > - Enqueuing flush of ks.version, Reason: MEMTABLE_LIMIT, Usage: 660.521MiB > (13%) on-heap, 790.606MiB (15%) off-heap > {noformat} > *MemtablePostFlush appears to be blocked* > At this point, MemtablePostFlush completed tasks stops incrementing, active > stays at 1 and pending starts to rise. > {noformat} > MemtablePostFlush 1 1 3446 0 0 > {noformat} > > The flame graph reveals that PostFlush.call is stuck. I don't have the line > number, but I know we're stuck in > {{org.apache.cassandra.db.ColumnFamilyStore.PostFlush#call}} given the visual > below: > *!image-2024-04-16-13-43-11-064.png!* > *Memtable flushing is now blocked.* > All MemtableFlushWriter threads are Parked waiting on > {{{}OpOrder.Barrier.await{}}}. A wall clock profile of 30s reveals all time > is spent here. Presumably we're waiting on the single threaded Post Flush. > !image-2024-04-16-12-29-15-386.png! > *Memtable allocations start to block* > Eventually it looks like the NativeAllocator stops successfully allocating > memory. I assume it's waiting on memory to be freed, but since memtable > flushes are blocked, we wait indefinitely. > Looking at a wall clock flame graph, all writer threads have reached the > allocation failure path of {{MemtableAllocator.allocate()}}. I believe we're > waiting on {{signal.awaitThrowUncheckedOnInterrupt()}} > {noformat} > MutationStage 48 828425 980253369 0 0{noformat} > !image-2024-04-16-11-55-54-750.png! > > *Compaction Stops* > Since we write to the compaction history table, and that requires memtables, > compactions are now blocked as well. > > !image-2024-04-16-13-53-24-455.png! > > The node is now doing basically nothing and must be restarted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19564) MemtablePostFlush deadlock leads to stuck nodes and crashes
[ https://issues.apache.org/jira/browse/CASSANDRA-19564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838249#comment-17838249 ] Benedict Elliott Smith commented on CASSANDRA-19564: Is the Flush that is blocked the one that the postFlush is waiting on? You can check this from a heap dump. If it is, the question is why the writeBarrier it has issued doesn't complete - any write that is behind such an issued barrier should be clear to complete without blocking. In which case we have perhaps introduced some new blocking mechanism that sits behind the completion of the barrier that depends on the barrier itself finishing. This should also be apparent from a heap dump, from which you can find the OpOrder that haven't completed, and which threads are holding a reference to it and what they are blocking on. > MemtablePostFlush deadlock leads to stuck nodes and crashes > --- > > Key: CASSANDRA-19564 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19564 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction, Local/Memtable >Reporter: Jon Haddad >Priority: Urgent > Fix For: 4.1.x > > Attachments: image-2024-04-16-11-55-54-750.png, > image-2024-04-16-12-29-15-386.png, image-2024-04-16-13-43-11-064.png, > image-2024-04-16-13-53-24-455.png > > > I've run into an issue on a 4.1.4 cluster where an entire node has locked up > due to what I believe is a deadlock in memtable flushing. Here's what I know > so far. I've stitched together what happened based on conversations, logs, > and some flame graphs. > *Log reports memtable flushing* > The last successful flush happens at 12:19. > {noformat} > INFO [NativePoolCleaner] 2024-04-16 12:19:53,634 > AbstractAllocatorMemtable.java:286 - Flushing largest CFS(Keyspace='ks', > ColumnFamily='version') to free up room. Used total: 0.24/0.33, live: > 0.16/0.20, flushing: 0.09/0.13, this: 0.13/0.15 > INFO [NativePoolCleaner] 2024-04-16 12:19:53,634 ColumnFamilyStore.java:1012 > - Enqueuing flush of ks.version, Reason: MEMTABLE_LIMIT, Usage: 660.521MiB > (13%) on-heap, 790.606MiB (15%) off-heap > {noformat} > *MemtablePostFlush appears to be blocked* > At this point, MemtablePostFlush completed tasks stops incrementing, active > stays at 1 and pending starts to rise. > {noformat} > MemtablePostFlush 1 1 3446 0 0 > {noformat} > > The flame graph reveals that PostFlush.call is stuck. I don't have the line > number, but I know we're stuck in > {{org.apache.cassandra.db.ColumnFamilyStore.PostFlush#call}} given the visual > below: > *!image-2024-04-16-13-43-11-064.png!* > *Memtable flushing is now blocked.* > All MemtableFlushWriter threads are Parked waiting on > {{{}OpOrder.Barrier.await{}}}. A wall clock profile of 30s reveals all time > is spent here. Presumably we're waiting on the single threaded Post Flush. > !image-2024-04-16-12-29-15-386.png! > *Memtable allocations start to block* > Eventually it looks like the NativeAllocator stops successfully allocating > memory. I assume it's waiting on memory to be freed, but since memtable > flushes are blocked, we wait indefinitely. > Looking at a wall clock flame graph, all writer threads have reached the > allocation failure path of {{MemtableAllocator.allocate()}}. I believe we're > waiting on {{signal.awaitThrowUncheckedOnInterrupt()}} > {noformat} > MutationStage 48 828425 980253369 0 0{noformat} > !image-2024-04-16-11-55-54-750.png! > > *Compaction Stops* > Since we write to the compaction history table, and that requires memtables, > compactions are now blocked as well. > > !image-2024-04-16-13-53-24-455.png! > > The node is now doing basically nothing and must be restarted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19308) Accord: Avoid maintaining separate FULL history; read the system table for mapReduce over command deps
[ https://issues.apache.org/jira/browse/CASSANDRA-19308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821599#comment-17821599 ] Benedict Elliott Smith commented on CASSANDRA-19308: CASSANDRA-19310 likely makes this unnecessary at least for key transactions, as dependencies are now efficiently represented in CommandsForKey, and there is likely little to gain. > Accord: Avoid maintaining separate FULL history; read the system table for > mapReduce over command deps > -- > > Key: CASSANDRA-19308 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19308 > Project: Cassandra > Issue Type: Improvement > Components: Accord >Reporter: Benedict Elliott Smith >Priority: Normal > > The FULL deps history is costly to maintain and to read. It is only used for > transaction recovery, and we can implement it by reading the accord system > table directly to fetch the deps of each transaction we find in the basic > deps history. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-19310) Accord: More efficient CommandsForKey with transitive dependency elision
[ https://issues.apache.org/jira/browse/CASSANDRA-19310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith reassigned CASSANDRA-19310: -- Assignee: Benedict Elliott Smith > Accord: More efficient CommandsForKey with transitive dependency elision > > > Key: CASSANDRA-19310 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19310 > Project: Cassandra > Issue Type: Improvement > Components: Accord >Reporter: Benedict Elliott Smith >Assignee: Benedict Elliott Smith >Priority: Normal > > We currently depend on state GC for dependency pruning, but we can prune > dependencies directly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19310) Accord: More efficient CommandsForKey with transitive dependency elision
[ https://issues.apache.org/jira/browse/CASSANDRA-19310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith updated CASSANDRA-19310: --- Resolution: Fixed Status: Resolved (was: Triage Needed) > Accord: More efficient CommandsForKey with transitive dependency elision > > > Key: CASSANDRA-19310 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19310 > Project: Cassandra > Issue Type: Improvement > Components: Accord >Reporter: Benedict Elliott Smith >Priority: Normal > > We currently depend on state GC for dependency pruning, but we can prune > dependencies directly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19310) Accord: More efficient CommandsForKey and transitive dependency elision
[ https://issues.apache.org/jira/browse/CASSANDRA-19310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith updated CASSANDRA-19310: --- Summary: Accord: More efficient CommandsForKey and transitive dependency elision (was: Accord: Dependency pruning) > Accord: More efficient CommandsForKey and transitive dependency elision > --- > > Key: CASSANDRA-19310 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19310 > Project: Cassandra > Issue Type: Improvement > Components: Accord >Reporter: Benedict Elliott Smith >Priority: Normal > > We currently depend on state GC for dependency pruning, but we can prune > dependencies directly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19310) Accord: More efficient CommandsForKey with transitive dependency elision
[ https://issues.apache.org/jira/browse/CASSANDRA-19310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith updated CASSANDRA-19310: --- Summary: Accord: More efficient CommandsForKey with transitive dependency elision (was: Accord: More efficient CommandsForKey and transitive dependency elision) > Accord: More efficient CommandsForKey with transitive dependency elision > > > Key: CASSANDRA-19310 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19310 > Project: Cassandra > Issue Type: Improvement > Components: Accord >Reporter: Benedict Elliott Smith >Priority: Normal > > We currently depend on state GC for dependency pruning, but we can prune > dependencies directly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19305) Accord: Fast single-partition reads
[ https://issues.apache.org/jira/browse/CASSANDRA-19305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith updated CASSANDRA-19305: --- Resolution: Fixed Status: Resolved (was: Triage Needed) > Accord: Fast single-partition reads > --- > > Key: CASSANDRA-19305 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19305 > Project: Cassandra > Issue Type: Improvement > Components: Accord >Reporter: Benedict Elliott Smith >Priority: Normal > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Introduce guaranteed 1RT single-partition reads with no transaction metadata -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19359) Accord: Never recover read-only transactions; simply invalidate
Benedict Elliott Smith created CASSANDRA-19359: -- Summary: Accord: Never recover read-only transactions; simply invalidate Key: CASSANDRA-19359 URL: https://issues.apache.org/jira/browse/CASSANDRA-19359 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith read-only transactions do not need to be recovered to supply client responses or for other transactions to make progress. The only situation that might require a read to be recovered is for recovery of a write transaction that needs to know whether the read might have witnessed or not-witnessed it at a specific `executeAt`. This can be special-cased, either to run recovery in this circumstance, or to simply compute the necessary recovery information to decide whether it is possible or not. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19358) Accord: AccordBootstrapTest hangs because topology fetching appears to stall
Benedict Elliott Smith created CASSANDRA-19358: -- Summary: Accord: AccordBootstrapTest hangs because topology fetching appears to stall Key: CASSANDRA-19358 URL: https://issues.apache.org/jira/browse/CASSANDRA-19358 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith This likely means there is some serious progress issue with topology fetching. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19357) Accord: Harden Node.Id handling: graceful restart for left nodes, and ensure don’t cause problems with IP reuse
Benedict Elliott Smith created CASSANDRA-19357: -- Summary: Accord: Harden Node.Id handling: graceful restart for left nodes, and ensure don’t cause problems with IP reuse Key: CASSANDRA-19357 URL: https://issues.apache.org/jira/browse/CASSANDRA-19357 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith We rely on TCM for mapping node.id to replicas/IPs, but TCM is not Accord-epoch-aware, so it might erase a mapping before Accord is finished with it (and so, after a reboot Accord may not be able to find it again), but also might permit an IP to be re-used for a new Node.Id when Accord is still using it for an older epoch. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19356) Accord: Range transaction state indexing / caching
Benedict Elliott Smith created CASSANDRA-19356: -- Summary: Accord: Range transaction state indexing / caching Key: CASSANDRA-19356 URL: https://issues.apache.org/jira/browse/CASSANDRA-19356 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith Range transactions are kept entirely in-memory at present. This is fine so long as we only use them for book-keeping and they do not exist too long, but runs the risk of OOM if cleanup doesn't excise them for some reason. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19355) Accord: PreLoadContext must properly and consistently support ranges
Benedict Elliott Smith created CASSANDRA-19355: -- Summary: Accord: PreLoadContext must properly and consistently support ranges Key: CASSANDRA-19355 URL: https://issues.apache.org/jira/browse/CASSANDRA-19355 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith There are some mechanisms for ensuring range transactions are loaded for range transactions, but they do not currently work properly (having several race conditions), are potentially costly in terms of memory consumption, and are inconsistent with how they work for key transactions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19354) Accord: Integrate speculative retry with Accord’s slow read mechanism
Benedict Elliott Smith created CASSANDRA-19354: -- Summary: Accord: Integrate speculative retry with Accord’s slow read mechanism Key: CASSANDRA-19354 URL: https://issues.apache.org/jira/browse/CASSANDRA-19354 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-19349) Timeuuid compare is broken
[ https://issues.apache.org/jira/browse/CASSANDRA-19349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812787#comment-17812787 ] Benedict Elliott Smith edited comment on CASSANDRA-19349 at 1/31/24 4:27 PM: - There is something weird going on here but it has been this way for a very long time - since UUIDs were introduced in fact (at the CQL layer, I mean), I think. Essentially the storage layer's {{compareCustom}} is not consistent with plain object comparison. This doesn't appear to be documented, but I don't think in practice this is a problem. was (Author: benedict): There is something weird going on here but it has been this way for a very long time - since the TimeUUID type (at the CQL layer, I mean) was introduced in fact, I think. Essentially the storage layer's {{compareCustom}} is not consistent with plain object comparison. This doesn't appear to be documented, but I don't think in practice this is a problem. > Timeuuid compare is broken > -- > > Key: CASSANDRA-19349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19349 > Project: Cassandra > Issue Type: Bug >Reporter: Andreas Mager >Priority: Normal > > {{I have stumbled over a wired problem on my pc.}} > {{When i turn on my wifi interface, then some of my integration test are > failing.}} > {{The mac part(lsb) of the timeuuids become changed in our Uuid > implementation.}} > {{These uuids are used for the cassandra insertions and queries.}} > > {{TestSetup with "broken" Uuids:}} > {code:java} > CREATE TABLE object_comment ( > object timeuuid, > comment timeuuid, > value blob, > PRIMARY KEY (object, comment) > ) > INSERT INTO object_comment (object, comment , value) VALUES > (95278adc-c03f-11ee-ab43-bb35e932d536, cf9e6440-c01e-11ee-847b-34cff6b1be80, > 0x01); > INSERT INTO object_comment (object, comment , value) VALUES > (95278adc-c03f-11ee-ab43-bb35e932d536, cf9f75b0-c01e-11ee-847b-34cff6b1be80, > 0x02); > // cf9f75b0-c01e-11ee-847b-34cff6b1be7f is lsb-1 and the same timestamp > SELECT * FROM object_comment where object = > 95278adc-c03f-11ee-ab43-bb35e932d536 AND comment <= > cf9f75b0-c01e-11ee-847b-34cff6b1be7f; object | > comment | value > --+--+--- > 95278adc-c03f-11ee-ab43-bb35e932d536 | cf9e6440-c01e-11ee-847b-34cff6b1be80 > | 0x01 > 95278adc-c03f-11ee-ab43-bb35e932d536 | cf9f75b0-c01e-11ee-847b-34cff6b1be80 > | 0x02(2 rows) > {code} > > > The second row must not be present. The Only row expected is : > {code:java} > 95278adc-c03f-11ee-ab43-bb35e932d536 | cf9e6440-c01e-11ee-847b-34cff6b1be80 | > 0x01{code} > > I think i have found the cause of the issue. > The Methods `org.apache.cassandra.utils.TimeUUID#compareTo` and > `org.apache.cassandra.db.marshal.TimeUUIDType#compareCustom` return different > results. > Test pseudocode: > {code:java} > var id = UUID.fromString("cf9f75b0-c01e-11ee-847b-34cff6b1be80"); > var idDecrementInLsb = > UUID.fromString("cf9f75b0-c01e-11ee-847b-34cff6b1be7f"); > // java.util.UUID#compareTo > assertThat(idDecrementInLsb.compareTo(id)).isEqualTo(-1); > var timeUuidDec = > org.apache.cassandra.utils.TimeUUID.fromUuid(idDecrementInLsb); > var timeUuidId = org.apache.cassandra.utils.TimeUUID.fromUuid(id); > // org.apache.cassandra.utils.TimeUUID#compareTo > assertThat(timeUuidDec.compareTo(timeUuidId)).isEqualTo(-1); > // org.apache.cassandra.db.marshal.TimeUUIDType.compareCustom > assertThat(org.apache.cassandra.db.marshal.TimeUUIDType.compareCustom(idDecrementInLsb, > id1)).isEqualTo(-1); // This fails > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-19349) Timeuuid compare is broken
[ https://issues.apache.org/jira/browse/CASSANDRA-19349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812787#comment-17812787 ] Benedict Elliott Smith edited comment on CASSANDRA-19349 at 1/31/24 4:27 PM: - There is something weird going on here but it has been this way for a very long time - since the TimeUUID type (at the CQL layer, I mean) was introduced in fact, I think. Essentially the storage layer's {{compareCustom}} is not consistent with plain object comparison. This doesn't appear to be documented, but I don't think in practice this is a problem. was (Author: benedict): There is something weird going on here but it has been this way for a very long time - since the TimeUUID type was introduced in fact, I think. Essentially the storage layer's {{compareCustom}} is not consistent with plain object comparison. This doesn't appear to be documented, but I don't think in practice this is a problem. > Timeuuid compare is broken > -- > > Key: CASSANDRA-19349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19349 > Project: Cassandra > Issue Type: Bug >Reporter: Andreas Mager >Priority: Normal > > {{I have stumbled over a wired problem on my pc.}} > {{When i turn on my wifi interface, then some of my integration test are > failing.}} > {{The mac part(lsb) of the timeuuids become changed in our Uuid > implementation.}} > {{These uuids are used for the cassandra insertions and queries.}} > > {{TestSetup with "broken" Uuids:}} > {code:java} > CREATE TABLE object_comment ( > object timeuuid, > comment timeuuid, > value blob, > PRIMARY KEY (object, comment) > ) > INSERT INTO object_comment (object, comment , value) VALUES > (95278adc-c03f-11ee-ab43-bb35e932d536, cf9e6440-c01e-11ee-847b-34cff6b1be80, > 0x01); > INSERT INTO object_comment (object, comment , value) VALUES > (95278adc-c03f-11ee-ab43-bb35e932d536, cf9f75b0-c01e-11ee-847b-34cff6b1be80, > 0x02); > // cf9f75b0-c01e-11ee-847b-34cff6b1be7f is lsb-1 and the same timestamp > SELECT * FROM object_comment where object = > 95278adc-c03f-11ee-ab43-bb35e932d536 AND comment <= > cf9f75b0-c01e-11ee-847b-34cff6b1be7f; object | > comment | value > --+--+--- > 95278adc-c03f-11ee-ab43-bb35e932d536 | cf9e6440-c01e-11ee-847b-34cff6b1be80 > | 0x01 > 95278adc-c03f-11ee-ab43-bb35e932d536 | cf9f75b0-c01e-11ee-847b-34cff6b1be80 > | 0x02(2 rows) > {code} > > > The second row must not be present. The Only row expected is : > {code:java} > 95278adc-c03f-11ee-ab43-bb35e932d536 | cf9e6440-c01e-11ee-847b-34cff6b1be80 | > 0x01{code} > > I think i have found the cause of the issue. > The Methods `org.apache.cassandra.utils.TimeUUID#compareTo` and > `org.apache.cassandra.db.marshal.TimeUUIDType#compareCustom` return different > results. > Test pseudocode: > {code:java} > var id = UUID.fromString("cf9f75b0-c01e-11ee-847b-34cff6b1be80"); > var idDecrementInLsb = > UUID.fromString("cf9f75b0-c01e-11ee-847b-34cff6b1be7f"); > // java.util.UUID#compareTo > assertThat(idDecrementInLsb.compareTo(id)).isEqualTo(-1); > var timeUuidDec = > org.apache.cassandra.utils.TimeUUID.fromUuid(idDecrementInLsb); > var timeUuidId = org.apache.cassandra.utils.TimeUUID.fromUuid(id); > // org.apache.cassandra.utils.TimeUUID#compareTo > assertThat(timeUuidDec.compareTo(timeUuidId)).isEqualTo(-1); > // org.apache.cassandra.db.marshal.TimeUUIDType.compareCustom > assertThat(org.apache.cassandra.db.marshal.TimeUUIDType.compareCustom(idDecrementInLsb, > id1)).isEqualTo(-1); // This fails > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19349) Timeuuid compare is broken
[ https://issues.apache.org/jira/browse/CASSANDRA-19349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812787#comment-17812787 ] Benedict Elliott Smith commented on CASSANDRA-19349: There is something weird going on here but it has been this way for a very long time - since the TimeUUID type was introduced in fact, I think. Essentially the storage layer's {{compareCustom}} is not consistent with plain object comparison. This doesn't appear to be documented, but I don't think in practice this is a problem. > Timeuuid compare is broken > -- > > Key: CASSANDRA-19349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19349 > Project: Cassandra > Issue Type: Bug >Reporter: Andreas Mager >Priority: Normal > > {{I have stumbled over a wired problem on my pc.}} > {{When i turn on my wifi interface, then some of my integration test are > failing.}} > {{The mac part(lsb) of the timeuuids become changed in our Uuid > implementation.}} > {{These uuids are used for the cassandra insertions and queries.}} > > {{TestSetup with "broken" Uuids:}} > {code:java} > CREATE TABLE object_comment ( > object timeuuid, > comment timeuuid, > value blob, > PRIMARY KEY (object, comment) > ) > INSERT INTO object_comment (object, comment , value) VALUES > (95278adc-c03f-11ee-ab43-bb35e932d536, cf9e6440-c01e-11ee-847b-34cff6b1be80, > 0x01); > INSERT INTO object_comment (object, comment , value) VALUES > (95278adc-c03f-11ee-ab43-bb35e932d536, cf9f75b0-c01e-11ee-847b-34cff6b1be80, > 0x02); > // cf9f75b0-c01e-11ee-847b-34cff6b1be7f is lsb-1 and the same timestamp > SELECT * FROM object_comment where object = > 95278adc-c03f-11ee-ab43-bb35e932d536 AND comment <= > cf9f75b0-c01e-11ee-847b-34cff6b1be7f; object | > comment | value > --+--+--- > 95278adc-c03f-11ee-ab43-bb35e932d536 | cf9e6440-c01e-11ee-847b-34cff6b1be80 > | 0x01 > 95278adc-c03f-11ee-ab43-bb35e932d536 | cf9f75b0-c01e-11ee-847b-34cff6b1be80 > | 0x02(2 rows) > {code} > > > The second row must not be present. The Only row expected is : > {code:java} > 95278adc-c03f-11ee-ab43-bb35e932d536 | cf9e6440-c01e-11ee-847b-34cff6b1be80 | > 0x01{code} > > I think i have found the cause of the issue. > The Methods `org.apache.cassandra.utils.TimeUUID#compareTo` and > `org.apache.cassandra.db.marshal.TimeUUIDType#compareCustom` return different > results. > Test pseudocode: > {code:java} > var id = UUID.fromString("cf9f75b0-c01e-11ee-847b-34cff6b1be80"); > var idDecrementInLsb = > UUID.fromString("cf9f75b0-c01e-11ee-847b-34cff6b1be7f"); > // java.util.UUID#compareTo > assertThat(idDecrementInLsb.compareTo(id)).isEqualTo(-1); > var timeUuidDec = > org.apache.cassandra.utils.TimeUUID.fromUuid(idDecrementInLsb); > var timeUuidId = org.apache.cassandra.utils.TimeUUID.fromUuid(id); > // org.apache.cassandra.utils.TimeUUID#compareTo > assertThat(timeUuidDec.compareTo(timeUuidId)).isEqualTo(-1); > // org.apache.cassandra.db.marshal.TimeUUIDType.compareCustom > assertThat(org.apache.cassandra.db.marshal.TimeUUIDType.compareCustom(idDecrementInLsb, > id1)).isEqualTo(-1); // This fails > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19323) Accord: table configuration
Benedict Elliott Smith created CASSANDRA-19323: -- Summary: Accord: table configuration Key: CASSANDRA-19323 URL: https://issues.apache.org/jira/browse/CASSANDRA-19323 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith We must be able to enable/disable Accord and specify various Accord settings at the table level via schema changes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19322) Accord: Fast path reconfiguration
Benedict Elliott Smith created CASSANDRA-19322: -- Summary: Accord: Fast path reconfiguration Key: CASSANDRA-19322 URL: https://issues.apache.org/jira/browse/CASSANDRA-19322 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith We must be able to provide configuration that decides the fast path based on the topology, and reconfigure the fast path in the event of outages -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19320) Accord: Metrics to detect stalled transactions or other problems
Benedict Elliott Smith created CASSANDRA-19320: -- Summary: Accord: Metrics to detect stalled transactions or other problems Key: CASSANDRA-19320 URL: https://issues.apache.org/jira/browse/CASSANDRA-19320 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith In order to detect faults in the transaction system or other issues, we must introduce metrics that expose potential issues promptly, such as stalled or failed transactions, failure to coordinate durability and cleanup state, etc -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19321) Accord: Command to mark replicas as “stale" for decommission
Benedict Elliott Smith created CASSANDRA-19321: -- Summary: Accord: Command to mark replicas as “stale" for decommission Key: CASSANDRA-19321 URL: https://issues.apache.org/jira/browse/CASSANDRA-19321 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith So that other replicas may continue to cleanup their state, we must have an operator command for marking replicas as stale so that the remaining replicas do not wait for them to coordinate their durability status. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19319) Accord: Developer journal replay debug feature
Benedict Elliott Smith created CASSANDRA-19319: -- Summary: Accord: Developer journal replay debug feature Key: CASSANDRA-19319 URL: https://issues.apache.org/jira/browse/CASSANDRA-19319 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith In order to assist debugging of faults in the transaction system, we must have a mechanism for replaying journals locally to understand how a CommandStore reached a given state. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19318) Accord: Virtual table functionality to modify current state of transactions, trigger various cleanup operations etc
Benedict Elliott Smith created CASSANDRA-19318: -- Summary: Accord: Virtual table functionality to modify current state of transactions, trigger various cleanup operations etc Key: CASSANDRA-19318 URL: https://issues.apache.org/jira/browse/CASSANDRA-19318 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith To assist operators in resolving issues with the transaction system, we must offer facilities for injecting state modifications, trigger various internal book-keeping operations, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19317) Accord: Virtual table to expose current state of transactions
Benedict Elliott Smith created CASSANDRA-19317: -- Summary: Accord: Virtual table to expose current state of transactions Key: CASSANDRA-19317 URL: https://issues.apache.org/jira/browse/CASSANDRA-19317 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith To assist operators and debugging of any faults in the transaction system we must expose as much internal information as possible. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19316) Accord: De-duplicate and timeout reads/WaitingToApply
Benedict Elliott Smith created CASSANDRA-19316: -- Summary: Accord: De-duplicate and timeout reads/WaitingToApply Key: CASSANDRA-19316 URL: https://issues.apache.org/jira/browse/CASSANDRA-19316 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith Currently we can have infinitely many copies of read callbacks for the same transaction to the same recipient replica. This work can perhaps be merged with that to optimise FetchData callbacks, introducing an efficient global read callback. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19315) Accord: CommandStore rebalancing
Benedict Elliott Smith created CASSANDRA-19315: -- Summary: Accord: CommandStore rebalancing Key: CASSANDRA-19315 URL: https://issues.apache.org/jira/browse/CASSANDRA-19315 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith Currently we cannot internally re-shard gracefully within a node, and topology changes increase the number of internal shards. We may want to settle for some less-than-optimal approach that is easy to implement for now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19314) Accord: SimpleProgressLog: make sure not too simple; probably at least page to system table as necessary
Benedict Elliott Smith created CASSANDRA-19314: -- Summary: Accord: SimpleProgressLog: make sure not too simple; probably at least page to system table as necessary Key: CASSANDRA-19314 URL: https://issues.apache.org/jira/browse/CASSANDRA-19314 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith The SPL is a core system for progress, and was only originally intended to be relied on as a reference implementation. However, we can modify it a little to make it satisfactory for the intended purpose. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19313) Accord: Reduce overhead of NotifyWaitingOn with a WaitingToExecute SaveStatus
Benedict Elliott Smith created CASSANDRA-19313: -- Summary: Accord: Reduce overhead of NotifyWaitingOn with a WaitingToExecute SaveStatus Key: CASSANDRA-19313 URL: https://issues.apache.org/jira/browse/CASSANDRA-19313 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith Long execution graphs can do a lot of duplicated work invoking {{NotifyWaitingOn}} repeatedly on a transaction that is already waiting on a dependent transaction to execute. This can easily be avoided by introducing a {{SaveStatus}} that indicates the transaction is actively managing its dependencies for execution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19312) Accord: Introduce long-lived callbacks for progress to reduce overhead of repeated FetchData calls
Benedict Elliott Smith created CASSANDRA-19312: -- Summary: Accord: Introduce long-lived callbacks for progress to reduce overhead of repeated FetchData calls Key: CASSANDRA-19312 URL: https://issues.apache.org/jira/browse/CASSANDRA-19312 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith We currently poll actively on replicas waiting to hear news of a transaction their execution depends upon. We should instead register a long-lived callback at most once per peer, and periodically batch-wise confirm callbacks for our transactions are are still registered. We can simultaneously make our callback management much less costly, by having a global callback manager that just tracks TxnId->Replica. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19311) Accord: (Resource Consumption) DurableBefore should be shared between shards
Benedict Elliott Smith created CASSANDRA-19311: -- Summary: Accord: (Resource Consumption) DurableBefore should be shared between shards Key: CASSANDRA-19311 URL: https://issues.apache.org/jira/browse/CASSANDRA-19311 Project: Cassandra Issue Type: Improvement Reporter: Benedict Elliott Smith {{DurableBefore}} is a fairly large structure, and is a cluster-universal concept. So a given node can share it between all {{CommandStore}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19309) Accord: General performance investigation/improvement
Benedict Elliott Smith created CASSANDRA-19309: -- Summary: Accord: General performance investigation/improvement Key: CASSANDRA-19309 URL: https://issues.apache.org/jira/browse/CASSANDRA-19309 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19310) Accord: Dependency pruning
Benedict Elliott Smith created CASSANDRA-19310: -- Summary: Accord: Dependency pruning Key: CASSANDRA-19310 URL: https://issues.apache.org/jira/browse/CASSANDRA-19310 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith We currently depend on state GC for dependency pruning, but we can prune dependencies directly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19308) Accord: Avoid maintaining separate FULL history; read the system table for mapReduce over command deps
Benedict Elliott Smith created CASSANDRA-19308: -- Summary: Accord: Avoid maintaining separate FULL history; read the system table for mapReduce over command deps Key: CASSANDRA-19308 URL: https://issues.apache.org/jira/browse/CASSANDRA-19308 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith The FULL deps history is costly to maintain and to read. It is only used for transaction recovery, and we can implement it by reading the accord system table directly to fetch the deps of each transaction we find in the basic deps history. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19306) Accord: Introduce a "Medium path"
Benedict Elliott Smith created CASSANDRA-19306: -- Summary: Accord: Introduce a "Medium path" Key: CASSANDRA-19306 URL: https://issues.apache.org/jira/browse/CASSANDRA-19306 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith Accord transactions are currently either one or three round-trips. There is a _relatively_ simple modification to the protocol that permits two round-trip transactions if the coordinator's proposed timestamp is agreed on the slow path. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19305) Accord: Fast single-partition reads
Benedict Elliott Smith created CASSANDRA-19305: -- Summary: Accord: Fast single-partition reads Key: CASSANDRA-19305 URL: https://issues.apache.org/jira/browse/CASSANDRA-19305 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith Introduce guaranteed 1RT single-partition reads with no transaction metadata -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19304) Accord: General invariant improvements/validation/investigation
Benedict Elliott Smith created CASSANDRA-19304: -- Summary: Accord: General invariant improvements/validation/investigation Key: CASSANDRA-19304 URL: https://issues.apache.org/jira/browse/CASSANDRA-19304 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19303) Accord: Address or triage all TODOs with priority >= ‘expected’ in cassandra-accord
Benedict Elliott Smith created CASSANDRA-19303: -- Summary: Accord: Address or triage all TODOs with priority >= ‘expected’ in cassandra-accord Key: CASSANDRA-19303 URL: https://issues.apache.org/jira/browse/CASSANDRA-19303 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19302) Accord: Support for dropping keyspaces and table
Benedict Elliott Smith created CASSANDRA-19302: -- Summary: Accord: Support for dropping keyspaces and table Key: CASSANDRA-19302 URL: https://issues.apache.org/jira/browse/CASSANDRA-19302 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19301) Accord: Support routing standard and range reads through Accord
Benedict Elliott Smith created CASSANDRA-19301: -- Summary: Accord: Support routing standard and range reads through Accord Key: CASSANDRA-19301 URL: https://issues.apache.org/jira/browse/CASSANDRA-19301 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith For compatibility with non-Accord transactions we should be able to transparently upgrade normal reads to reads serviced by Accord. Range reads can safely employ the fast 1RT read optimisation since they do not expect serializable consistency. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19298) Accord: Deps.isEqualOrFuller is incorrect
Benedict Elliott Smith created CASSANDRA-19298: -- Summary: Accord: Deps.isEqualOrFuller is incorrect Key: CASSANDRA-19298 URL: https://issues.apache.org/jira/browse/CASSANDRA-19298 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith Deps may be considered equal or fuller if a Deps only has the same TxnId and Keys, when in fact some TxnId may cover different keys. However, any Deps associated with a given Commit Ballot that has been sliced correctly would satisfy this property safely with only the above checks. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19297) Accord: RejectBefore must be up-to-date on joining nodes before ready to coordinate
Benedict Elliott Smith created CASSANDRA-19297: -- Summary: Accord: RejectBefore must be up-to-date on joining nodes before ready to coordinate Key: CASSANDRA-19297 URL: https://issues.apache.org/jira/browse/CASSANDRA-19297 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith The exclusive sync point used to join the shard will be known by a majority of the existing replicas, but in the event the quorum changes and the new replica has not recorded the exclusive sync point this might in principle lead to failing to reject a TxnId that should be rejected. Simple fix, but introduce tests to corroborate this issue, and see if can reproduce in burn test. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19296) Accord: Improve and Document CoordinateShardDurable semantics
Benedict Elliott Smith created CASSANDRA-19296: -- Summary: Accord: Improve and Document CoordinateShardDurable semantics Key: CASSANDRA-19296 URL: https://issues.apache.org/jira/browse/CASSANDRA-19296 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith Firstly, CoordinateShardDurable should retry in future epochs if necessary. In principle this isn't a problem; the next CoordinateShardDurable should pick up where this one left-off. But we should consider the logic very carefully, and anyway not leave dangling waits. We should also carefully consider the special-case where replicas are bootstrapping in the future and we are coordinating the shard durability. This replica should safely participate in the sync point, waiting for only the transactions it requires to be replicated to it. So this should also function as expected, but this should be tested and documented carefully. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19294) Accord: Remove concept of non-participating home keys
[ https://issues.apache.org/jira/browse/CASSANDRA-19294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith updated CASSANDRA-19294: --- Component/s: Accord > Accord: Remove concept of non-participating home keys > - > > Key: CASSANDRA-19294 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19294 > Project: Cassandra > Issue Type: Improvement > Components: Accord >Reporter: Benedict Elliott Smith >Priority: Normal > > This concept causes a lot more trouble than it is worth, complicating a lot > of logic particularly around state GC, and forbids coordinator-only members > of the cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19295) Accord: Remove concept of covering() for PartialX; assume access to FullRoute for most behaviours
Benedict Elliott Smith created CASSANDRA-19295: -- Summary: Accord: Remove concept of covering() for PartialX; assume access to FullRoute for most behaviours Key: CASSANDRA-19295 URL: https://issues.apache.org/jira/browse/CASSANDRA-19295 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith This is a costly abstraction to compute particularly as topologies grow, and only complicates the internal logic -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19294) Accord: Remove concept of non-participating home keys
Benedict Elliott Smith created CASSANDRA-19294: -- Summary: Accord: Remove concept of non-participating home keys Key: CASSANDRA-19294 URL: https://issues.apache.org/jira/browse/CASSANDRA-19294 Project: Cassandra Issue Type: Improvement Reporter: Benedict Elliott Smith This concept causes a lot more trouble than it is worth, complicating a lot of logic particularly around state GC, and forbids coordinator-only members of the cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19288) Accord: Asynchronous reads may be unsafe
Benedict Elliott Smith created CASSANDRA-19288: -- Summary: Accord: Asynchronous reads may be unsafe Key: CASSANDRA-19288 URL: https://issues.apache.org/jira/browse/CASSANDRA-19288 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith In principle we should invalidate asynchronous reads before they complete if the data they read may be invalid, but this anyway causes faults when we permit them to occur in accord-core. We can and perhaps should simply ensure the reads are issued against an sstable/memtable snapshot taken by the command store, as this is lower cost and more robust. Otherwise we should investigate what issue asynchronous reads cause. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19287) Accord: Ensure no storage timestamp clashes across Accord bootstrap
[ https://issues.apache.org/jira/browse/CASSANDRA-19287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith updated CASSANDRA-19287: --- Description: At present bootstrap does not propagate the local metadata associated with the shard that is being bootstrapped. However, due to the many-to-one relation between Accord timestamps and sstable/C* timestamps it is possible for transactions with the same sstable timestamp to occur either side of a bootstrap for a single key. We can resolve this by either # Propagating the timestamp state from Accord system tables alongside bootstrap # Making the relationship between timestamps 1:1, by ** assigning each replica in the cluster a range of timestamps to allocate for Accord transactions ** permit timestamps larger than 8 bytes # Prevent timestamp clashes across a SyncPoint was:At present bootstrap does not propagate the local metadata associated with the shard that is being bootstrapped. However, due to the many-to-one relation between Accord timestamps and sstable/C* timestamps it is possible for transactions with the same sstable timestamp to occur either side of a bootstrap for a single key. We can resolve this by either > Accord: Ensure no storage timestamp clashes across Accord bootstrap > --- > > Key: CASSANDRA-19287 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19287 > Project: Cassandra > Issue Type: Improvement > Components: Accord >Reporter: Benedict Elliott Smith >Priority: Normal > > At present bootstrap does not propagate the local metadata associated with > the shard that is being bootstrapped. However, due to the many-to-one > relation between Accord timestamps and sstable/C* timestamps it is possible > for transactions with the same sstable timestamp to occur either side of a > bootstrap for a single key. We can resolve this by either > # Propagating the timestamp state from Accord system tables alongside > bootstrap > # Making the relationship between timestamps 1:1, by > ** assigning each replica in the cluster a range of timestamps to allocate > for Accord transactions > ** permit timestamps larger than 8 bytes > # Prevent timestamp clashes across a SyncPoint -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19287) Accord: Ensure no storage timestamp clashes across Accord bootstrap
Benedict Elliott Smith created CASSANDRA-19287: -- Summary: Accord: Ensure no storage timestamp clashes across Accord bootstrap Key: CASSANDRA-19287 URL: https://issues.apache.org/jira/browse/CASSANDRA-19287 Project: Cassandra Issue Type: Improvement Components: Accord Reporter: Benedict Elliott Smith At present bootstrap does not propagate the local metadata associated with the shard that is being bootstrapped. However, due to the many-to-one relation between Accord timestamps and sstable/C* timestamps it is possible for transactions with the same sstable timestamp to occur either side of a bootstrap for a single key. We can resolve this by either -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15464) Inserts to set slow due to AtomicBTreePartition for ComplexColumnData.dataSize
[ https://issues.apache.org/jira/browse/CASSANDRA-15464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798900#comment-17798900 ] Benedict Elliott Smith edited comment on CASSANDRA-15464 at 12/20/23 10:18 AM: --- I think it is likely to have been fixed by CASSANDRA-15511, although CASSANDRA-18125 did fix up accounting in this area in follow-up. was (Author: benedict): I think it is likely to have been fixed by CASSANDRA-15511 > Inserts to set slow due to AtomicBTreePartition for > ComplexColumnData.dataSize > > > Key: CASSANDRA-15464 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15464 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Core >Reporter: Eric Jacobsen >Priority: Normal > > Concurrent inserts to set can cause client timeouts and excessive CPU > due to compare and swap in AtomicBTreePartition for > ComplexColumnData.dataSize. As the length of the set gets longer, the > probability of doing the compare decreases. > The problem we saw in production was with insertions into a set with > len(set) hundreds to thousands. Because of the semantics of what we > store in the set, we had not anticipated the length being more than about 10. > (Almost all rows have length <= 6, the largest observed was 7032. Total > number of rows < 4000. 3 machines were used.) > The bad behavior we saw was all machines went to 100% cpu on all cores, and > clients were timing out. Our immediate solution in production was adding more > machines (went from 3 machines to 6 machines). The stack included > partitions.AtomicBTreePartition.addAllWithSizeDelta … > ComplexColumnData.dataSize. > The AtomicBTreePartition code uses a Compare And Swap approach, yet the time > between compares is dependent on the length of the set. When the length of > the set is long, with concurrent updates, each loop is unlikely to make > forward progress and can be delayed looping. > Here is one example call stack: > {noformat} > "SharedPool-Worker-40" #167 daemon prio=10 os_prio=0 tid=0x7f9bb4032800 > nid=0x2ee5 runnable [0x7f9b067f4000] > java.lang.Thread.State: RUNNABLE > at > org.apache.cassandra.db.rows.ComplexColumnData.dataSize(ComplexColumnData.java:114) > at org.apache.cassandra.db.rows.BTreeRow.dataSize(BTreeRow.java:373) > at > org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.apply(AtomicBTreePartition.java:292) > at > org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.apply(AtomicBTreePartition.java:235) > at org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:159) > at org.apache.cassandra.utils.btree.TreeBuilder.update(TreeBuilder.java:73) > at org.apache.cassandra.utils.btree.BTree.update(BTree.java:181) > at > org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:155) > at org.apache.cassandra.db.Memtable.put(Memtable.java:254) > at > org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1204) > at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:573) > at org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:384) > at org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:205) > at org.apache.cassandra.hints.Hint.applyFuture(Hint.java:99) > at org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:95) > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) > at > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:136) > at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) > at java.lang.Thread.run(Thread.java:748) > {noformat} > In a test program to repro the problem, we raise the number of concurrent > users and lower the think time between queries. Updating elements of > low-length sets can occur without errors, and with long-length sets, clients > time out with errors and there are periods with all cores 99.x% CPU and with > jstack shows time going to ComplexColumnData.dataSize. > Here is the schema. Our long term application solution was to just have the > set elements be part of the primary key and avoid using set, thus > guaranteeing the code does not go through ComplexColumnData.dataSize > {noformat} > CREATE TABLE x.x ( > x int PRIMARY KEY, > y set ) ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To
[jira] [Commented] (CASSANDRA-15464) Inserts to set slow due to AtomicBTreePartition for ComplexColumnData.dataSize
[ https://issues.apache.org/jira/browse/CASSANDRA-15464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798900#comment-17798900 ] Benedict Elliott Smith commented on CASSANDRA-15464: I think it is likely to have been fixed by CASSANDRA-15511 > Inserts to set slow due to AtomicBTreePartition for > ComplexColumnData.dataSize > > > Key: CASSANDRA-15464 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15464 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Core >Reporter: Eric Jacobsen >Priority: Normal > > Concurrent inserts to set can cause client timeouts and excessive CPU > due to compare and swap in AtomicBTreePartition for > ComplexColumnData.dataSize. As the length of the set gets longer, the > probability of doing the compare decreases. > The problem we saw in production was with insertions into a set with > len(set) hundreds to thousands. Because of the semantics of what we > store in the set, we had not anticipated the length being more than about 10. > (Almost all rows have length <= 6, the largest observed was 7032. Total > number of rows < 4000. 3 machines were used.) > The bad behavior we saw was all machines went to 100% cpu on all cores, and > clients were timing out. Our immediate solution in production was adding more > machines (went from 3 machines to 6 machines). The stack included > partitions.AtomicBTreePartition.addAllWithSizeDelta … > ComplexColumnData.dataSize. > The AtomicBTreePartition code uses a Compare And Swap approach, yet the time > between compares is dependent on the length of the set. When the length of > the set is long, with concurrent updates, each loop is unlikely to make > forward progress and can be delayed looping. > Here is one example call stack: > {noformat} > "SharedPool-Worker-40" #167 daemon prio=10 os_prio=0 tid=0x7f9bb4032800 > nid=0x2ee5 runnable [0x7f9b067f4000] > java.lang.Thread.State: RUNNABLE > at > org.apache.cassandra.db.rows.ComplexColumnData.dataSize(ComplexColumnData.java:114) > at org.apache.cassandra.db.rows.BTreeRow.dataSize(BTreeRow.java:373) > at > org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.apply(AtomicBTreePartition.java:292) > at > org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.apply(AtomicBTreePartition.java:235) > at org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:159) > at org.apache.cassandra.utils.btree.TreeBuilder.update(TreeBuilder.java:73) > at org.apache.cassandra.utils.btree.BTree.update(BTree.java:181) > at > org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:155) > at org.apache.cassandra.db.Memtable.put(Memtable.java:254) > at > org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1204) > at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:573) > at org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:384) > at org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:205) > at org.apache.cassandra.hints.Hint.applyFuture(Hint.java:99) > at org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:95) > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) > at > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:136) > at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) > at java.lang.Thread.run(Thread.java:748) > {noformat} > In a test program to repro the problem, we raise the number of concurrent > users and lower the think time between queries. Updating elements of > low-length sets can occur without errors, and with long-length sets, clients > time out with errors and there are periods with all cores 99.x% CPU and with > jstack shows time going to ComplexColumnData.dataSize. > Here is the schema. Our long term application solution was to just have the > set elements be part of the primary key and avoid using set, thus > guaranteeing the code does not go through ComplexColumnData.dataSize > {noformat} > CREATE TABLE x.x ( > x int PRIMARY KEY, > y set ) ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19045) Various Accord protocol fixes and improvements to validation
Benedict Elliott Smith created CASSANDRA-19045: -- Summary: Various Accord protocol fixes and improvements to validation Key: CASSANDRA-19045 URL: https://issues.apache.org/jira/browse/CASSANDRA-19045 Project: Cassandra Issue Type: Improvement Reporter: Benedict Elliott Smith Improve validation, and address various faults discovered by the improved validation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19045) Various Accord protocol fixes and improvements to validation
[ https://issues.apache.org/jira/browse/CASSANDRA-19045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith updated CASSANDRA-19045: --- Component/s: Accord > Various Accord protocol fixes and improvements to validation > > > Key: CASSANDRA-19045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19045 > Project: Cassandra > Issue Type: Improvement > Components: Accord >Reporter: Benedict Elliott Smith >Priority: Normal > > Improve validation, and address various faults discovered by the improved > validation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18988) Updating the column of a non-existent row in an Accord transaction results in Atomicity violation
[ https://issues.apache.org/jira/browse/CASSANDRA-18988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781661#comment-17781661 ] Benedict Elliott Smith commented on CASSANDRA-18988: Thanks. I'll let [~maedhroz] figure out the shape of what we think should happen, and perhaps this discussion can be taken on list, since it is API impacting and what we do today is correct - but the specifics of how this impacts e.g. row markers perhaps warrants discussion. For instance, I might expect the result of the first operation to look like this: partition | account_id | balance ---++- default | 0 | 100 default | 1 | 90 default | 3 | null > Updating the column of a non-existent row in an Accord transaction results in > Atomicity violation > - > > Key: CASSANDRA-18988 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18988 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Luis E Fernandez >Assignee: Caleb Rackliffe >Priority: Normal > Fix For: 5.x > > > *System configuration and information:* > Single node Cassandra with Accord transactions enabled running on docker > Built from commit: > [a7cd114435704b988c81f47ef53d0bfd6441f38b|https://github.com/apache/cassandra/commit/a7cd114435704b988c81f47ef53d0bfd6441f38b] > CQLSH: [cqlsh 6.2.0 | Cassandra 5.0-alpha2-SNAPSHOT | CQL spec 3.4.7 | Native > protocol v5] > > *Steps to reproduce in CQLSH:* > {code:java} > CREATE KEYSPACE accord WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': '1'} AND durable_writes = true;{code} > {code:java} > CREATE TABLE accord.accounts ( > partition text, > account_id int, > balance int, > PRIMARY KEY (partition, account_id) > ) WITH CLUSTERING ORDER BY (account_id ASC); > {code} > {code:java} > BEGIN TRANSACTION > INSERT INTO accord.accounts (partition, account_id, balance) VALUES > ('default', 0, 100); > INSERT INTO accord.accounts (partition, account_id, balance) VALUES > ('default', 1, 100); > COMMIT TRANSACTION;{code} > atomicity bug happens after executing the following statement: > Based on [Cassandra > documentation|https://cassandra.apache.org/doc/4.1/cassandra/cql/dml.html#update-statement] > regarding the use of UPDATE statements, I expect the result of this > transaction to be the insertion of a new account (\{ account_id: 3, balance: > 10 }). The total balance across the three (3) accounts should be maintained > (200). After executing the below transaction, the total number of accounts > remains at two (2) and the total balance drops to 190. Basically, it appears > as if only one half of the transaction proceeds. > {code:java} > BEGIN TRANSACTION > UPDATE accord.accounts > SET balance -= 10 > WHERE > partition = 'default' > AND account_id = 1; > UPDATE accord.accounts > SET balance += 10 > WHERE > partition = 'default' > AND account_id = 3; > COMMIT TRANSACTION;{code} > Bug / Error: > == > The result of performing a table read after executing the buggy transaction > is: > {code:java} > partition | account_id | balance > ---++- > default | 0 | 100 > default | 1 | 90 > {code} > {color:#172b4d}Note that the above transactions are not possible without a > transaction block because only counter type columns can be updated with += or > -= syntax in normal (non-transactional) cql statements. Using counter type > columns also results in a separate, related bug: > [CASSANDRA-18987|https://issues.apache.org/jira/browse/CASSANDRA-18987]{color} > {color:#172b4d}This was found while testing Accord transactions with > [~henrik.ingo] and team.{color} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-18988) Updating the column of a non-existent row in an Accord transaction results in Atomicity violation
[ https://issues.apache.org/jira/browse/CASSANDRA-18988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781661#comment-17781661 ] Benedict Elliott Smith edited comment on CASSANDRA-18988 at 11/1/23 9:12 AM: - Thanks. I'll let [~maedhroz] figure out the shape of what we think should happen and how that relates to what happens today, and then perhaps this discussion can be taken on list. It is API impacting, and what we do today is correct - but the specifics of how this impacts e.g. row markers perhaps warrants discussion. For instance, I might expect the result of the first operation to look like this: partition | account_id | balance ---++- default | 0 | 100 default | 1 | 90 default | 3 | null was (Author: benedict): Thanks. I'll let [~maedhroz] figure out the shape of what we think should happen, and perhaps this discussion can be taken on list, since it is API impacting and what we do today is correct - but the specifics of how this impacts e.g. row markers perhaps warrants discussion. For instance, I might expect the result of the first operation to look like this: partition | account_id | balance ---++- default | 0 | 100 default | 1 | 90 default | 3 | null > Updating the column of a non-existent row in an Accord transaction results in > Atomicity violation > - > > Key: CASSANDRA-18988 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18988 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Luis E Fernandez >Assignee: Caleb Rackliffe >Priority: Normal > Fix For: 5.x > > > *System configuration and information:* > Single node Cassandra with Accord transactions enabled running on docker > Built from commit: > [a7cd114435704b988c81f47ef53d0bfd6441f38b|https://github.com/apache/cassandra/commit/a7cd114435704b988c81f47ef53d0bfd6441f38b] > CQLSH: [cqlsh 6.2.0 | Cassandra 5.0-alpha2-SNAPSHOT | CQL spec 3.4.7 | Native > protocol v5] > > *Steps to reproduce in CQLSH:* > {code:java} > CREATE KEYSPACE accord WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': '1'} AND durable_writes = true;{code} > {code:java} > CREATE TABLE accord.accounts ( > partition text, > account_id int, > balance int, > PRIMARY KEY (partition, account_id) > ) WITH CLUSTERING ORDER BY (account_id ASC); > {code} > {code:java} > BEGIN TRANSACTION > INSERT INTO accord.accounts (partition, account_id, balance) VALUES > ('default', 0, 100); > INSERT INTO accord.accounts (partition, account_id, balance) VALUES > ('default', 1, 100); > COMMIT TRANSACTION;{code} > atomicity bug happens after executing the following statement: > Based on [Cassandra > documentation|https://cassandra.apache.org/doc/4.1/cassandra/cql/dml.html#update-statement] > regarding the use of UPDATE statements, I expect the result of this > transaction to be the insertion of a new account (\{ account_id: 3, balance: > 10 }). The total balance across the three (3) accounts should be maintained > (200). After executing the below transaction, the total number of accounts > remains at two (2) and the total balance drops to 190. Basically, it appears > as if only one half of the transaction proceeds. > {code:java} > BEGIN TRANSACTION > UPDATE accord.accounts > SET balance -= 10 > WHERE > partition = 'default' > AND account_id = 1; > UPDATE accord.accounts > SET balance += 10 > WHERE > partition = 'default' > AND account_id = 3; > COMMIT TRANSACTION;{code} > Bug / Error: > == > The result of performing a table read after executing the buggy transaction > is: > {code:java} > partition | account_id | balance > ---++- > default | 0 | 100 > default | 1 | 90 > {code} > {color:#172b4d}Note that the above transactions are not possible without a > transaction block because only counter type columns can be updated with += or > -= syntax in normal (non-transactional) cql statements. Using counter type > columns also results in a separate, related bug: > [CASSANDRA-18987|https://issues.apache.org/jira/browse/CASSANDRA-18987]{color} > {color:#172b4d}This was found while testing Accord transactions with > [~henrik.ingo] and team.{color} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional
[jira] [Commented] (CASSANDRA-18988) Updating the column of a non-existent row in an Accord transaction results in Atomicity violation
[ https://issues.apache.org/jira/browse/CASSANDRA-18988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781534#comment-17781534 ] Benedict Elliott Smith commented on CASSANDRA-18988: Thanks for the report [~antithesis-luis]. [~maedhroz] can you take a look? I think that technically this outcome is correct: {{null}} + 10 == {{null}}. Whether a partition should be inserted for this implicit delete I don't know, but the result of this should certainly be {{null}}. It's worth taking a closer look at the semantics either way. [~antithesis-luis] can you confirm if you see the behaviour with {{UPDATE set balance = 10}}, rather than {{+= 10}}? This would be a more serious problem. > Updating the column of a non-existent row in an Accord transaction results in > Atomicity violation > - > > Key: CASSANDRA-18988 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18988 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Luis E Fernandez >Priority: Normal > Fix For: 5.x > > > *System configuration and information:* > Single node Cassandra with Accord transactions enabled running on docker > Built from commit: > [a7cd114435704b988c81f47ef53d0bfd6441f38b|https://github.com/apache/cassandra/commit/a7cd114435704b988c81f47ef53d0bfd6441f38b] > CQLSH: [cqlsh 6.2.0 | Cassandra 5.0-alpha2-SNAPSHOT | CQL spec 3.4.7 | Native > protocol v5] > > *Steps to reproduce in CQLSH:* > {code:java} > CREATE KEYSPACE accord WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': '1'} AND durable_writes = true;{code} > {code:java} > CREATE TABLE accord.accounts ( > partition text, > account_id int, > balance int, > PRIMARY KEY (partition, account_id) > ) WITH CLUSTERING ORDER BY (account_id ASC); > {code} > {code:java} > BEGIN TRANSACTION > INSERT INTO accord.accounts (partition, account_id, balance) VALUES > ('default', 0, 100); > INSERT INTO accord.accounts (partition, account_id, balance) VALUES > ('default', 1, 100); > COMMIT TRANSACTION;{code} > atomicity bug happens after executing the following statement: > Based on [Cassandra > documentation|https://cassandra.apache.org/doc/4.1/cassandra/cql/dml.html#update-statement] > regarding the use of UPDATE statements, I expect the result of this > transaction to be the insertion of a new account (\{ account_id: 3, balance: > 10 }). The total balance across the three (3) accounts should be maintained > (200). After executing the below transaction, the total number of accounts > remains at two (2) and the total balance drops to 190. Basically, it appears > as if only one half of the transaction proceeds. > {code:java} > BEGIN TRANSACTION > UPDATE accord.accounts > SET balance -= 10 > WHERE > partition = 'default' > AND account_id = 1; > UPDATE accord.accounts > SET balance += 10 > WHERE > partition = 'default' > AND account_id = 3; > COMMIT TRANSACTION;{code} > Bug / Error: > == > The result of performing a table read after executing the buggy transaction > is: > {code:java} > partition | account_id | balance > ---++- > default | 0 | 100 > default | 1 | 90 > {code} > {color:#172b4d}Note that the above transactions are not possible without a > transaction block because only counter type columns can be updated with += or > -= syntax in normal (non-transactional) cql statements. Using counter type > columns also results in a separate, related bug: > [CASSANDRA-18987|https://issues.apache.org/jira/browse/CASSANDRA-18987]{color} > {color:#172b4d}This was found while testing Accord transactions with > [~henrik.ingo] and team.{color} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-18987) Using counter column type in Accord transactions leads to Atomicity / Consistency violations
[ https://issues.apache.org/jira/browse/CASSANDRA-18987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781533#comment-17781533 ] Benedict Elliott Smith edited comment on CASSANDRA-18987 at 10/31/23 9:52 PM: -- Thanks for the report. Counter columns are inherently not transactional, and I don't know why they are permitted to be included in transactions. I assume it's an oversight. [~maedhroz] can you take a look? was (Author: benedict): Thanks for the report. Counter columns are inherently not transactional, and I don't know why they are permitted to be included in transactions. [~maedhroz] can you take a look? > Using counter column type in Accord transactions leads to Atomicity / > Consistency violations > > > Key: CASSANDRA-18987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18987 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Luis E Fernandez >Priority: Normal > Fix For: 5.x > > > *System configuration and information:* > Single node Cassandra with Accord transactions enabled running on docker > Built from commit: > [a7cd114435704b988c81f47ef53d0bfd6441f38b|https://github.com/apache/cassandra/commit/a7cd114435704b988c81f47ef53d0bfd6441f38b] > CQLSH: [cqlsh 6.2.0 | Cassandra 5.0-alpha2-SNAPSHOT | CQL spec 3.4.7 | Native > protocol v5] > > *Steps to reproduce in CQLSH:* > {code:java} > CREATE KEYSPACE accord WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': '1'} AND durable_writes = true;{code} > {code:java} > CREATE TABLE accord.accounts ( > partition text, > account_id int, > balance counter, > PRIMARY KEY (partition, account_id) > ) WITH CLUSTERING ORDER BY (account_id ASC); > {code} > {code:java} > BEGIN TRANSACTION > UPDATE accord.accounts > SET balance += 100 > WHERE > partition = 'default' > AND account_id = 0; > UPDATE accord.accounts > SET balance += 100 > WHERE > partition = 'default' > AND account_id =1; > COMMIT TRANSACTION;{code} > bug happens after executing the following statement: > Based on [Cassandra > documentation|https://cassandra.apache.org/doc/trunk/cassandra/developing/cql/types.html#counters] > regarding the use of counters, I expect the following results: > Transaction A: subtract 10 from the balance of account 1 (total ending > balance of 90) and add 10 to the balance of account 0 (total ending balance > of 110) > {*}Bug A{*}: Neither account's balance is updated and the state of the rows > is left unchanged > {code:java} > /* Transaction A */ > BEGIN TRANSACTION > UPDATE accord.accounts > SET balance -= 10 > WHERE > partition = 'default' > AND account_id = 1; > UPDATE accord.accounts > SET balance += 10 > WHERE > partition = 'default' > AND account_id = 0; > COMMIT TRANSACTION;{code} > Transaction B: subtract 10 from the balance of account 1 (total ending > balance of 90) and add 10 to the balance of a new account 2 (total ending > balance of 10) > {*}Bug B{*}: Only the new account 2 is created. The balance of account 1 is > left unchanged > {code:java} > /* Transaction B */ > BEGIN TRANSACTION > UPDATE accord.accounts > SET balance -= 10 > WHERE > partition = 'default' > AND account_id = 1; > UPDATE accord.accounts > SET balance += 10 > WHERE > partition = 'default' > AND account_id = 2; > COMMIT TRANSACTION;{code} > Bug / Error: > == > The result of performing a table read after executing each buggy transaction > is: > {code:java} > /* Transaction / Bug A */ > partition | account_id | balance > ---++- > default | 0 | 100 > default | 1 | 100{code} > {code:java} > /* Transaction / Bug B */ > partition | account_id | balance > ---++- > default | 0 | 100 > default | 1 | 100 > default | 2 | 10 {code} > Note that performing the above statements without transaction blocks works as > expected. > {color:#172b4d}This was found while testing Accord transactions with > [~henrik.ingo] and team.{color} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18987) Using counter column type in Accord transactions leads to Atomicity / Consistency violations
[ https://issues.apache.org/jira/browse/CASSANDRA-18987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781533#comment-17781533 ] Benedict Elliott Smith commented on CASSANDRA-18987: Thanks for the report. Counter columns are inherently not transactional, and I don't know why they are permitted to be included in transactions. [~maedhroz] can you take a look? > Using counter column type in Accord transactions leads to Atomicity / > Consistency violations > > > Key: CASSANDRA-18987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18987 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Luis E Fernandez >Priority: Normal > Fix For: 5.x > > > *System configuration and information:* > Single node Cassandra with Accord transactions enabled running on docker > Built from commit: > [a7cd114435704b988c81f47ef53d0bfd6441f38b|https://github.com/apache/cassandra/commit/a7cd114435704b988c81f47ef53d0bfd6441f38b] > CQLSH: [cqlsh 6.2.0 | Cassandra 5.0-alpha2-SNAPSHOT | CQL spec 3.4.7 | Native > protocol v5] > > *Steps to reproduce in CQLSH:* > {code:java} > CREATE KEYSPACE accord WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': '1'} AND durable_writes = true;{code} > {code:java} > CREATE TABLE accord.accounts ( > partition text, > account_id int, > balance counter, > PRIMARY KEY (partition, account_id) > ) WITH CLUSTERING ORDER BY (account_id ASC); > {code} > {code:java} > BEGIN TRANSACTION > UPDATE accord.accounts > SET balance += 100 > WHERE > partition = 'default' > AND account_id = 0; > UPDATE accord.accounts > SET balance += 100 > WHERE > partition = 'default' > AND account_id =1; > COMMIT TRANSACTION;{code} > bug happens after executing the following statement: > Based on [Cassandra > documentation|https://cassandra.apache.org/doc/trunk/cassandra/developing/cql/types.html#counters] > regarding the use of counters, I expect the following results: > Transaction A: subtract 10 from the balance of account 1 (total ending > balance of 90) and add 10 to the balance of account 0 (total ending balance > of 110) > {*}Bug A{*}: Neither account's balance is updated and the state of the rows > is left unchanged > {code:java} > /* Transaction A */ > BEGIN TRANSACTION > UPDATE accord.accounts > SET balance -= 10 > WHERE > partition = 'default' > AND account_id = 1; > UPDATE accord.accounts > SET balance += 10 > WHERE > partition = 'default' > AND account_id = 0; > COMMIT TRANSACTION;{code} > Transaction B: subtract 10 from the balance of account 1 (total ending > balance of 90) and add 10 to the balance of a new account 2 (total ending > balance of 10) > {*}Bug B{*}: Only the new account 2 is created. The balance of account 1 is > left unchanged > {code:java} > /* Transaction B */ > BEGIN TRANSACTION > UPDATE accord.accounts > SET balance -= 10 > WHERE > partition = 'default' > AND account_id = 1; > UPDATE accord.accounts > SET balance += 10 > WHERE > partition = 'default' > AND account_id = 2; > COMMIT TRANSACTION;{code} > Bug / Error: > == > The result of performing a table read after executing each buggy transaction > is: > {code:java} > /* Transaction / Bug A */ > partition | account_id | balance > ---++- > default | 0 | 100 > default | 1 | 100{code} > {code:java} > /* Transaction / Bug B */ > partition | account_id | balance > ---++- > default | 0 | 100 > default | 1 | 100 > default | 2 | 10 {code} > Note that performing the above statements without transaction blocks works as > expected. > {color:#172b4d}This was found while testing Accord transactions with > [~henrik.ingo] and team.{color} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17772546#comment-17772546 ] Benedict Elliott Smith commented on CASSANDRA-18798: Ok, so I have taken a quick look at the code, and I can see the problem. We have implemented an {{AccordUpdateParameters} that 1) sets the ClientState timestamp and nowInSec to 42 on the assumption that all updates will be computed on the replica side. 2) does not copy over the logic from CASUpdateParameters for ensuring list appends are performed correctly. What I can say is that the time used for the cell path's TimeUUID definitely needs to be set deterministically. This could be set on the replicas using CommandsForKey's timestamp bounds, but it must handle the additional complexity of List appends a la CASUpdateParameters. If we are currently deriving these on the coordinator, we're going to be having a very bad time as the coordinator seems to always use a timestamp of {{42}}. This is another spot where I suspect we really want to update Accord to generate unique HLCs, as it would simplify this a great deal. > Appending to list in Accord transactions uses insertion timestamp > - > > Key: CASSANDRA-18798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18798 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Jaroslaw Kijanowski >Assignee: Henrik Ingo >Priority: Normal > Attachments: image-2023-09-26-20-05-25-846.png > > > Given the following schema: > {code:java} > CREATE KEYSPACE IF NOT EXISTS accord WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': 3}; > CREATE TABLE IF NOT EXISTS accord.list_append(id int PRIMARY KEY,contents > LIST); > TRUNCATE accord.list_append;{code} > And the following two possible queries executed by 10 threads in parallel: > {code:java} > BEGIN TRANSACTION > LET row = (SELECT * FROM list_append WHERE id = ?); > SELECT row.contents; > COMMIT TRANSACTION;" > BEGIN TRANSACTION > UPDATE list_append SET contents += ? WHERE id = ?; > COMMIT TRANSACTION;" > {code} > there seems to be an issue with transaction guarantees. Here's an excerpt in > the edn format from a test. > {code:java} > {:type :invoke :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607285967116627} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 54 > :time 1692607286078732473} > {:type :invoke :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286133833428} > {:type :invoke :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286149702511} > {:type :ok :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607286156314099} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 52 > :time 1692607286167090389} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352]]] :tid 1 :n 54 :time 1692607286168657534} > {:type :invoke :process 1 :value [[:r 5 nil]] :tid 0 :n 51 > :time 1692607286201762938} > {:type :ok :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286245571513} > {:type :invoke :process 7 :value [[:r 5 nil]] :tid 4 :n 56 > :time 1692607286245655775} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 455]]] :tid 9 :n 52 :time 1692607286253928906} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 53 > :time 1692607286254095215} > {:type :ok :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286266263422} > {:type :ok :process 1 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 0 :n 51 :time 1692607286271617955} > {:type :ok :process 7 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 4 :n 56 :time 1692607286271816933} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733
[jira] [Comment Edited] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759505#comment-17759505 ] Benedict Elliott Smith edited comment on CASSANDRA-18798 at 8/28/23 8:42 AM: - Either way this is a protocol bug, as if the insert by process 6 has a lower timestamp than the insert by process 7 then it should occur before, and so the read by process 5 should be deferred until the insert has completed. I won't spend time debugging this as a result, as we have several known protocol bugs that could cause this, that we have been deferring fixing until now (I plan to address over the next 2-3 weeks). If you have a simulator seed that produces this we could perhaps confirm which protocol bug if any might have caused this, as it is always nice to know which protocol bugs we have reproductions for via which routes. It's great to have some further external validation that these bugs can be found via this form of testing. was (Author: benedict): Either way this is a protocol bug, as if the insert by process 6 has a lower timestamp than the insert by process 7 then it should occur before, and so the read by process 5 should be deferred until the insert has completed. I won't spend time debugging this as a result, as we have several known protocol bugs that could cause this, that we have been deferring fixing until now (I plan to address over the next 2-3 weeks). If you have a simulator seed that produces this we could perhaps confirm which protocol bug if any might have caused this, as it is always nice to know which protocol bugs we have reproductions for via which routes. > Appending to list in Accord transactions uses insertion timestamp > - > > Key: CASSANDRA-18798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18798 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Jaroslaw Kijanowski >Priority: Normal > > Given the following schema: > {code:java} > CREATE KEYSPACE IF NOT EXISTS accord WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': 3}; > CREATE TABLE IF NOT EXISTS accord.list_append(id int PRIMARY KEY,contents > LIST); > TRUNCATE accord.list_append;{code} > And the following two possible queries executed by 10 threads in parallel: > {code:java} > BEGIN TRANSACTION > LET row = (SELECT * FROM list_append WHERE id = ?); > SELECT row.contents; > COMMIT TRANSACTION;" > BEGIN TRANSACTION > UPDATE list_append SET contents += ? WHERE id = ?; > COMMIT TRANSACTION;" > {code} > there seems to be an issue with transaction guarantees. Here's an excerpt in > the edn format from a test. > {code:java} > {:type :invoke :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607285967116627} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 54 > :time 1692607286078732473} > {:type :invoke :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286133833428} > {:type :invoke :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286149702511} > {:type :ok :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607286156314099} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 52 > :time 1692607286167090389} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352]]] :tid 1 :n 54 :time 1692607286168657534} > {:type :invoke :process 1 :value [[:r 5 nil]] :tid 0 :n 51 > :time 1692607286201762938} > {:type :ok :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286245571513} > {:type :invoke :process 7 :value [[:r 5 nil]] :tid 4 :n 56 > :time 1692607286245655775} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 455]]] :tid 9 :n 52 :time 1692607286253928906} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 53 > :time 1692607286254095215} > {:type :ok :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286266263422} > {:type :ok :process 1 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 0 :n 51 :time
[jira] [Commented] (CASSANDRA-18798) Appending to list in Accord transactions uses insertion timestamp
[ https://issues.apache.org/jira/browse/CASSANDRA-18798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759505#comment-17759505 ] Benedict Elliott Smith commented on CASSANDRA-18798: Either way this is a protocol bug, as if the insert by process 6 has a lower timestamp than the insert by process 7 then it should occur before, and so the read by process 5 should be deferred until the insert has completed. I won't spend time debugging this as a result, as we have several known protocol bugs that could cause this, that we have been deferring fixing until now (I plan to address over the next 2-3 weeks). If you have a simulator seed that produces this we could perhaps confirm which protocol bug if any might have caused this, as it is always nice to know which protocol bugs we have reproductions for via which routes. > Appending to list in Accord transactions uses insertion timestamp > - > > Key: CASSANDRA-18798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18798 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Jaroslaw Kijanowski >Priority: Normal > > Given the following schema: > {code:java} > CREATE KEYSPACE IF NOT EXISTS accord WITH replication = {'class': > 'SimpleStrategy', 'replication_factor': 3}; > CREATE TABLE IF NOT EXISTS accord.list_append(id int PRIMARY KEY,contents > LIST); > TRUNCATE accord.list_append;{code} > And the following two possible queries executed by 10 threads in parallel: > {code:java} > BEGIN TRANSACTION > LET row = (SELECT * FROM list_append WHERE id = ?); > SELECT row.contents; > COMMIT TRANSACTION;" > BEGIN TRANSACTION > UPDATE list_append SET contents += ? WHERE id = ?; > COMMIT TRANSACTION;" > {code} > there seems to be an issue with transaction guarantees. Here's an excerpt in > the edn format from a test. > {code:java} > {:type :invoke :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607285967116627} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 54 > :time 1692607286078732473} > {:type :invoke :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286133833428} > {:type :invoke :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286149702511} > {:type :ok :process 8 :value [[:append 5 352]] :tid 3 :n 52 > :time 1692607286156314099} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 52 > :time 1692607286167090389} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352]]] :tid 1 :n 54 :time 1692607286168657534} > {:type :invoke :process 1 :value [[:r 5 nil]] :tid 0 :n 51 > :time 1692607286201762938} > {:type :ok :process 7 :value [[:append 5 455]] :tid 4 :n 55 > :time 1692607286245571513} > {:type :invoke :process 7 :value [[:r 5 nil]] :tid 4 :n 56 > :time 1692607286245655775} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 455]]] :tid 9 :n 52 :time 1692607286253928906} > {:type :invoke :process 5 :value [[:r 5 nil]] :tid 9 :n 53 > :time 1692607286254095215} > {:type :ok :process 6 :value [[:append 5 553]] :tid 5 :n 53 > :time 1692607286266263422} > {:type :ok :process 1 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 0 :n 51 :time 1692607286271617955} > {:type :ok :process 7 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 4 :n 56 :time 1692607286271816933} > {:type :ok :process 5 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19 623 22 425 24 926 25 832 130 733 430 533 29 933 333 > 537 934 538 740 139 744 938 544 42 646 749 242 546 547 548 753 450 150 349 48 > 852 352 553 455]]] :tid 9 :n 53 :time 1692607286281483026} > {:type :invoke :process 9 :value [[:r 5 nil]] :tid 1 :n 56 > :time 1692607286284097561} > {:type :ok :process 9 :value [[:r 5 [303 304 604 6 306 509 909 409 912 > 411 514 415 719 419 19
[jira] [Commented] (CASSANDRA-18355) CEP-15: Transaction Result Serialization Efficiency
[ https://issues.apache.org/jira/browse/CASSANDRA-18355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754713#comment-17754713 ] Benedict Elliott Smith commented on CASSANDRA-18355: So, it would also be nice in this patch to ensure we aren't double writing the transaction contents. We already persist any constant write values in the transaction, and don't need them to reconstruct their portion of the `Writes` - which for most cases will be the vast majority of a `Writes`. So, really, instead of `Writes` we should be persisting only what we read from replicas that are necessary for computing the `Writes` from the local `PartialTxn`. Does that make sense? > CEP-15: Transaction Result Serialization Efficiency > --- > > Key: CASSANDRA-18355 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18355 > Project: Cassandra > Issue Type: Improvement > Components: Accord >Reporter: Caleb Rackliffe >Assignee: Caleb Rackliffe >Priority: Normal > Fix For: NA > > Time Spent: 2h > Remaining Estimate: 0h > > There are two things we probably don’t need to serialize and write to the > Accord state tables: > > 1.) Internal/external read responses > 2.) The full result of the transaction -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14227) Extend maximum expiration date
[ https://issues.apache.org/jira/browse/CASSANDRA-14227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724840#comment-17724840 ] Benedict Elliott Smith commented on CASSANDRA-14227: Sorry, the downside of lots of Jira traffic (incl from GitHub comments) is that I don't check the email notifications for a high traffic ticket. I won't have time to look at the code soon, but I trust you to have addressed my concerns given what you describe above. Feel free to proceed. > Extend maximum expiration date > -- > > Key: CASSANDRA-14227 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14227 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Paulo Motta (Deprecated) >Assignee: Berenguer Blasi >Priority: Urgent > Fix For: 5.x > > Attachments: C14227 Perf check 2023.03.21.pdf, screenshot-1.png, > screenshot-2.png, screenshot-3.png, screenshot-4.png, unnamed-1.png > > > The maximum expiration timestamp that can be represented by the storage > engine is > 2038-01-19T03:14:06+00:00 due to the encoding of {{localExpirationTime}} as > an int32. > On CASSANDRA-14092 we added an overflow policy which rejects requests with > expiration above the maximum date as a temporary measure, but we should > remove this limitation by updating the storage engine to support at least the > maximum allowed TTL of 20 years. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18204) CEP-15: (C*) Add git submodule for Accord
[ https://issues.apache.org/jira/browse/CASSANDRA-18204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723270#comment-17723270 ] Benedict Elliott Smith commented on CASSANDRA-18204: We have this discussion roughly once per major. If you look back through dev@ you'll find the last one a few years back. I don't recall NA ever being the approved approach, though. ".x" lines are target versions, whereas concrete versions are the ones a fix landed in. There's always ambiguity over the next release, as it's sort of both. But since there is no 5.0 version, only 5.0-alphaN, 5.0-betaN and 5.0.0, perhaps 5.0 is the correct label. I forget what we landed upon last time. > CEP-15: (C*) Add git submodule for Accord > - > > Key: CASSANDRA-18204 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18204 > Project: Cassandra > Issue Type: Task > Components: Accord >Reporter: David Capwell >Assignee: David Capwell >Priority: Normal > Fix For: 5.0 > > Time Spent: 9h 50m > Remaining Estimate: 0h > > As talked about in dev@ thread "Intra-project dependencies”, we talked about > adding git submodules but before doing this had to work out a few issues > first; this ticket is to track this work. > Goals > * when checking out an older commit, or pulling in newer commits, the > submodule should also be updated automatically > * release artifact must include the submodule and must be able to build > without issue > * build.xml must be updated to build the submodule > * build.xml must be updated to release the submodule jar -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-18204) CEP-15: (C*) Add git submodule for Accord
[ https://issues.apache.org/jira/browse/CASSANDRA-18204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723270#comment-17723270 ] Benedict Elliott Smith edited comment on CASSANDRA-18204 at 5/16/23 7:59 PM: - We have this discussion roughly once per major. If you look back through dev@ you'll find the last one a few years back. I don't recall NA ever being the approved approach, though. ".x" lines are target versions, whereas concrete versions are the ones a fix landed in. There's always ambiguity over the next release, as it's sort of both. But since there is no 5.0 version, only 5.0-alphaN, 5.0-betaN and 5.0.0, perhaps 5.0 is the correct label (and makes sense to me). I forget what we landed upon last time. Work that has actually landed should probably be labelled as 5.0-alpha1 was (Author: benedict): We have this discussion roughly once per major. If you look back through dev@ you'll find the last one a few years back. I don't recall NA ever being the approved approach, though. ".x" lines are target versions, whereas concrete versions are the ones a fix landed in. There's always ambiguity over the next release, as it's sort of both. But since there is no 5.0 version, only 5.0-alphaN, 5.0-betaN and 5.0.0, perhaps 5.0 is the correct label. I forget what we landed upon last time. > CEP-15: (C*) Add git submodule for Accord > - > > Key: CASSANDRA-18204 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18204 > Project: Cassandra > Issue Type: Task > Components: Accord >Reporter: David Capwell >Assignee: David Capwell >Priority: Normal > Fix For: 5.0 > > Time Spent: 9h 50m > Remaining Estimate: 0h > > As talked about in dev@ thread "Intra-project dependencies”, we talked about > adding git submodules but before doing this had to work out a few issues > first; this ticket is to track this work. > Goals > * when checking out an older commit, or pulling in newer commits, the > submodule should also be updated automatically > * release artifact must include the submodule and must be able to build > without issue > * build.xml must be updated to build the submodule > * build.xml must be updated to release the submodule jar -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18523) CEP-15: (Accord) Join cluster without full transaction log
[ https://issues.apache.org/jira/browse/CASSANDRA-18523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith updated CASSANDRA-18523: --- Resolution: Fixed Status: Resolved (was: Triage Needed) > CEP-15: (Accord) Join cluster without full transaction log > -- > > Key: CASSANDRA-18523 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18523 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict Elliott Smith >Assignee: Benedict Elliott Smith >Priority: Normal > > Joining replicas should not require the full transaction history to > successfully start serving queries. This ticket introduces mechanisms for a > replica to join (or catch up) with a data snapshot and all transactions that > execute after that snapshot. This is a precursor for transaction state GC. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-18523) CEP-15: (Accord) Join cluster without full transaction log
[ https://issues.apache.org/jira/browse/CASSANDRA-18523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith reassigned CASSANDRA-18523: -- Assignee: Benedict Elliott Smith > CEP-15: (Accord) Join cluster without full transaction log > -- > > Key: CASSANDRA-18523 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18523 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict Elliott Smith >Assignee: Benedict Elliott Smith >Priority: Normal > > Joining replicas should not require the full transaction history to > successfully start serving queries. This ticket introduces mechanisms for a > replica to join (or catch up) with a data snapshot and all transactions that > execute after that snapshot. This is a precursor for transaction state GC. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-18175) CEP-15: (Accord) Introduce ExclusiveSyncPoint transactions
[ https://issues.apache.org/jira/browse/CASSANDRA-18175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith reassigned CASSANDRA-18175: -- Assignee: Benedict Elliott Smith > CEP-15: (Accord) Introduce ExclusiveSyncPoint transactions > -- > > Key: CASSANDRA-18175 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18175 > Project: Cassandra > Issue Type: Improvement > Components: Accord >Reporter: Benedict Elliott Smith >Assignee: Benedict Elliott Smith >Priority: Normal > > Introduce a mechanism for invalidating older {{TxnId}}, so that a newly > bootstrapped node may have a complete log as of a point in time {{TxnId}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18175) CEP-15: (Accord) Introduce ExclusiveSyncPoint transactions
[ https://issues.apache.org/jira/browse/CASSANDRA-18175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith updated CASSANDRA-18175: --- Resolution: Fixed Status: Resolved (was: Triage Needed) > CEP-15: (Accord) Introduce ExclusiveSyncPoint transactions > -- > > Key: CASSANDRA-18175 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18175 > Project: Cassandra > Issue Type: Improvement > Components: Accord >Reporter: Benedict Elliott Smith >Assignee: Benedict Elliott Smith >Priority: Normal > > Introduce a mechanism for invalidating older {{TxnId}}, so that a newly > bootstrapped node may have a complete log as of a point in time {{TxnId}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-18524) CEP-15: (Accord) Separate durable and transient listeners
[ https://issues.apache.org/jira/browse/CASSANDRA-18524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith reassigned CASSANDRA-18524: -- Assignee: Benedict Elliott Smith > CEP-15: (Accord) Separate durable and transient listeners > - > > Key: CASSANDRA-18524 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18524 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict Elliott Smith >Assignee: Benedict Elliott Smith >Priority: Normal > > Transient listeners should be handled differently and, ironically, should be > more "persistent" in that they should not disappear when we evict state from > cache. This patch separates listeners into `DurableAndIdempotent` and > `Transient` with the latter being saved in a shared global register that also > more easily permits us to ensure we do not invoke listeners redundantly (and > for listeners themselves to know if we have done so). This is also a stepping > stone to ensuring listeners survive cache eviction. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18524) CEP-15: (Accord) Separate durable and transient listeners
[ https://issues.apache.org/jira/browse/CASSANDRA-18524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith updated CASSANDRA-18524: --- Resolution: Fixed Status: Resolved (was: Triage Needed) > CEP-15: (Accord) Separate durable and transient listeners > - > > Key: CASSANDRA-18524 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18524 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict Elliott Smith >Priority: Normal > > Transient listeners should be handled differently and, ironically, should be > more "persistent" in that they should not disappear when we evict state from > cache. This patch separates listeners into `DurableAndIdempotent` and > `Transient` with the latter being saved in a shared global register that also > more easily permits us to ensure we do not invoke listeners redundantly (and > for listeners themselves to know if we have done so). This is also a stepping > stone to ensuring listeners survive cache eviction. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-18524) CEP-15: (Accord) Separate durable and transient listeners
Benedict Elliott Smith created CASSANDRA-18524: -- Summary: CEP-15: (Accord) Separate durable and transient listeners Key: CASSANDRA-18524 URL: https://issues.apache.org/jira/browse/CASSANDRA-18524 Project: Cassandra Issue Type: Improvement Reporter: Benedict Elliott Smith Transient listeners should be handled differently and, ironically, should be more "persistent" in that they should not disappear when we evict state from cache. This patch separates listeners into `DurableAndIdempotent` and `Transient` with the latter being saved in a shared global register that also more easily permits us to ensure we do not invoke listeners redundantly (and for listeners themselves to know if we have done so). This is also a stepping stone to ensuring listeners survive cache eviction. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-18523) CEP-15: (Accord) Join cluster without full transaction log
Benedict Elliott Smith created CASSANDRA-18523: -- Summary: CEP-15: (Accord) Join cluster without full transaction log Key: CASSANDRA-18523 URL: https://issues.apache.org/jira/browse/CASSANDRA-18523 Project: Cassandra Issue Type: Improvement Reporter: Benedict Elliott Smith Joining replicas should not require the full transaction history to successfully start serving queries. This ticket introduces mechanisms for a replica to join (or catch up) with a data snapshot and all transactions that execute after that snapshot. This is a precursor for transaction state GC. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-18171) CEP-15: (Accord) Faster SimpleProgressLog and BurnTest
[ https://issues.apache.org/jira/browse/CASSANDRA-18171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith reassigned CASSANDRA-18171: -- Assignee: Benedict Elliott Smith > CEP-15: (Accord) Faster SimpleProgressLog and BurnTest > -- > > Key: CASSANDRA-18171 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18171 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict Elliott Smith >Assignee: Benedict Elliott Smith >Priority: Normal > > Some general efficiency improvements, most notably affecting > `SimpleProgressLog`, to manage the list of transactions we expect progress on > rather than polling all transactions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-18174) CEP-15: (Accord/C*) Introduce range transactions
[ https://issues.apache.org/jira/browse/CASSANDRA-18174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith reassigned CASSANDRA-18174: -- Assignee: Benedict Elliott Smith > CEP-15: (Accord/C*) Introduce range transactions > > > Key: CASSANDRA-18174 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18174 > Project: Cassandra > Issue Type: Improvement > Components: Accord >Reporter: Benedict Elliott Smith >Assignee: Benedict Elliott Smith >Priority: Normal > > Support range transactions in Accord, to facilitate bootstrap. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18172) CEP-15: (Accord/C*) Refactor Timestamp/TxnId
[ https://issues.apache.org/jira/browse/CASSANDRA-18172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith updated CASSANDRA-18172: --- Resolution: Fixed Status: Resolved (was: Triage Needed) > CEP-15: (Accord/C*) Refactor Timestamp/TxnId > > > Key: CASSANDRA-18172 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18172 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict Elliott Smith >Priority: Normal > > Reduce the amount of storage required for Timestamp and TxnId by compressing > epoch to 48 bits, and real/logical to a single 64-bit HLC, while also > supporting flag carrier bits for communicating protocol state information. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-18172) CEP-15: (Accord/C*) Refactor Timestamp/TxnId
[ https://issues.apache.org/jira/browse/CASSANDRA-18172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith reassigned CASSANDRA-18172: -- Assignee: Benedict Elliott Smith > CEP-15: (Accord/C*) Refactor Timestamp/TxnId > > > Key: CASSANDRA-18172 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18172 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict Elliott Smith >Assignee: Benedict Elliott Smith >Priority: Normal > > Reduce the amount of storage required for Timestamp and TxnId by compressing > epoch to 48 bits, and real/logical to a single 64-bit HLC, while also > supporting flag carrier bits for communicating protocol state information. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18171) CEP-15: (Accord) Faster SimpleProgressLog and BurnTest
[ https://issues.apache.org/jira/browse/CASSANDRA-18171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith updated CASSANDRA-18171: --- Resolution: Fixed Status: Resolved (was: Triage Needed) > CEP-15: (Accord) Faster SimpleProgressLog and BurnTest > -- > > Key: CASSANDRA-18171 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18171 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict Elliott Smith >Priority: Normal > > Some general efficiency improvements, most notably affecting > `SimpleProgressLog`, to manage the list of transactions we expect progress on > rather than polling all transactions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18174) CEP-15: (Accord/C*) Introduce range transactions
[ https://issues.apache.org/jira/browse/CASSANDRA-18174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith updated CASSANDRA-18174: --- Resolution: Fixed Status: Resolved (was: Triage Needed) > CEP-15: (Accord/C*) Introduce range transactions > > > Key: CASSANDRA-18174 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18174 > Project: Cassandra > Issue Type: Improvement > Components: Accord >Reporter: Benedict Elliott Smith >Priority: Normal > > Support range transactions in Accord, to facilitate bootstrap. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18173) CEP-15: (Accord/C*) Introduce RangeDeps
[ https://issues.apache.org/jira/browse/CASSANDRA-18173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith updated CASSANDRA-18173: --- Resolution: Fixed Status: Resolved (was: Triage Needed) > CEP-15: (Accord/C*) Introduce RangeDeps > --- > > Key: CASSANDRA-18173 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18173 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict Elliott Smith >Priority: Normal > > In order to support range transactions, we must be able to separately manage > dependencies that cover ranges rather than specific keys. This patch splits > {{Deps}} into {{KeyDeps}} and {{RangeDeps}}, while introducing a new > {{SearchableRangeList}} structure for efficiently looking up range > intersections. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-18173) CEP-15: (Accord/C*) Introduce RangeDeps
[ https://issues.apache.org/jira/browse/CASSANDRA-18173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith reassigned CASSANDRA-18173: -- Assignee: Benedict Elliott Smith > CEP-15: (Accord/C*) Introduce RangeDeps > --- > > Key: CASSANDRA-18173 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18173 > Project: Cassandra > Issue Type: Improvement >Reporter: Benedict Elliott Smith >Assignee: Benedict Elliott Smith >Priority: Normal > > In order to support range transactions, we must be able to separately manage > dependencies that cover ranges rather than specific keys. This patch splits > {{Deps}} into {{KeyDeps}} and {{RangeDeps}}, while introducing a new > {{SearchableRangeList}} structure for efficiently looking up range > intersections. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18484) FunctionCall can throw more specific exceptions
[ https://issues.apache.org/jira/browse/CASSANDRA-18484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717149#comment-17717149 ] Benedict Elliott Smith commented on CASSANDRA-18484: {{InvalidRequestException}} isn't a checked exception - it's a special case of {{RuntimeException}} > FunctionCall can throw more specific exceptions > --- > > Key: CASSANDRA-18484 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18484 > Project: Cassandra > Issue Type: Bug >Reporter: Hao Zhong >Priority: Normal > > FunctionCall has the following code: > {code:java} > private static ByteBuffer executeInternal(ProtocolVersion protocolVersion, > ScalarFunction fun, List params) throws InvalidRequestException > { > ByteBuffer result = fun.execute(protocolVersion, params); > try > { > // Check the method didn't lied on it's declared return type > if (result != null) > fun.returnType().validate(result); > return result; > } > catch (MarshalException e) > { > throw new RuntimeException(String.format("Return of function %s > (%s) is not a valid value for its declared return type %s", > fun, > ByteBufferUtil.bytesToHex(result), fun.returnType().asCQL3Type()), e); > } > } > {code} > When validate throws MarshalException, it rethrows RuntimeException. Other > methods throw more specific exceptions. For example, BytesConversionFcts > throws > {color:#00}InvalidRequestException:{color} > {code:java} > public ByteBuffer execute(ProtocolVersion protocolVersion, List > parameters) > { > ByteBuffer val = parameters.get(0); > if (val != null) > { > try > { > toType.getType().validate(val); > } > catch (MarshalException e) > { > throw new InvalidRequestException(String.format("In call > to function %s, value 0x%s is not a " + > "valid > binary representation for type %s", > name, > ByteBufferUtil.bytesToHex(val), toType)); > } > } > return val; > } > {code} > {color:#00}{color:#00}As another example, Validation also rethrows > this exception:{color}{color} > {code:java} > public static void validateKey(TableMetadata metadata, ByteBuffer key) > { > ... > try > { > metadata.partitionKeyType.validate(key); > } > catch (MarshalException e) > { > throw new InvalidRequestException(e.getMessage()); > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18470) Average of "decimal" values rounds the average if all inputs are integers
[ https://issues.apache.org/jira/browse/CASSANDRA-18470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17714905#comment-17714905 ] Benedict Elliott Smith commented on CASSANDRA-18470: Oof, that is a pretty serious bug IMO, and probably deserves its own ticket. [~ifesdjeen], [~blambov], [~blerer]: this appears to have been introduced by CASSANDRA-12417, would any of you like to have a look? > Average of "decimal" values rounds the average if all inputs are integers > - > > Key: CASSANDRA-18470 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18470 > Project: Cassandra > Issue Type: Bug >Reporter: Nadav Har'El >Priority: Normal > > When running the AVG aggregator on "decimal" values, each value is an > arbitrary-precision number which may be an integer or fractional, but it is > expected that the average would be, in general, fractional. But it turns out > that if all the values are integer *without* a ".0", the aggregator sums them > up as integers and the final division returns an integer too instead of the > fractional response expected from a "decimal" value. > For example: > # AVG of {{decimal}} values 1.0 and 2.0 returns 1.5, as expected. > # AVG of 1.0 and 2 or 1 and 2.0 also return 1.5. > # But AVG of 1 and 2 returns... 1. This is wrong. The user asked for the > average to be a "decimal", not a "varint", so there is no reason why it > should be rounded up to be an integer. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-18470) Average of "decimal" values rounds the average if all inputs are integers
[ https://issues.apache.org/jira/browse/CASSANDRA-18470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17714573#comment-17714573 ] Benedict Elliott Smith edited comment on CASSANDRA-18470 at 4/20/23 12:35 PM: -- I think this is ambiguous to be honest. In general we have very inadequately both _considered_ and _documented_ our behaviour for these kinds of features and data types. However, it is not immediately obvious this behaviour is _incorrect_ since we do not ask the user to specify a level of precision of the output, and since we support arbitrary precision we have to make some decision based on the inputs, and in this case neither parameter has any fractional component, so the result is rounded to the same. There's an argument to be made that this is really inappropriate for an aggregation, as the order in which values occur in the aggregation affects the result. But I think the correct solution is probably to permit a precision to be provided with the operator. We could plausibly also pick a default precision that is non-zero, though this might constrain the precision below an acceptable level for some workloads. We could permit the user to configure a default precision for this operator, and/or use the default precision as a lower bound only. Probably our implementation is wrong, though, given this behaviour. It seems that we assume we have good precision and therefore recompute the average on each new datum, as opposed to maintaining a running sum and count. This would also solve the problem of the order of provision modifying the output. was (Author: benedict): I think this is ambiguous to be honest. In general we have very inadequately both _considered_ and _documented_ our behaviour for these kinds of features and data types. However, it is not immediately obvious this behaviour is _incorrect_ since we do not ask the user to specify a level of precision of the output, and since we support arbitrary precision we have to make some decision based on the inputs, and in this case neither parameter has any fractional component, so the result is rounded to the same. There's an argument to be made that this is really inappropriate for an aggregation, as the order in which values occur in the aggregation affects the result. But I think the correct solution is probably to permit a precision to be provided with the operator. We could plausibly also pick a default precision that is non-zero, though this might constrain the precision below an acceptable level for some workloads. We could permit the user to configure a default precision for this operator, and/or use the default precision as a lower bound only. > Average of "decimal" values rounds the average if all inputs are integers > - > > Key: CASSANDRA-18470 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18470 > Project: Cassandra > Issue Type: Bug >Reporter: Nadav Har'El >Priority: Normal > > When running the AVG aggregator on "decimal" values, each value is an > arbitrary-precision number which may be an integer or fractional, but it is > expected that the average would be, in general, fractional. But it turns out > that if all the values are integer *without* a ".0", the aggregator sums them > up as integers and the final division returns an integer too instead of the > fractional response expected from a "decimal" value. > For example: > # AVG of {{decimal}} values 1.0 and 2.0 returns 1.5, as expected. > # AVG of 1.0 and 2 or 1 and 2.0 also return 1.5. > # But AVG of 1 and 2 returns... 1. This is wrong. The user asked for the > average to be a "decimal", not a "varint", so there is no reason why it > should be rounded up to be an integer. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18470) Average of "decimal" values rounds the average if all inputs are integers
[ https://issues.apache.org/jira/browse/CASSANDRA-18470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17714573#comment-17714573 ] Benedict Elliott Smith commented on CASSANDRA-18470: I think this is ambiguous to be honest. In general we have very inadequately both _considered_ and _documented_ our behaviour for these kinds of features and data types. However, it is not immediately obvious this behaviour is _incorrect_ since we do not ask the user to specify a level of precision of the output, and since we support arbitrary precision we have to make some decision based on the inputs, and in this case neither parameter has any fractional component, so the result is rounded to the same. There's an argument to be made that this is really inappropriate for an aggregation, as the order in which values occur in the aggregation affects the result. But I think the correct solution is probably to permit a precision to be provided with the operator. We could plausibly also pick a default precision that is non-zero, though this might constrain the precision below an acceptable level for some workloads. We could permit the user to configure a default precision for this operator, and/or use the default precision as a lower bound only. > Average of "decimal" values rounds the average if all inputs are integers > - > > Key: CASSANDRA-18470 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18470 > Project: Cassandra > Issue Type: Bug >Reporter: Nadav Har'El >Priority: Normal > > When running the AVG aggregator on "decimal" values, each value is an > arbitrary-precision number which may be an integer or fractional, but it is > expected that the average would be, in general, fractional. But it turns out > that if all the values are integer *without* a ".0", the aggregator sums them > up as integers and the final division returns an integer too instead of the > fractional response expected from a "decimal" value. > For example: > # AVG of {{decimal}} values 1.0 and 2.0 returns 1.5, as expected. > # AVG of 1.0 and 2 or 1 and 2.0 also return 1.5. > # But AVG of 1 and 2 returns... 1. This is wrong. The user asked for the > average to be a "decimal", not a "varint", so there is no reason why it > should be rounded up to be an integer. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18466) Paxos only repair is treated as an incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-18466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith updated CASSANDRA-18466: --- Complexity: Low Hanging Fruit (was: Normal) > Paxos only repair is treated as an incremental repair > - > > Key: CASSANDRA-18466 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18466 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Andrew >Priority: Normal > Labels: lhf > Fix For: 4.1.x, 5.x > > > Paxos only repair tries to continue or is treated as an incremental repair. > This happened on 4.1.0 and 4.1.1 when trying to run repair in preparation for > enabling paxos_state_purging. The repair was in preparation mode triggered > multiple anti-compactions on the nodes. Running the command with --full > behaves in the expected way, ie. only the paxos data is repaired and it's > finished within a few seconds. > {code:java} > nodetool repair --paxos-only // This does not behave as expected, does it > complete quickly and seems to be waiting on anticompactions > {code} > {code:java} > nodetool repair --full --paxos-only // Completes within a few seconds as > expected > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18466) Paxos only repair is treated as an incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-18466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict Elliott Smith updated CASSANDRA-18466: --- Labels: lhf (was: ) > Paxos only repair is treated as an incremental repair > - > > Key: CASSANDRA-18466 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18466 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Andrew >Priority: Normal > Labels: lhf > Fix For: 4.1.x, 5.x > > > Paxos only repair tries to continue or is treated as an incremental repair. > This happened on 4.1.0 and 4.1.1 when trying to run repair in preparation for > enabling paxos_state_purging. The repair was in preparation mode triggered > multiple anti-compactions on the nodes. Running the command with --full > behaves in the expected way, ie. only the paxos data is repaired and it's > finished within a few seconds. > {code:java} > nodetool repair --paxos-only // This does not behave as expected, does it > complete quickly and seems to be waiting on anticompactions > {code} > {code:java} > nodetool repair --full --paxos-only // Completes within a few seconds as > expected > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18466) Paxos only repair is treated as an incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-18466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17714491#comment-17714491 ] Benedict Elliott Smith commented on CASSANDRA-18466: [~maxwellguo] yes, and for paxos-only repairs this should not really happen - since it's not really doing a regular repair at all, and incremental repairs bring in a lot of baggage for clusters that haven't run them yet. > Paxos only repair is treated as an incremental repair > - > > Key: CASSANDRA-18466 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18466 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Andrew >Priority: Normal > Fix For: 4.1.x, 5.x > > > Paxos only repair tries to continue or is treated as an incremental repair. > This happened on 4.1.0 and 4.1.1 when trying to run repair in preparation for > enabling paxos_state_purging. The repair was in preparation mode triggered > multiple anti-compactions on the nodes. Running the command with --full > behaves in the expected way, ie. only the paxos data is repaired and it's > finished within a few seconds. > {code:java} > nodetool repair --paxos-only // This does not behave as expected, does it > complete quickly and seems to be waiting on anticompactions > {code} > {code:java} > nodetool repair --full --paxos-only // Completes within a few seconds as > expected > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18465) Add support for multiple condition branches and results in Accord transaction
[ https://issues.apache.org/jira/browse/CASSANDRA-18465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713988#comment-17713988 ] Benedict Elliott Smith commented on CASSANDRA-18465: This was always intended to be the natural evolution of the syntax, so fully support this of course. > Add support for multiple condition branches and results in Accord transaction > - > > Key: CASSANDRA-18465 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18465 > Project: Cassandra > Issue Type: New Feature > Components: Accord, CQL/Syntax >Reporter: Jacek Lewandowski >Priority: Normal > > I'd like to propose adding support for multiple branches and result sets for > Accord transactions. It could look like this: > {code:sql} > BEGIN TRANSACTION > LET a = ... > LET b = ... > IF condition THEN > SELECT 'one', a.value > UPDATE ... > ELSE IF condition2 THEN > SELECT 'two', b.value > UPDATE ... > ELSE > SELECT 'three', NULL > END IF > COMMIT TRANSACTION > {code} > The existing syntax would remain valid, when a single SELECT is defined in > which case the conditional SELECTs would not be valid. > SELECTs would be validated to return columns of the same type. They would be > able to return literals as well. > This would be make the result of the transaction more intuitive as the client > would know explicitly if the updates where applied or not. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-18433) Row cache inconsistency issue: A read can put stale data into row cache in a race condition
[ https://issues.apache.org/jira/browse/CASSANDRA-18433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17709903#comment-17709903 ] Benedict Elliott Smith edited comment on CASSANDRA-18433 at 4/8/23 4:41 PM: Hmm, the semantics of {{replace}} _should_ only update if the sentinel is still present. However, the {{OHCacheAdapter}} appears to invoke {{addOrReplace}}. It appears this bug has existed since it was introduced, and there does not appear to be an equivalent {{replace}} method in the underlying implementation. Unfortunately, the row cache is not a widely used facility anymore, at least amongst the contributor-base, so it has not benefitted from the push for improved quality in the project I would suggest trying to swap the underlying cache implementation by setting {{row_cache_class_name}} in your yaml to "org.apache.cassandra.cache.CaffeineCache" - though this will have very different heap behaviour, the cache implementation itself is very good. Or, I would consider disabling the row cache. Fixing the existing implementation may take some time, as I don't know if OHCache is actively maintained any longer. was (Author: benedict): Hmm, the semantics of {{replace}} _should_ only update if the sentinel is still present. However, the {{OHCacheAdapter}} appears to invoke {{addOrReplace}}. It appears this bug has existed since it was introduced, and there does not appear to be an equivalent {{replace}} method in the underlying implementation. Unfortunately, the row cache is not a widely used facility anymore, so it has not benefitted from the push for improved quality in the project I would suggest trying to swap the underlying cache implementation by setting {{row_cache_class_name}} in your yaml to "org.apache.cassandra.cache.CaffeineCache" - though this will have very different heap behaviour, the cache implementation itself is very good. Or, I would consider disabling the row cache. Fixing the existing implementation may take some time, as I don't know if OHCache is actively maintained any longer. > Row cache inconsistency issue: A read can put stale data into row cache in a > race condition > --- > > Key: CASSANDRA-18433 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18433 > Project: Cassandra > Issue Type: Bug > Components: Local/Caching >Reporter: Huapeng Yuan >Priority: Normal > Fix For: 3.11.x > > > We found the issue in our production system which has the version 3.11.6. > When we did an update and then read immediately after update successfully, we > may read the stale data sometimes. Same issue for writeAll + readOne > consistency and writeQuorm+readQuorum. The issue is gone once we disabled the > row cache. > The config for row cache: > caching = \{'keys': 'ALL', 'rows_per_partition': 'ALL'} > > After some investigations, we think there is a race condition during > read/write path. Problems: > When two threads are reading and writing the same partition (for example, two > rows with same partition key) at same time, the read thread may load the > stale data into row cache for the row which is being updated. > {{}} > {panel:title=The steps of write-thread inserting a row to partition p} > {{W-Step }}{{{}1{}}}{{{}: inserts the value v1 to memtable.{}}} > {{W-Step }}{{{}2{}}}{{{}: invalidates the row cache using partition key.{}}} > {panel} > {{}} > {panel:title=The steps of read-thread reading a row from partition p} > {{R-Step }}{{{}1{}}}{{{}: Checks row cache and finds whether the row is not > present in cache. If not, goes to '{}}}{{{}R-Step {}}}{{{}2'{}}}{{{}.{}}} > {{R-Step }}{{{}2{}}}{{{}: Insert a sentinel (timestamp) as the row value into > row cache to tell other read threads should skip the row cache.{}}} > {{R-Step }}{{{}3{}}}{{{}: Read from storage layer and get value v0 which can > be older than v1.{}}} > {{R-Step }}{{{}4{}}}{{{}: Insert v0 to row cache {}}}{{for}} {{the row by > checking }}{{if}} {{the row doesn't exist or it has the same sentinel. *The > inconsistency is caused by this step. Should not insert the stale value if > the sentinel doesn't exist in row cache any more.*}} > {panel} > {{}} > {panel:title=The sequence to reproduce the issue} > {{R-Step }}{{1}} > {{R-Step }}{{2}} > {{R-Step }}{{3}} > {{W-Step }}{{1}} > {{W-Step }}{{2}} > {{R-Step }}{{4}} > {panel} > {{}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18433) Row cache inconsistency issue: A read can put stale data into row cache in a race condition
[ https://issues.apache.org/jira/browse/CASSANDRA-18433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17709903#comment-17709903 ] Benedict Elliott Smith commented on CASSANDRA-18433: Hmm, the semantics of {{replace}} _should_ only update if the sentinel is still present. However, the {{OHCacheAdapter}} appears to invoke {{addOrReplace}}. It appears this bug has existed since it was introduced, and there does not appear to be an equivalent {{replace}} method in the underlying implementation. Unfortunately, the row cache is not a widely used facility anymore, so it has not benefitted from the push for improved quality in the project I would suggest trying to swap the underlying cache implementation by setting {{row_cache_class_name}} in your yaml to "org.apache.cassandra.cache.CaffeineCache" - though this will have very different heap behaviour, the cache implementation itself is very good. Or, I would consider disabling the row cache. Fixing the existing implementation may take some time, as I don't know if OHCache is actively maintained any longer. > Row cache inconsistency issue: A read can put stale data into row cache in a > race condition > --- > > Key: CASSANDRA-18433 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18433 > Project: Cassandra > Issue Type: Bug > Components: Local/Caching >Reporter: Huapeng Yuan >Priority: Normal > Fix For: 3.11.x > > > We found the issue in our production system which has the version 3.11.6. > When we did an update and then read immediately after update successfully, we > may read the stale data sometimes. Same issue for writeAll + readOne > consistency and writeQuorm+readQuorum. The issue is gone once we disabled the > row cache. > The config for row cache: > caching = \{'keys': 'ALL', 'rows_per_partition': 'ALL'} > > After some investigations, we think there is a race condition during > read/write path. Problems: > When two threads are reading and writing the same partition (for example, two > rows with same partition key) at same time, the read thread may load the > stale data into row cache for the row which is being updated. > {{}} > {panel:title=The steps of write-thread inserting a row to partition p} > {{W-Step }}{{{}1{}}}{{{}: inserts the value v1 to memtable.{}}} > {{W-Step }}{{{}2{}}}{{{}: invalidates the row cache using partition key.{}}} > {panel} > {{}} > {panel:title=The steps of read-thread reading a row from partition p} > {{R-Step }}{{{}1{}}}{{{}: Checks row cache and finds whether the row is not > present in cache. If not, goes to '{}}}{{{}R-Step {}}}{{{}2'{}}}{{{}.{}}} > {{R-Step }}{{{}2{}}}{{{}: Insert a sentinel (timestamp) as the row value into > row cache to tell other read threads should skip the row cache.{}}} > {{R-Step }}{{{}3{}}}{{{}: Read from storage layer and get value v0 which can > be older than v1.{}}} > {{R-Step }}{{{}4{}}}{{{}: Insert v0 to row cache {}}}{{for}} {{the row by > checking }}{{if}} {{the row doesn't exist or it has the same sentinel. *The > inconsistency is caused by this step. Should not insert the stale value if > the sentinel doesn't exist in row cache any more.*}} > {panel} > {{}} > {panel:title=The sequence to reproduce the issue} > {{R-Step }}{{1}} > {{R-Step }}{{2}} > {{R-Step }}{{3}} > {{W-Step }}{{1}} > {{W-Step }}{{2}} > {{R-Step }}{{4}} > {panel} > {{}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org