Eric Jacobsen created CASSANDRA-15464:
-----------------------------------------

             Summary: Inserts to set<text> slow due to AtomicBTreePartition for 
ComplexColumnData.dataSize
                 Key: CASSANDRA-15464
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15464
             Project: Cassandra
          Issue Type: Bug
          Components: Legacy/Core
            Reporter: Eric Jacobsen


Concurrent inserts to set<text> can cause client timeouts and excessive CPU due 
to compare and swap in AtomicBTreePartition for ComplexColumnData.dataSize. As 
the length of the set gets longer, the probability of doing the compare 
decreases.

The problem we saw in production was with insertions into a set<text> with 
len(set<text>) hundreds to thousands. Because of the semantics of what we store 
in the set, we had not anticipated the length being more than about 10. (Almost 
all rows have length <= 6, the largest observed was 7032. Total number of rows 
< 4000. 3 machines were used.)



The bad behavior we saw was all machines went to 100% cpu on all cores, and 
clients were timing out. Our immediate solution in production was adding more 
machines (went from 3 machines to 6 machines). The stack included 
partitions.AtomicBTreePartition.addAllWithSizeDelta … 
ComplexColumnData.dataSize.
The AtomicBTreePartition code uses a Compare And Swap approach, yet the time 
between compares is dependent on the length of the set. When the length of the 
set is long, with concurrent updates, each loop is unlikely to make forward 
progress and can be delayed looping.


Here is one example call stack:
```
"SharedPool-Worker-40" #167 daemon prio=10 os_prio=0 tid=0x00007f9bb4032800 
nid=0x2ee5 runnable [0x00007f9b067f4000]
java.lang.Thread.State: RUNNABLE
at 
org.apache.cassandra.db.rows.ComplexColumnData.dataSize(ComplexColumnData.java:114)
at org.apache.cassandra.db.rows.BTreeRow.dataSize(BTreeRow.java:373)
at 
org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.apply(AtomicBTreePartition.java:292)
at 
org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.apply(AtomicBTreePartition.java:235)
at org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:159)
at org.apache.cassandra.utils.btree.TreeBuilder.update(TreeBuilder.java:73)
at org.apache.cassandra.utils.btree.BTree.update(BTree.java:181)
at 
org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:155)
at org.apache.cassandra.db.Memtable.put(Memtable.java:254)
at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1204)
at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:573)
at org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:384)
at org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:205)
at org.apache.cassandra.hints.Hint.applyFuture(Hint.java:99)
at org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:95)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at 
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
at 
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:136)
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
at java.lang.Thread.run(Thread.java:748)
```

In a test program to repro the problem, we raise the number of concurrent users 
and lower the think time between queries. Updating elements of low-length sets 
can occur without errors, and with long-length sets, clients time out with 
errors and there are periods with all cores 99.x% CPU and with jstack shows 
time going to  ComplexColumnData.dataSize.


Here is the schema. Our long term application solution was to just have the set 
elements be part of the primary key and avoid using set<text>, thus 
guaranteeing the code does not go through ComplexColumnData.dataSize
```CREATE TABLE x.x (
 x int PRIMARY KEY,
 y set<text> ) ... ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to