[jira] [Updated] (CASSANDRA-15464) Inserts to set slow due to AtomicBTreePartition for ComplexColumnData.dataSize

Erick Ramirez (Jira) Sun, 22 Dec 2019 15:42:26 -0800


     [ 
https://issues.apache.org/jira/browse/CASSANDRA-15464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Erick Ramirez updated CASSANDRA-15464:
--------------------------------------
    Description: 
Concurrent inserts to set<text> can cause client timeouts and excessive CPU due 
to compare and swap in AtomicBTreePartition for ComplexColumnData.dataSize. As 
the length of the set gets longer, the probability of doing the compare 
decreases.

The problem we saw in production was with insertions into a set<text> with 
len(set<text>) hundreds to thousands. Because of the semantics of what we store 
in the set, we had not anticipated the length being more than about 10. (Almost 
all rows have length <= 6, the largest observed was 7032. Total number of rows 
< 4000. 3 machines were used.)

The bad behavior we saw was all machines went to 100% cpu on all cores, and 
clients were timing out. Our immediate solution in production was adding more 
machines (went from 3 machines to 6 machines). The stack included 
partitions.AtomicBTreePartition.addAllWithSizeDelta … 
ComplexColumnData.dataSize.
The AtomicBTreePartition code uses a Compare And Swap approach, yet the time 
between compares is dependent on the length of the set. When the length of the 
set is long, with concurrent updates, each loop is unlikely to make forward 
progress and can be delayed looping.

Here is one example call stack:

{noformat}
"SharedPool-Worker-40" #167 daemon prio=10 os_prio=0 tid=0x00007f9bb4032800 
nid=0x2ee5 runnable [0x00007f9b067f4000]
java.lang.Thread.State: RUNNABLE
at 
org.apache.cassandra.db.rows.ComplexColumnData.dataSize(ComplexColumnData.java:114)
at org.apache.cassandra.db.rows.BTreeRow.dataSize(BTreeRow.java:373)
at 
org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.apply(AtomicBTreePartition.java:292)
at 
org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.apply(AtomicBTreePartition.java:235)
at org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:159)
at org.apache.cassandra.utils.btree.TreeBuilder.update(TreeBuilder.java:73)
at org.apache.cassandra.utils.btree.BTree.update(BTree.java:181)
at 
org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:155)
at org.apache.cassandra.db.Memtable.put(Memtable.java:254)
at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1204)
at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:573)
at org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:384)
at org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:205)
at org.apache.cassandra.hints.Hint.applyFuture(Hint.java:99)
at org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:95)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at 
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
at 
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:136)
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
at java.lang.Thread.run(Thread.java:748)
{noformat}

In a test program to repro the problem, we raise the number of concurrent users 
and lower the think time between queries. Updating elements of low-length sets 
can occur without errors, and with long-length sets, clients time out with 
errors and there are periods with all cores 99.x% CPU and with jstack shows 
time going to  ComplexColumnData.dataSize.

Here is the schema. Our long term application solution was to just have the set 
elements be part of the primary key and avoid using set<text>, thus 
guaranteeing the code does not go through ComplexColumnData.dataSize

{noformat}
CREATE TABLE x.x (
 x int PRIMARY KEY,
 y set<text> ) ...
{noformat}

  was:
Concurrent inserts to set<text> can cause client timeouts and excessive CPU due 
to compare and swap in AtomicBTreePartition for ComplexColumnData.dataSize. As 
the length of the set gets longer, the probability of doing the compare 
decreases.

The problem we saw in production was with insertions into a set<text> with 
len(set<text>) hundreds to thousands. Because of the semantics of what we store 
in the set, we had not anticipated the length being more than about 10. (Almost 
all rows have length <= 6, the largest observed was 7032. Total number of rows 
< 4000. 3 machines were used.)



The bad behavior we saw was all machines went to 100% cpu on all cores, and 
clients were timing out. Our immediate solution in production was adding more 
machines (went from 3 machines to 6 machines). The stack included 
partitions.AtomicBTreePartition.addAllWithSizeDelta … 
ComplexColumnData.dataSize.
The AtomicBTreePartition code uses a Compare And Swap approach, yet the time 
between compares is dependent on the length of the set. When the length of the 
set is long, with concurrent updates, each loop is unlikely to make forward 
progress and can be delayed looping.


Here is one example call stack:
```
"SharedPool-Worker-40" #167 daemon prio=10 os_prio=0 tid=0x00007f9bb4032800 
nid=0x2ee5 runnable [0x00007f9b067f4000]
java.lang.Thread.State: RUNNABLE
at 
org.apache.cassandra.db.rows.ComplexColumnData.dataSize(ComplexColumnData.java:114)
at org.apache.cassandra.db.rows.BTreeRow.dataSize(BTreeRow.java:373)
at 
org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.apply(AtomicBTreePartition.java:292)
at 
org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.apply(AtomicBTreePartition.java:235)
at org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:159)
at org.apache.cassandra.utils.btree.TreeBuilder.update(TreeBuilder.java:73)
at org.apache.cassandra.utils.btree.BTree.update(BTree.java:181)
at 
org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:155)
at org.apache.cassandra.db.Memtable.put(Memtable.java:254)
at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1204)
at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:573)
at org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:384)
at org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:205)
at org.apache.cassandra.hints.Hint.applyFuture(Hint.java:99)
at org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:95)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at 
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
at 
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:136)
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
at java.lang.Thread.run(Thread.java:748)
```

In a test program to repro the problem, we raise the number of concurrent users 
and lower the think time between queries. Updating elements of low-length sets 
can occur without errors, and with long-length sets, clients time out with 
errors and there are periods with all cores 99.x% CPU and with jstack shows 
time going to  ComplexColumnData.dataSize.


Here is the schema. Our long term application solution was to just have the set 
elements be part of the primary key and avoid using set<text>, thus 
guaranteeing the code does not go through ComplexColumnData.dataSize
```CREATE TABLE x.x (
 x int PRIMARY KEY,
 y set<text> ) ... ```


> Inserts to set<text> slow due to AtomicBTreePartition for 
> ComplexColumnData.dataSize
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15464
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15464
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Legacy/Core
>            Reporter: Eric Jacobsen
>            Priority: Normal
>
> Concurrent inserts to set<text> can cause client timeouts and excessive CPU 
> due to compare and swap in AtomicBTreePartition for 
> ComplexColumnData.dataSize. As the length of the set gets longer, the 
> probability of doing the compare decreases.
> The problem we saw in production was with insertions into a set<text> with 
> len(set<text>) hundreds to thousands. Because of the semantics of what we 
> store in the set, we had not anticipated the length being more than about 10. 
> (Almost all rows have length <= 6, the largest observed was 7032. Total 
> number of rows < 4000. 3 machines were used.)
> The bad behavior we saw was all machines went to 100% cpu on all cores, and 
> clients were timing out. Our immediate solution in production was adding more 
> machines (went from 3 machines to 6 machines). The stack included 
> partitions.AtomicBTreePartition.addAllWithSizeDelta … 
> ComplexColumnData.dataSize.
> The AtomicBTreePartition code uses a Compare And Swap approach, yet the time 
> between compares is dependent on the length of the set. When the length of 
> the set is long, with concurrent updates, each loop is unlikely to make 
> forward progress and can be delayed looping.
> Here is one example call stack:
> {noformat}
> "SharedPool-Worker-40" #167 daemon prio=10 os_prio=0 tid=0x00007f9bb4032800 
> nid=0x2ee5 runnable [0x00007f9b067f4000]
> java.lang.Thread.State: RUNNABLE
> at 
> org.apache.cassandra.db.rows.ComplexColumnData.dataSize(ComplexColumnData.java:114)
> at org.apache.cassandra.db.rows.BTreeRow.dataSize(BTreeRow.java:373)
> at 
> org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.apply(AtomicBTreePartition.java:292)
> at 
> org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.apply(AtomicBTreePartition.java:235)
> at org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:159)
> at org.apache.cassandra.utils.btree.TreeBuilder.update(TreeBuilder.java:73)
> at org.apache.cassandra.utils.btree.BTree.update(BTree.java:181)
> at 
> org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:155)
> at org.apache.cassandra.db.Memtable.put(Memtable.java:254)
> at 
> org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1204)
> at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:573)
> at org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:384)
> at org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:205)
> at org.apache.cassandra.hints.Hint.applyFuture(Hint.java:99)
> at org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:95)
> at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at 
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
> at 
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:136)
> at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> In a test program to repro the problem, we raise the number of concurrent 
> users and lower the think time between queries. Updating elements of 
> low-length sets can occur without errors, and with long-length sets, clients 
> time out with errors and there are periods with all cores 99.x% CPU and with 
> jstack shows time going to  ComplexColumnData.dataSize.
> Here is the schema. Our long term application solution was to just have the 
> set elements be part of the primary key and avoid using set<text>, thus 
> guaranteeing the code does not go through ComplexColumnData.dataSize
> {noformat}
> CREATE TABLE x.x (
>  x int PRIMARY KEY,
>  y set<text> ) ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15464) Inserts to set slow due to AtomicBTreePartition for ComplexColumnData.dataSize

Reply via email to