[jira] [Updated] (CASSANDRA-13756) StreamingHistogram is not thread safe

2017-08-15 Thread xiangzhou xia (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangzhou xia updated CASSANDRA-13756:
--
Description: 
When we test C*3 in shadow cluster, we notice after a period of time, several 
data node suddenly run into 100% cpu and stop process query anymore.

After investigation, we found that threads are stuck on the sum() in 
streaminghistogram class. Those are jmx threads that working on expose 
getTombStoneRatio metrics (since jmx is kicked off every 3 seconds, there is a 
chance that multiple jmx thread is access streaminghistogram at the same time). 
 

After further investigation, we find that the optimization in CASSANDRA-13038 
led to a spool flush every time when we call sum(). Since TreeMap is not thread 
safe, threads will be stuck when multiple threads visit sum() at the same time.

There are two approaches to solve this issue. 

The first one is to add a lock to the flush in sum() which will introduce some 
extra overhead to streaminghistogram.

The second one is to avoid streaminghistogram to be access by multiple threads. 
For our specific case, is to remove the metrics we added.  

  was:
optimization in CASSANDRA-13038 led to a spool flush every time when we call 
sum. Since TreeMap is not thread safe, threads will be stuck when multiple 
threads visit sum() at the same time, and finally 100% cpu is stuck in that 
function. 

I think this issue is not limit to sum(), update() and merge() both have the 
same issue since they all need to update TreeMap. 

Add lock to bin solved this issue but it also introduced extra overhead.


> StreamingHistogram is not thread safe
> -
>
> Key: CASSANDRA-13756
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13756
> Project: Cassandra
>  Issue Type: Bug
>Reporter: xiangzhou xia
>
> When we test C*3 in shadow cluster, we notice after a period of time, several 
> data node suddenly run into 100% cpu and stop process query anymore.
> After investigation, we found that threads are stuck on the sum() in 
> streaminghistogram class. Those are jmx threads that working on expose 
> getTombStoneRatio metrics (since jmx is kicked off every 3 seconds, there is 
> a chance that multiple jmx thread is access streaminghistogram at the same 
> time).  
> After further investigation, we find that the optimization in CASSANDRA-13038 
> led to a spool flush every time when we call sum(). Since TreeMap is not 
> thread safe, threads will be stuck when multiple threads visit sum() at the 
> same time.
> There are two approaches to solve this issue. 
> The first one is to add a lock to the flush in sum() which will introduce 
> some extra overhead to streaminghistogram.
> The second one is to avoid streaminghistogram to be access by multiple 
> threads. For our specific case, is to remove the metrics we added.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-13756) StreamingHistogram is not thread safe

2017-08-10 Thread xiangzhou xia (JIRA)
xiangzhou xia created CASSANDRA-13756:
-

 Summary: StreamingHistogram is not thread safe
 Key: CASSANDRA-13756
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13756
 Project: Cassandra
  Issue Type: Bug
Reporter: xiangzhou xia


optimization in CASSANDRA-13038 led to a spool flush every time when we call 
sum. Since TreeMap is not thread safe, threads will be stuck when multiple 
threads visit sum() at the same time, and finally 100% cpu is stuck in that 
function. 

I think this issue is not limit to sum(), update() and merge() both have the 
same issue since they all need to update TreeMap. 

Add lock to bin solved this issue but it also introduced extra overhead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-13736) CASSANDRA-9673 cause atomic batch p99 increase 3x

2017-07-31 Thread xiangzhou xia (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangzhou xia updated CASSANDRA-13736:
--
Description: 
When we testing atomic batch in production traffic, we found that p99 latency 
in atomic batch write is 2x-3x worse than 2.2. 

After debuging, we found that the regression is causing by CASSANDRA-9673. This 
patch changed consistency level in batchlog store from ONE to TWO. 
[~iamaleksey] think only block for one batchlog message is a bug in batchlog 
and change it to block for two in CASSANDRA-9673, I think it's actually a very 
good optimization to reduce latency. 

Set the consistency to one will decrease the possibility of slow data node (GC, 
long message queue, etc) affect the latency of atomic batch.  In our shadow 
cluster, when we change consistency from two to one, we notice a 2x-3x p99 
latency drop in atomic batch.   

  was:we notice that changing consistency level from ONE to TWO dramatically 
increased p99 latency in 3.0 atomic batch. [~iamaleksey] think only block for 
one batchlog message is a bug in batchlog and change it to block for two in 
CASSANDRA-9673, I think it's actually a very good optimization to reduce 
latency. Set the consistency to one will decrease the possibility of slow data 
node (GC, long message queue, etc) affect the latency of atomic batch.  In our 
shadow cluster, when we change consistency from two to one, we notice a 2x-3x 
p99 latency drop in atomic batch.   

Summary: CASSANDRA-9673 cause atomic batch p99 increase 3x  (was: 
consistency level change in batchlog send from CASSANDRA-9673 cause atomic 
batch p99 increase 3x)

> CASSANDRA-9673 cause atomic batch p99 increase 3x
> -
>
> Key: CASSANDRA-13736
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13736
> Project: Cassandra
>  Issue Type: Bug
>Reporter: xiangzhou xia
>Assignee: xiangzhou xia
>
> When we testing atomic batch in production traffic, we found that p99 latency 
> in atomic batch write is 2x-3x worse than 2.2. 
> After debuging, we found that the regression is causing by CASSANDRA-9673. 
> This patch changed consistency level in batchlog store from ONE to TWO. 
> [~iamaleksey] think only block for one batchlog message is a bug in batchlog 
> and change it to block for two in CASSANDRA-9673, I think it's actually a 
> very good optimization to reduce latency. 
> Set the consistency to one will decrease the possibility of slow data node 
> (GC, long message queue, etc) affect the latency of atomic batch.  In our 
> shadow cluster, when we change consistency from two to one, we notice a 2x-3x 
> p99 latency drop in atomic batch.   



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-13736) consistency level change in batchlog send from CASSANDRA-9673 cause atomic batch p99 increase 3x

2017-07-31 Thread xiangzhou xia (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xiangzhou xia updated CASSANDRA-13736:
--
Description: we notice that changing consistency level from ONE to TWO 
dramatically increased p99 latency in 3.0 atomic batch. [~iamaleksey] think 
only block for one batchlog message is a bug in batchlog and change it to block 
for two in CASSANDRA-9673, I think it's actually a very good optimization to 
reduce latency. Set the consistency to one will decrease the possibility of 
slow data node (GC, long message queue, etc) affect the latency of atomic 
batch.  In our shadow cluster, when we change consistency from two to one, we 
notice a 2x-3x p99 latency drop in atomic batch. (was: we notice that 
changing consistency level from ONE to TWO dramatically increased p99 latency 
in 3.0 atomic batch. [~iamaleksey] think only block for one batchlog message is 
a bug in batchlog, I think it's actually a very good optimization to reduce 
latency. Set the consistency to one will decrease the possibility of slow data 
node (GC, long message queue, etc) affect the latency of atomic batch.  In our 
shadow cluster, when we change consistency from two to one, we notice a 2x-3x 
p99 latency drop in atomic batch.   )

> consistency level change in batchlog send from CASSANDRA-9673 cause atomic 
> batch p99 increase 3x
> 
>
> Key: CASSANDRA-13736
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13736
> Project: Cassandra
>  Issue Type: Bug
>Reporter: xiangzhou xia
>
> we notice that changing consistency level from ONE to TWO dramatically 
> increased p99 latency in 3.0 atomic batch. [~iamaleksey] think only block for 
> one batchlog message is a bug in batchlog and change it to block for two in 
> CASSANDRA-9673, I think it's actually a very good optimization to reduce 
> latency. Set the consistency to one will decrease the possibility of slow 
> data node (GC, long message queue, etc) affect the latency of atomic batch.  
> In our shadow cluster, when we change consistency from two to one, we notice 
> a 2x-3x p99 latency drop in atomic batch.   



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-13736) consistency level change in batchlog send from CASSANDRA-9673 cause atomic batch p99 increase 3x

2017-07-31 Thread xiangzhou xia (JIRA)
xiangzhou xia created CASSANDRA-13736:
-

 Summary: consistency level change in batchlog send from 
CASSANDRA-9673 cause atomic batch p99 increase 3x
 Key: CASSANDRA-13736
 URL: https://issues.apache.org/jira/browse/CASSANDRA-13736
 Project: Cassandra
  Issue Type: Bug
Reporter: xiangzhou xia


we notice that changing consistency level from ONE to TWO dramatically 
increased p99 latency in 3.0 atomic batch. [~iamaleksey] think only block for 
one batchlog message is a bug in batchlog, I think it's actually a very good 
optimization to reduce latency. Set the consistency to one will decrease the 
possibility of slow data node (GC, long message queue, etc) affect the latency 
of atomic batch.  In our shadow cluster, when we change consistency from two to 
one, we notice a 2x-3x p99 latency drop in atomic batch.   



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org