[jira] [Updated] (CASSANDRA-14764) Test Messaging Refactor with: 12 Node Breaking Point, compression=none, encryption=none, coalescing=off

2021-03-19 Thread Benjamin Lerer (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Lerer updated CASSANDRA-14764:
---
Resolution: Done
Status: Resolved  (was: Open)

> Test Messaging Refactor with: 12 Node Breaking Point, compression=none, 
> encryption=none, coalescing=off
> ---
>
> Key: CASSANDRA-14764
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14764
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Legacy/Streaming and Messaging
>Reporter: Joey Lynch
>Assignee: Vinay Chella
>Priority: Normal
> Fix For: 4.0-rc
>
> Attachments: i-03341e1c52de6ea3e-after-queue-change.svg, 
> i-07cd92e844d66d801-after-queue-bound.svg, i-07cd92e844d66d801-hint-play.svg, 
> i-07cd92e844d66d801-uninlined-with-jvm-methods.svg, ttop.txt
>
>
> *Setup:*
>  * Cassandra: 12 (2*6) node i3.xlarge AWS instance (4 cpu cores, 30GB ram) 
> running cassandra trunk off of jasobrown/14503 jdd7ec5a2 (Jasons patched 
> internode messaging branch) vs the same footprint running 3.0.17
>  * Two datacenters with 100ms latency between them
>  * No compression, encryption, or coalescing turned on
> *Test #1:*
> ndbench sent 1.5k QPS at a coordinator level to one datacenter (RF=3*2 = 6 so 
> 3k global replica QPS) of 4kb single partition BATCH mutations at LOCAL_ONE. 
> This represents about 250 QPS per coordinator in the first datacenter or 60 
> QPS per core. The goal was to observe P99 write and read latencies under 
> various QPS.
> *Result:*
> The good news is since the CASSANDRA-14503 changes, instead of keeping the 
> mutations on heap we put the message into hints instead and don't run out of 
> memory. The bad news is that the {{MessagingService-NettyOutbound-Thread's}} 
> would occasionally enter a degraded state where they would just spin on a 
> core. I've attached flame graphs showing the CPU state as [~jasobrown] 
> applied fixes to the {{OutboundMessagingConnection}} class.
>  *Follow Ups:*
> [~jasobrown] has committed a number of fixes onto his 
> {{jasobrown/14503-collab}} branch including:
> 1. Limiting the amount of time spent dequeuing messages if they are expired 
> (previously if messages entered the queue faster than we could dequeue them 
> we'd just inifinte loop on the consumer side)
> 2. Don't call {{dequeueMessages}} from within {{dequeueMessages}} created 
> callbacks.
> We're continuing to use CPU flamegraphs to figure out where we're looping and 
> fixing bugs as we find them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14764) Test Messaging Refactor with: 12 Node Breaking Point, compression=none, encryption=none, coalescing=off

2020-10-09 Thread Joey Lynch (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joey Lynch updated CASSANDRA-14764:
---
Fix Version/s: (was: 4.0.x)
   4.0-rc

> Test Messaging Refactor with: 12 Node Breaking Point, compression=none, 
> encryption=none, coalescing=off
> ---
>
> Key: CASSANDRA-14764
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14764
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Legacy/Streaming and Messaging
>Reporter: Joey Lynch
>Assignee: Vinay Chella
>Priority: Normal
> Fix For: 4.0-rc
>
> Attachments: i-03341e1c52de6ea3e-after-queue-change.svg, 
> i-07cd92e844d66d801-after-queue-bound.svg, i-07cd92e844d66d801-hint-play.svg, 
> i-07cd92e844d66d801-uninlined-with-jvm-methods.svg, ttop.txt
>
>
> *Setup:*
>  * Cassandra: 12 (2*6) node i3.xlarge AWS instance (4 cpu cores, 30GB ram) 
> running cassandra trunk off of jasobrown/14503 jdd7ec5a2 (Jasons patched 
> internode messaging branch) vs the same footprint running 3.0.17
>  * Two datacenters with 100ms latency between them
>  * No compression, encryption, or coalescing turned on
> *Test #1:*
> ndbench sent 1.5k QPS at a coordinator level to one datacenter (RF=3*2 = 6 so 
> 3k global replica QPS) of 4kb single partition BATCH mutations at LOCAL_ONE. 
> This represents about 250 QPS per coordinator in the first datacenter or 60 
> QPS per core. The goal was to observe P99 write and read latencies under 
> various QPS.
> *Result:*
> The good news is since the CASSANDRA-14503 changes, instead of keeping the 
> mutations on heap we put the message into hints instead and don't run out of 
> memory. The bad news is that the {{MessagingService-NettyOutbound-Thread's}} 
> would occasionally enter a degraded state where they would just spin on a 
> core. I've attached flame graphs showing the CPU state as [~jasobrown] 
> applied fixes to the {{OutboundMessagingConnection}} class.
>  *Follow Ups:*
> [~jasobrown] has committed a number of fixes onto his 
> {{jasobrown/14503-collab}} branch including:
> 1. Limiting the amount of time spent dequeuing messages if they are expired 
> (previously if messages entered the queue faster than we could dequeue them 
> we'd just inifinte loop on the consumer side)
> 2. Don't call {{dequeueMessages}} from within {{dequeueMessages}} created 
> callbacks.
> We're continuing to use CPU flamegraphs to figure out where we're looping and 
> fixing bugs as we find them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14764) Test Messaging Refactor with: 12 Node Breaking Point, compression=none, encryption=none, coalescing=off

2020-10-09 Thread Adam Holmberg (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Holmberg updated CASSANDRA-14764:
--
Complexity: Normal

> Test Messaging Refactor with: 12 Node Breaking Point, compression=none, 
> encryption=none, coalescing=off
> ---
>
> Key: CASSANDRA-14764
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14764
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Legacy/Streaming and Messaging
>Reporter: Joey Lynch
>Assignee: Vinay Chella
>Priority: Normal
> Fix For: 4.0-beta, 4.0-triage
>
> Attachments: i-03341e1c52de6ea3e-after-queue-change.svg, 
> i-07cd92e844d66d801-after-queue-bound.svg, i-07cd92e844d66d801-hint-play.svg, 
> i-07cd92e844d66d801-uninlined-with-jvm-methods.svg, ttop.txt
>
>
> *Setup:*
>  * Cassandra: 12 (2*6) node i3.xlarge AWS instance (4 cpu cores, 30GB ram) 
> running cassandra trunk off of jasobrown/14503 jdd7ec5a2 (Jasons patched 
> internode messaging branch) vs the same footprint running 3.0.17
>  * Two datacenters with 100ms latency between them
>  * No compression, encryption, or coalescing turned on
> *Test #1:*
> ndbench sent 1.5k QPS at a coordinator level to one datacenter (RF=3*2 = 6 so 
> 3k global replica QPS) of 4kb single partition BATCH mutations at LOCAL_ONE. 
> This represents about 250 QPS per coordinator in the first datacenter or 60 
> QPS per core. The goal was to observe P99 write and read latencies under 
> various QPS.
> *Result:*
> The good news is since the CASSANDRA-14503 changes, instead of keeping the 
> mutations on heap we put the message into hints instead and don't run out of 
> memory. The bad news is that the {{MessagingService-NettyOutbound-Thread's}} 
> would occasionally enter a degraded state where they would just spin on a 
> core. I've attached flame graphs showing the CPU state as [~jasobrown] 
> applied fixes to the {{OutboundMessagingConnection}} class.
>  *Follow Ups:*
> [~jasobrown] has committed a number of fixes onto his 
> {{jasobrown/14503-collab}} branch including:
> 1. Limiting the amount of time spent dequeuing messages if they are expired 
> (previously if messages entered the queue faster than we could dequeue them 
> we'd just inifinte loop on the consumer side)
> 2. Don't call {{dequeueMessages}} from within {{dequeueMessages}} created 
> callbacks.
> We're continuing to use CPU flamegraphs to figure out where we're looping and 
> fixing bugs as we find them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14764) Test Messaging Refactor with: 12 Node Breaking Point, compression=none, encryption=none, coalescing=off

2020-05-12 Thread Josh McKenzie (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh McKenzie updated CASSANDRA-14764:
--
Fix Version/s: (was: 4.0-rc)
   4.0-beta

> Test Messaging Refactor with: 12 Node Breaking Point, compression=none, 
> encryption=none, coalescing=off
> ---
>
> Key: CASSANDRA-14764
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14764
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Legacy/Streaming and Messaging
>Reporter: Joey Lynch
>Assignee: Vinay Chella
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: i-03341e1c52de6ea3e-after-queue-change.svg, 
> i-07cd92e844d66d801-after-queue-bound.svg, i-07cd92e844d66d801-hint-play.svg, 
> i-07cd92e844d66d801-uninlined-with-jvm-methods.svg, ttop.txt
>
>
> *Setup:*
>  * Cassandra: 12 (2*6) node i3.xlarge AWS instance (4 cpu cores, 30GB ram) 
> running cassandra trunk off of jasobrown/14503 jdd7ec5a2 (Jasons patched 
> internode messaging branch) vs the same footprint running 3.0.17
>  * Two datacenters with 100ms latency between them
>  * No compression, encryption, or coalescing turned on
> *Test #1:*
> ndbench sent 1.5k QPS at a coordinator level to one datacenter (RF=3*2 = 6 so 
> 3k global replica QPS) of 4kb single partition BATCH mutations at LOCAL_ONE. 
> This represents about 250 QPS per coordinator in the first datacenter or 60 
> QPS per core. The goal was to observe P99 write and read latencies under 
> various QPS.
> *Result:*
> The good news is since the CASSANDRA-14503 changes, instead of keeping the 
> mutations on heap we put the message into hints instead and don't run out of 
> memory. The bad news is that the {{MessagingService-NettyOutbound-Thread's}} 
> would occasionally enter a degraded state where they would just spin on a 
> core. I've attached flame graphs showing the CPU state as [~jasobrown] 
> applied fixes to the {{OutboundMessagingConnection}} class.
>  *Follow Ups:*
> [~jasobrown] has committed a number of fixes onto his 
> {{jasobrown/14503-collab}} branch including:
> 1. Limiting the amount of time spent dequeuing messages if they are expired 
> (previously if messages entered the queue faster than we could dequeue them 
> we'd just inifinte loop on the consumer side)
> 2. Don't call {{dequeueMessages}} from within {{dequeueMessages}} created 
> callbacks.
> We're continuing to use CPU flamegraphs to figure out where we're looping and 
> fixing bugs as we find them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14764) Test Messaging Refactor with: 12 Node Breaking Point, compression=none, encryption=none, coalescing=off

2020-04-27 Thread Benjamin Lerer (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Lerer updated CASSANDRA-14764:
---
Summary: Test Messaging Refactor with: 12 Node Breaking Point, 
compression=none, encryption=none, coalescing=off  (was: Evaluate 12 Node 
Breaking Point, compression=none, encryption=none, coalescing=off)

> Test Messaging Refactor with: 12 Node Breaking Point, compression=none, 
> encryption=none, coalescing=off
> ---
>
> Key: CASSANDRA-14764
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14764
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Legacy/Streaming and Messaging
>Reporter: Joey Lynch
>Assignee: Vinay Chella
>Priority: Normal
> Fix For: 4.0-rc
>
> Attachments: i-03341e1c52de6ea3e-after-queue-change.svg, 
> i-07cd92e844d66d801-after-queue-bound.svg, i-07cd92e844d66d801-hint-play.svg, 
> i-07cd92e844d66d801-uninlined-with-jvm-methods.svg, ttop.txt
>
>
> *Setup:*
>  * Cassandra: 12 (2*6) node i3.xlarge AWS instance (4 cpu cores, 30GB ram) 
> running cassandra trunk off of jasobrown/14503 jdd7ec5a2 (Jasons patched 
> internode messaging branch) vs the same footprint running 3.0.17
>  * Two datacenters with 100ms latency between them
>  * No compression, encryption, or coalescing turned on
> *Test #1:*
> ndbench sent 1.5k QPS at a coordinator level to one datacenter (RF=3*2 = 6 so 
> 3k global replica QPS) of 4kb single partition BATCH mutations at LOCAL_ONE. 
> This represents about 250 QPS per coordinator in the first datacenter or 60 
> QPS per core. The goal was to observe P99 write and read latencies under 
> various QPS.
> *Result:*
> The good news is since the CASSANDRA-14503 changes, instead of keeping the 
> mutations on heap we put the message into hints instead and don't run out of 
> memory. The bad news is that the {{MessagingService-NettyOutbound-Thread's}} 
> would occasionally enter a degraded state where they would just spin on a 
> core. I've attached flame graphs showing the CPU state as [~jasobrown] 
> applied fixes to the {{OutboundMessagingConnection}} class.
>  *Follow Ups:*
> [~jasobrown] has committed a number of fixes onto his 
> {{jasobrown/14503-collab}} branch including:
> 1. Limiting the amount of time spent dequeuing messages if they are expired 
> (previously if messages entered the queue faster than we could dequeue them 
> we'd just inifinte loop on the consumer side)
> 2. Don't call {{dequeueMessages}} from within {{dequeueMessages}} created 
> callbacks.
> We're continuing to use CPU flamegraphs to figure out where we're looping and 
> fixing bugs as we find them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org