Using cassandra-stress with the out of the box schema I am seeing around 140k rows/second throughput using 1 client on each of 3 client machines. On the servers:
- CPU utilization: 43% usr/20% sys, 55%/28%, 70%/10% (the last number is the older box) - Inbound network traffic: 174 Mbps, 190 Mbps, 178 Mbps - Disk writes/sec: ~10k each server - Disk utilization is in the low single digits but spikes up to 50% - Disk queue size is in the low single digits but spikes up into the mid hundreds. I even saw in the thousands. I had not noticed this before. The disk stats come from iostat -xz 1. Given the low reported utilization %s I would not expect to see any disk queue buildup, even low single digits. Going to 2 cassandra-stress clients per machine the throughput dropped to 133k rows/sec. - CPU utilization: 13% usr/5% sys, 15%/25%, 40%/22% on the older box - Inbound network RX: 100Mbps, 125Mbps, 120Mbps - Disk utilization is a little lower, but with the same spiky behavior Going to 3 cassandra-stress clients per machine the throughput dropped to 110k rows/sec - CPU utilization: 15% usr/20% sys, 15%/20%, 40%/20% on the older box - Inbound network RX dropped to 130 Mbps - Disk utilization stayed roughly the same I noticed that with the standard cassandra-stress schema GC is not an issue. But with my application-specific schema there is a lot of GC on the slower box. Also with the application-specific schema I can't seem to get past 36k rows/sec. The application schema has 64 columns (mostly ints) and the key is (date,sequence#). The standard stress schema has a lot fewer columns and no clustering column. Thanks, -- Eric On Wed, Jun 14, 2017 at 1:47 AM, Eric Pederson <eric...@gmail.com> wrote: > Shoot - I didn't see that one. I subscribe to the digest but was focusing > on the direct replies and accidentally missed Patrick and Jeff Jirsa's > messages. Sorry about that... > > I've been using a combination of cassandra-stress, cqlsh COPY FROM and a > custom C++ application for my ingestion testing. My default setting for > my custom client application is 96 threads, and then by default I run one > client application process on each of 3 machines. I tried > doubling/quadrupling the number of client threads (and doubling/tripling > the number of client processes but keeping the threads per process the > same) but didn't see any change. If I recall correctly I started getting > timeouts after I went much beyond concurrent_writes which is 384 (for a 48 > CPU box) - meaning at 500 threads per client machine I started seeing > timeouts. I'll try again to be sure. > > For the purposes of this conversation I will try to always use > cassandra-stress to keep the number of unknowns limited. I'll will run > more cassandra-stress clients tomorrow in line with Patrick's 3-5 per > server recommendation. > > Thanks! > > > -- Eric > > On Wed, Jun 14, 2017 at 12:40 AM, Jonathan Haddad <j...@jonhaddad.com> > wrote: > >> Did you try adding more client stress nodes as Patrick recommended? >> >> On Tue, Jun 13, 2017 at 9:31 PM Eric Pederson <eric...@gmail.com> wrote: >> >>> Scratch that theory - the flamegraphs show that GC is only 3-4% of two >>> newer machine's overall processing, compared to 18% on the slow machine. >>> >>> I took that machine out of the cluster completely and recreated the >>> keyspaces. The ingest tests now run slightly faster (!). I would have >>> expected a linear slowdown since the load is fairly balanced across >>> partitions. GC appears to be the bottleneck in the 3-server >>> configuration. But still in the two-server configuration the >>> CPU/disk/network is still not being fully utilized (the closest is CPU at >>> ~45% on one ingest test). nodetool tpstats shows only blips of >>> queueing. >>> >>> >>> >>> >>> -- Eric >>> >>> On Mon, Jun 12, 2017 at 9:50 PM, Eric Pederson <eric...@gmail.com> >>> wrote: >>> >>>> Hi all - I wanted to follow up on this. I'm happy with the throughput >>>> we're getting but I'm still curious about the bottleneck. >>>> >>>> The big thing that sticks out is one of the nodes is logging frequent >>>> GCInspector messages: 350-500ms every 3-6 seconds. All three nodes in >>>> the cluster have identical Cassandra configuration, but the node that is >>>> logging frequent GCs is an older machine with slower CPU and SSD. This >>>> node logs frequent GCInspectors both under load and when compacting >>>> but otherwise unloaded. >>>> >>>> My theory is that the other two nodes have similar GC frequency >>>> (because they are seeing the same basic load), but because they are faster >>>> machines, they don't spend as much time per GC and don't cross the >>>> GCInspector threshold. Does that sound plausible? nodetool tpstats >>>> doesn't show any queueing in the system. >>>> >>>> Here's flamegraphs from the system when running a cqlsh COPY FROM: >>>> >>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>> /flamegraph_ultva01_cars_batch2.svg >>>> >>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg> >>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>> /flamegraph_ultva02_cars_batch2.svg >>>> >>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg> >>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>> /flamegraph_ultva03_cars_batch2.svg >>>> >>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg> >>>> >>>> The slow node (ultva03) spends disproportional time in GC. >>>> >>>> Thanks, >>>> >>>> >>>> -- Eric >>>> >>>> On Thu, May 25, 2017 at 8:09 PM, Eric Pederson <eric...@gmail.com> >>>> wrote: >>>> >>>>> Due to a cut and paste error those flamegraphs were a recording of the >>>>> whole system, not just Cassandra. Throughput is approximately 30k >>>>> rows/sec. >>>>> >>>>> Here's the graphs with just the Cassandra PID: >>>>> >>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>> /flamegraph_ultva01_sars2.svg >>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>> /flamegraph_ultva02_sars2.svg >>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>> /flamegraph_ultva03_sars2.svg >>>>> >>>>> >>>>> And here's graphs during a cqlsh COPY FROM to the same table, using >>>>> real data, MAXBATCHSIZE=2. Throughput is good at approximately >>>>> 110k rows/sec. >>>>> >>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>> /flamegraph_ultva01_cars_batch2.svg >>>>> >>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg> >>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>> /flamegraph_ultva02_cars_batch2.svg >>>>> >>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg> >>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>> /flamegraph_ultva03_cars_batch2.svg >>>>> >>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg> >>>>> >>>>> >>>>> >>>>> >>>>> -- Eric >>>>> >>>>> On Thu, May 25, 2017 at 6:44 PM, Eric Pederson <eric...@gmail.com> >>>>> wrote: >>>>> >>>>>> Totally understood :) >>>>>> >>>>>> I forgot to mention - I set the /proc/irq/*/smp_affinity mask to >>>>>> include all of the CPUs. Actually most of them were set that way already >>>>>> (for example, 0000ffff,ffffffff) - it might be because irqbalanced >>>>>> is running. But for some reason the interrupts are all being handled on >>>>>> CPU 0 anyway. >>>>>> >>>>>> I see this in /var/log/dmesg on the machines: >>>>>> >>>>>>> >>>>>>> Your BIOS has requested that x2apic be disabled. >>>>>>> This will leave your machine vulnerable to irq-injection attacks. >>>>>>> Use 'intremap=no_x2apic_optout' to override BIOS request. >>>>>>> Enabled IRQ remapping in xapic mode >>>>>>> x2apic not enabled, IRQ remapping is in xapic mode >>>>>> >>>>>> >>>>>> In a reply to one of the comments, he says: >>>>>> >>>>>> >>>>>> When IO-APIC configured to spread interrupts among all cores, it can >>>>>>> handle up to eight cores. If you have more than eight cores, kernel will >>>>>>> not configure IO-APIC to spread interrupts. Thus the trick I described >>>>>>> in >>>>>>> the article will not work. >>>>>>> Otherwise it may be caused by buggy BIOS or even buggy hardware. >>>>>> >>>>>> >>>>>> I'm not sure if either of them is relevant to my situation. >>>>>> >>>>>> >>>>>> Thanks! >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- Eric >>>>>> >>>>>> On Thu, May 25, 2017 at 4:16 PM, Jonathan Haddad <j...@jonhaddad.com> >>>>>> wrote: >>>>>> >>>>>>> You shouldn't need a kernel recompile. Check out the section >>>>>>> "Simple solution for the problem" in http://www.alexonlinux.com/ >>>>>>> smp-affinity-and-proper-interrupt-handling-in-linux. You can >>>>>>> balance your requests across up to 8 CPUs. >>>>>>> >>>>>>> I'll check out the flame graphs in a little bit - in the middle of >>>>>>> something and my brain doesn't multitask well :) >>>>>>> >>>>>>> On Thu, May 25, 2017 at 1:06 PM Eric Pederson <eric...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Jonathan - >>>>>>>> >>>>>>>> It looks like these machines are configured to use CPU 0 for all >>>>>>>> I/O interrupts. I don't think I'm going to get the OK to compile a new >>>>>>>> kernel for them to balance the interrupts across CPUs, but to mitigate >>>>>>>> the >>>>>>>> problem I taskset the Cassandra process to run on all CPU except 0. It >>>>>>>> didn't change the performance though. Let me know if you think it's >>>>>>>> crucial that we balance the interrupts across CPUs and I can try to >>>>>>>> lobby >>>>>>>> for a new kernel. >>>>>>>> >>>>>>>> Here are flamegraphs from each node from a cassandra-stress ingest >>>>>>>> into a table representative of the what we are going to be using. >>>>>>>> This >>>>>>>> table is also roughly 200 bytes, with 64 columns and the primary key >>>>>>>> (date, >>>>>>>> sequence_number). Cassandra-stress was run on 3 separate client >>>>>>>> machines. Using cassandra-stress to write to this table I see the >>>>>>>> same thing: neither disk, CPU or network is fully utilized. >>>>>>>> >>>>>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>>>>> /flamegraph_ultva01_sars.svg >>>>>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>>>>> /flamegraph_ultva02_sars.svg >>>>>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>>>>> /flamegraph_ultva03_sars.svg >>>>>>>> >>>>>>>> Re: GC: In the stress run with the parameters above, two of the >>>>>>>> three nodes log zero or one GCInspectors. On the other hand, the >>>>>>>> 3rd machine logs a GCInspector every 5 seconds or so, 300-500ms >>>>>>>> each time. I found out that the 3rd machine actually has different >>>>>>>> specs >>>>>>>> as the other two. It's an older box with the same RAM but less CPUs >>>>>>>> (32 >>>>>>>> instead of 48), a slower SSD and slower memory. The Cassandra >>>>>>>> configuration is exactly the same. I tried running Cassandra with >>>>>>>> only 32 >>>>>>>> CPUs on the newer boxes to see if that would cause them to GC pause >>>>>>>> more, >>>>>>>> but it didn't. >>>>>>>> >>>>>>>> On a separate topic - for this cassandra-stress run I reduced the >>>>>>>> batch size to 2 in order to keep the logs clean. That also reduced the >>>>>>>> throughput from around 100k rows/second to 32k rows/sec. I've been >>>>>>>> doing >>>>>>>> ingestion tests using cassandra-stress, cqlsh COPY FROM and a >>>>>>>> custom C++ application. In most of the tests that I've been doing I've >>>>>>>> been using a batch size of around 20 (unlogged, all batch rows have the >>>>>>>> same partition key). However, it fills the logs with batch size >>>>>>>> warnings. >>>>>>>> I was going to raise the batch warning size but the docs scared me away >>>>>>>> from doing that. Given that we're using unlogged/same partition >>>>>>>> batches >>>>>>>> is it safe to raise the batch size warning limit? Actually cqlsh >>>>>>>> COPY FROM has very good throughput using a small batch size, but I >>>>>>>> can't get that same throughput in cassandra-stress or my C++ app with a >>>>>>>> batch size of 2. >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- Eric >>>>>>>> >>>>>>>> On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <j...@jonhaddad.com >>>>>>>> > wrote: >>>>>>>> >>>>>>>>> How many CPUs are you using for interrupts? >>>>>>>>> http://www.alexonlinux.com/smp-affinity-and-proper-interrup >>>>>>>>> t-handling-in-linux >>>>>>>>> >>>>>>>>> Have you tried making a flame graph to see where Cassandra is >>>>>>>>> spending its time? http://www.brendangregg. >>>>>>>>> com/blog/2014-06-12/java-flame-graphs.html >>>>>>>>> >>>>>>>>> Are you tracking GC pauses? >>>>>>>>> >>>>>>>>> Jon >>>>>>>>> >>>>>>>>> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <eric...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi all: >>>>>>>>>> >>>>>>>>>> I'm new to Cassandra and I'm doing some performance testing. One >>>>>>>>>> of things that I'm testing is ingestion throughput. My server >>>>>>>>>> setup is: >>>>>>>>>> >>>>>>>>>> - 3 node cluster >>>>>>>>>> - SSD data (both commit log and sstables are on the same disk) >>>>>>>>>> - 64 GB RAM per server >>>>>>>>>> - 48 cores per server >>>>>>>>>> - Cassandra 3.0.11 >>>>>>>>>> - 48 Gb heap using G1GC >>>>>>>>>> - 1 Gbps NICs >>>>>>>>>> >>>>>>>>>> Since I'm using SSD I've tried tuning the following (one at a >>>>>>>>>> time) but none seemed to make a lot of difference: >>>>>>>>>> >>>>>>>>>> - concurrent_writes=384 >>>>>>>>>> - memtable_flush_writers=8 >>>>>>>>>> - concurrent_compactors=8 >>>>>>>>>> >>>>>>>>>> I am currently doing ingestion tests sending data from 3 clients >>>>>>>>>> on the same subnet. I am using cassandra-stress to do some ingestion >>>>>>>>>> testing. The tests are using CL=ONE and RF=2. >>>>>>>>>> >>>>>>>>>> Using cassandra-stress (3.10) I am able to saturate the disk >>>>>>>>>> using a large enough column size and the standard five column >>>>>>>>>> cassandra-stress schema. For example, -col size=fixed(400) will >>>>>>>>>> saturate the disk and compactions will start falling behind. >>>>>>>>>> >>>>>>>>>> One of our main tables has a row size that approximately 200 >>>>>>>>>> bytes, across 64 columns. When ingesting this table I don't see any >>>>>>>>>> resource saturation. Disk utilization is around 10-15% per >>>>>>>>>> iostat. Incoming network traffic on the servers is around >>>>>>>>>> 100-300 Mbps. CPU utilization is around 20-70%. nodetool >>>>>>>>>> tpstats shows mostly zeros with occasional spikes around 500 in >>>>>>>>>> MutationStage. >>>>>>>>>> >>>>>>>>>> The stress run does 10,000,000 inserts per client, each with a >>>>>>>>>> separate range of partition IDs. The run with 200 byte rows takes >>>>>>>>>> about 4 >>>>>>>>>> minutes, with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC >>>>>>>>>> time 173 >>>>>>>>>> ms. >>>>>>>>>> >>>>>>>>>> The overall performance is good - around 120k rows/sec ingested. >>>>>>>>>> But I'm curious to know where the bottleneck is. There's no resource >>>>>>>>>> saturation and nodetool tpstats shows only occasional brief >>>>>>>>>> queueing. Is the rest just expected latency inside of Cassandra? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> -- Eric >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> >