Scratch that theory - the flamegraphs show that GC is only 3-4% of two newer machine's overall processing, compared to 18% on the slow machine.
I took that machine out of the cluster completely and recreated the keyspaces. The ingest tests now run slightly faster (!). I would have expected a linear slowdown since the load is fairly balanced across partitions. GC appears to be the bottleneck in the 3-server configuration. But still in the two-server configuration the CPU/disk/network is still not being fully utilized (the closest is CPU at ~45% on one ingest test). nodetool tpstats shows only blips of queueing. -- Eric On Mon, Jun 12, 2017 at 9:50 PM, Eric Pederson <eric...@gmail.com> wrote: > Hi all - I wanted to follow up on this. I'm happy with the throughput > we're getting but I'm still curious about the bottleneck. > > The big thing that sticks out is one of the nodes is logging frequent > GCInspector messages: 350-500ms every 3-6 seconds. All three nodes in > the cluster have identical Cassandra configuration, but the node that is > logging frequent GCs is an older machine with slower CPU and SSD. This > node logs frequent GCInspectors both under load and when compacting but > otherwise unloaded. > > My theory is that the other two nodes have similar GC frequency (because > they are seeing the same basic load), but because they are faster machines, > they don't spend as much time per GC and don't cross the GCInspector > threshold. Does that sound plausible? nodetool tpstats doesn't show > any queueing in the system. > > Here's flamegraphs from the system when running a cqlsh COPY FROM: > > - http://sourcedelica.com/wordpress/wp-content/uploads/2017/ > 05/flamegraph_ultva01_cars_batch2.svg > > <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg> > - http://sourcedelica.com/wordpress/wp-content/uploads/2017/ > 05/flamegraph_ultva02_cars_batch2.svg > > <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg> > - http://sourcedelica.com/wordpress/wp-content/uploads/2017/ > 05/flamegraph_ultva03_cars_batch2.svg > > <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg> > > The slow node (ultva03) spends disproportional time in GC. > > Thanks, > > > -- Eric > > On Thu, May 25, 2017 at 8:09 PM, Eric Pederson <eric...@gmail.com> wrote: > >> Due to a cut and paste error those flamegraphs were a recording of the >> whole system, not just Cassandra. Throughput is approximately 30k >> rows/sec. >> >> Here's the graphs with just the Cassandra PID: >> >> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/ >> 05/flamegraph_ultva01_sars2.svg >> >> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_sars2.svg> >> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/ >> 05/flamegraph_ultva02_sars2.svg >> >> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_sars2.svg> >> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/ >> 05/flamegraph_ultva03_sars2.svg >> >> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_sars2.svg> >> >> >> And here's graphs during a cqlsh COPY FROM to the same table, using real >> data, MAXBATCHSIZE=2. Throughput is good at approximately 110k >> rows/sec. >> >> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/ >> 05/flamegraph_ultva01_cars_batch2.svg >> >> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg> >> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/ >> 05/flamegraph_ultva02_cars_batch2.svg >> >> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg> >> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/ >> 05/flamegraph_ultva03_cars_batch2.svg >> >> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg> >> >> >> >> >> -- Eric >> >> On Thu, May 25, 2017 at 6:44 PM, Eric Pederson <eric...@gmail.com> wrote: >> >>> Totally understood :) >>> >>> I forgot to mention - I set the /proc/irq/*/smp_affinity mask to >>> include all of the CPUs. Actually most of them were set that way already >>> (for example, 0000ffff,ffffffff) - it might be because irqbalanced is >>> running. But for some reason the interrupts are all being handled on CPU 0 >>> anyway. >>> >>> I see this in /var/log/dmesg on the machines: >>> >>>> >>>> Your BIOS has requested that x2apic be disabled. >>>> This will leave your machine vulnerable to irq-injection attacks. >>>> Use 'intremap=no_x2apic_optout' to override BIOS request. >>>> Enabled IRQ remapping in xapic mode >>>> x2apic not enabled, IRQ remapping is in xapic mode >>> >>> >>> In a reply to one of the comments, he says: >>> >>> >>> When IO-APIC configured to spread interrupts among all cores, it can >>>> handle up to eight cores. If you have more than eight cores, kernel will >>>> not configure IO-APIC to spread interrupts. Thus the trick I described in >>>> the article will not work. >>>> Otherwise it may be caused by buggy BIOS or even buggy hardware. >>> >>> >>> I'm not sure if either of them is relevant to my situation. >>> >>> >>> Thanks! >>> >>> >>> >>> >>> >>> -- Eric >>> >>> On Thu, May 25, 2017 at 4:16 PM, Jonathan Haddad <j...@jonhaddad.com> >>> wrote: >>> >>>> You shouldn't need a kernel recompile. Check out the section "Simple >>>> solution for the problem" in http://www.alexonlinux.com/ >>>> smp-affinity-and-proper-interrupt-handling-in-linux. You can balance >>>> your requests across up to 8 CPUs. >>>> >>>> I'll check out the flame graphs in a little bit - in the middle of >>>> something and my brain doesn't multitask well :) >>>> >>>> On Thu, May 25, 2017 at 1:06 PM Eric Pederson <eric...@gmail.com> >>>> wrote: >>>> >>>>> Hi Jonathan - >>>>> >>>>> It looks like these machines are configured to use CPU 0 for all I/O >>>>> interrupts. I don't think I'm going to get the OK to compile a new kernel >>>>> for them to balance the interrupts across CPUs, but to mitigate the >>>>> problem >>>>> I taskset the Cassandra process to run on all CPU except 0. It didn't >>>>> change the performance though. Let me know if you think it's crucial that >>>>> we balance the interrupts across CPUs and I can try to lobby for a new >>>>> kernel. >>>>> >>>>> Here are flamegraphs from each node from a cassandra-stress ingest >>>>> into a table representative of the what we are going to be using. This >>>>> table is also roughly 200 bytes, with 64 columns and the primary key >>>>> (date, >>>>> sequence_number). Cassandra-stress was run on 3 separate client >>>>> machines. Using cassandra-stress to write to this table I see the >>>>> same thing: neither disk, CPU or network is fully utilized. >>>>> >>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>> /flamegraph_ultva01_sars.svg >>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>> /flamegraph_ultva02_sars.svg >>>>> - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05 >>>>> /flamegraph_ultva03_sars.svg >>>>> >>>>> Re: GC: In the stress run with the parameters above, two of the three >>>>> nodes log zero or one GCInspectors. On the other hand, the 3rd >>>>> machine logs a GCInspector every 5 seconds or so, 300-500ms each >>>>> time. I found out that the 3rd machine actually has different specs as >>>>> the >>>>> other two. It's an older box with the same RAM but less CPUs (32 instead >>>>> of 48), a slower SSD and slower memory. The Cassandra configuration is >>>>> exactly the same. I tried running Cassandra with only 32 CPUs on the >>>>> newer boxes to see if that would cause them to GC pause more, but it >>>>> didn't. >>>>> >>>>> On a separate topic - for this cassandra-stress run I reduced the >>>>> batch size to 2 in order to keep the logs clean. That also reduced the >>>>> throughput from around 100k rows/second to 32k rows/sec. I've been doing >>>>> ingestion tests using cassandra-stress, cqlsh COPY FROM and a custom >>>>> C++ application. In most of the tests that I've been doing I've been >>>>> using >>>>> a batch size of around 20 (unlogged, all batch rows have the same >>>>> partition >>>>> key). However, it fills the logs with batch size warnings. I was going >>>>> to >>>>> raise the batch warning size but the docs scared me away from doing that. >>>>> Given that we're using unlogged/same partition batches is it safe to raise >>>>> the batch size warning limit? Actually cqlsh COPY FROM has very >>>>> good throughput using a small batch size, but I can't get that same >>>>> throughput in cassandra-stress or my C++ app with a batch size of 2. >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> >>>>> -- Eric >>>>> >>>>> On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <j...@jonhaddad.com> >>>>> wrote: >>>>> >>>>>> How many CPUs are you using for interrupts? >>>>>> http://www.alexonlinux.com/smp-affinity-and-proper-interrup >>>>>> t-handling-in-linux >>>>>> >>>>>> Have you tried making a flame graph to see where Cassandra is >>>>>> spending its time? http://www.brendangregg. >>>>>> com/blog/2014-06-12/java-flame-graphs.html >>>>>> >>>>>> Are you tracking GC pauses? >>>>>> >>>>>> Jon >>>>>> >>>>>> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <eric...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi all: >>>>>>> >>>>>>> I'm new to Cassandra and I'm doing some performance testing. One of >>>>>>> things that I'm testing is ingestion throughput. My server setup is: >>>>>>> >>>>>>> - 3 node cluster >>>>>>> - SSD data (both commit log and sstables are on the same disk) >>>>>>> - 64 GB RAM per server >>>>>>> - 48 cores per server >>>>>>> - Cassandra 3.0.11 >>>>>>> - 48 Gb heap using G1GC >>>>>>> - 1 Gbps NICs >>>>>>> >>>>>>> Since I'm using SSD I've tried tuning the following (one at a time) >>>>>>> but none seemed to make a lot of difference: >>>>>>> >>>>>>> - concurrent_writes=384 >>>>>>> - memtable_flush_writers=8 >>>>>>> - concurrent_compactors=8 >>>>>>> >>>>>>> I am currently doing ingestion tests sending data from 3 clients on >>>>>>> the same subnet. I am using cassandra-stress to do some ingestion >>>>>>> testing. The tests are using CL=ONE and RF=2. >>>>>>> >>>>>>> Using cassandra-stress (3.10) I am able to saturate the disk using a >>>>>>> large enough column size and the standard five column cassandra-stress >>>>>>> schema. For example, -col size=fixed(400) will saturate the disk >>>>>>> and compactions will start falling behind. >>>>>>> >>>>>>> One of our main tables has a row size that approximately 200 bytes, >>>>>>> across 64 columns. When ingesting this table I don't see any resource >>>>>>> saturation. Disk utilization is around 10-15% per iostat. >>>>>>> Incoming network traffic on the servers is around 100-300 Mbps. CPU >>>>>>> utilization is around 20-70%. nodetool tpstats shows mostly zeros >>>>>>> with occasional spikes around 500 in MutationStage. >>>>>>> >>>>>>> The stress run does 10,000,000 inserts per client, each with a >>>>>>> separate range of partition IDs. The run with 200 byte rows takes >>>>>>> about 4 >>>>>>> minutes, with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC time >>>>>>> 173 >>>>>>> ms. >>>>>>> >>>>>>> The overall performance is good - around 120k rows/sec ingested. >>>>>>> But I'm curious to know where the bottleneck is. There's no resource >>>>>>> saturation and nodetool tpstats shows only occasional brief >>>>>>> queueing. Is the rest just expected latency inside of Cassandra? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> -- Eric >>>>>>> >>>>>> >>>>> >>> >> >