Re: Bottleneck for small inserts?

Eric Pederson Wed, 14 Jun 2017 20:19:10 -0700

Using cassandra-stress with the out of the box schema I am seeing around
140k rows/second throughput using 1 client on each of 3 client machines.
On the servers:


   - CPU utilization: 43% usr/20% sys, 55%/28%, 70%/10% (the last number is
   the older box)
   - Inbound network traffic: 174 Mbps, 190 Mbps, 178 Mbps
   - Disk writes/sec: ~10k each server
   - Disk utilization is in the low single digits but spikes up to 50%
   - Disk queue size is in the low single digits but spikes up into the mid
   hundreds.  I even saw in the thousands.   I had not noticed this before.

The disk stats come from iostat -xz 1.   Given the low reported utilization
%s I would not expect to see any disk queue buildup, even low single digits.

Going to 2 cassandra-stress clients per machine the throughput dropped to
133k rows/sec.

   - CPU utilization: 13% usr/5% sys, 15%/25%, 40%/22% on the older box
   - Inbound network RX: 100Mbps, 125Mbps, 120Mbps
   - Disk utilization is a little lower, but with the same spiky behavior

Going to 3 cassandra-stress clients per machine the throughput dropped to
110k rows/sec

   - CPU utilization: 15% usr/20% sys,  15%/20%, 40%/20% on the older box
   - Inbound network RX dropped to 130 Mbps
   - Disk utilization stayed roughly the same

I noticed that with the standard cassandra-stress schema GC is not an
issue.   But with my application-specific schema there is a lot of GC on
the slower box.  Also with the application-specific schema I can't seem to
get past 36k rows/sec.   The application schema has 64 columns (mostly
ints) and the key is (date,sequence#).   The standard stress schema has a
lot fewer columns and no clustering column.

Thanks,



-- Eric

On Wed, Jun 14, 2017 at 1:47 AM, Eric Pederson <eric...@gmail.com> wrote:

> Shoot - I didn't see that one.  I subscribe to the digest but was focusing
> on the direct replies and accidentally missed Patrick and Jeff Jirsa's
> messages.  Sorry about that...
>
> I've been using a combination of cassandra-stress, cqlsh COPY FROM and a
> custom C++ application for my ingestion testing.   My default setting for
> my custom client application is 96 threads, and then by default I run one
> client application process on each of 3 machines.  I tried
> doubling/quadrupling the number of client threads (and doubling/tripling
> the number of client processes but keeping the threads per process the
> same) but didn't see any change.   If I recall correctly I started getting
> timeouts after I went much beyond concurrent_writes which is 384 (for a 48
> CPU box) - meaning at 500 threads per client machine I started seeing
> timeouts.    I'll try again to be sure.
>
> For the purposes of this conversation I will try to always use
> cassandra-stress to keep the number of unknowns limited.  I'll will run
> more cassandra-stress clients tomorrow in line with Patrick's 3-5 per
> server recommendation.
>
> Thanks!
>
>
> -- Eric
>
> On Wed, Jun 14, 2017 at 12:40 AM, Jonathan Haddad <j...@jonhaddad.com>
> wrote:
>
>> Did you try adding more client stress nodes as Patrick recommended?
>>
>> On Tue, Jun 13, 2017 at 9:31 PM Eric Pederson <eric...@gmail.com> wrote:
>>
>>> Scratch that theory - the flamegraphs show that GC is only 3-4% of two
>>> newer machine's overall processing, compared to 18% on the slow machine.
>>>
>>> I took that machine out of the cluster completely and recreated the
>>> keyspaces.  The ingest tests now run slightly faster (!).   I would have
>>> expected a linear slowdown since the load is fairly balanced across
>>> partitions.  GC appears to be the bottleneck in the 3-server
>>> configuration.  But still in the two-server configuration the
>>> CPU/disk/network is still not being fully utilized (the closest is CPU at
>>> ~45% on one ingest test).  nodetool tpstats shows only blips of
>>> queueing.
>>>
>>>
>>>
>>>
>>> -- Eric
>>>
>>> On Mon, Jun 12, 2017 at 9:50 PM, Eric Pederson <eric...@gmail.com>
>>> wrote:
>>>
>>>> Hi all - I wanted to follow up on this.  I'm happy with the throughput
>>>> we're getting but I'm still curious about the bottleneck.
>>>>
>>>> The big thing that sticks out is one of the nodes is logging frequent
>>>> GCInspector messages: 350-500ms every 3-6 seconds.  All three nodes in
>>>> the cluster have identical Cassandra configuration, but the node that is
>>>> logging frequent GCs is an older machine with slower CPU and SSD.  This
>>>> node logs frequent GCInspectors both under load and when compacting
>>>> but otherwise unloaded.
>>>>
>>>> My theory is that the other two nodes have similar GC frequency
>>>> (because they are seeing the same basic load), but because they are faster
>>>> machines, they don't spend as much time per GC and don't cross the
>>>> GCInspector threshold.  Does that sound plausible?   nodetool tpstats
>>>> doesn't show any queueing in the system.
>>>>
>>>> Here's flamegraphs from the system when running a cqlsh COPY FROM:
>>>>
>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>    /flamegraph_ultva01_cars_batch2.svg
>>>>    
>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg>
>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>    /flamegraph_ultva02_cars_batch2.svg
>>>>    
>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg>
>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>    /flamegraph_ultva03_cars_batch2.svg
>>>>    
>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg>
>>>>
>>>> The slow node (ultva03) spends disproportional time in GC.
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> -- Eric
>>>>
>>>> On Thu, May 25, 2017 at 8:09 PM, Eric Pederson <eric...@gmail.com>
>>>> wrote:
>>>>
>>>>> Due to a cut and paste error those flamegraphs were a recording of the
>>>>> whole system, not just Cassandra.    Throughput is approximately 30k
>>>>> rows/sec.
>>>>>
>>>>> Here's the graphs with just the Cassandra PID:
>>>>>
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva01_sars2.svg
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva02_sars2.svg
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva03_sars2.svg
>>>>>
>>>>>
>>>>> And here's graphs during a cqlsh COPY FROM to the same table, using
>>>>> real data, MAXBATCHSIZE=2.    Throughput is good at approximately
>>>>> 110k rows/sec.
>>>>>
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva01_cars_batch2.svg
>>>>>    
>>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg>
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva02_cars_batch2.svg
>>>>>    
>>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg>
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva03_cars_batch2.svg
>>>>>    
>>>>> <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -- Eric
>>>>>
>>>>> On Thu, May 25, 2017 at 6:44 PM, Eric Pederson <eric...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Totally understood :)
>>>>>>
>>>>>> I forgot to mention - I set the /proc/irq/*/smp_affinity mask to
>>>>>> include all of the CPUs.  Actually most of them were set that way already
>>>>>> (for example, 0000ffff,ffffffff) - it might be because irqbalanced
>>>>>> is running.  But for some reason the interrupts are all being handled on
>>>>>> CPU 0 anyway.
>>>>>>
>>>>>> I see this in /var/log/dmesg on the machines:
>>>>>>
>>>>>>>
>>>>>>> Your BIOS has requested that x2apic be disabled.
>>>>>>> This will leave your machine vulnerable to irq-injection attacks.
>>>>>>> Use 'intremap=no_x2apic_optout' to override BIOS request.
>>>>>>> Enabled IRQ remapping in xapic mode
>>>>>>> x2apic not enabled, IRQ remapping is in xapic mode
>>>>>>
>>>>>>
>>>>>> In a reply to one of the comments, he says:
>>>>>>
>>>>>>
>>>>>> When IO-APIC configured to spread interrupts among all cores, it can
>>>>>>> handle up to eight cores. If you have more than eight cores, kernel will
>>>>>>> not configure IO-APIC to spread interrupts. Thus the trick I described 
>>>>>>> in
>>>>>>> the article will not work.
>>>>>>> Otherwise it may be caused by buggy BIOS or even buggy hardware.
>>>>>>
>>>>>>
>>>>>> I'm not sure if either of them is relevant to my situation.
>>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- Eric
>>>>>>
>>>>>> On Thu, May 25, 2017 at 4:16 PM, Jonathan Haddad <j...@jonhaddad.com>
>>>>>> wrote:
>>>>>>
>>>>>>> You shouldn't need a kernel recompile.  Check out the section
>>>>>>> "Simple solution for the problem" in http://www.alexonlinux.com/
>>>>>>> smp-affinity-and-proper-interrupt-handling-in-linux.  You can
>>>>>>> balance your requests across up to 8 CPUs.
>>>>>>>
>>>>>>> I'll check out the flame graphs in a little bit - in the middle of
>>>>>>> something and my brain doesn't multitask well :)
>>>>>>>
>>>>>>> On Thu, May 25, 2017 at 1:06 PM Eric Pederson <eric...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Jonathan -
>>>>>>>>
>>>>>>>> It looks like these machines are configured to use CPU 0 for all
>>>>>>>> I/O interrupts.  I don't think I'm going to get the OK to compile a new
>>>>>>>> kernel for them to balance the interrupts across CPUs, but to mitigate 
>>>>>>>> the
>>>>>>>> problem I taskset the Cassandra process to run on all CPU except 0.  It
>>>>>>>> didn't change the performance though.  Let me know if you think it's
>>>>>>>> crucial that we balance the interrupts across CPUs and I can try to 
>>>>>>>> lobby
>>>>>>>> for a new kernel.
>>>>>>>>
>>>>>>>> Here are flamegraphs from each node from a cassandra-stress ingest
>>>>>>>> into a table representative of the what we are going to be using.   
>>>>>>>> This
>>>>>>>> table is also roughly 200 bytes, with 64 columns and the primary key 
>>>>>>>> (date,
>>>>>>>> sequence_number).  Cassandra-stress was run on 3 separate client
>>>>>>>> machines.  Using cassandra-stress to write to this table I see the
>>>>>>>> same thing: neither disk, CPU or network is fully utilized.
>>>>>>>>
>>>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>>>    /flamegraph_ultva01_sars.svg
>>>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>>>    /flamegraph_ultva02_sars.svg
>>>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>>>    /flamegraph_ultva03_sars.svg
>>>>>>>>
>>>>>>>> Re: GC: In the stress run with the parameters above, two of the
>>>>>>>> three nodes log zero or one GCInspectors.  On the other hand, the
>>>>>>>> 3rd machine logs a GCInspector every 5 seconds or so, 300-500ms
>>>>>>>> each time.  I found out that the 3rd machine actually has different 
>>>>>>>> specs
>>>>>>>> as the other two.  It's an older box with the same RAM but less CPUs 
>>>>>>>> (32
>>>>>>>> instead of 48), a slower SSD and slower memory.   The Cassandra
>>>>>>>> configuration is exactly the same.   I tried running Cassandra with 
>>>>>>>> only 32
>>>>>>>> CPUs on the newer boxes to see if that would cause them to GC pause 
>>>>>>>> more,
>>>>>>>> but it didn't.
>>>>>>>>
>>>>>>>> On a separate topic - for this cassandra-stress run I reduced the
>>>>>>>> batch size to 2 in order to keep the logs clean.  That also reduced the
>>>>>>>> throughput from around 100k rows/second to 32k rows/sec.  I've been 
>>>>>>>> doing
>>>>>>>> ingestion tests using cassandra-stress, cqlsh COPY FROM and a
>>>>>>>> custom C++ application.  In most of the tests that I've been doing I've
>>>>>>>> been using a batch size of around 20 (unlogged, all batch rows have the
>>>>>>>> same partition key).  However, it fills the logs with batch size 
>>>>>>>> warnings.
>>>>>>>> I was going to raise the batch warning size but the docs scared me away
>>>>>>>> from doing that.   Given that we're using unlogged/same partition 
>>>>>>>> batches
>>>>>>>> is it safe to raise the batch size warning limit?   Actually cqlsh
>>>>>>>> COPY FROM has very good throughput using a small batch size, but I
>>>>>>>> can't get that same throughput in cassandra-stress or my C++ app with a
>>>>>>>> batch size of 2.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -- Eric
>>>>>>>>
>>>>>>>> On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <j...@jonhaddad.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> How many CPUs are you using for interrupts?
>>>>>>>>> http://www.alexonlinux.com/smp-affinity-and-proper-interrup
>>>>>>>>> t-handling-in-linux
>>>>>>>>>
>>>>>>>>> Have you tried making a flame graph to see where Cassandra is
>>>>>>>>> spending its time? http://www.brendangregg.
>>>>>>>>> com/blog/2014-06-12/java-flame-graphs.html
>>>>>>>>>
>>>>>>>>> Are you tracking GC pauses?
>>>>>>>>>
>>>>>>>>> Jon
>>>>>>>>>
>>>>>>>>> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <eric...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all:
>>>>>>>>>>
>>>>>>>>>> I'm new to Cassandra and I'm doing some performance testing.  One
>>>>>>>>>> of things that I'm testing is ingestion throughput.   My server 
>>>>>>>>>> setup is:
>>>>>>>>>>
>>>>>>>>>>    - 3 node cluster
>>>>>>>>>>    - SSD data (both commit log and sstables are on the same disk)
>>>>>>>>>>    - 64 GB RAM per server
>>>>>>>>>>    - 48 cores per server
>>>>>>>>>>    - Cassandra 3.0.11
>>>>>>>>>>    - 48 Gb heap using G1GC
>>>>>>>>>>    - 1 Gbps NICs
>>>>>>>>>>
>>>>>>>>>> Since I'm using SSD I've tried tuning the following (one at a
>>>>>>>>>> time) but none seemed to make a lot of difference:
>>>>>>>>>>
>>>>>>>>>>    - concurrent_writes=384
>>>>>>>>>>    - memtable_flush_writers=8
>>>>>>>>>>    - concurrent_compactors=8
>>>>>>>>>>
>>>>>>>>>> I am currently doing ingestion tests sending data from 3 clients
>>>>>>>>>> on the same subnet.  I am using cassandra-stress to do some ingestion
>>>>>>>>>> testing.  The tests are using CL=ONE and RF=2.
>>>>>>>>>>
>>>>>>>>>> Using cassandra-stress (3.10) I am able to saturate the disk
>>>>>>>>>> using a large enough column size and the standard five column
>>>>>>>>>> cassandra-stress schema.  For example, -col size=fixed(400) will
>>>>>>>>>> saturate the disk and compactions will start falling behind.
>>>>>>>>>>
>>>>>>>>>> One of our main tables has a row size that approximately 200
>>>>>>>>>> bytes, across 64 columns.  When ingesting this table I don't see any
>>>>>>>>>> resource saturation.  Disk utilization is around 10-15% per
>>>>>>>>>> iostat.  Incoming network traffic on the servers is around
>>>>>>>>>> 100-300 Mbps.  CPU utilization is around 20-70%.  nodetool
>>>>>>>>>> tpstats shows mostly zeros with occasional spikes around 500 in
>>>>>>>>>> MutationStage.
>>>>>>>>>>
>>>>>>>>>> The stress run does 10,000,000 inserts per client, each with a
>>>>>>>>>> separate range of partition IDs.  The run with 200 byte rows takes 
>>>>>>>>>> about 4
>>>>>>>>>> minutes, with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC 
>>>>>>>>>> time 173
>>>>>>>>>> ms.
>>>>>>>>>>
>>>>>>>>>> The overall performance is good - around 120k rows/sec ingested.
>>>>>>>>>> But I'm curious to know where the bottleneck is.  There's no resource
>>>>>>>>>> saturation and nodetool tpstats shows only occasional brief
>>>>>>>>>> queueing.  Is the rest just expected latency inside of Cassandra?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> -- Eric
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Re: Bottleneck for small inserts?

Reply via email to