Re: Sporadic high IO bandwidth and Linux OOM killer

Jonathan Haddad Wed, 05 Dec 2018 10:53:17 -0800

Seeing high kswapd usage means there's a lot of churn in the page cache.
It doesn't mean you're using swap, it means the box is spending time
clearing pages out of the page cache to make room for the stuff you're
reading now.  The machines don't have enough memory - they are way
undersized for a production workload.


Things that make it worse:
* high readahead (use 8kb on ssd)
* high compression chunk length when reading small rows / partitions.
Nobody specifies this, 64KB by default is awful.  I almost always switch to
4KB-16KB here but on these boxes you're kind of screwed since you're
already basically out of memory.

I'd never put Cassandra in production with less than 30GB ram and 8 cores
per box.

On Wed, Dec 5, 2018 at 10:44 AM Jon Meredith <jmeredit...@gmail.com> wrote:

> The kswapd issue is interesting, is it possible you're being affected by
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457 - although I
> don't see a fix for Trusty listed on there?
>
> On Wed, Dec 5, 2018 at 11:34 AM Riccardo Ferrari <ferra...@gmail.com>
> wrote:
>
>> Hi Alex,
>>
>> I saw that behaviout in the past. I can tell you the kswapd0 usage is
>> connected to the `disk_access_mode` property. On 64bit systems defaults to
>> mmap. That also explains why your virtual memory is so high (it somehow
>> matches the node load, right?). I can not find and good reference however
>> googling for "kswapd0 cassandra" you'll find quite some resources.
>>
>> HTH,
>>
>> On Wed, Dec 5, 2018 at 6:39 PM Oleksandr Shulgin <
>> oleksandr.shul...@zalando.de> wrote:
>>
>>> Hello,
>>>
>>> We are running the following setup on AWS EC2:
>>>
>>> Host system (AWS AMI): Ubuntu 14.04.4 LTS,
>>> Linux <hostname> 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5
>>> 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> Cassandra process runs inside a docker container.
>>> Docker image is based on Ubuntu 18.04.1 LTS.
>>>
>>> Apache Cassandra 3.0.17, installed from .deb packages.
>>>
>>> $ java -version
>>> openjdk version "1.8.0_181"
>>> OpenJDK Runtime Environment (build
>>> 1.8.0_181-8u181-b13-1ubuntu0.18.04.1-b13)
>>> OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
>>>
>>> We have a total of 36 nodes.  All are r4.large instances, they have 2
>>> vCPUs and ~15 GB RAM.
>>> On each instance we have:
>>> - 2TB gp2 SSD EBS volume for data and commit log,
>>> - 8GB gp2 SSD EBS for system (root volume).
>>>
>>> Non-default settings in cassandra.yaml:
>>> num_tokens: 16
>>> memtable_flush_writers: 1
>>> concurrent_compactors: 1
>>> snitch: Ec2Snitch
>>>
>>> JVM heap/stack size options: -Xms8G -Xmx8G -Xmn800M -Xss256k
>>> Garbage collection: CMS with default settings.
>>>
>>> We repair once a week using Cassandra Reaper: parallel, intensity 1, 64
>>> segments per node.  The issue also happens outside of repair time.
>>>
>>> The symptoms:
>>> ============
>>>
>>> Sporadically a node becomes unavailable for a period of time between few
>>> minutes and a few hours.  According to our analysis and as pointed out by
>>> AWS support team, the unavailability is caused by exceptionally high read
>>> bandwidth on the *root* EBS volume.  I repeat, on the root volume, *not* on
>>> the data/commitlog volume.  Basically, the amount if IO exceeds instance's
>>> bandwidth (~52MB/s) and all other network communication becomes impossible.
>>>
>>> The root volume contains operating system, docker container with OpenJDK
>>> and Cassandra binaries, and the logs.
>>>
>>> Most of the time, whenever this happens it is too late to SSH into the
>>> instance to troubleshoot: it becomes completely unavailable within very
>>> short period of time.
>>> Rebooting the affected instance helps to bring it back to life.
>>>
>>> Starting from the middle of last week we have seen this problem
>>> repeatedly 1-3 times a day, affecting different instances in a seemingly
>>> random fashion.  Most of the time it affects only one instance, but we've
>>> had one incident when 9 nodes (3 from each of the 3 Availability Zones)
>>> were down at the same time due to this exact issue.
>>>
>>> Actually, we've had the same issue previously on the same Cassandra
>>> cluster around 3 months ago (beginning to mid of September 2018).  At that
>>> time we were running on m4.xlarge instances (these have 4 vCPUs and 16GB
>>> RAM).
>>>
>>> As a mitigation measure we have migrated away from those to r4.2xlarge.
>>> Then we didn't observe any issues for a few weeks, so we have scaled down
>>> two times: to r4.xlarge and then to r4.large.  The last migration was
>>> completed before Nov 13th.  No changes to the cluster or application
>>> happened since that time.
>>>
>>> Now, after some weeks the issue appears again...
>>>
>>> When we are not fast enough to react and reboot the affected instance,
>>> we can see that ultimately Linux OOM killer kicks in and kills the java
>>> process running Cassandra.  After that the instance becomes available
>>> almost immediately.  This allows us to rule out other processes running in
>>> background as potential offenders.
>>>
>>> We routinely observe Memory.HeapMemoryUsage.used between 1GB and 6GB
>>> and Memory.NonHeapMemoryUsage.used below 100MB, as reported by JMX (via
>>> Jolokia).  At the same time, Committed_AS on each host is constantly around
>>> 11-12GB, as reported by atop(1) and prometheus.
>>>
>>> We are running atop with sampling interval of 60 seconds.  After the
>>> fact we observe that the java process is the one responsible for the most
>>> of disk activity during unavailability period.  We also see kswapd0 high on
>>> the list from time to time, which always has 0K reads, but non-zero write
>>> bandwidth.  There is no swap space defined on these instances, so not
>>> really clear why kswapd appears at the top of the list all (measurement
>>> error?).
>>>
>>> We also attempted to troubleshoot by running jstack, jmap and pmap
>>> against Cassandra process in background every few minutes.  The idea was to
>>> compare dumps taken before and during unavailability, but that didn't lead
>>> to any findings so far.  Ultimately we had to stop doing this, once we've
>>> seen that jmap can also become stuck burning CPU cycles.  Now the output of
>>> jmap is not that useful, but we fear that jstack might also expose the same
>>> behavior.  So we wanted to avoid making the issue worse than it currently
>>> is and disabled this debug sampling.
>>>
>>> Now to my questions:
>>>
>>> 1. Is there anything in Cassandra or in the JVM that could explain all
>>> of a sudden reading from non-data volume at such a high rate, for prolonged
>>> periods of time as described above?
>>>
>>> 2. Why does JVM heap utilization never reaches the 8GB that we provide
>>> to it?
>>>
>>> 3. Why Committed virtual memory is so much bigger than sum of the heap
>>> and off-heap reported by JMX?  To what can this difference be attributed?
>>> I've just visited a node at random and collected "off heap memory used"
>>> numbers reported by nodetool cfstats, and still I see only 2.6GB in total,
>>> while Commited_AS is ~12GB.  Is there a more direct way to monitor off-heap
>>> memory usage by JVM?
>>>
>>> 4. The only Jira issue related to Linux OOM, we've found is this one:
>>> https://issues.apache.org/jira/browse/CASSANDRA-13931  This might be
>>> related to our OOM, but still doesn't explain the unexpected IO anomalies.
>>>
>>> I would really appreciate any hints / pointers / insights!  The more I
>>> learn about this issue, the less I understand it...
>>>
>>> Regards,
>>> --
>>> Alex
>>>
>>>

-- 
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade

Re: Sporadic high IO bandwidth and Linux OOM killer

Reply via email to