Re: Sporadic high IO bandwidth and Linux OOM killer

Jon Meredith Wed, 05 Dec 2018 10:45:01 -0800

The kswapd issue is interesting, is it possible you're being affected by
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457 - although I
don't see a fix for Trusty listed on there?


On Wed, Dec 5, 2018 at 11:34 AM Riccardo Ferrari <ferra...@gmail.com> wrote:

> Hi Alex,
>
> I saw that behaviout in the past. I can tell you the kswapd0 usage is
> connected to the `disk_access_mode` property. On 64bit systems defaults to
> mmap. That also explains why your virtual memory is so high (it somehow
> matches the node load, right?). I can not find and good reference however
> googling for "kswapd0 cassandra" you'll find quite some resources.
>
> HTH,
>
> On Wed, Dec 5, 2018 at 6:39 PM Oleksandr Shulgin <
> oleksandr.shul...@zalando.de> wrote:
>
>> Hello,
>>
>> We are running the following setup on AWS EC2:
>>
>> Host system (AWS AMI): Ubuntu 14.04.4 LTS,
>> Linux <hostname> 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5
>> 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>>
>> Cassandra process runs inside a docker container.
>> Docker image is based on Ubuntu 18.04.1 LTS.
>>
>> Apache Cassandra 3.0.17, installed from .deb packages.
>>
>> $ java -version
>> openjdk version "1.8.0_181"
>> OpenJDK Runtime Environment (build
>> 1.8.0_181-8u181-b13-1ubuntu0.18.04.1-b13)
>> OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
>>
>> We have a total of 36 nodes.  All are r4.large instances, they have 2
>> vCPUs and ~15 GB RAM.
>> On each instance we have:
>> - 2TB gp2 SSD EBS volume for data and commit log,
>> - 8GB gp2 SSD EBS for system (root volume).
>>
>> Non-default settings in cassandra.yaml:
>> num_tokens: 16
>> memtable_flush_writers: 1
>> concurrent_compactors: 1
>> snitch: Ec2Snitch
>>
>> JVM heap/stack size options: -Xms8G -Xmx8G -Xmn800M -Xss256k
>> Garbage collection: CMS with default settings.
>>
>> We repair once a week using Cassandra Reaper: parallel, intensity 1, 64
>> segments per node.  The issue also happens outside of repair time.
>>
>> The symptoms:
>> ============
>>
>> Sporadically a node becomes unavailable for a period of time between few
>> minutes and a few hours.  According to our analysis and as pointed out by
>> AWS support team, the unavailability is caused by exceptionally high read
>> bandwidth on the *root* EBS volume.  I repeat, on the root volume, *not* on
>> the data/commitlog volume.  Basically, the amount if IO exceeds instance's
>> bandwidth (~52MB/s) and all other network communication becomes impossible.
>>
>> The root volume contains operating system, docker container with OpenJDK
>> and Cassandra binaries, and the logs.
>>
>> Most of the time, whenever this happens it is too late to SSH into the
>> instance to troubleshoot: it becomes completely unavailable within very
>> short period of time.
>> Rebooting the affected instance helps to bring it back to life.
>>
>> Starting from the middle of last week we have seen this problem
>> repeatedly 1-3 times a day, affecting different instances in a seemingly
>> random fashion.  Most of the time it affects only one instance, but we've
>> had one incident when 9 nodes (3 from each of the 3 Availability Zones)
>> were down at the same time due to this exact issue.
>>
>> Actually, we've had the same issue previously on the same Cassandra
>> cluster around 3 months ago (beginning to mid of September 2018).  At that
>> time we were running on m4.xlarge instances (these have 4 vCPUs and 16GB
>> RAM).
>>
>> As a mitigation measure we have migrated away from those to r4.2xlarge.
>> Then we didn't observe any issues for a few weeks, so we have scaled down
>> two times: to r4.xlarge and then to r4.large.  The last migration was
>> completed before Nov 13th.  No changes to the cluster or application
>> happened since that time.
>>
>> Now, after some weeks the issue appears again...
>>
>> When we are not fast enough to react and reboot the affected instance, we
>> can see that ultimately Linux OOM killer kicks in and kills the java
>> process running Cassandra.  After that the instance becomes available
>> almost immediately.  This allows us to rule out other processes running in
>> background as potential offenders.
>>
>> We routinely observe Memory.HeapMemoryUsage.used between 1GB and 6GB
>> and Memory.NonHeapMemoryUsage.used below 100MB, as reported by JMX (via
>> Jolokia).  At the same time, Committed_AS on each host is constantly around
>> 11-12GB, as reported by atop(1) and prometheus.
>>
>> We are running atop with sampling interval of 60 seconds.  After the fact
>> we observe that the java process is the one responsible for the most of
>> disk activity during unavailability period.  We also see kswapd0 high on
>> the list from time to time, which always has 0K reads, but non-zero write
>> bandwidth.  There is no swap space defined on these instances, so not
>> really clear why kswapd appears at the top of the list all (measurement
>> error?).
>>
>> We also attempted to troubleshoot by running jstack, jmap and pmap
>> against Cassandra process in background every few minutes.  The idea was to
>> compare dumps taken before and during unavailability, but that didn't lead
>> to any findings so far.  Ultimately we had to stop doing this, once we've
>> seen that jmap can also become stuck burning CPU cycles.  Now the output of
>> jmap is not that useful, but we fear that jstack might also expose the same
>> behavior.  So we wanted to avoid making the issue worse than it currently
>> is and disabled this debug sampling.
>>
>> Now to my questions:
>>
>> 1. Is there anything in Cassandra or in the JVM that could explain all of
>> a sudden reading from non-data volume at such a high rate, for prolonged
>> periods of time as described above?
>>
>> 2. Why does JVM heap utilization never reaches the 8GB that we provide to
>> it?
>>
>> 3. Why Committed virtual memory is so much bigger than sum of the heap
>> and off-heap reported by JMX?  To what can this difference be attributed?
>> I've just visited a node at random and collected "off heap memory used"
>> numbers reported by nodetool cfstats, and still I see only 2.6GB in total,
>> while Commited_AS is ~12GB.  Is there a more direct way to monitor off-heap
>> memory usage by JVM?
>>
>> 4. The only Jira issue related to Linux OOM, we've found is this one:
>> https://issues.apache.org/jira/browse/CASSANDRA-13931  This might be
>> related to our OOM, but still doesn't explain the unexpected IO anomalies.
>>
>> I would really appreciate any hints / pointers / insights!  The more I
>> learn about this issue, the less I understand it...
>>
>> Regards,
>> --
>> Alex
>>
>>

Re: Sporadic high IO bandwidth and Linux OOM killer

Reply via email to