The kswapd issue is interesting, is it possible you're being affected by https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457 - although I don't see a fix for Trusty listed on there?
On Wed, Dec 5, 2018 at 11:34 AM Riccardo Ferrari <ferra...@gmail.com> wrote: > Hi Alex, > > I saw that behaviout in the past. I can tell you the kswapd0 usage is > connected to the `disk_access_mode` property. On 64bit systems defaults to > mmap. That also explains why your virtual memory is so high (it somehow > matches the node load, right?). I can not find and good reference however > googling for "kswapd0 cassandra" you'll find quite some resources. > > HTH, > > On Wed, Dec 5, 2018 at 6:39 PM Oleksandr Shulgin < > oleksandr.shul...@zalando.de> wrote: > >> Hello, >> >> We are running the following setup on AWS EC2: >> >> Host system (AWS AMI): Ubuntu 14.04.4 LTS, >> Linux <hostname> 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5 >> 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux >> >> Cassandra process runs inside a docker container. >> Docker image is based on Ubuntu 18.04.1 LTS. >> >> Apache Cassandra 3.0.17, installed from .deb packages. >> >> $ java -version >> openjdk version "1.8.0_181" >> OpenJDK Runtime Environment (build >> 1.8.0_181-8u181-b13-1ubuntu0.18.04.1-b13) >> OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode) >> >> We have a total of 36 nodes. All are r4.large instances, they have 2 >> vCPUs and ~15 GB RAM. >> On each instance we have: >> - 2TB gp2 SSD EBS volume for data and commit log, >> - 8GB gp2 SSD EBS for system (root volume). >> >> Non-default settings in cassandra.yaml: >> num_tokens: 16 >> memtable_flush_writers: 1 >> concurrent_compactors: 1 >> snitch: Ec2Snitch >> >> JVM heap/stack size options: -Xms8G -Xmx8G -Xmn800M -Xss256k >> Garbage collection: CMS with default settings. >> >> We repair once a week using Cassandra Reaper: parallel, intensity 1, 64 >> segments per node. The issue also happens outside of repair time. >> >> The symptoms: >> ============ >> >> Sporadically a node becomes unavailable for a period of time between few >> minutes and a few hours. According to our analysis and as pointed out by >> AWS support team, the unavailability is caused by exceptionally high read >> bandwidth on the *root* EBS volume. I repeat, on the root volume, *not* on >> the data/commitlog volume. Basically, the amount if IO exceeds instance's >> bandwidth (~52MB/s) and all other network communication becomes impossible. >> >> The root volume contains operating system, docker container with OpenJDK >> and Cassandra binaries, and the logs. >> >> Most of the time, whenever this happens it is too late to SSH into the >> instance to troubleshoot: it becomes completely unavailable within very >> short period of time. >> Rebooting the affected instance helps to bring it back to life. >> >> Starting from the middle of last week we have seen this problem >> repeatedly 1-3 times a day, affecting different instances in a seemingly >> random fashion. Most of the time it affects only one instance, but we've >> had one incident when 9 nodes (3 from each of the 3 Availability Zones) >> were down at the same time due to this exact issue. >> >> Actually, we've had the same issue previously on the same Cassandra >> cluster around 3 months ago (beginning to mid of September 2018). At that >> time we were running on m4.xlarge instances (these have 4 vCPUs and 16GB >> RAM). >> >> As a mitigation measure we have migrated away from those to r4.2xlarge. >> Then we didn't observe any issues for a few weeks, so we have scaled down >> two times: to r4.xlarge and then to r4.large. The last migration was >> completed before Nov 13th. No changes to the cluster or application >> happened since that time. >> >> Now, after some weeks the issue appears again... >> >> When we are not fast enough to react and reboot the affected instance, we >> can see that ultimately Linux OOM killer kicks in and kills the java >> process running Cassandra. After that the instance becomes available >> almost immediately. This allows us to rule out other processes running in >> background as potential offenders. >> >> We routinely observe Memory.HeapMemoryUsage.used between 1GB and 6GB >> and Memory.NonHeapMemoryUsage.used below 100MB, as reported by JMX (via >> Jolokia). At the same time, Committed_AS on each host is constantly around >> 11-12GB, as reported by atop(1) and prometheus. >> >> We are running atop with sampling interval of 60 seconds. After the fact >> we observe that the java process is the one responsible for the most of >> disk activity during unavailability period. We also see kswapd0 high on >> the list from time to time, which always has 0K reads, but non-zero write >> bandwidth. There is no swap space defined on these instances, so not >> really clear why kswapd appears at the top of the list all (measurement >> error?). >> >> We also attempted to troubleshoot by running jstack, jmap and pmap >> against Cassandra process in background every few minutes. The idea was to >> compare dumps taken before and during unavailability, but that didn't lead >> to any findings so far. Ultimately we had to stop doing this, once we've >> seen that jmap can also become stuck burning CPU cycles. Now the output of >> jmap is not that useful, but we fear that jstack might also expose the same >> behavior. So we wanted to avoid making the issue worse than it currently >> is and disabled this debug sampling. >> >> Now to my questions: >> >> 1. Is there anything in Cassandra or in the JVM that could explain all of >> a sudden reading from non-data volume at such a high rate, for prolonged >> periods of time as described above? >> >> 2. Why does JVM heap utilization never reaches the 8GB that we provide to >> it? >> >> 3. Why Committed virtual memory is so much bigger than sum of the heap >> and off-heap reported by JMX? To what can this difference be attributed? >> I've just visited a node at random and collected "off heap memory used" >> numbers reported by nodetool cfstats, and still I see only 2.6GB in total, >> while Commited_AS is ~12GB. Is there a more direct way to monitor off-heap >> memory usage by JVM? >> >> 4. The only Jira issue related to Linux OOM, we've found is this one: >> https://issues.apache.org/jira/browse/CASSANDRA-13931 This might be >> related to our OOM, but still doesn't explain the unexpected IO anomalies. >> >> I would really appreciate any hints / pointers / insights! The more I >> learn about this issue, the less I understand it... >> >> Regards, >> -- >> Alex >> >>