Hi Alex, I saw that behaviout in the past. I can tell you the kswapd0 usage is connected to the `disk_access_mode` property. On 64bit systems defaults to mmap. That also explains why your virtual memory is so high (it somehow matches the node load, right?). I can not find and good reference however googling for "kswapd0 cassandra" you'll find quite some resources.
HTH, On Wed, Dec 5, 2018 at 6:39 PM Oleksandr Shulgin < oleksandr.shul...@zalando.de> wrote: > Hello, > > We are running the following setup on AWS EC2: > > Host system (AWS AMI): Ubuntu 14.04.4 LTS, > Linux <hostname> 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5 > 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux > > Cassandra process runs inside a docker container. > Docker image is based on Ubuntu 18.04.1 LTS. > > Apache Cassandra 3.0.17, installed from .deb packages. > > $ java -version > openjdk version "1.8.0_181" > OpenJDK Runtime Environment (build > 1.8.0_181-8u181-b13-1ubuntu0.18.04.1-b13) > OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode) > > We have a total of 36 nodes. All are r4.large instances, they have 2 > vCPUs and ~15 GB RAM. > On each instance we have: > - 2TB gp2 SSD EBS volume for data and commit log, > - 8GB gp2 SSD EBS for system (root volume). > > Non-default settings in cassandra.yaml: > num_tokens: 16 > memtable_flush_writers: 1 > concurrent_compactors: 1 > snitch: Ec2Snitch > > JVM heap/stack size options: -Xms8G -Xmx8G -Xmn800M -Xss256k > Garbage collection: CMS with default settings. > > We repair once a week using Cassandra Reaper: parallel, intensity 1, 64 > segments per node. The issue also happens outside of repair time. > > The symptoms: > ============ > > Sporadically a node becomes unavailable for a period of time between few > minutes and a few hours. According to our analysis and as pointed out by > AWS support team, the unavailability is caused by exceptionally high read > bandwidth on the *root* EBS volume. I repeat, on the root volume, *not* on > the data/commitlog volume. Basically, the amount if IO exceeds instance's > bandwidth (~52MB/s) and all other network communication becomes impossible. > > The root volume contains operating system, docker container with OpenJDK > and Cassandra binaries, and the logs. > > Most of the time, whenever this happens it is too late to SSH into the > instance to troubleshoot: it becomes completely unavailable within very > short period of time. > Rebooting the affected instance helps to bring it back to life. > > Starting from the middle of last week we have seen this problem repeatedly > 1-3 times a day, affecting different instances in a seemingly random > fashion. Most of the time it affects only one instance, but we've had one > incident when 9 nodes (3 from each of the 3 Availability Zones) were down > at the same time due to this exact issue. > > Actually, we've had the same issue previously on the same Cassandra > cluster around 3 months ago (beginning to mid of September 2018). At that > time we were running on m4.xlarge instances (these have 4 vCPUs and 16GB > RAM). > > As a mitigation measure we have migrated away from those to r4.2xlarge. > Then we didn't observe any issues for a few weeks, so we have scaled down > two times: to r4.xlarge and then to r4.large. The last migration was > completed before Nov 13th. No changes to the cluster or application > happened since that time. > > Now, after some weeks the issue appears again... > > When we are not fast enough to react and reboot the affected instance, we > can see that ultimately Linux OOM killer kicks in and kills the java > process running Cassandra. After that the instance becomes available > almost immediately. This allows us to rule out other processes running in > background as potential offenders. > > We routinely observe Memory.HeapMemoryUsage.used between 1GB and 6GB > and Memory.NonHeapMemoryUsage.used below 100MB, as reported by JMX (via > Jolokia). At the same time, Committed_AS on each host is constantly around > 11-12GB, as reported by atop(1) and prometheus. > > We are running atop with sampling interval of 60 seconds. After the fact > we observe that the java process is the one responsible for the most of > disk activity during unavailability period. We also see kswapd0 high on > the list from time to time, which always has 0K reads, but non-zero write > bandwidth. There is no swap space defined on these instances, so not > really clear why kswapd appears at the top of the list all (measurement > error?). > > We also attempted to troubleshoot by running jstack, jmap and pmap against > Cassandra process in background every few minutes. The idea was to compare > dumps taken before and during unavailability, but that didn't lead to any > findings so far. Ultimately we had to stop doing this, once we've seen > that jmap can also become stuck burning CPU cycles. Now the output of jmap > is not that useful, but we fear that jstack might also expose the same > behavior. So we wanted to avoid making the issue worse than it currently > is and disabled this debug sampling. > > Now to my questions: > > 1. Is there anything in Cassandra or in the JVM that could explain all of > a sudden reading from non-data volume at such a high rate, for prolonged > periods of time as described above? > > 2. Why does JVM heap utilization never reaches the 8GB that we provide to > it? > > 3. Why Committed virtual memory is so much bigger than sum of the heap and > off-heap reported by JMX? To what can this difference be attributed? > I've just visited a node at random and collected "off heap memory used" > numbers reported by nodetool cfstats, and still I see only 2.6GB in total, > while Commited_AS is ~12GB. Is there a more direct way to monitor off-heap > memory usage by JVM? > > 4. The only Jira issue related to Linux OOM, we've found is this one: > https://issues.apache.org/jira/browse/CASSANDRA-13931 This might be > related to our OOM, but still doesn't explain the unexpected IO anomalies. > > I would really appreciate any hints / pointers / insights! The more I > learn about this issue, the less I understand it... > > Regards, > -- > Alex > >