Hi Alex,

I saw that behaviout in the past. I can tell you the kswapd0 usage is
connected to the `disk_access_mode` property. On 64bit systems defaults to
mmap. That also explains why your virtual memory is so high (it somehow
matches the node load, right?). I can not find and good reference however
googling for "kswapd0 cassandra" you'll find quite some resources.

HTH,

On Wed, Dec 5, 2018 at 6:39 PM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> Hello,
>
> We are running the following setup on AWS EC2:
>
> Host system (AWS AMI): Ubuntu 14.04.4 LTS,
> Linux <hostname> 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5
> 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> Cassandra process runs inside a docker container.
> Docker image is based on Ubuntu 18.04.1 LTS.
>
> Apache Cassandra 3.0.17, installed from .deb packages.
>
> $ java -version
> openjdk version "1.8.0_181"
> OpenJDK Runtime Environment (build
> 1.8.0_181-8u181-b13-1ubuntu0.18.04.1-b13)
> OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
>
> We have a total of 36 nodes.  All are r4.large instances, they have 2
> vCPUs and ~15 GB RAM.
> On each instance we have:
> - 2TB gp2 SSD EBS volume for data and commit log,
> - 8GB gp2 SSD EBS for system (root volume).
>
> Non-default settings in cassandra.yaml:
> num_tokens: 16
> memtable_flush_writers: 1
> concurrent_compactors: 1
> snitch: Ec2Snitch
>
> JVM heap/stack size options: -Xms8G -Xmx8G -Xmn800M -Xss256k
> Garbage collection: CMS with default settings.
>
> We repair once a week using Cassandra Reaper: parallel, intensity 1, 64
> segments per node.  The issue also happens outside of repair time.
>
> The symptoms:
> ============
>
> Sporadically a node becomes unavailable for a period of time between few
> minutes and a few hours.  According to our analysis and as pointed out by
> AWS support team, the unavailability is caused by exceptionally high read
> bandwidth on the *root* EBS volume.  I repeat, on the root volume, *not* on
> the data/commitlog volume.  Basically, the amount if IO exceeds instance's
> bandwidth (~52MB/s) and all other network communication becomes impossible.
>
> The root volume contains operating system, docker container with OpenJDK
> and Cassandra binaries, and the logs.
>
> Most of the time, whenever this happens it is too late to SSH into the
> instance to troubleshoot: it becomes completely unavailable within very
> short period of time.
> Rebooting the affected instance helps to bring it back to life.
>
> Starting from the middle of last week we have seen this problem repeatedly
> 1-3 times a day, affecting different instances in a seemingly random
> fashion.  Most of the time it affects only one instance, but we've had one
> incident when 9 nodes (3 from each of the 3 Availability Zones) were down
> at the same time due to this exact issue.
>
> Actually, we've had the same issue previously on the same Cassandra
> cluster around 3 months ago (beginning to mid of September 2018).  At that
> time we were running on m4.xlarge instances (these have 4 vCPUs and 16GB
> RAM).
>
> As a mitigation measure we have migrated away from those to r4.2xlarge.
> Then we didn't observe any issues for a few weeks, so we have scaled down
> two times: to r4.xlarge and then to r4.large.  The last migration was
> completed before Nov 13th.  No changes to the cluster or application
> happened since that time.
>
> Now, after some weeks the issue appears again...
>
> When we are not fast enough to react and reboot the affected instance, we
> can see that ultimately Linux OOM killer kicks in and kills the java
> process running Cassandra.  After that the instance becomes available
> almost immediately.  This allows us to rule out other processes running in
> background as potential offenders.
>
> We routinely observe Memory.HeapMemoryUsage.used between 1GB and 6GB
> and Memory.NonHeapMemoryUsage.used below 100MB, as reported by JMX (via
> Jolokia).  At the same time, Committed_AS on each host is constantly around
> 11-12GB, as reported by atop(1) and prometheus.
>
> We are running atop with sampling interval of 60 seconds.  After the fact
> we observe that the java process is the one responsible for the most of
> disk activity during unavailability period.  We also see kswapd0 high on
> the list from time to time, which always has 0K reads, but non-zero write
> bandwidth.  There is no swap space defined on these instances, so not
> really clear why kswapd appears at the top of the list all (measurement
> error?).
>
> We also attempted to troubleshoot by running jstack, jmap and pmap against
> Cassandra process in background every few minutes.  The idea was to compare
> dumps taken before and during unavailability, but that didn't lead to any
> findings so far.  Ultimately we had to stop doing this, once we've seen
> that jmap can also become stuck burning CPU cycles.  Now the output of jmap
> is not that useful, but we fear that jstack might also expose the same
> behavior.  So we wanted to avoid making the issue worse than it currently
> is and disabled this debug sampling.
>
> Now to my questions:
>
> 1. Is there anything in Cassandra or in the JVM that could explain all of
> a sudden reading from non-data volume at such a high rate, for prolonged
> periods of time as described above?
>
> 2. Why does JVM heap utilization never reaches the 8GB that we provide to
> it?
>
> 3. Why Committed virtual memory is so much bigger than sum of the heap and
> off-heap reported by JMX?  To what can this difference be attributed?
> I've just visited a node at random and collected "off heap memory used"
> numbers reported by nodetool cfstats, and still I see only 2.6GB in total,
> while Commited_AS is ~12GB.  Is there a more direct way to monitor off-heap
> memory usage by JVM?
>
> 4. The only Jira issue related to Linux OOM, we've found is this one:
> https://issues.apache.org/jira/browse/CASSANDRA-13931  This might be
> related to our OOM, but still doesn't explain the unexpected IO anomalies.
>
> I would really appreciate any hints / pointers / insights!  The more I
> learn about this issue, the less I understand it...
>
> Regards,
> --
> Alex
>
>

Reply via email to