[ 
https://issues.apache.org/jira/browse/CASSANDRA-12699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Heiko Sommer updated CASSANDRA-12699:
-------------------------------------
    Description: 
The cassandra JVM process uses many gigabytes of page table memory during 
certain activities, which can lead to oom-killer action with 
"java.lang.OutOfMemoryError: null" logs.
Page table memory is not reported by Linux tools such as "top" or "ps" and 
therefore might be responsible also for other spurious Cassandra issues with 
"memory eating" or crashes, e.g. CASSANDRA-8723.

The problem happens especially (or only?) during large compactions and 
anticompactions. 
Eventually all memory gets released, which means there is no real leak. Still I 
suspect that the memory mappings that fill the page table could be released 
much sooner, to keep the page table size at a small fraction of the total 
Cassandra process memory. 

How to reproduce: Record the memory use on a Cassandra node, including page 
table memory, for example using the attached script cassandraMemoryLog.sh. Even 
when there is no crash, the ramping up and sudden release of page table memory 
is visible. 

A stacked area plot for the memory on one of our crashed nodes is attached 
(PageTableMemoryExample.png). The page table memory used by Cassandra is shown 
in red ("VmPTE").
(In the plot we also see that the sum of measured memory portions sometimes 
exceeds the total memory. This is probably an issue of how RSS memory is 
measured, perhaps including some buffers/cache memory that also counts toward 
available memory. It does not invalidate the finding that page table memory is 
growing to enormous sizes.) 

Shortly before the crash, /proc/$PID/status reported 
                VmPeak: 6989760944 kB
                VmSize: 5742400572 kB
                VmLck:   4735036 kB
                VmHWM:   8589972 kB
                VmRSS:   7022036 kB
                VmData: 10019732 kB
                VmStk:        92 kB
                VmExe:         4 kB
                VmLib:     17584 kB
                VmPTE:   3965856 kB
                VmSwap:        0 kB
The files cassandra.yaml and cassandra-env.sh used on the node where the data 
was taken are attached. 
Please let me know if I should provide any other data or descriptions to help 
with this ticket. 

Known workarounds: Use more RAM, or limit the amount of Java heap memory. In 
the above crash, MAX_HEAP_SIZE was not set, so that the default heap size for 
12 GB RAM was used (-Xms2976M, -Xmx2976M). 
We have not tried yet if variations of heap vs. offheap config choices make a 
difference. 
Perhaps there are other workarounds using -XX+UseLargePages or related Linux 
settings to reduce the size of the process page table?

I believe that we see these crashes more often than other projects because we 
have a test system with not much RAM but with a lot of data (compressed ~3 TB 
per node), while the CPUs are slow so that anti-/compactions overlap a lot. 
Ideally Cassandra (native) code should be changed to release memory in smaller 
chunks, so that page table size cannot cause an otherwise stable system to 
crash.

  was:
The cassandra JVM process uses many gigabytes of page table memory during 
certain activities, which can lead to oom-killer action with 
"java.lang.OutOfMemoryError: null" logs.
Page table memory is not reported by Linux tools such as "top" or "ps" and 
therefore might be responsible also for other spurious Cassandra issues with 
"memory eating" or crashes, e.g. CASSANDRA-8723.

The problem happens especially (or only?) during large compactions and 
anticompactions. 
Eventually all memory gets released, which means there is no real leak. Still I 
suspect that the memory mappings that fill the page table could be released 
much sooner, to keep the page table size at a small fraction of the total 
Cassandra process memory. 

How to reproduce: Record the memory use on a Cassandra node, including page 
table memory, for example using the attached script cassandraMemoryLog.sh. Even 
when there is no crash, the ramping up and sudden release of page table memory 
is visible. 

A stacked area plot for the memory on one of our crashed nodes is attached 
(PageCacheMemoryExample.png). The page table memory used by Cassandra is shown 
in red ("VmPTE").
(In the plot we also see that the sum of measured memory portions sometimes 
exceeds the total memory. This is probably an issue of how RSS memory is 
measured, perhaps including some buffers/cache memory that also counts toward 
available memory. It does not invalidate the finding that page table memory is 
growing to enormous sizes.) 

Shortly before the crash, /proc/$PID/status reported 
                VmPeak: 6989760944 kB
                VmSize: 5742400572 kB
                VmLck:   4735036 kB
                VmHWM:   8589972 kB
                VmRSS:   7022036 kB
                VmData: 10019732 kB
                VmStk:        92 kB
                VmExe:         4 kB
                VmLib:     17584 kB
                VmPTE:   3965856 kB
                VmSwap:        0 kB
The config files cassandra.yaml and cassandra-env.sh used on the node where the 
data was taken are attached. 
Please let me know if I should provide any other data or descriptions to help 
with this ticket. 

Known workaround: Use more RAM, or limit the amount of Java heap memory. In the 
above crash, MAX_HEAP_SIZE was not set, so that the default heap size for 12 GB 
RAM was used. 
We have not tried yet if variations of heap vs. offheap config choices make a 
difference. 
Perhaps there are other workarounds using -XX+UseLargePages or related Linux 
settings to reduce the size of the process page table?

I believe that we see these crashes more often than other projects because we 
have a test system with not much RAM but with a lot of data (~3 TB per node 
compressed), while the CPUs are slow so that anti-/compactions overlap a lot. 
Ideally Cassandra (native) code should be changed to release memory in smaller 
chunks, so that page table size cannot cause an otherwise stable system to 
crash.


> Excessive use of "hidden" Linux page table memory
> -------------------------------------------------
>
>                 Key: CASSANDRA-12699
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12699
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Cassandra 2.2.7 on Red Hat 6.7, with Java 1.8.0_73. 
> Probably others. 
>            Reporter: Heiko Sommer
>         Attachments: PageTableMemoryExample.png, cassandra-env.sh, 
> cassandra.yaml, cassandraMemoryLog.sh
>
>
> The cassandra JVM process uses many gigabytes of page table memory during 
> certain activities, which can lead to oom-killer action with 
> "java.lang.OutOfMemoryError: null" logs.
> Page table memory is not reported by Linux tools such as "top" or "ps" and 
> therefore might be responsible also for other spurious Cassandra issues with 
> "memory eating" or crashes, e.g. CASSANDRA-8723.
> The problem happens especially (or only?) during large compactions and 
> anticompactions. 
> Eventually all memory gets released, which means there is no real leak. Still 
> I suspect that the memory mappings that fill the page table could be released 
> much sooner, to keep the page table size at a small fraction of the total 
> Cassandra process memory. 
> How to reproduce: Record the memory use on a Cassandra node, including page 
> table memory, for example using the attached script cassandraMemoryLog.sh. 
> Even when there is no crash, the ramping up and sudden release of page table 
> memory is visible. 
> A stacked area plot for the memory on one of our crashed nodes is attached 
> (PageTableMemoryExample.png). The page table memory used by Cassandra is 
> shown in red ("VmPTE").
> (In the plot we also see that the sum of measured memory portions sometimes 
> exceeds the total memory. This is probably an issue of how RSS memory is 
> measured, perhaps including some buffers/cache memory that also counts toward 
> available memory. It does not invalidate the finding that page table memory 
> is growing to enormous sizes.) 
> Shortly before the crash, /proc/$PID/status reported 
>                 VmPeak: 6989760944 kB
>                 VmSize: 5742400572 kB
>                 VmLck:   4735036 kB
>                 VmHWM:   8589972 kB
>                 VmRSS:   7022036 kB
>                 VmData: 10019732 kB
>                 VmStk:        92 kB
>                 VmExe:         4 kB
>                 VmLib:     17584 kB
>                 VmPTE:   3965856 kB
>                 VmSwap:        0 kB
> The files cassandra.yaml and cassandra-env.sh used on the node where the data 
> was taken are attached. 
> Please let me know if I should provide any other data or descriptions to help 
> with this ticket. 
> Known workarounds: Use more RAM, or limit the amount of Java heap memory. In 
> the above crash, MAX_HEAP_SIZE was not set, so that the default heap size for 
> 12 GB RAM was used (-Xms2976M, -Xmx2976M). 
> We have not tried yet if variations of heap vs. offheap config choices make a 
> difference. 
> Perhaps there are other workarounds using -XX+UseLargePages or related Linux 
> settings to reduce the size of the process page table?
> I believe that we see these crashes more often than other projects because we 
> have a test system with not much RAM but with a lot of data (compressed ~3 TB 
> per node), while the CPUs are slow so that anti-/compactions overlap a lot. 
> Ideally Cassandra (native) code should be changed to release memory in 
> smaller chunks, so that page table size cannot cause an otherwise stable 
> system to crash.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to