[jira] [Commented] (KAFKA-6777) Wrong reaction on Out Of Memory situation
[ https://issues.apache.org/jira/browse/KAFKA-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16607574#comment-16607574 ] John Roesler commented on KAFKA-6777: - Hi [~habdank], When Java processes are under extreme memory pressure, but not actually out of memory, it is expected to see GC take an increasing percentage of CPU. The GC interruptions will grow more frequent and also longer, although G1GC attempts to bound the pause length. Note that the user-space code, such as Kafka, has effectively *no visibility* into when these collections occur or how long they take. From the application code's perspective, it's exactly like running on a slow CPU when you get into this state. This is why you can't expect Kafka, or any other JVM application, to detect this state for you. When running Kafka, or any other JVM application, you will want to monitor GC activity, as you suggested. When it passes a threshold that you're comfortable with (you suggested 40% CPU time), you would set up some alert. I don't think it would be a good idea to just bounce the process if GC is becoming an issue. The heavy GC is just an indication that you're trying to run the application with a heap that is too small for its workload. Better reactions would be to increase the heap size or decrease the workload per node. Note that with JVM apps, you have to account not only for the memory requirements of the application itself, but also for the garbage that it generates. If the heap is too small for the app's own memory requirements, then you *will* get an OOME. If the heap is big enough for the app, but not big enough for the GC's data structures, then you'll just get heavy GC and *not* an OOME. Does this make sense? Thanks, -John > Wrong reaction on Out Of Memory situation > - > > Key: KAFKA-6777 > URL: https://issues.apache.org/jira/browse/KAFKA-6777 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.0.0 >Reporter: Seweryn Habdank-Wojewodzki >Priority: Critical > Attachments: screenshot-1.png > > > Dears, > We already encountered many times problems related to Out Of Memory situation > in Kafka Broker and streaming clients. > The scenario is the following. > When Kafka Broker (or Streaming Client) is under load and has too less > memory, there are no errors in server logs. One can see some cryptic entries > in GC logs, but they are definitely not self-explaining. > Kafka Broker (and Streaming Clients) works further. Later we see in JMX > monitoring, that JVM uses more and more time in GC. In our case it grows from > e.g. 1% to 80%-90% of CPU time is used by GC. > Next, software collapses into zombie mode – process in not ending. In such a > case I would expect, that process is crashing (e.g. got SIG SEGV). Even worse > Kafka treats such a zombie process normal and somewhat sends messages, which > are in fact getting lost, also the cluster is not excluding broken nodes. The > question is how to configure Kafka to really terminate the JVM and not remain > in zombie mode, to give a chance to other nodes to know, that something is > dead. > I would expect that in Out Of Memory situation JVM is ended if not graceful > than at least process is crashed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-6777) Wrong reaction on Out Of Memory situation
[ https://issues.apache.org/jira/browse/KAFKA-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16606816#comment-16606816 ] Seweryn Habdank-Wojewodzki commented on KAFKA-6777: --- What means long pauses? I see Broker is not doing anythink for hours. Is it expected behaviour of CG? > Wrong reaction on Out Of Memory situation > - > > Key: KAFKA-6777 > URL: https://issues.apache.org/jira/browse/KAFKA-6777 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.0.0 >Reporter: Seweryn Habdank-Wojewodzki >Priority: Critical > Attachments: screenshot-1.png > > > Dears, > We already encountered many times problems related to Out Of Memory situation > in Kafka Broker and streaming clients. > The scenario is the following. > When Kafka Broker (or Streaming Client) is under load and has too less > memory, there are no errors in server logs. One can see some cryptic entries > in GC logs, but they are definitely not self-explaining. > Kafka Broker (and Streaming Clients) works further. Later we see in JMX > monitoring, that JVM uses more and more time in GC. In our case it grows from > e.g. 1% to 80%-90% of CPU time is used by GC. > Next, software collapses into zombie mode – process in not ending. In such a > case I would expect, that process is crashing (e.g. got SIG SEGV). Even worse > Kafka treats such a zombie process normal and somewhat sends messages, which > are in fact getting lost, also the cluster is not excluding broken nodes. The > question is how to configure Kafka to really terminate the JVM and not remain > in zombie mode, to give a chance to other nodes to know, that something is > dead. > I would expect that in Out Of Memory situation JVM is ended if not graceful > than at least process is crashed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-6777) Wrong reaction on Out Of Memory situation
[ https://issues.apache.org/jira/browse/KAFKA-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605926#comment-16605926 ] John Roesler commented on KAFKA-6777: - Yes, it certainly seems safer to generally avoid catching Errors. > But anyhow, I would expect, that lack of critical resources like memory will >quickly lead to crash with FATAL error e.g. Out Of Memory. Have you seen some evidence that this is not happening? Note that frequent or long GC pauses are not the same as running out of memory. > Wrong reaction on Out Of Memory situation > - > > Key: KAFKA-6777 > URL: https://issues.apache.org/jira/browse/KAFKA-6777 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.0.0 >Reporter: Seweryn Habdank-Wojewodzki >Priority: Critical > Attachments: screenshot-1.png > > > Dears, > We already encountered many times problems related to Out Of Memory situation > in Kafka Broker and streaming clients. > The scenario is the following. > When Kafka Broker (or Streaming Client) is under load and has too less > memory, there are no errors in server logs. One can see some cryptic entries > in GC logs, but they are definitely not self-explaining. > Kafka Broker (and Streaming Clients) works further. Later we see in JMX > monitoring, that JVM uses more and more time in GC. In our case it grows from > e.g. 1% to 80%-90% of CPU time is used by GC. > Next, software collapses into zombie mode – process in not ending. In such a > case I would expect, that process is crashing (e.g. got SIG SEGV). Even worse > Kafka treats such a zombie process normal and somewhat sends messages, which > are in fact getting lost, also the cluster is not excluding broken nodes. The > question is how to configure Kafka to really terminate the JVM and not remain > in zombie mode, to give a chance to other nodes to know, that something is > dead. > I would expect that in Out Of Memory situation JVM is ended if not graceful > than at least process is crashed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-6777) Wrong reaction on Out Of Memory situation
[ https://issues.apache.org/jira/browse/KAFKA-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598974#comment-16598974 ] John Roesler commented on KAFKA-6777: - Hi [~habdank], It's unfortunately a common behavior with JVM applications that when they are memory-constrained they never actually crash, but instead disappear into gc-pause oblivion. For practical purposes, we don't have any visibility into when GC pauses occur, how long they are, or even what our resident memory footprint is. This is all by design of the JVM. However, if we are catching and swallowing OOME, or really any subclass of Error, it would not be good. Error is by definition not recoverable and should be caught only to gracefully exit. I've taken a quick perusal of the code, and most of the `catch (Throwable t)` instances I see are logged and/or propagated. Some usages (such as in KafkaAdminClient.AdminClientRunnable) are suspicious. I'm unclear on whether you are saying that when Kafka runs out of memory, it # shuts down, but hides the reason # or continues running The latter seems unlikely, since if the JVM is truly out of memory, then catching and swallowing the OOME would only work for so long; it seems like eventually some operation would attempt to allocate memory outside of a catch block and still crash the app. Can you elaborate on the reason you think that the culprit is a swallowed OOME instead of just normal GC hell? Is there a specific code path that you think is responsible for catching and swallowing OOMEs? Thanks, -John > Wrong reaction on Out Of Memory situation > - > > Key: KAFKA-6777 > URL: https://issues.apache.org/jira/browse/KAFKA-6777 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.0.0 >Reporter: Seweryn Habdank-Wojewodzki >Priority: Critical > Attachments: screenshot-1.png > > > Dears, > We already encountered many times problems related to Out Of Memory situation > in Kafka Broker and streaming clients. > The scenario is the following. > When Kafka Broker (or Streaming Client) is under load and has too less > memory, there are no errors in server logs. One can see some cryptic entries > in GC logs, but they are definitely not self-explaining. > Kafka Broker (and Streaming Clients) works further. Later we see in JMX > monitoring, that JVM uses more and more time in GC. In our case it grows from > e.g. 1% to 80%-90% of CPU time is used by GC. > Next, software collapses into zombie mode – process in not ending. In such a > case I would expect, that process is crashing (e.g. got SIG SEGV). Even worse > Kafka treats such a zombie process normal and somewhat sends messages, which > are in fact getting lost, also the cluster is not excluding broken nodes. The > question is how to configure Kafka to really terminate the JVM and not remain > in zombie mode, to give a chance to other nodes to know, that something is > dead. > I would expect that in Out Of Memory situation JVM is ended if not graceful > than at least process is crashed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-6777) Wrong reaction on Out Of Memory situation
[ https://issues.apache.org/jira/browse/KAFKA-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437181#comment-16437181 ] Seweryn Habdank-Wojewodzki commented on KAFKA-6777: --- Our options: -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC -Djava.awt.headless=true > Wrong reaction on Out Of Memory situation > - > > Key: KAFKA-6777 > URL: https://issues.apache.org/jira/browse/KAFKA-6777 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.0.0 >Reporter: Seweryn Habdank-Wojewodzki >Priority: Critical > Attachments: screenshot-1.png > > > Dears, > We already encountered many times problems related to Out Of Memory situation > in Kafka Broker and streaming clients. > The scenario is the following. > When Kafka Broker (or Streaming Client) is under load and has too less > memory, there are no errors in server logs. One can see some cryptic entries > in GC logs, but they are definitely not self-explaining. > Kafka Broker (and Streaming Clients) works further. Later we see in JMX > monitoring, that JVM uses more and more time in GC. In our case it grows from > e.g. 1% to 80%-90% of CPU time is used by GC. > Next, software collapses into zombie mode – process in not ending. In such a > case I would expect, that process is crashing (e.g. got SIG SEGV). Even worse > Kafka treats such a zombie process normal and somewhat sends messages, which > are in fact getting lost, also the cluster is not excluding broken nodes. The > question is how to configure Kafka to really terminate the JVM and not remain > in zombie mode, to give a chance to other nodes to know, that something is > dead. > I would expect that in Out Of Memory situation JVM is ended if not graceful > than at least process is crashed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-6777) Wrong reaction on Out Of Memory situation
[ https://issues.apache.org/jira/browse/KAFKA-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436445#comment-16436445 ] huxihx commented on KAFKA-6777: --- A possible way is to diagnose the GC logs to see why it spent so much CPU time to do the collecting. What's the gc collector you use? throughput gc or G1? > Wrong reaction on Out Of Memory situation > - > > Key: KAFKA-6777 > URL: https://issues.apache.org/jira/browse/KAFKA-6777 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.0.0 >Reporter: Seweryn Habdank-Wojewodzki >Priority: Critical > Attachments: screenshot-1.png > > > Dears, > We already encountered many times problems related to Out Of Memory situation > in Kafka Broker and streaming clients. > The scenario is the following. > When Kafka Broker (or Streaming Client) is under load and has too less > memory, there are no errors in server logs. One can see some cryptic entries > in GC logs, but they are definitely not self-explaining. > Kafka Broker (and Streaming Clients) works further. Later we see in JMX > monitoring, that JVM uses more and more time in GC. In our case it grows from > e.g. 1% to 80%-90% of CPU time is used by GC. > Next, software collapses into zombie mode – process in not ending. In such a > case I would expect, that process is crashing (e.g. got SIG SEGV). Even worse > Kafka treats such a zombie process normal and somewhat sends messages, which > are in fact getting lost, also the cluster is not excluding broken nodes. The > question is how to configure Kafka to really terminate the JVM and not remain > in zombie mode, to give a chance to other nodes to know, that something is > dead. > I would expect that in Out Of Memory situation JVM is ended if not graceful > than at least process is crashed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-6777) Wrong reaction on Out Of Memory situation
[ https://issues.apache.org/jira/browse/KAFKA-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433863#comment-16433863 ] Seweryn Habdank-Wojewodzki commented on KAFKA-6777: --- One more comment. I see quite often in Kafka that Throwable is converted to RuntimeException. This kind of code may lead to situation when OOM will never appear. I had made simple example: {code:java} public class Main { public static void main ( String[] args ) { try { try { throw new OutOfMemoryError(); // very often in Kafka code: } catch ( Throwable t ) { throw ( RuntimeException ) t; } // end of very often } catch ( Exception ignore ) { } } } {code} Executed with: {code:java} -XX:OnOutOfMemoryError="echo OOM" {code} leads to: {code:java} Process finished with exit code 0 {code} I see no *OOM* string, also no _OutOfMemoryError_ is noticed, by any stactrace. > Wrong reaction on Out Of Memory situation > - > > Key: KAFKA-6777 > URL: https://issues.apache.org/jira/browse/KAFKA-6777 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.0.0 >Reporter: Seweryn Habdank-Wojewodzki >Priority: Critical > Attachments: screenshot-1.png > > > Dears, > We already encountered many times problems related to Out Of Memory situation > in Kafka Broker and streaming clients. > The scenario is the following. > When Kafka Broker (or Streaming Client) is under load and has too less > memory, there are no errors in server logs. One can see some cryptic entries > in GC logs, but they are definitely not self-explaining. > Kafka Broker (and Streaming Clients) works further. Later we see in JMX > monitoring, that JVM uses more and more time in GC. In our case it grows from > e.g. 1% to 80%-90% of CPU time is used by GC. > Next, software collapses into zombie mode – process in not ending. In such a > case I would expect, that process is crashing (e.g. got SIG SEGV). Even worse > Kafka treats such a zombie process normal and somewhat sends messages, which > are in fact getting lost, also the cluster is not excluding broken nodes. The > question is how to configure Kafka to really terminate the JVM and not remain > in zombie mode, to give a chance to other nodes to know, that something is > dead. > I would expect that in Out Of Memory situation JVM is ended if not graceful > than at least process is crashed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-6777) Wrong reaction on Out Of Memory situation
[ https://issues.apache.org/jira/browse/KAFKA-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433799#comment-16433799 ] Seweryn Habdank-Wojewodzki commented on KAFKA-6777: --- Thanks for comment. The problem is, that either the OnOutOfMemoryError is never thrown, as the algorithms trying to do their best and they are loading GC, and then no message processing may happen. Or the OnOutOfMemoryError is thrown, but caught in code like catch(Throwable) {} The observed bahaviour is that at INFO level logs there is no explicit error like: OnOutOfMemoryError. I had seen in JMX metrics and there heap is out and GC is endless busy, till nothing is also to JMX reported. I mean I can write a tool to reboot Kafka node, when GC load on CPU is higher than 40% or so, but this kind of tool is workaround and not a solution for the problem. I am attaching graphs to highlight wat had happend. !screenshot-1.png! > Wrong reaction on Out Of Memory situation > - > > Key: KAFKA-6777 > URL: https://issues.apache.org/jira/browse/KAFKA-6777 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.0.0 >Reporter: Seweryn Habdank-Wojewodzki >Priority: Critical > Attachments: screenshot-1.png > > > Dears, > We already encountered many times problems related to Out Of Memory situation > in Kafka Broker and streaming clients. > The scenario is the following. > When Kafka Broker (or Streaming Client) is under load and has too less > memory, there are no errors in server logs. One can see some cryptic entries > in GC logs, but they are definitely not self-explaining. > Kafka Broker (and Streaming Clients) works further. Later we see in JMX > monitoring, that JVM uses more and more time in GC. In our case it grows from > e.g. 1% to 80%-90% of CPU time is used by GC. > Next, software collapses into zombie mode – process in not ending. In such a > case I would expect, that process is crashing (e.g. got SIG SEGV). Even worse > Kafka treats such a zombie process normal and somewhat sends messages, which > are in fact getting lost, also the cluster is not excluding broken nodes. The > question is how to configure Kafka to really terminate the JVM and not remain > in zombie mode, to give a chance to other nodes to know, that something is > dead. > I would expect that in Out Of Memory situation JVM is ended if not graceful > than at least process is crashed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-6777) Wrong reaction on Out Of Memory situation
[ https://issues.apache.org/jira/browse/KAFKA-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433726#comment-16433726 ] Mickael Maison commented on KAFKA-6777: --- You usually do that via JVM options. For example on Oracle's JVM, you can use: -XX:OnOutOfMemoryError="; " [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html] If you set it in KAFKA_OPTS, it will be automatically picked up by the tools under /bin > Wrong reaction on Out Of Memory situation > - > > Key: KAFKA-6777 > URL: https://issues.apache.org/jira/browse/KAFKA-6777 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.0.0 >Reporter: Seweryn Habdank-Wojewodzki >Priority: Critical > > Dears, > We already encountered many times problems related to Out Of Memory situation > in Kafka Broker and streaming clients. > The scenario is the following. > When Kafka Broker (or Streaming Client) is under load and has too less > memory, there are no errors in server logs. One can see some cryptic entries > in GC logs, but they are definitely not self-explaining. > Kafka Broker (and Streaming Clients) works further. Later we see in JMX > monitoring, that JVM uses more and more time in GC. In our case it grows from > e.g. 1% to 80%-90% of CPU time is used by GC. > Next, software collapses into zombie mode – process in not ending. In such a > case I would expect, that process is crashing (e.g. got SIG SEGV). Even worse > Kafka treats such a zombie process normal and somewhat sends messages, which > are in fact getting lost, also the cluster is not excluding broken nodes. The > question is how to configure Kafka to really terminate the JVM and not remain > in zombie mode, to give a chance to other nodes to know, that something is > dead. > I would expect that in Out Of Memory situation JVM is ended if not graceful > than at least process is crashed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)