[jira] [Commented] (KAFKA-6777) Wrong reaction on Out Of Memory situation

2018-09-07 Thread John Roesler (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16607574#comment-16607574
 ] 

John Roesler commented on KAFKA-6777:
-

Hi [~habdank],

When Java processes are under extreme memory pressure, but not actually out of 
memory, it is expected to see GC take an increasing percentage of CPU. The GC 
interruptions will grow more frequent and also longer, although G1GC attempts 
to bound the pause length.

Note that the user-space code, such as Kafka, has effectively *no visibility* 
into when these collections occur or how long they take. From the application 
code's perspective, it's exactly like running on a slow CPU when you get into 
this state. This is why you can't expect Kafka, or any other JVM application, 
to detect this state for you.

When running Kafka, or any other JVM application, you will want to monitor GC 
activity, as you suggested. When it passes a threshold that you're comfortable 
with (you suggested 40% CPU time), you would set up some alert.

I don't think it would be a good idea to just bounce the process if GC is 
becoming an issue. The heavy GC is just an indication that you're trying to run 
the application with a heap that is too small for its workload. Better 
reactions would be to increase the heap size or decrease the workload per node.

Note that with JVM apps, you have to account not only for the memory 
requirements of the application itself, but also for the garbage that it 
generates. If the heap is too small for the app's own memory requirements, then 
you *will* get an OOME. If the heap is big enough for the app, but not big 
enough for the GC's data structures, then you'll just get heavy GC and *not* an 
OOME. Does this make sense?

Thanks,

-John

> Wrong reaction on Out Of Memory situation
> -
>
> Key: KAFKA-6777
> URL: https://issues.apache.org/jira/browse/KAFKA-6777
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.0.0
>Reporter: Seweryn Habdank-Wojewodzki
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> Dears,
> We already encountered many times problems related to Out Of Memory situation 
> in Kafka Broker and streaming clients.
> The scenario is the following.
> When Kafka Broker (or Streaming Client) is under load and has too less 
> memory, there are no errors in server logs. One can see some cryptic entries 
> in GC logs, but they are definitely not self-explaining.
> Kafka Broker (and Streaming Clients) works further. Later we see in JMX 
> monitoring, that JVM uses more and more time in GC. In our case it grows from 
> e.g. 1% to 80%-90% of CPU time is used by GC.
> Next, software collapses into zombie mode – process in not ending. In such a 
> case I would expect, that process is crashing (e.g. got SIG SEGV). Even worse 
> Kafka treats such a zombie process normal and somewhat sends messages, which 
> are in fact getting lost, also the cluster is not excluding broken nodes. The 
> question is how to configure Kafka to really terminate the JVM and not remain 
> in zombie mode, to give a chance to other nodes to know, that something is 
> dead.
> I would expect that in Out Of Memory situation JVM is ended if not graceful 
> than at least process is crashed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6777) Wrong reaction on Out Of Memory situation

2018-09-07 Thread Seweryn Habdank-Wojewodzki (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16606816#comment-16606816
 ] 

Seweryn Habdank-Wojewodzki commented on KAFKA-6777:
---

What means long pauses? 
I see Broker is not doing anythink for hours. 
Is it expected behaviour of CG?

> Wrong reaction on Out Of Memory situation
> -
>
> Key: KAFKA-6777
> URL: https://issues.apache.org/jira/browse/KAFKA-6777
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.0.0
>Reporter: Seweryn Habdank-Wojewodzki
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> Dears,
> We already encountered many times problems related to Out Of Memory situation 
> in Kafka Broker and streaming clients.
> The scenario is the following.
> When Kafka Broker (or Streaming Client) is under load and has too less 
> memory, there are no errors in server logs. One can see some cryptic entries 
> in GC logs, but they are definitely not self-explaining.
> Kafka Broker (and Streaming Clients) works further. Later we see in JMX 
> monitoring, that JVM uses more and more time in GC. In our case it grows from 
> e.g. 1% to 80%-90% of CPU time is used by GC.
> Next, software collapses into zombie mode – process in not ending. In such a 
> case I would expect, that process is crashing (e.g. got SIG SEGV). Even worse 
> Kafka treats such a zombie process normal and somewhat sends messages, which 
> are in fact getting lost, also the cluster is not excluding broken nodes. The 
> question is how to configure Kafka to really terminate the JVM and not remain 
> in zombie mode, to give a chance to other nodes to know, that something is 
> dead.
> I would expect that in Out Of Memory situation JVM is ended if not graceful 
> than at least process is crashed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6777) Wrong reaction on Out Of Memory situation

2018-09-06 Thread John Roesler (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605926#comment-16605926
 ] 

John Roesler commented on KAFKA-6777:
-

Yes, it certainly seems safer to generally avoid catching Errors.

> But anyhow, I would expect, that lack of critical resources like memory will 
>quickly lead to crash with FATAL error e.g. Out Of Memory.

Have you seen some evidence that this is not happening? Note that frequent or 
long GC pauses are not the same as running out of memory.

> Wrong reaction on Out Of Memory situation
> -
>
> Key: KAFKA-6777
> URL: https://issues.apache.org/jira/browse/KAFKA-6777
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.0.0
>Reporter: Seweryn Habdank-Wojewodzki
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> Dears,
> We already encountered many times problems related to Out Of Memory situation 
> in Kafka Broker and streaming clients.
> The scenario is the following.
> When Kafka Broker (or Streaming Client) is under load and has too less 
> memory, there are no errors in server logs. One can see some cryptic entries 
> in GC logs, but they are definitely not self-explaining.
> Kafka Broker (and Streaming Clients) works further. Later we see in JMX 
> monitoring, that JVM uses more and more time in GC. In our case it grows from 
> e.g. 1% to 80%-90% of CPU time is used by GC.
> Next, software collapses into zombie mode – process in not ending. In such a 
> case I would expect, that process is crashing (e.g. got SIG SEGV). Even worse 
> Kafka treats such a zombie process normal and somewhat sends messages, which 
> are in fact getting lost, also the cluster is not excluding broken nodes. The 
> question is how to configure Kafka to really terminate the JVM and not remain 
> in zombie mode, to give a chance to other nodes to know, that something is 
> dead.
> I would expect that in Out Of Memory situation JVM is ended if not graceful 
> than at least process is crashed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6777) Wrong reaction on Out Of Memory situation

2018-08-31 Thread John Roesler (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598974#comment-16598974
 ] 

John Roesler commented on KAFKA-6777:
-

Hi [~habdank],

It's unfortunately a common behavior with JVM applications that when they are 
memory-constrained they never actually crash, but instead disappear into 
gc-pause oblivion. For practical purposes, we don't have any visibility into 
when GC pauses occur, how long they are, or even what our resident memory 
footprint is. This is all by design of the JVM.

However, if we are catching and swallowing OOME, or really any subclass of 
Error, it would not be good. Error is by definition not recoverable and should 
be caught only to gracefully exit.

I've taken a quick perusal of the code, and most of the `catch (Throwable t)` 
instances I see are logged and/or propagated. Some usages (such as in 
KafkaAdminClient.AdminClientRunnable) are suspicious.

 

I'm unclear on whether you are saying that when Kafka runs out of memory, it 
 # shuts down, but hides the reason
 # or continues running

The latter seems unlikely, since if the JVM is truly out of memory, then 
catching and swallowing the OOME would only work for so long; it seems like 
eventually some operation would attempt to allocate memory outside of a catch 
block and still crash the app.

 

Can you elaborate on the reason you think that the culprit is a swallowed OOME 
instead of just normal GC hell?

Is there a specific code path that you think is responsible for catching and 
swallowing OOMEs?

Thanks,

-John

> Wrong reaction on Out Of Memory situation
> -
>
> Key: KAFKA-6777
> URL: https://issues.apache.org/jira/browse/KAFKA-6777
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.0.0
>Reporter: Seweryn Habdank-Wojewodzki
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> Dears,
> We already encountered many times problems related to Out Of Memory situation 
> in Kafka Broker and streaming clients.
> The scenario is the following.
> When Kafka Broker (or Streaming Client) is under load and has too less 
> memory, there are no errors in server logs. One can see some cryptic entries 
> in GC logs, but they are definitely not self-explaining.
> Kafka Broker (and Streaming Clients) works further. Later we see in JMX 
> monitoring, that JVM uses more and more time in GC. In our case it grows from 
> e.g. 1% to 80%-90% of CPU time is used by GC.
> Next, software collapses into zombie mode – process in not ending. In such a 
> case I would expect, that process is crashing (e.g. got SIG SEGV). Even worse 
> Kafka treats such a zombie process normal and somewhat sends messages, which 
> are in fact getting lost, also the cluster is not excluding broken nodes. The 
> question is how to configure Kafka to really terminate the JVM and not remain 
> in zombie mode, to give a chance to other nodes to know, that something is 
> dead.
> I would expect that in Out Of Memory situation JVM is ended if not graceful 
> than at least process is crashed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6777) Wrong reaction on Out Of Memory situation

2018-04-13 Thread Seweryn Habdank-Wojewodzki (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437181#comment-16437181
 ] 

Seweryn Habdank-Wojewodzki commented on KAFKA-6777:
---

Our options:
-server
-XX:+UseG1GC
-XX:MaxGCPauseMillis=20
-XX:InitiatingHeapOccupancyPercent=35
-XX:+DisableExplicitGC
-Djava.awt.headless=true

> Wrong reaction on Out Of Memory situation
> -
>
> Key: KAFKA-6777
> URL: https://issues.apache.org/jira/browse/KAFKA-6777
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.0.0
>Reporter: Seweryn Habdank-Wojewodzki
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> Dears,
> We already encountered many times problems related to Out Of Memory situation 
> in Kafka Broker and streaming clients.
> The scenario is the following.
> When Kafka Broker (or Streaming Client) is under load and has too less 
> memory, there are no errors in server logs. One can see some cryptic entries 
> in GC logs, but they are definitely not self-explaining.
> Kafka Broker (and Streaming Clients) works further. Later we see in JMX 
> monitoring, that JVM uses more and more time in GC. In our case it grows from 
> e.g. 1% to 80%-90% of CPU time is used by GC.
> Next, software collapses into zombie mode – process in not ending. In such a 
> case I would expect, that process is crashing (e.g. got SIG SEGV). Even worse 
> Kafka treats such a zombie process normal and somewhat sends messages, which 
> are in fact getting lost, also the cluster is not excluding broken nodes. The 
> question is how to configure Kafka to really terminate the JVM and not remain 
> in zombie mode, to give a chance to other nodes to know, that something is 
> dead.
> I would expect that in Out Of Memory situation JVM is ended if not graceful 
> than at least process is crashed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6777) Wrong reaction on Out Of Memory situation

2018-04-12 Thread huxihx (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436445#comment-16436445
 ] 

huxihx commented on KAFKA-6777:
---

A possible way is to diagnose the GC logs to see why it spent so much CPU time 
to do the collecting. What's the gc collector you use? throughput gc or G1?

> Wrong reaction on Out Of Memory situation
> -
>
> Key: KAFKA-6777
> URL: https://issues.apache.org/jira/browse/KAFKA-6777
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.0.0
>Reporter: Seweryn Habdank-Wojewodzki
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> Dears,
> We already encountered many times problems related to Out Of Memory situation 
> in Kafka Broker and streaming clients.
> The scenario is the following.
> When Kafka Broker (or Streaming Client) is under load and has too less 
> memory, there are no errors in server logs. One can see some cryptic entries 
> in GC logs, but they are definitely not self-explaining.
> Kafka Broker (and Streaming Clients) works further. Later we see in JMX 
> monitoring, that JVM uses more and more time in GC. In our case it grows from 
> e.g. 1% to 80%-90% of CPU time is used by GC.
> Next, software collapses into zombie mode – process in not ending. In such a 
> case I would expect, that process is crashing (e.g. got SIG SEGV). Even worse 
> Kafka treats such a zombie process normal and somewhat sends messages, which 
> are in fact getting lost, also the cluster is not excluding broken nodes. The 
> question is how to configure Kafka to really terminate the JVM and not remain 
> in zombie mode, to give a chance to other nodes to know, that something is 
> dead.
> I would expect that in Out Of Memory situation JVM is ended if not graceful 
> than at least process is crashed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6777) Wrong reaction on Out Of Memory situation

2018-04-11 Thread Seweryn Habdank-Wojewodzki (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433863#comment-16433863
 ] 

Seweryn Habdank-Wojewodzki commented on KAFKA-6777:
---

One more comment. I see quite often in Kafka that Throwable is converted to 
RuntimeException.
 This kind of code may lead to situation when OOM will never appear.

I had made simple example:
{code:java}
public class Main {
public static void main ( String[] args ) {
try {
try {
throw new OutOfMemoryError();
// very often in Kafka code:
} catch ( Throwable t ) {
throw ( RuntimeException ) t;
}
// end of very often
} catch ( Exception ignore ) {
}
}
}
{code}
Executed with:
{code:java}
-XX:OnOutOfMemoryError="echo OOM"
{code}
leads to:
{code:java}
Process finished with exit code 0
{code}
I see no *OOM* string, also no _OutOfMemoryError_ is noticed, by any stactrace.

> Wrong reaction on Out Of Memory situation
> -
>
> Key: KAFKA-6777
> URL: https://issues.apache.org/jira/browse/KAFKA-6777
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.0.0
>Reporter: Seweryn Habdank-Wojewodzki
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> Dears,
> We already encountered many times problems related to Out Of Memory situation 
> in Kafka Broker and streaming clients.
> The scenario is the following.
> When Kafka Broker (or Streaming Client) is under load and has too less 
> memory, there are no errors in server logs. One can see some cryptic entries 
> in GC logs, but they are definitely not self-explaining.
> Kafka Broker (and Streaming Clients) works further. Later we see in JMX 
> monitoring, that JVM uses more and more time in GC. In our case it grows from 
> e.g. 1% to 80%-90% of CPU time is used by GC.
> Next, software collapses into zombie mode – process in not ending. In such a 
> case I would expect, that process is crashing (e.g. got SIG SEGV). Even worse 
> Kafka treats such a zombie process normal and somewhat sends messages, which 
> are in fact getting lost, also the cluster is not excluding broken nodes. The 
> question is how to configure Kafka to really terminate the JVM and not remain 
> in zombie mode, to give a chance to other nodes to know, that something is 
> dead.
> I would expect that in Out Of Memory situation JVM is ended if not graceful 
> than at least process is crashed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6777) Wrong reaction on Out Of Memory situation

2018-04-11 Thread Seweryn Habdank-Wojewodzki (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433799#comment-16433799
 ] 

Seweryn Habdank-Wojewodzki commented on KAFKA-6777:
---

Thanks for comment.

The problem is, that either the OnOutOfMemoryError is never thrown, as the 
algorithms trying to do their best and they are loading GC, and then no message 
processing may happen.

Or the OnOutOfMemoryError is thrown, but caught in code like catch(Throwable) {}

The observed bahaviour is that at INFO level logs there is no explicit error 
like: OnOutOfMemoryError. 
I had seen in JMX metrics and there heap is out and GC is endless busy, till 
nothing is also to JMX reported.

I mean I can write a tool to reboot Kafka node, when GC load on CPU is higher 
than 40% or so, but this kind of tool is workaround and not a solution for the 
problem.

I am attaching graphs to highlight wat had happend.

 !screenshot-1.png! 

> Wrong reaction on Out Of Memory situation
> -
>
> Key: KAFKA-6777
> URL: https://issues.apache.org/jira/browse/KAFKA-6777
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.0.0
>Reporter: Seweryn Habdank-Wojewodzki
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> Dears,
> We already encountered many times problems related to Out Of Memory situation 
> in Kafka Broker and streaming clients.
> The scenario is the following.
> When Kafka Broker (or Streaming Client) is under load and has too less 
> memory, there are no errors in server logs. One can see some cryptic entries 
> in GC logs, but they are definitely not self-explaining.
> Kafka Broker (and Streaming Clients) works further. Later we see in JMX 
> monitoring, that JVM uses more and more time in GC. In our case it grows from 
> e.g. 1% to 80%-90% of CPU time is used by GC.
> Next, software collapses into zombie mode – process in not ending. In such a 
> case I would expect, that process is crashing (e.g. got SIG SEGV). Even worse 
> Kafka treats such a zombie process normal and somewhat sends messages, which 
> are in fact getting lost, also the cluster is not excluding broken nodes. The 
> question is how to configure Kafka to really terminate the JVM and not remain 
> in zombie mode, to give a chance to other nodes to know, that something is 
> dead.
> I would expect that in Out Of Memory situation JVM is ended if not graceful 
> than at least process is crashed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6777) Wrong reaction on Out Of Memory situation

2018-04-11 Thread Mickael Maison (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433726#comment-16433726
 ] 

Mickael Maison commented on KAFKA-6777:
---

You usually do that via JVM options.

For example on Oracle's JVM, you can use: -XX:OnOutOfMemoryError="; 
"
[http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]

If you set it in KAFKA_OPTS, it will be automatically picked up by the tools 
under /bin

> Wrong reaction on Out Of Memory situation
> -
>
> Key: KAFKA-6777
> URL: https://issues.apache.org/jira/browse/KAFKA-6777
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.0.0
>Reporter: Seweryn Habdank-Wojewodzki
>Priority: Critical
>
> Dears,
> We already encountered many times problems related to Out Of Memory situation 
> in Kafka Broker and streaming clients.
> The scenario is the following.
> When Kafka Broker (or Streaming Client) is under load and has too less 
> memory, there are no errors in server logs. One can see some cryptic entries 
> in GC logs, but they are definitely not self-explaining.
> Kafka Broker (and Streaming Clients) works further. Later we see in JMX 
> monitoring, that JVM uses more and more time in GC. In our case it grows from 
> e.g. 1% to 80%-90% of CPU time is used by GC.
> Next, software collapses into zombie mode – process in not ending. In such a 
> case I would expect, that process is crashing (e.g. got SIG SEGV). Even worse 
> Kafka treats such a zombie process normal and somewhat sends messages, which 
> are in fact getting lost, also the cluster is not excluding broken nodes. The 
> question is how to configure Kafka to really terminate the JVM and not remain 
> in zombie mode, to give a chance to other nodes to know, that something is 
> dead.
> I would expect that in Out Of Memory situation JVM is ended if not graceful 
> than at least process is crashed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)