[ 
https://issues.apache.org/jira/browse/KAFKA-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433799#comment-16433799
 ] 

Seweryn Habdank-Wojewodzki edited comment on KAFKA-6777 at 4/11/18 12:17 PM:
-----------------------------------------------------------------------------

Thanks for comment.

The problem is, that either the OnOutOfMemoryError is never thrown, as the 
algorithms are trying to do their best and they are loading GC, so later no 
message processing may happen, as most CPU is used by GC.

Or the OnOutOfMemoryError is thrown, but caught in code like catch(Throwable) {}

The observed bahaviour is that at INFO level logs there is no explicit error 
like: OnOutOfMemoryError. 
I had seen in JMX metrics and there heap is out and GC is endless busy, till 
nothing is also to JMX reported.

I mean I can write a tool to reboot Kafka node, when GC load on CPU is higher 
than 40% or so, but this kind of tool is workaround and not a solution for the 
problem.

I am attaching graphs to highlight wat had happened.

On the image blow there are metrics from 2 Kafka nodes. The green one was 
dead/zombie when GC time reached 80%. This "drop" of value is only a 
presentation matter.

!screenshot-1.png! 


was (Author: habdank):
Thanks for comment.

The problem is, that either the OnOutOfMemoryError is never thrown, as the 
algorithms trying to do their best and they are loading GC, and then no message 
processing may happen.

Or the OnOutOfMemoryError is thrown, but caught in code like catch(Throwable) {}

The observed bahaviour is that at INFO level logs there is no explicit error 
like: OnOutOfMemoryError. 
I had seen in JMX metrics and there heap is out and GC is endless busy, till 
nothing is also to JMX reported.

I mean I can write a tool to reboot Kafka node, when GC load on CPU is higher 
than 40% or so, but this kind of tool is workaround and not a solution for the 
problem.

I am attaching graphs to highlight wat had happend.

On the image blow there are metrics from 2 Kafka nodes. The green one was 
dead/zombie when GC time reached 80%. This "drop" of value is only a 
presentation matter.

 !screenshot-1.png! 

> Wrong reaction on Out Of Memory situation
> -----------------------------------------
>
>                 Key: KAFKA-6777
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6777
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.0.0
>            Reporter: Seweryn Habdank-Wojewodzki
>            Priority: Critical
>         Attachments: screenshot-1.png
>
>
> Dears,
> We already encountered many times problems related to Out Of Memory situation 
> in Kafka Broker and streaming clients.
> The scenario is the following.
> When Kafka Broker (or Streaming Client) is under load and has too less 
> memory, there are no errors in server logs. One can see some cryptic entries 
> in GC logs, but they are definitely not self-explaining.
> Kafka Broker (and Streaming Clients) works further. Later we see in JMX 
> monitoring, that JVM uses more and more time in GC. In our case it grows from 
> e.g. 1% to 80%-90% of CPU time is used by GC.
> Next, software collapses into zombie mode – process in not ending. In such a 
> case I would expect, that process is crashing (e.g. got SIG SEGV). Even worse 
> Kafka treats such a zombie process normal and somewhat sends messages, which 
> are in fact getting lost, also the cluster is not excluding broken nodes. The 
> question is how to configure Kafka to really terminate the JVM and not remain 
> in zombie mode, to give a chance to other nodes to know, that something is 
> dead.
> I would expect that in Out Of Memory situation JVM is ended if not graceful 
> than at least process is crashed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to