[jira] [Commented] (KAFKA-6777) Wrong reaction on Out Of Memory situation

John Roesler (JIRA) Fri, 31 Aug 2018 09:32:13 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16598974#comment-16598974
 ]


John Roesler commented on KAFKA-6777:
-------------------------------------

Hi [~habdank],

It's unfortunately a common behavior with JVM applications that when they are 
memory-constrained they never actually crash, but instead disappear into 
gc-pause oblivion. For practical purposes, we don't have any visibility into 
when GC pauses occur, how long they are, or even what our resident memory 
footprint is. This is all by design of the JVM.

However, if we are catching and swallowing OOME, or really any subclass of 
Error, it would not be good. Error is by definition not recoverable and should 
be caught only to gracefully exit.

I've taken a quick perusal of the code, and most of the `catch (Throwable t)` 
instances I see are logged and/or propagated. Some usages (such as in 
KafkaAdminClient.AdminClientRunnable) are suspicious.

 

I'm unclear on whether you are saying that when Kafka runs out of memory, it 
 # shuts down, but hides the reason
 # or continues running

The latter seems unlikely, since if the JVM is truly out of memory, then 
catching and swallowing the OOME would only work for so long; it seems like 
eventually some operation would attempt to allocate memory outside of a catch 
block and still crash the app.

 

Can you elaborate on the reason you think that the culprit is a swallowed OOME 
instead of just normal GC hell?

Is there a specific code path that you think is responsible for catching and 
swallowing OOMEs?

Thanks,

-John

> Wrong reaction on Out Of Memory situation
> -----------------------------------------
>
>                 Key: KAFKA-6777
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6777
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.0.0
>            Reporter: Seweryn Habdank-Wojewodzki
>            Priority: Critical
>         Attachments: screenshot-1.png
>
>
> Dears,
> We already encountered many times problems related to Out Of Memory situation 
> in Kafka Broker and streaming clients.
> The scenario is the following.
> When Kafka Broker (or Streaming Client) is under load and has too less 
> memory, there are no errors in server logs. One can see some cryptic entries 
> in GC logs, but they are definitely not self-explaining.
> Kafka Broker (and Streaming Clients) works further. Later we see in JMX 
> monitoring, that JVM uses more and more time in GC. In our case it grows from 
> e.g. 1% to 80%-90% of CPU time is used by GC.
> Next, software collapses into zombie mode – process in not ending. In such a 
> case I would expect, that process is crashing (e.g. got SIG SEGV). Even worse 
> Kafka treats such a zombie process normal and somewhat sends messages, which 
> are in fact getting lost, also the cluster is not excluding broken nodes. The 
> question is how to configure Kafka to really terminate the JVM and not remain 
> in zombie mode, to give a chance to other nodes to know, that something is 
> dead.
> I would expect that in Out Of Memory situation JVM is ended if not graceful 
> than at least process is crashed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KAFKA-6777) Wrong reaction on Out Of Memory situation

Reply via email to