Boyang Chen created KAFKA-10237:
-----------------------------------
Summary: Properly handle in-memory stores OOM
Key: KAFKA-10237
URL: https://issues.apache.org/jira/browse/KAFKA-10237
Project: Kafka
Issue Type: Improvement
Components: streams
Reporter: Boyang Chen
We have seen the in-memory store buffered too much data and eventually get OOM.
Generally speaking, OOM has no real indication of the underlying problem and
increases the difficulty for user debugging, since the failed thread may not be
the actual culprit which causes the explosion. If we could get better
protection to avoid hitting memory limit, or at least giving out a clear guide,
the end user debugging would be much simpler.
To make it work, we need to enforce a certain memory limit below heap size and
take actions when hitting it. The first question would be, whether we set a
numeric limit, such as 100MB or 500MB, or a percentile limit, such as 60% or
80% of total memory.
The second question is about the action itself. One approach would be crashing
the store immediately and inform the user to increase their application
capacity. The second approach would be opening up an on-disk store
spontaneously and offload the data to it.
Personally I'm in favor of approach #2 because it has minimum impact to the
on-going application. However it is more complex and potentially requires
significant works to define the proper behavior such as the default store
configuration, how to manage its lifecycle, etc.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)