[
https://issues.apache.org/jira/browse/KAFKA-19912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18040961#comment-18040961
]
KC.H commented on KAFKA-19912:
------------------------------
h2. How to Reproduce
{code:java}
//org.apache.kafka.clients.consumer.internals.ApplicationEventHandlerTest.java
@Test
public void testDelayInInitializeResources() throws InterruptedException {
assertInitializeResourcesError(
TimeoutException.class,
() -> {
long delayMs = initializationTimeoutMs * 2;
org.apache.kafka.common.utils.Utils.sleep(delayMs);
return networkClientDelegate;
}
);
TimeUnit.MINUTES.sleep(1000); // Add this.
} {code}
h2. Root Cause
The OOM issue occurs due to a combination of three factors:
# {*}Thread Leak (Already Fixed){*}: Before the fix, when an exception
occurred during initialization, the {{ConsumerNetworkThread}} couldn't be
properly closed, causing it to enter an infinite loop in the {{runOnce()}}
method.
# {*}Mock Object Behavior{*}: In tests, {{NetworkClientDelegate}} is a mock
object. Unlike production code where {{poll()}} respects the timeout parameter,
the mock returns immediately without any waiting.
# {*}Mockito Invocation Recording{*}: Mockito records every method invocation
in its {{{}invocationContainer{}}}. When the infinite loop rapidly calls mock
methods, these records quickly fill up the heap.
> Investigate OOM error on builds due to leaked thread
> ----------------------------------------------------
>
> Key: KAFKA-19912
> URL: https://issues.apache.org/jira/browse/KAFKA-19912
> Project: Kafka
> Issue Type: Bug
> Components: clients, consumer, unit tests
> Reporter: Kirk True
> Assignee: Kirk True
> Priority: Major
> Attachments: heap-dump.png
>
>
> Builds of Kafka began failing in the GitHub Actions environment because of an
> OOM error. KAFKA-19898 was filed and one of the client unit tests was
> identified as the culprit. [A fix|https://github.com/apache/kafka/pull/20930]
> was pushed quickly to resolve the issue. However, since the test is only
> executed once per run, it's not clear how a small number of dangling threads
> could have caused the entire build to experience problems.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)