[jira] [Commented] (KAFKA-16430) The group-metadata-manager thread is always in a loading state and occupies one CPU, unable to end.
[ https://issues.apache.org/jira/browse/KAFKA-16430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833389#comment-17833389 ] Gao Fei commented on KAFKA-16430: - [~chia7712] what you mean? Is the newer kafka script referring to the use of the new version of the kafka-consumer-group.sh client script? But now there is a problem with the kafka broker server side. > The group-metadata-manager thread is always in a loading state and occupies > one CPU, unable to end. > --- > > Key: KAFKA-16430 > URL: https://issues.apache.org/jira/browse/KAFKA-16430 > Project: Kafka > Issue Type: Bug > Components: group-coordinator >Affects Versions: 2.4.0 >Reporter: Gao Fei >Priority: Blocker > > I deployed three broker instances and suddenly found that the client was > unable to consume data from certain topic partitions. I first tried to log in > to the broker corresponding to the group and used the following command to > view the consumer group: > {code:java} > ./bin/kafka-consumer-groups.sh --bootstrap-server localhost:9093 --describe > --group mygroup{code} > and found the following error: > {code:java} > Error: Executing consumer group command failed due to > org.apache.kafka.common.errors.CoodinatorLoadInProgressException: The > coodinator is loading and hence can't process requests.{code} > I then discovered that the broker may be stuck in a loop, which is constantly > in a loading state. At the same time, I found through the top command that > the "group-metadata-manager-0" thread was constantly consuming 100% of the > CPU resources. This loop could not be broken, resulting in the inability to > consume topic partition data on that node. At this point, I suspected that > the issue may be related to the __consumer_offsets partition data file loaded > by this thread. > Finally, after restarting the broker instance, everything was back to normal. > It's very strange that if there was an issue with the __consumer_offsets > partition data file, the broker should have failed to start. Why was it able > to automatically recover after a restart? And why did this continuous loop > loading of the __consumer_offsets partition data occur? > We encountered this issue in our production environment using Kafka versions > 2.2.1 and 2.4.0, and I believe it may also affect other versions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16430) The group-metadata-manager thread is always in a loading state and occupies one CPU, unable to end.
Gao Fei created KAFKA-16430: --- Summary: The group-metadata-manager thread is always in a loading state and occupies one CPU, unable to end. Key: KAFKA-16430 URL: https://issues.apache.org/jira/browse/KAFKA-16430 Project: Kafka Issue Type: Bug Components: group-coordinator Affects Versions: 2.4.0 Reporter: Gao Fei I deployed three broker instances and suddenly found that the client was unable to consume data from certain topic partitions. I first tried to log in to the broker corresponding to the group and used the following command to view the consumer group: {code:java} ./bin/kafka-consumer-groups.sh --bootstrap-server localhost:9093 --describe --group mygroup{code} and found the following error: {code:java} Error: Executing consumer group command failed due to org.apache.kafka.common.errors.CoodinatorLoadInProgressException: The coodinator is loading and hence can't process requests.{code} I then discovered that the broker may be stuck in a loop, which is constantly in a loading state. At the same time, I found through the top command that the "group-metadata-manager-0" thread was constantly consuming 100% of the CPU resources. This loop could not be broken, resulting in the inability to consume topic partition data on that node. At this point, I suspected that the issue may be related to the __consumer_offsets partition data file loaded by this thread. Finally, after restarting the broker instance, everything was back to normal. It's very strange that if there was an issue with the __consumer_offsets partition data file, the broker should have failed to start. Why was it able to automatically recover after a restart? And why did this continuous loop loading of the __consumer_offsets partition data occur? We encountered this issue in our production environment using Kafka versions 2.2.1 and 2.4.0, and I believe it may also affect other versions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15902) Topic partitions cannot be automatically cleaned up, leading to disk space occupation
Gao Fei created KAFKA-15902: --- Summary: Topic partitions cannot be automatically cleaned up, leading to disk space occupation Key: KAFKA-15902 URL: https://issues.apache.org/jira/browse/KAFKA-15902 Project: Kafka Issue Type: Bug Affects Versions: 2.4.0 Reporter: Gao Fei we are unable to determine the cause of this situation, but the error logs from the faulty node process keep showing the following error: {code:java} ERROR Uncaught exception in scheduled task 'kafka-log-retention' (kafka.utils.KafkaScheduler) java.nio.BufferOverflowException at java.base/java.nio.Buffer.nextPutIndex(Buffer.java:674) at java.base/java.nio.DirectByteBuffer.putLong(DirectByteBuffer.java:882) at kafka.log.TimeIndex.$anonfun$maybeAppend$1(TimeIndex.scala:134) at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:114) at kafka.log.LogSegment.onBecomeInactiveSegment(LogSegment.scala:506) at kafka.log.Log.$anonfun$roll$8(Log.scala:2066) at kafka.log.Log.$anonfun$roll$8$adapted(Log.scala:2066) at scala.Option.foreach(Option.scala:437) at kafka.log.Log.$anonfun$roll$2(Log.scala:2066) at kafka.log.Log.roll(Log.scala:2482) at kafka.log.Log.$anonfun$deleteSegments$2(Log.scala:1859) at kafka.log.Log.deleteSegments(Log.scala:2482) at kafka.log.Log.deleteRetentionMsBreachedSegments(Log.scala:1847) at kafka.log.Log.deleteOldSegments(Log.scala:1916) at kafka.log.LogManager.$anonfun$cleanupLogs$3(LogManager.scala:1092) at kafka.log.LogManager.$anonfun$cleanupLogs$3$adapted(LogManager.scala:1089) at scala.collection.immutable.List.foreach(List.scala:333) at kafka.log.LogManager.cleanupLogs(LogManager.scala:1089) at kafka.log.LogManager.$anonfun$startupWithConfigOverrides$2(LogManager.scala:429) at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-14088) KafkaChannel memory leak
[ https://issues.apache.org/jira/browse/KAFKA-14088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17634689#comment-17634689 ] Gao Fei commented on KAFKA-14088: - Is it the following question? Here only the version above 2.8 is involved, and the old versions are not involved? h2. [CVE-2022-34917|https://nvd.nist.gov/vuln/detail/CVE-2022-34917] UNAUTHENTICATED CLIENTS MAY CAUSE OUTOFMEMORYERROR ON BROKERS > KafkaChannel memory leak > > > Key: KAFKA-14088 > URL: https://issues.apache.org/jira/browse/KAFKA-14088 > Project: Kafka > Issue Type: Bug > Components: network >Affects Versions: 2.2.1, 2.4.1, 2.5.1, 2.6.1, 2.7.1, 2.8.1, 3.1.1, 3.2.1 > Environment: Current system environment: > kafka version: 2.2.1 > openjdk(openj9): jdk1.8 > Heap memory: 6.4GB > MaxDirectSize: 8GB > Total number of topics: about 150+, each with about 3 partitions >Reporter: Gao Fei >Priority: Minor > > The kafka broker reports OutOfMemoryError: Java heap space and > OutOfMemoryError: Direct buffer memory at the same time. Through the memory > dump, it is found that the most occupied objects are > KafkaChannel->NetworkReceive->HeapByteBuffer, there are about 4 such > KafkaChannels, each about 1.5GB Around, and the total heap memory allocation > is only 6.4GB. > It's strange why a KafkaChannel occupies so much heap memory. Isn't each > batch request slowly written to disk through the RequestHandler thread? > Normally, this memory in KafkaChannel should be released continuously, but it > is not released. > I am curious why there is such a large HeapByteBuffer object in KafkaChannel? > What does this object store? Shouldn't the socket communication here use a > lot of direct memory? Instead, why a lot of heap memory is used, and why is > it not released? > The business data is not very large, the business data of each customer is > different, and some customers have this OOM in the environment, and some > customers with large business data do not appear OOM. > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:693) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:174) > at sun.nio.ch.IOUtil.read(IOUtil.java:195) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > org.apache.kafka.common.network.PlaintextTransportLayer.read(PlaintextTransportLayer.java:103) > at > org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:117) > at > org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:424) > at > org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:385) > at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:651) > at > org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:572) > at org.apache.kafka.common.network.Selector.poll(Selector.java:483) > at kafka.network.Processor.poll(SocketServer.scala:863) > at kafka.network.Processor.run(SocketServer.scala:762) > at java.lang.Thread.run(Thread.java:745) > java.lang.OutOfMemoryError: Java heap space > at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) > at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) > at org.apache.kafka.common.MemoryPool$1.tryAllocate(MemoryPool.java:30) > at > org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:112) > at > org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:424) > at > org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:385) > at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:651) > at > org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:572) > at org.apache.kafka.common.network.Selector.poll(Selector.java:483) > at kafka.network.Processor.poll(SocketServer.scala:863) > at kafka.network.Processor.run(SocketServer.scala:762) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-14088) KafkaChannel memory leak
[ https://issues.apache.org/jira/browse/KAFKA-14088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gao Fei updated KAFKA-14088: Affects Version/s: 3.2.1 3.1.1 2.8.1 2.7.1 2.6.1 2.5.1 2.4.1 > KafkaChannel memory leak > > > Key: KAFKA-14088 > URL: https://issues.apache.org/jira/browse/KAFKA-14088 > Project: Kafka > Issue Type: Bug > Components: network >Affects Versions: 2.2.1, 2.4.1, 2.5.1, 2.6.1, 2.7.1, 2.8.1, 3.1.1, 3.2.1 > Environment: Current system environment: > kafka version: 2.2.1 > openjdk(openj9): jdk1.8 > Heap memory: 6.4GB > MaxDirectSize: 8GB > Total number of topics: about 150+, each with about 3 partitions >Reporter: Gao Fei >Priority: Minor > > The kafka broker reports OutOfMemoryError: Java heap space and > OutOfMemoryError: Direct buffer memory at the same time. Through the memory > dump, it is found that the most occupied objects are > KafkaChannel->NetworkReceive->HeapByteBuffer, there are about 4 such > KafkaChannels, each about 1.5GB Around, and the total heap memory allocation > is only 6.4GB. > It's strange why a KafkaChannel occupies so much heap memory. Isn't each > batch request slowly written to disk through the RequestHandler thread? > Normally, this memory in KafkaChannel should be released continuously, but it > is not released. > I am curious why there is such a large HeapByteBuffer object in KafkaChannel? > What does this object store? Shouldn't the socket communication here use a > lot of direct memory? Instead, why a lot of heap memory is used, and why is > it not released? > The business data is not very large, the business data of each customer is > different, and some customers have this OOM in the environment, and some > customers with large business data do not appear OOM. > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:693) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:174) > at sun.nio.ch.IOUtil.read(IOUtil.java:195) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > org.apache.kafka.common.network.PlaintextTransportLayer.read(PlaintextTransportLayer.java:103) > at > org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:117) > at > org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:424) > at > org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:385) > at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:651) > at > org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:572) > at org.apache.kafka.common.network.Selector.poll(Selector.java:483) > at kafka.network.Processor.poll(SocketServer.scala:863) > at kafka.network.Processor.run(SocketServer.scala:762) > at java.lang.Thread.run(Thread.java:745) > java.lang.OutOfMemoryError: Java heap space > at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) > at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) > at org.apache.kafka.common.MemoryPool$1.tryAllocate(MemoryPool.java:30) > at > org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:112) > at > org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:424) > at > org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:385) > at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:651) > at > org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:572) > at org.apache.kafka.common.network.Selector.poll(Selector.java:483) > at kafka.network.Processor.poll(SocketServer.scala:863) > at kafka.network.Processor.run(SocketServer.scala:762) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-14088) KafkaChannel memory leak
[ https://issues.apache.org/jira/browse/KAFKA-14088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570586#comment-17570586 ] Gao Fei commented on KAFKA-14088: - [~ijuma] I experimented with version 2.8.1 and the latest version 3.2.0, and both have this problem > KafkaChannel memory leak > > > Key: KAFKA-14088 > URL: https://issues.apache.org/jira/browse/KAFKA-14088 > Project: Kafka > Issue Type: Bug > Components: network >Affects Versions: 2.2.1 > Environment: Current system environment: > kafka version: 2.2.1 > openjdk(openj9): jdk1.8 > Heap memory: 6.4GB > MaxDirectSize: 8GB > Total number of topics: about 150+, each with about 3 partitions >Reporter: Gao Fei >Priority: Minor > > The kafka broker reports OutOfMemoryError: Java heap space and > OutOfMemoryError: Direct buffer memory at the same time. Through the memory > dump, it is found that the most occupied objects are > KafkaChannel->NetworkReceive->HeapByteBuffer, there are about 4 such > KafkaChannels, each about 1.5GB Around, and the total heap memory allocation > is only 6.4GB. > It's strange why a KafkaChannel occupies so much heap memory. Isn't each > batch request slowly written to disk through the RequestHandler thread? > Normally, this memory in KafkaChannel should be released continuously, but it > is not released. > I am curious why there is such a large HeapByteBuffer object in KafkaChannel? > What does this object store? Shouldn't the socket communication here use a > lot of direct memory? Instead, why a lot of heap memory is used, and why is > it not released? > The business data is not very large, the business data of each customer is > different, and some customers have this OOM in the environment, and some > customers with large business data do not appear OOM. > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:693) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:174) > at sun.nio.ch.IOUtil.read(IOUtil.java:195) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > org.apache.kafka.common.network.PlaintextTransportLayer.read(PlaintextTransportLayer.java:103) > at > org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:117) > at > org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:424) > at > org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:385) > at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:651) > at > org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:572) > at org.apache.kafka.common.network.Selector.poll(Selector.java:483) > at kafka.network.Processor.poll(SocketServer.scala:863) > at kafka.network.Processor.run(SocketServer.scala:762) > at java.lang.Thread.run(Thread.java:745) > java.lang.OutOfMemoryError: Java heap space > at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) > at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) > at org.apache.kafka.common.MemoryPool$1.tryAllocate(MemoryPool.java:30) > at > org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:112) > at > org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:424) > at > org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:385) > at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:651) > at > org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:572) > at org.apache.kafka.common.network.Selector.poll(Selector.java:483) > at kafka.network.Processor.poll(SocketServer.scala:863) > at kafka.network.Processor.run(SocketServer.scala:762) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-14088) KafkaChannel memory leak
[ https://issues.apache.org/jira/browse/KAFKA-14088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569832#comment-17569832 ] Gao Fei commented on KAFKA-14088: - Later, by looking at the kafka source code and combining the logs, it was found that kafka may have received a large number of abnormal packets. If it encounters some long connection data packets, these abnormal packets will be cached by kafka as normal data until the memory is not enough. However, the subsequent processing also found that the format of these data packets was incorrect and could not be processed. Subsequent tests are performed through the nmap -p 9092 -T4 -A -v ip command tool, and the above-mentioned memory overflow problem will soon occur. The abnormal message that should be generated will cause kafka to quickly report the memory overflow. Later, the consulting customer found that the customer did use a vulnerability scanning tool to perform similar operations on site, and each operation caused kafka to crash. Can this be avoided by using SASL? When Kafka itself encounters such an abnormal message, can it detect an incorrect data format without having to cache a lot and close the connection directly? The following is some log information of the error: {code:java} [2022-07-21 14:33:18,664] ERROR Exception while processing request from 177.177.113.129:6667-172.36.28.103:65440-406 (kafka.network.Processor) org.apache.kafka.common.errors.InvalidRequestException: Error parsing request header. Our best guess of the apiKey is: 27265 Caused by: org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'client_id': Error reading string of length 513, only 103 bytes available at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:77) at org.apache.kafka.common.requests.RequestHeader.parse(RequestHeader.java:121) at kafka.network.Processor.$anonfun$processCompletedReceives$1(SocketServer.scala:844) at kafka.network.Processor.$anonfun$processCompletedReceives$1$adapted(SocketServer.scala:840) at kafka.network.Processor$$Lambda$1000/0x58005440.apply(Unknown Source) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at kafka.network.Processor.processCompletedReceives(SocketServer.scala:840) at kafka.network.Processor.run(SocketServer.scala:731) at java.lang.Thread.run(Thread.java:823) [2022-07-21 14:33:18,727] ERROR Closing socket for 177.177.113.129:6667-172.36.28.103:30646-406 because of error (kafka.network.Processor) org.apache.kafka.common.errors.InvalidRequestException: Unknown API key -173 [2022-07-21 14:33:18,727] ERROR Exception while processing request from 177.177.113.129:6667-172.36.28.103:30646-406 (kafka.network.Processor) org.apache.kafka.common.errors.InvalidRequestException: Unknown API key -173 [2022-07-21 14:39:56,995] ERROR Processor got uncaught exception. (kafka.network.Processor) java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:703) at java.nio.DirectByteBuffer.(DirectByteBuffer.java:128) at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:241) at sun.nio.ch.IOUtil.read(IOUtil.java:195) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at org.apache.kafka.common.network.PlaintextTransportLayer.read(PlaintextTransportLayer.java:103) at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:117) at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:424) at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:385) at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:651) at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:572) at org.apache.kafka.common.network.Selector.poll(Selector.java:483) at kafka.network.Processor.poll(SocketServer.scala:830) at kafka.network.Processor.run(SocketServer.scala:730) at java.lang.Thread.run(Thread.java:823){code} > KafkaChannel memory leak > > > Key: KAFKA-14088 > URL: https://issues.apache.org/jira/browse/KAFKA-14088 > Project: Kafka > Issue Type: Bug > Components: network >Affects Versions: 2.2.1 > Environment: Current system environment: > kafka version: 2.2.1 > openjdk(openj9): jdk1.8 > Heap memory: 6.4GB > MaxDirectSize: 8GB > Total number of topics: about 150+, each with about 3 partitions >
[jira] [Created] (KAFKA-14088) KafkaChannel memory leak
Gao Fei created KAFKA-14088: --- Summary: KafkaChannel memory leak Key: KAFKA-14088 URL: https://issues.apache.org/jira/browse/KAFKA-14088 Project: Kafka Issue Type: Bug Components: network Affects Versions: 2.2.1 Environment: Current system environment: kafka version: 2.2.1 openjdk(openj9): jdk1.8 Heap memory: 6.4GB MaxDirectSize: 8GB Total number of topics: about 150+, each with about 3 partitions Reporter: Gao Fei The kafka broker reports OutOfMemoryError: Java heap space and OutOfMemoryError: Direct buffer memory at the same time. Through the memory dump, it is found that the most occupied objects are KafkaChannel->NetworkReceive->HeapByteBuffer, there are about 4 such KafkaChannels, each about 1.5GB Around, and the total heap memory allocation is only 6.4GB. It's strange why a KafkaChannel occupies so much heap memory. Isn't each batch request slowly written to disk through the RequestHandler thread? Normally, this memory in KafkaChannel should be released continuously, but it is not released. I am curious why there is such a large HeapByteBuffer object in KafkaChannel? What does this object store? Shouldn't the socket communication here use a lot of direct memory? Instead, why a lot of heap memory is used, and why is it not released? The business data is not very large, the business data of each customer is different, and some customers have this OOM in the environment, and some customers with large business data do not appear OOM. java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:693) at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:174) at sun.nio.ch.IOUtil.read(IOUtil.java:195) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.apache.kafka.common.network.PlaintextTransportLayer.read(PlaintextTransportLayer.java:103) at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:117) at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:424) at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:385) at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:651) at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:572) at org.apache.kafka.common.network.Selector.poll(Selector.java:483) at kafka.network.Processor.poll(SocketServer.scala:863) at kafka.network.Processor.run(SocketServer.scala:762) at java.lang.Thread.run(Thread.java:745) java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) at org.apache.kafka.common.MemoryPool$1.tryAllocate(MemoryPool.java:30) at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:112) at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:424) at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:385) at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:651) at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:572) at org.apache.kafka.common.network.Selector.poll(Selector.java:483) at kafka.network.Processor.poll(SocketServer.scala:863) at kafka.network.Processor.run(SocketServer.scala:762) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian Jira (v8.20.10#820010)