[
https://issues.apache.org/jira/browse/HDDS-9032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Duong updated HDDS-9032:
------------------------
Description:
Ozone datanode today uses 2 separated Netty memory pool (or
PooledByteBufAllocator) instances.
First pool instance is created with NettyServer (used by Ratis server and
Replication server). All NettyServer instances share the same
PooledByteBufAllocator instances (or the same direct memory pool) which is
created and cached by ByteBufAllocatorPreferDirectHolder.allocator. This
resolves to the usage of "io.grpc.netty.Utils#getByteBufAllocator".
{code:java}
public static ByteBufAllocator getByteBufAllocator(boolean forceHeapBuffer) {
if (Boolean.parseBoolean(
System.getProperty("org.apache.ratis.thirdparty.io.grpc.netty.useCustomAllocator",
"true"))) {
boolean defaultPreferDirect = PooledByteBufAllocator.defaultPreferDirect();
logger.log(
Level.FINE,
String.format(
"Using custom allocator: forceHeapBuffer=%s,
defaultPreferDirect=%s",
forceHeapBuffer,
defaultPreferDirect));
if (forceHeapBuffer || !defaultPreferDirect) {
return ByteBufAllocatorPreferHeapHolder.allocator;
} else {
return ByteBufAllocatorPreferDirectHolder.allocator;
} {code}
The second instance is created from the usage of
[CodecBuffer|https://github.com/apache/ozone/blob/master/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/utils/db/CodecBuffer.java#L85-86].
This CodecBuffer uses the default Netty memory pool, created and cached by
PooledByteBufAllocator.DEFAULT, to create temporary caches to decode/encode
data (from/to storage like RocksDb or network).
{code:java}
private static final ByteBufAllocator POOL
= PooledByteBufAllocator.DEFAULT; {code}
Netty has a decent set of config to ensure its memory pool usage doesn't exceed
the JVM limits, aka, maxMemory and maxDirectMemory. However, as there're
multiple pool instances exists, Ozone service instances are prone to
OutOfMemoryError.
Below are some example of OOME we've seen from Netty.
E1: from sending container for re-replication.
{code:java}
Exception in thread "ContainerReplicationThread-2" java.lang.OutOfMemoryError:
Direct buffer memory
at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118)
at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:701)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:676)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:215)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:197)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.allocate(PoolArena.java:139)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.allocate(PoolArena.java:129)
at
org.apache.ratis.thirdparty.io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:396)
at
org.apache.ratis.thirdparty.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
at
org.apache.ratis.thirdparty.io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:124)
at
org.apache.ratis.thirdparty.io.grpc.netty.NettyWritableBufferAllocator.allocate(NettyWritableBufferAllocator.java:51)
at
org.apache.ratis.thirdparty.io.grpc.internal.MessageFramer.writeKnownLengthUncompressed(MessageFramer.java:230)
at
org.apache.ratis.thirdparty.io.grpc.internal.MessageFramer.writeUncompressed(MessageFramer.java:169)
at
org.apache.ratis.thirdparty.io.grpc.internal.MessageFramer.writePayload(MessageFramer.java:142)
at
org.apache.ratis.thirdparty.io.grpc.internal.AbstractStream.writeMessage(AbstractStream.java:65)
at
org.apache.ratis.thirdparty.io.grpc.internal.ForwardingClientStream.writeMessage(ForwardingClientStream.java:37)
at
org.apache.ratis.thirdparty.io.grpc.internal.DelayedStream.writeMessage(DelayedStream.java:278)
at
org.apache.ratis.thirdparty.io.grpc.internal.RetriableStream.sendMessage(RetriableStream.java:545)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.sendMessageInternal(ClientCallImpl.java:521)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.sendMessage(ClientCallImpl.java:507)
at
org.apache.ratis.thirdparty.io.grpc.internal.DelayedClientCall.sendMessage(DelayedClientCall.java:324)
at
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$CallToStreamObserverAdapter.onNext(ClientCalls.java:374)
at
org.apache.hadoop.ozone.container.replication.SendContainerOutputStream.sendPart(SendContainerOutputStream.java:46)
at
org.apache.hadoop.ozone.container.replication.GrpcOutputStream.flushBuffer(GrpcOutputStream.java:136)
at
org.apache.hadoop.ozone.container.replication.GrpcOutputStream.write(GrpcOutputStream.java:96)
at
org.apache.commons.io.output.ProxyOutputStream.write(ProxyOutputStream.java:92)
at
org.apache.commons.compress.utils.CountingOutputStream.write(CountingOutputStream.java:48)
at
org.apache.commons.compress.utils.FixedLengthBlockOutputStream$BufferAtATimeOutputChannel.write(FixedLengthBlockOutputStream.java:244)
at
org.apache.commons.compress.utils.FixedLengthBlockOutputStream.writeBlock(FixedLengthBlockOutputStream.java:92)
at
org.apache.commons.compress.utils.FixedLengthBlockOutputStream.maybeFlush(FixedLengthBlockOutputStream.java:86)
at
org.apache.commons.compress.utils.FixedLengthBlockOutputStream.write(FixedLengthBlockOutputStream.java:122)
at
org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.write(TarArchiveOutputStream.java:462)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1310)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:978)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1282)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:953)
at
org.apache.hadoop.ozone.container.keyvalue.TarContainerPacker.includeFile(TarContainerPacker.java:265)
at
org.apache.hadoop.ozone.container.keyvalue.TarContainerPacker.includePath(TarContainerPacker.java:255)
at
org.apache.hadoop.ozone.container.keyvalue.TarContainerPacker.pack(TarContainerPacker.java:167)
at
org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.packContainerToDestination(KeyValueContainer.java:940)
at
org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.exportContainerData(KeyValueContainer.java:650)
at
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.exportContainer(KeyValueHandler.java:1046)
at
org.apache.hadoop.ozone.container.ozoneimpl.ContainerController.exportContainer(ContainerController.java:167)
at
org.apache.hadoop.ozone.container.replication.OnDemandContainerReplicationSource.copyData(OnDemandContainerReplicationSource.java:62)
at
org.apache.hadoop.ozone.container.replication.PushReplicator.replicate(PushReplicator.java:67)
at
org.apache.hadoop.ozone.container.replication.MeasuredReplicator.replicate(MeasuredReplicator.java:83)
at
org.apache.hadoop.ozone.container.replication.ReplicationTask.runTask(ReplicationTask.java:122)
at
org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:357)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Suppressed: java.io.IOException: This archive contains unclosed entries.
at
org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.finish(TarArchiveOutputStream.java:291)
at
org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.close(TarArchiveOutputStream.java:309)
at
org.apache.hadoop.ozone.container.keyvalue.TarContainerPacker.pack(TarContainerPacker.java:169)
{code}
E2: From processing readChunk command (Ratis)
{code:java}
2023-07-16 23:07:11,132 [ChunkReader-14] ERROR
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService:
Got exception when processing ContainerCommandRequestProto cmdType: ReadChunk
traceID: ""
containerID: 10047
datanodeUuid: "eb5580f7-03cf-4b14-a19e-8c350fb90a6e"
readChunk {
blockID {
containerID: 10047
localID: 111677748019225467
blockCommitSequenceId: 0
}
chunkData {
chunkName: "111677748019225467_chunk_13"
offset: 12582912
len: 1048576
checksumData {
type: CRC32
bytesPerChecksum: 1048576
checksums: "\334}w\374"
}
}
readChunkVersion: V1
}
encodedToken:
"VgoCZG4SJmNvbklEOiAxMDA0NyBsb2NJRDogMTExNjc3NzQ4MDE5MjI1NDY3GJjz7byWMSgBKAIoBDCc2Kp_OhYIg4_B_obNkNywARDIgtLMovHGp4IBIHmmC2gLI4CkpaO54_a4aoIAqmeczb5AmGk9RWEurGBHEEhERFNfQkxPQ0tfVE9LRU4mY29uSUQ6IDEwMDQ3IGxvY0lEOiAxMTE2Nzc3NDgwMTkyMjU0Njc"
java.lang.OutOfMemoryError: Direct buffer memory
at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118)
at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:701)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:676)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:215)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:197)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.allocate(PoolArena.java:139)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.allocate(PoolArena.java:129)
at
org.apache.ratis.thirdparty.io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:396)
at
org.apache.ratis.thirdparty.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
at
org.apache.ratis.thirdparty.io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:124)
at
org.apache.ratis.thirdparty.io.grpc.netty.NettyWritableBufferAllocator.allocate(NettyWritableBufferAllocator.java:51)
at
org.apache.ratis.thirdparty.io.grpc.internal.MessageFramer.writeKnownLengthUncompressed(MessageFramer.java:230)
at
org.apache.ratis.thirdparty.io.grpc.internal.MessageFramer.writeUncompressed(MessageFramer.java:169)
at
org.apache.ratis.thirdparty.io.grpc.internal.MessageFramer.writePayload(MessageFramer.java:142)
at
org.apache.ratis.thirdparty.io.grpc.internal.AbstractStream.writeMessage(AbstractStream.java:65)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl.sendMessageInternal(ServerCallImpl.java:172)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl.sendMessage(ServerCallImpl.java:154)
at
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:380)
at
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:58)
at
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:50)
at
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
at
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
at
org.apache.hadoop.hdds.tracing.GrpcServerInterceptor$1.onMessage(GrpcServerInterceptor.java:49)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:333)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:316)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:835)
at
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834){code}
was:
Ozone datanode today uses 2 separated Netty memory pool (or
PooledByteBufAllocator) instances.
First pool instance is created with NettyServer (used by Ratis server and
Replication server). All NettyServer instances share the same
PooledByteBufAllocator instances (or the same direct memory pool) which is
created and cached by ByteBufAllocatorPreferDirectHolder.allocator. This
resolves to the usage of "io.grpc.netty.Utils#getByteBufAllocator".
{code:java}
public static ByteBufAllocator getByteBufAllocator(boolean forceHeapBuffer) {
if (Boolean.parseBoolean(
System.getProperty("org.apache.ratis.thirdparty.io.grpc.netty.useCustomAllocator",
"true"))) {
boolean defaultPreferDirect = PooledByteBufAllocator.defaultPreferDirect();
logger.log(
Level.FINE,
String.format(
"Using custom allocator: forceHeapBuffer=%s,
defaultPreferDirect=%s",
forceHeapBuffer,
defaultPreferDirect));
if (forceHeapBuffer || !defaultPreferDirect) {
return ByteBufAllocatorPreferHeapHolder.allocator;
} else {
return ByteBufAllocatorPreferDirectHolder.allocator;
} {code}
The second instance is created from the usage of
[CodecBuffer|https://github.com/apache/ozone/blob/master/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/utils/db/CodecBuffer.java#L85-86].
This CodecBuffer uses the default Netty memory pool, created and cached by
PooledByteBufAllocator.DEFAULT, to create temporary caches to decode/encode
data (from/to storage like RocksDb or network).
{code:java}
private static final ByteBufAllocator POOL
= PooledByteBufAllocator.DEFAULT; {code}
Netty has a decent set of config to ensure its memory pool usage doesn't exceed
the JVM limits, aka, maxMemory and maxDirectMemory. However, as there're
multiple pool instances exists, Ozone service instances are prone to
OutOfMemoryError.
Below are some example of OOME we've seen from Netty.
E1: from sending container for re-replication.
{code:java}
Exception in thread "ContainerReplicationThread-2" java.lang.OutOfMemoryError:
Direct buffer memory
at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118)
at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:701)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:676)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:215)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:197)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.allocate(PoolArena.java:139)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.allocate(PoolArena.java:129)
at
org.apache.ratis.thirdparty.io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:396)
at
org.apache.ratis.thirdparty.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
at
org.apache.ratis.thirdparty.io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:124)
at
org.apache.ratis.thirdparty.io.grpc.netty.NettyWritableBufferAllocator.allocate(NettyWritableBufferAllocator.java:51)
at
org.apache.ratis.thirdparty.io.grpc.internal.MessageFramer.writeKnownLengthUncompressed(MessageFramer.java:230)
at
org.apache.ratis.thirdparty.io.grpc.internal.MessageFramer.writeUncompressed(MessageFramer.java:169)
at
org.apache.ratis.thirdparty.io.grpc.internal.MessageFramer.writePayload(MessageFramer.java:142)
at
org.apache.ratis.thirdparty.io.grpc.internal.AbstractStream.writeMessage(AbstractStream.java:65)
at
org.apache.ratis.thirdparty.io.grpc.internal.ForwardingClientStream.writeMessage(ForwardingClientStream.java:37)
at
org.apache.ratis.thirdparty.io.grpc.internal.DelayedStream.writeMessage(DelayedStream.java:278)
at
org.apache.ratis.thirdparty.io.grpc.internal.RetriableStream.sendMessage(RetriableStream.java:545)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.sendMessageInternal(ClientCallImpl.java:521)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.sendMessage(ClientCallImpl.java:507)
at
org.apache.ratis.thirdparty.io.grpc.internal.DelayedClientCall.sendMessage(DelayedClientCall.java:324)
at
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$CallToStreamObserverAdapter.onNext(ClientCalls.java:374)
at
org.apache.hadoop.ozone.container.replication.SendContainerOutputStream.sendPart(SendContainerOutputStream.java:46)
at
org.apache.hadoop.ozone.container.replication.GrpcOutputStream.flushBuffer(GrpcOutputStream.java:136)
at
org.apache.hadoop.ozone.container.replication.GrpcOutputStream.write(GrpcOutputStream.java:96)
at
org.apache.commons.io.output.ProxyOutputStream.write(ProxyOutputStream.java:92)
at
org.apache.commons.compress.utils.CountingOutputStream.write(CountingOutputStream.java:48)
at
org.apache.commons.compress.utils.FixedLengthBlockOutputStream$BufferAtATimeOutputChannel.write(FixedLengthBlockOutputStream.java:244)
at
org.apache.commons.compress.utils.FixedLengthBlockOutputStream.writeBlock(FixedLengthBlockOutputStream.java:92)
at
org.apache.commons.compress.utils.FixedLengthBlockOutputStream.maybeFlush(FixedLengthBlockOutputStream.java:86)
at
org.apache.commons.compress.utils.FixedLengthBlockOutputStream.write(FixedLengthBlockOutputStream.java:122)
at
org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.write(TarArchiveOutputStream.java:462)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1310)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:978)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1282)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:953)
at
org.apache.hadoop.ozone.container.keyvalue.TarContainerPacker.includeFile(TarContainerPacker.java:265)
at
org.apache.hadoop.ozone.container.keyvalue.TarContainerPacker.includePath(TarContainerPacker.java:255)
at
org.apache.hadoop.ozone.container.keyvalue.TarContainerPacker.pack(TarContainerPacker.java:167)
at
org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.packContainerToDestination(KeyValueContainer.java:940)
at
org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.exportContainerData(KeyValueContainer.java:650)
at
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.exportContainer(KeyValueHandler.java:1046)
at
org.apache.hadoop.ozone.container.ozoneimpl.ContainerController.exportContainer(ContainerController.java:167)
at
org.apache.hadoop.ozone.container.replication.OnDemandContainerReplicationSource.copyData(OnDemandContainerReplicationSource.java:62)
at
org.apache.hadoop.ozone.container.replication.PushReplicator.replicate(PushReplicator.java:67)
at
org.apache.hadoop.ozone.container.replication.MeasuredReplicator.replicate(MeasuredReplicator.java:83)
at
org.apache.hadoop.ozone.container.replication.ReplicationTask.runTask(ReplicationTask.java:122)
at
org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:357)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Suppressed: java.io.IOException: This archive contains unclosed entries.
at
org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.finish(TarArchiveOutputStream.java:291)
at
org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.close(TarArchiveOutputStream.java:309)
at
org.apache.hadoop.ozone.container.keyvalue.TarContainerPacker.pack(TarContainerPacker.java:169)
{code}
E2: From processing readChunk command (Ratis)
{code:java}
2023-07-16 23:07:11,132 [ChunkReader-14] ERROR
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService:
Got exception when processing ContainerCommandRequestProto cmdType: ReadChunk
traceID: ""
containerID: 10047
datanodeUuid: "eb5580f7-03cf-4b14-a19e-8c350fb90a6e"
readChunk {
blockID {
containerID: 10047
localID: 111677748019225467
blockCommitSequenceId: 0
}
chunkData {
chunkName: "111677748019225467_chunk_13"
offset: 12582912
len: 1048576
checksumData {
type: CRC32
bytesPerChecksum: 1048576
checksums: "\334}w\374"
}
}
readChunkVersion: V1
}
encodedToken:
"VgoCZG4SJmNvbklEOiAxMDA0NyBsb2NJRDogMTExNjc3NzQ4MDE5MjI1NDY3GJjz7byWMSgBKAIoBDCc2Kp_OhYIg4_B_obNkNywARDIgtLMovHGp4IBIHmmC2gLI4CkpaO54_a4aoIAqmeczb5AmGk9RWEurGBHEEhERFNfQkxPQ0tfVE9LRU4mY29uSUQ6IDEwMDQ3IGxvY0lEOiAxMTE2Nzc3NDgwMTkyMjU0Njc"
java.lang.OutOfMemoryError: Direct buffer memory
at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118)
at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:701)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:676)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:215)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:197)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.allocate(PoolArena.java:139)
at
org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.allocate(PoolArena.java:129)
at
org.apache.ratis.thirdparty.io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:396)
at
org.apache.ratis.thirdparty.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
at
org.apache.ratis.thirdparty.io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:124)
at
org.apache.ratis.thirdparty.io.grpc.netty.NettyWritableBufferAllocator.allocate(NettyWritableBufferAllocator.java:51)
at
org.apache.ratis.thirdparty.io.grpc.internal.MessageFramer.writeKnownLengthUncompressed(MessageFramer.java:230)
at
org.apache.ratis.thirdparty.io.grpc.internal.MessageFramer.writeUncompressed(MessageFramer.java:169)
at
org.apache.ratis.thirdparty.io.grpc.internal.MessageFramer.writePayload(MessageFramer.java:142)
at
org.apache.ratis.thirdparty.io.grpc.internal.AbstractStream.writeMessage(AbstractStream.java:65)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl.sendMessageInternal(ServerCallImpl.java:172)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl.sendMessage(ServerCallImpl.java:154)
at
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:380)
at
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:58)
at
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:50)
at
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
at
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
at
org.apache.hadoop.hdds.tracing.GrpcServerInterceptor$1.onMessage(GrpcServerInterceptor.java:49)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:333)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:316)
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:835)
at
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834){code}
> CodecBuffer results in Ozone Datanode using 2 separate Netty memory pool
> instances
> ----------------------------------------------------------------------------------
>
> Key: HDDS-9032
> URL: https://issues.apache.org/jira/browse/HDDS-9032
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Duong
> Priority: Major
>
> Ozone datanode today uses 2 separated Netty memory pool (or
> PooledByteBufAllocator) instances.
> First pool instance is created with NettyServer (used by Ratis server and
> Replication server). All NettyServer instances share the same
> PooledByteBufAllocator instances (or the same direct memory pool) which is
> created and cached by ByteBufAllocatorPreferDirectHolder.allocator. This
> resolves to the usage of "io.grpc.netty.Utils#getByteBufAllocator".
> {code:java}
> public static ByteBufAllocator getByteBufAllocator(boolean forceHeapBuffer) {
> if (Boolean.parseBoolean(
>
> System.getProperty("org.apache.ratis.thirdparty.io.grpc.netty.useCustomAllocator",
> "true"))) {
> boolean defaultPreferDirect =
> PooledByteBufAllocator.defaultPreferDirect();
> logger.log(
> Level.FINE,
> String.format(
> "Using custom allocator: forceHeapBuffer=%s,
> defaultPreferDirect=%s",
> forceHeapBuffer,
> defaultPreferDirect));
> if (forceHeapBuffer || !defaultPreferDirect) {
> return ByteBufAllocatorPreferHeapHolder.allocator;
> } else {
> return ByteBufAllocatorPreferDirectHolder.allocator;
> } {code}
> The second instance is created from the usage of
> [CodecBuffer|https://github.com/apache/ozone/blob/master/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/utils/db/CodecBuffer.java#L85-86].
> This CodecBuffer uses the default Netty memory pool, created and cached by
> PooledByteBufAllocator.DEFAULT, to create temporary caches to decode/encode
> data (from/to storage like RocksDb or network).
> {code:java}
> private static final ByteBufAllocator POOL
> = PooledByteBufAllocator.DEFAULT; {code}
>
> Netty has a decent set of config to ensure its memory pool usage doesn't
> exceed the JVM limits, aka, maxMemory and maxDirectMemory. However, as
> there're multiple pool instances exists, Ozone service instances are prone to
> OutOfMemoryError.
> Below are some example of OOME we've seen from Netty.
> E1: from sending container for re-replication.
> {code:java}
> Exception in thread "ContainerReplicationThread-2"
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at
> java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at
> org.apache.ratis.thirdparty.io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:701)
> at
> org.apache.ratis.thirdparty.io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:676)
> at
> org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:215)
> at
> org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:197)
> at
> org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.allocate(PoolArena.java:139)
> at
> org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.allocate(PoolArena.java:129)
> at
> org.apache.ratis.thirdparty.io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:396)
> at
> org.apache.ratis.thirdparty.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
> at
> org.apache.ratis.thirdparty.io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:124)
> at
> org.apache.ratis.thirdparty.io.grpc.netty.NettyWritableBufferAllocator.allocate(NettyWritableBufferAllocator.java:51)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.MessageFramer.writeKnownLengthUncompressed(MessageFramer.java:230)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.MessageFramer.writeUncompressed(MessageFramer.java:169)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.MessageFramer.writePayload(MessageFramer.java:142)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.AbstractStream.writeMessage(AbstractStream.java:65)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ForwardingClientStream.writeMessage(ForwardingClientStream.java:37)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.DelayedStream.writeMessage(DelayedStream.java:278)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.RetriableStream.sendMessage(RetriableStream.java:545)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.sendMessageInternal(ClientCallImpl.java:521)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.sendMessage(ClientCallImpl.java:507)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.DelayedClientCall.sendMessage(DelayedClientCall.java:324)
> at
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$CallToStreamObserverAdapter.onNext(ClientCalls.java:374)
> at
> org.apache.hadoop.ozone.container.replication.SendContainerOutputStream.sendPart(SendContainerOutputStream.java:46)
> at
> org.apache.hadoop.ozone.container.replication.GrpcOutputStream.flushBuffer(GrpcOutputStream.java:136)
> at
> org.apache.hadoop.ozone.container.replication.GrpcOutputStream.write(GrpcOutputStream.java:96)
> at
> org.apache.commons.io.output.ProxyOutputStream.write(ProxyOutputStream.java:92)
> at
> org.apache.commons.compress.utils.CountingOutputStream.write(CountingOutputStream.java:48)
> at
> org.apache.commons.compress.utils.FixedLengthBlockOutputStream$BufferAtATimeOutputChannel.write(FixedLengthBlockOutputStream.java:244)
> at
> org.apache.commons.compress.utils.FixedLengthBlockOutputStream.writeBlock(FixedLengthBlockOutputStream.java:92)
> at
> org.apache.commons.compress.utils.FixedLengthBlockOutputStream.maybeFlush(FixedLengthBlockOutputStream.java:86)
> at
> org.apache.commons.compress.utils.FixedLengthBlockOutputStream.write(FixedLengthBlockOutputStream.java:122)
> at
> org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.write(TarArchiveOutputStream.java:462)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1310)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:978)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1282)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:953)
> at
> org.apache.hadoop.ozone.container.keyvalue.TarContainerPacker.includeFile(TarContainerPacker.java:265)
> at
> org.apache.hadoop.ozone.container.keyvalue.TarContainerPacker.includePath(TarContainerPacker.java:255)
> at
> org.apache.hadoop.ozone.container.keyvalue.TarContainerPacker.pack(TarContainerPacker.java:167)
> at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.packContainerToDestination(KeyValueContainer.java:940)
> at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.exportContainerData(KeyValueContainer.java:650)
> at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.exportContainer(KeyValueHandler.java:1046)
> at
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerController.exportContainer(ContainerController.java:167)
> at
> org.apache.hadoop.ozone.container.replication.OnDemandContainerReplicationSource.copyData(OnDemandContainerReplicationSource.java:62)
> at
> org.apache.hadoop.ozone.container.replication.PushReplicator.replicate(PushReplicator.java:67)
> at
> org.apache.hadoop.ozone.container.replication.MeasuredReplicator.replicate(MeasuredReplicator.java:83)
> at
> org.apache.hadoop.ozone.container.replication.ReplicationTask.runTask(ReplicationTask.java:122)
> at
> org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:357)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834)
> Suppressed: java.io.IOException: This archive contains unclosed
> entries.
> at
> org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.finish(TarArchiveOutputStream.java:291)
> at
> org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.close(TarArchiveOutputStream.java:309)
> at
> org.apache.hadoop.ozone.container.keyvalue.TarContainerPacker.pack(TarContainerPacker.java:169)
> {code}
> E2: From processing readChunk command (Ratis)
> {code:java}
> 2023-07-16 23:07:11,132 [ChunkReader-14] ERROR
> org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService:
> Got exception when processing ContainerCommandRequestProto cmdType: ReadChunk
> traceID: ""
> containerID: 10047
> datanodeUuid: "eb5580f7-03cf-4b14-a19e-8c350fb90a6e"
> readChunk {
> blockID {
> containerID: 10047
> localID: 111677748019225467
> blockCommitSequenceId: 0
> }
> chunkData {
> chunkName: "111677748019225467_chunk_13"
> offset: 12582912
> len: 1048576
> checksumData {
> type: CRC32
> bytesPerChecksum: 1048576
> checksums: "\334}w\374"
> }
> }
> readChunkVersion: V1
> }
> encodedToken:
> "VgoCZG4SJmNvbklEOiAxMDA0NyBsb2NJRDogMTExNjc3NzQ4MDE5MjI1NDY3GJjz7byWMSgBKAIoBDCc2Kp_OhYIg4_B_obNkNywARDIgtLMovHGp4IBIHmmC2gLI4CkpaO54_a4aoIAqmeczb5AmGk9RWEurGBHEEhERFNfQkxPQ0tfVE9LRU4mY29uSUQ6IDEwMDQ3IGxvY0lEOiAxMTE2Nzc3NDgwMTkyMjU0Njc"
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
> at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118)
> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:318)
> at
> org.apache.ratis.thirdparty.io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:701)
> at
> org.apache.ratis.thirdparty.io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:676)
> at
> org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:215)
> at
> org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:197)
> at
> org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.allocate(PoolArena.java:139)
> at
> org.apache.ratis.thirdparty.io.netty.buffer.PoolArena.allocate(PoolArena.java:129)
> at
> org.apache.ratis.thirdparty.io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:396)
> at
> org.apache.ratis.thirdparty.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
> at
> org.apache.ratis.thirdparty.io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:124)
> at
> org.apache.ratis.thirdparty.io.grpc.netty.NettyWritableBufferAllocator.allocate(NettyWritableBufferAllocator.java:51)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.MessageFramer.writeKnownLengthUncompressed(MessageFramer.java:230)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.MessageFramer.writeUncompressed(MessageFramer.java:169)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.MessageFramer.writePayload(MessageFramer.java:142)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.AbstractStream.writeMessage(AbstractStream.java:65)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl.sendMessageInternal(ServerCallImpl.java:172)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl.sendMessage(ServerCallImpl.java:154)
> at
> org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:380)
> at
> org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:58)
> at
> org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:50)
> at
> org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
> at
> org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
> at
> org.apache.hadoop.hdds.tracing.GrpcServerInterceptor$1.onMessage(GrpcServerInterceptor.java:49)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:333)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:316)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:835)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834){code}
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]