[
https://issues.apache.org/jira/browse/HDDS-14964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arun Sarin updated HDDS-14964:
------------------------------
Priority: Critical (was: Major)
> Write failures with ContainerNotOpenException during large file (100GB)
> concurrent parallel writes at high cluster utilization (~90%)
> -------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDDS-14964
> URL: https://issues.apache.org/jira/browse/HDDS-14964
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Arun Sarin
> Priority: Critical
>
> During a large-scale deletion test, the cluster was being filled with 100GB
> files nested across various directory structures. Writes were progressing
> successfully until ~90% disk utilization was reached on approximately 80% of
> the DataNodes. At that point, all large file writes started failing with
> {{{}ContainerNotOpenException{}}}.
> Key observation: Small file writes on the same cluster succeeded without
> issue during this same time window. Only concurrent parallel writes of large
> files (100GB) were failing. This suggests the issue is not that there are
> zero writable containers on the cluster, but rather that the client - while
> in the middle of writing a large file - encounters a container that has
> transitioned to {{CLOSED}} state and is unable to recover or allocate a new
> one successfully in that context.
> *Root cause hypothesis:* When writing a 100GB file, many chunks are sent
> sequentially to the same container. At high cluster utilization, a container
> can transition to {{CLOSED}} state mid-write (e.g., due to being full or the
> SCM closing it). The client hits {{ContainerNotOpenException}} on a
> {{WriteChunk}} call. While retry/failover logic exists, under high disk
> pressure the client is unable to get a new OPEN container allocated in time,
> causing the write to fail entirely.
> Error Stack :
> {code:java}
> java.util.concurrent.CompletionException:
> org.apache.ratis.protocol.exceptions.StateMachineException:
> org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException
> from Server 098be32c-a26e-4fe9-a491-c7d17b9bb04b@group-D9554F414C68:
> Container 20677 in CLOSED state
> at
> org.apache.ratis.client.impl.RaftClientImpl.handleRaftException(RaftClientImpl.java:373)
> at
> org.apache.ratis.client.impl.OrderedAsync.lambda$send$3(OrderedAsync.java:175)
> at
> java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
> at
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
> at
> java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
> at
> org.apache.ratis.client.impl.OrderedAsync$PendingOrderedRequest.setReply(OrderedAsync.java:105)
> at
> org.apache.ratis.client.impl.OrderedAsync$PendingOrderedRequest.setReply(OrderedAsync.java:66)
> at
> org.apache.ratis.util.SlidingWindow$RequestMap.setReply(SlidingWindow.java:147)
> at
> org.apache.ratis.util.SlidingWindow$Client.receiveReply(SlidingWindow.java:351)
> at
> org.apache.ratis.client.impl.OrderedAsync.lambda$sendRequestWithRetry$5(OrderedAsync.java:210)
> at
> java.base/java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:718)
> at
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
> at
> java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.lambda$onNext$0(GrpcClientProtocolClient.java:325)
> at java.base/java.util.Optional.ifPresent(Optional.java:178)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.handleReplyFuture(GrpcClientProtocolClient.java:381)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:325)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:308)
> at
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:551)
> at
> org.apache.ratis.thirdparty.io.grpc.ForwardingClientCallListener.onMessage(ForwardingClientCallListener.java:33)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:661)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInContext(ClientCallImpl.java:648)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:840)
> Caused by: org.apache.ratis.protocol.exceptions.StateMachineException:
> org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException
> from Server 098be32c-a26e-4fe9-a491-c7d17b9bb04b@group-D9554F414C68:
> Container 20677 in CLOSED state
> at
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.validateContainerCommand(HddsDispatcher.java:581)
> at
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.startTransaction(ContainerStateMachine.java:488)
> at
> org.apache.ratis.server.impl.RaftServerImpl.writeAsyncImpl(RaftServerImpl.java:987)
> at
> org.apache.ratis.server.impl.RaftServerImpl.writeAsync(RaftServerImpl.java:960)
> at
> org.apache.ratis.server.impl.RaftServerImpl.replyFuture(RaftServerImpl.java:953)
> at
> org.apache.ratis.server.impl.RaftServerImpl.submitClientRequestAsync(RaftServerImpl.java:930)
> at
> org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitClientRequestAsync$11(RaftServerImpl.java:919)
> at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:118)
> at
> org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitClientRequestAsync$12(RaftServerImpl.java:919)
> at
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
> ... 3 more
> Caused by:
> org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException:
> Container 20677 in CLOSED state
> at
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
> at
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at
> java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500)
> at
> java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:481)
> at
> org.apache.ratis.util.ReflectionUtils.instantiateException(ReflectionUtils.java:265)
> at
> org.apache.ratis.client.impl.ClientProtoUtils.toStateMachineException(ClientProtoUtils.java:455)
> at
> org.apache.ratis.client.impl.ClientProtoUtils.toStateMachineException(ClientProtoUtils.java:441)
> at
> org.apache.ratis.client.impl.ClientProtoUtils.toRaftClientReply(ClientProtoUtils.java:408)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:313)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:308)
> at
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:551)
> at
> org.apache.ratis.thirdparty.io.grpc.ForwardingClientCallListener.onMessage(ForwardingClientCallListener.java:33)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:661)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInContext(ClientCallImpl.java:648)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
> at
> org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
> ... 3 more
> 26/03/02 23:09:43 ERROR impl.OrderedAsync: Failed to send request,
> message=cmdType: WriteChunk
> traceID: ""
> containerID: 20677
> datanodeUuid: "c15cc65e-f4eb-43f0-b27c-6b6a3fbc5a86"
> writeChunk {
> blockID {
> containerID: 20677
> localID: 117883640217932227
> blockCommitSequenceId: 686720
> replicaIndex: 0
> }
> chunkData {
> chunkName: "117883640217932227_chunk_53"
> offset: 218103808
> len: 4194304
> checksumData {
> type: CRC32
> bytesPerChecksum: 1048576
> checksums: "\216\301\375\321"
> checksums: "\300\2022\n"
> checksums: "\334\002L0"
> checksums: "\335m=\322"
> }
> }
> } {code}
> *Steps to Reproduce:*
> # Set up an Apache Ozone cluster with sufficient DataNodes (e.g., 10+).
> # Write 100GB files concurrently across various directory structures using
> Freon (or similar tool) targeting petabyte-scale data.
> # Continue writing until approximately 80–90% disk utilization is reached on
> the majority of DataNodes.
> # Observe that large file writes begin failing with
> {{{}ContainerNotOpenException{}}}, while small file writes on the same
> cluster continue to succeed.
> *Expected Behavior:*
> Large file writes should either:
> * Successfully obtain a new OPEN container when the current container
> transitions to CLOSED mid-write and continue writing, OR
> * Fail early with a clear, user-facing error (e.g., "Insufficient cluster
> space") before the write is initiated, rather than failing mid-stream after
> chunks have already been sent.
> Actual Behavior:
> Large file (100GB) writes fail mid-stream with:
> java.util.concurrent.CompletionException:
> org.apache.ratis.protocol.exceptions.StateMachineException:
> org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException
> from Server 098be32c-a26e-4fe9-a491-c7d17b9bb04b@group-D9554F414C68:
> Container 20677 in CLOSED state
> at
> org.apache.ratis.client.impl.RaftClientImpl.handleRaftException(RaftClientImpl.java:373)
> at
> org.apache.ratis.client.impl.OrderedAsync.lambda$send$3(OrderedAsync.java:175)
> ...
> Caused by:
> org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException:
> Container 20677 in CLOSED state
> at
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.validateContainerCommand(HddsDispatcher.java:581)
> at
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.startTransaction(ContainerStateMachine.java:488)
> ...
>
> Failed {{WriteChunk}} request details:
> * Container ID: 20677
> * Block local ID: 117883640217932227
> * Chunk: {{{}117883640217932227_chunk_53{}}}, offset: 218103808, len: 4194304
> *Proposed Improvement:*
> Similar to how quota validation is performed before a key {{PUT}} (checking
> namespace/space quota against available capacity), a pre-write cluster-level
> space check should be introduced before initiating a large file write.
> Specifically:
> * Before allocating block pipelines for a write, check whether the cluster
> has sufficient OPEN/allocable containers to accommodate the expected write
> size.
> * If the cluster is approaching full capacity and cannot allocate the
> required containers, fail fast with a clear error (e.g.,
> {{ClusterStorageFullException}} or similar) rather than allowing the write to
> proceed and fail mid-stream with a confusing
> {{{}ContainerNotOpenException{}}}.
> This would provide better UX, avoid partial writes/orphaned blocks, and align
> with the existing quota-check pattern already present at the OM layer.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]