Wei-Chiu Chuang created HDDS-14830:
--------------------------------------
Summary: Graceful interruption handling of EC reconstruction
Key: HDDS-14830
URL: https://issues.apache.org/jira/browse/HDDS-14830
Project: Apache Ozone
Issue Type: Bug
Components: ECOfflineRecovery, Ozone Datanode
Reporter: Wei-Chiu Chuang
[https://github.com/apache/ozone/blob/master/hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/XceiverClientGrpc.java#L553]
{code:java}
Objects.requireNonNull(ioException, "ioException == null"); {code}
this nullity check can be violated if the thread is interrupted. See the blow
log message:
{code:java}
2026-03-11 13:59:37,187 ERROR
[ContainerReplicationThread-1]-org.apache.hadoop.hdds.scm.XceiverClientGrpc:
Command execution was interrupted
java.lang.InterruptedException
at
java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:385)
at
java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2005)
at
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:414)
at
org.apache.hadoop.hdds.scm.XceiverClientGrpc.lambda$sendCommandWithTraceIDAndRetry$1(XceiverClientGrpc.java:357)
at
org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:169)
at
org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
at
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithTraceIDAndRetry(XceiverClientGrpc.java:349)
at
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommand(XceiverClientGrpc.java:330)
at
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.createContainer(ContainerProtocolCalls.java:546)
at
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.createRecoveringContainer(ContainerProtocolCalls.java:499)
at
org.apache.hadoop.ozone.container.ec.reconstruction.ECContainerOperationClient.createRecoveringContainer(ECContainerOperationClient.java:169)
at
org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECContainerGroup(ECReconstructionCoordinator.java:164)
at
org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask.runTask(ECReconstructionCoordinatorTask.java:68)
at
org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:359)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
2026-03-11 13:59:37,188 WARN
[ContainerReplicationThread-0]-org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator:
Exception while reconstructing the container 72838. Cleaning up all the
recovering containers in the reconstruction process.
java.lang.NullPointerException
at java.base/java.util.Objects.requireNonNull(Objects.java:222)
at
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:454)
at
org.apache.hadoop.hdds.scm.XceiverClientGrpc.lambda$sendCommandWithTraceIDAndRetry$1(XceiverClientGrpc.java:357)
at
org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:169)
at
org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
at
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithTraceIDAndRetry(XceiverClientGrpc.java:349)
at
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommand(XceiverClientGrpc.java:330)
at
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.createContainer(ContainerProtocolCalls.java:546)
at
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.createRecoveringContainer(ContainerProtocolCalls.java:499)
at
org.apache.hadoop.ozone.container.ec.reconstruction.ECContainerOperationClient.createRecoveringContainer(ECContainerOperationClient.java:169)
at
org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECContainerGroup(ECReconstructionCoordinator.java:164)
at
org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask.runTask(ECReconstructionCoordinatorTask.java:68)
at
org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:359)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
{code}
Not a huge problem because the datanode was crashing anyway, but we should
always strive to handle exceptions more gracefully.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]