[jira] [Created] (HDDS-14830) Graceful interruption handling of EC reconstruction

Wei-Chiu Chuang (Jira) Thu, 12 Mar 2026 12:12:58 -0700

Wei-Chiu Chuang created HDDS-14830:
--------------------------------------

             Summary: Graceful interruption handling of EC reconstruction 
                 Key: HDDS-14830
                 URL: https://issues.apache.org/jira/browse/HDDS-14830
             Project: Apache Ozone
          Issue Type: Bug
          Components: ECOfflineRecovery, Ozone Datanode
            Reporter: Wei-Chiu Chuang



[https://github.com/apache/ozone/blob/master/hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/XceiverClientGrpc.java#L553]
{code:java}
      Objects.requireNonNull(ioException, "ioException == null"); {code}
this nullity check can be violated if the thread is interrupted. See the blow 
log message:

 
{code:java}
2026-03-11 13:59:37,187 ERROR 
[ContainerReplicationThread-1]-org.apache.hadoop.hdds.scm.XceiverClientGrpc: 
Command execution was interrupted
java.lang.InterruptedException
        at 
java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:385)
        at 
java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2005)
        at 
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:414)
        at 
org.apache.hadoop.hdds.scm.XceiverClientGrpc.lambda$sendCommandWithTraceIDAndRetry$1(XceiverClientGrpc.java:357)
        at 
org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:169)
        at 
org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
        at 
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithTraceIDAndRetry(XceiverClientGrpc.java:349)
        at 
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommand(XceiverClientGrpc.java:330)
        at 
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.createContainer(ContainerProtocolCalls.java:546)
        at 
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.createRecoveringContainer(ContainerProtocolCalls.java:499)
        at 
org.apache.hadoop.ozone.container.ec.reconstruction.ECContainerOperationClient.createRecoveringContainer(ECContainerOperationClient.java:169)
        at 
org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECContainerGroup(ECReconstructionCoordinator.java:164)
        at 
org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask.runTask(ECReconstructionCoordinatorTask.java:68)
        at 
org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:359)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
2026-03-11 13:59:37,188 WARN 
[ContainerReplicationThread-0]-org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator:
 Exception while reconstructing the container 72838. Cleaning up all the 
recovering containers in the reconstruction process.
java.lang.NullPointerException
        at java.base/java.util.Objects.requireNonNull(Objects.java:222)
        at 
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:454)
        at 
org.apache.hadoop.hdds.scm.XceiverClientGrpc.lambda$sendCommandWithTraceIDAndRetry$1(XceiverClientGrpc.java:357)
        at 
org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:169)
        at 
org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
        at 
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithTraceIDAndRetry(XceiverClientGrpc.java:349)
        at 
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommand(XceiverClientGrpc.java:330)
        at 
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.createContainer(ContainerProtocolCalls.java:546)
        at 
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.createRecoveringContainer(ContainerProtocolCalls.java:499)
        at 
org.apache.hadoop.ozone.container.ec.reconstruction.ECContainerOperationClient.createRecoveringContainer(ECContainerOperationClient.java:169)
        at 
org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECContainerGroup(ECReconstructionCoordinator.java:164)
        at 
org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask.runTask(ECReconstructionCoordinatorTask.java:68)
        at 
org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:359)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 {code}
 

Not a huge problem because the datanode was crashing anyway, but we should 
always strive to handle exceptions more gracefully.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (HDDS-14830) Graceful interruption handling of EC reconstruction

Reply via email to