[
https://issues.apache.org/jira/browse/HDDS-14830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HDDS-14830:
----------------------------------
Labels: pull-request-available (was: )
> Graceful interruption handling of EC reconstruction
> ----------------------------------------------------
>
> Key: HDDS-14830
> URL: https://issues.apache.org/jira/browse/HDDS-14830
> Project: Apache Ozone
> Issue Type: Bug
> Components: ECOfflineRecovery, Ozone Datanode
> Reporter: Wei-Chiu Chuang
> Assignee: Chi-Hsuan Huang
> Priority: Minor
> Labels: pull-request-available
>
> [https://github.com/apache/ozone/blob/master/hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/XceiverClientGrpc.java#L553]
> {code:java}
> Objects.requireNonNull(ioException, "ioException == null"); {code}
> this nullity check can be violated if the thread is interrupted. See the blow
> log message:
>
> {code:java}
> 2026-03-11 13:59:37,187 ERROR
> [ContainerReplicationThread-1]-org.apache.hadoop.hdds.scm.XceiverClientGrpc:
> Command execution was interrupted
> java.lang.InterruptedException
> at
> java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:385)
> at
> java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2005)
> at
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:414)
> at
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.lambda$sendCommandWithTraceIDAndRetry$1(XceiverClientGrpc.java:357)
> at
> org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:169)
> at
> org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
> at
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithTraceIDAndRetry(XceiverClientGrpc.java:349)
> at
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommand(XceiverClientGrpc.java:330)
> at
> org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.createContainer(ContainerProtocolCalls.java:546)
> at
> org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.createRecoveringContainer(ContainerProtocolCalls.java:499)
> at
> org.apache.hadoop.ozone.container.ec.reconstruction.ECContainerOperationClient.createRecoveringContainer(ECContainerOperationClient.java:169)
> at
> org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECContainerGroup(ECReconstructionCoordinator.java:164)
> at
> org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask.runTask(ECReconstructionCoordinatorTask.java:68)
> at
> org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:359)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> 2026-03-11 13:59:37,188 WARN
> [ContainerReplicationThread-0]-org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator:
> Exception while reconstructing the container 72838. Cleaning up all the
> recovering containers in the reconstruction process.
> java.lang.NullPointerException
> at java.base/java.util.Objects.requireNonNull(Objects.java:222)
> at
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:454)
> at
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.lambda$sendCommandWithTraceIDAndRetry$1(XceiverClientGrpc.java:357)
> at
> org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:169)
> at
> org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
> at
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithTraceIDAndRetry(XceiverClientGrpc.java:349)
> at
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommand(XceiverClientGrpc.java:330)
> at
> org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.createContainer(ContainerProtocolCalls.java:546)
> at
> org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.createRecoveringContainer(ContainerProtocolCalls.java:499)
> at
> org.apache.hadoop.ozone.container.ec.reconstruction.ECContainerOperationClient.createRecoveringContainer(ECContainerOperationClient.java:169)
> at
> org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECContainerGroup(ECReconstructionCoordinator.java:164)
> at
> org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask.runTask(ECReconstructionCoordinatorTask.java:68)
> at
> org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:359)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> {code}
>
> Not a huge problem because the datanode was crashing anyway, but we should
> always strive to handle exceptions more gracefully.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]