[
https://issues.apache.org/jira/browse/HDFS-17488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ZanderXu resolved HDFS-17488.
-----------------------------
Fix Version/s: 3.5.0
Resolution: Fixed
> DN can fail IBRs with NPE when a volume is removed
> --------------------------------------------------
>
> Key: HDFS-17488
> URL: https://issues.apache.org/jira/browse/HDFS-17488
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs
> Reporter: Felix N
> Assignee: Felix N
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.5.0
>
>
>
> Error logs
> {code:java}
> 2024-04-22 15:46:33,422 [BP-1842952724-10.22.68.249-1713771988830
> heartbeating to localhost/127.0.0.1:64977] ERROR datanode.DataNode
> (BPServiceActor.java:run(922)) - Exception in BPOfferService for Block pool
> BP-1842952724-10.22.68.249-1713771988830 (Datanode Uuid
> 1659ffaf-1a80-4a8e-a542-643f6bd97ed4) service to localhost/127.0.0.1:64977
> java.lang.NullPointerException
> at
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReceivedAndDeleted(DatanodeProtocolClientSideTranslatorPB.java:246)
> at
> org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager.sendIBRs(IncrementalBlockReportManager.java:218)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:749)
> at
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:920)
> at java.lang.Thread.run(Thread.java:748) {code}
> The root cause is in BPOfferService#notifyNamenodeBlock, happens when it's
> called on a block belonging to a volume already removed prior. Because the
> volume was already removed
>
> {code:java}
> private void notifyNamenodeBlock(ExtendedBlock block, BlockStatus status,
> String delHint, String storageUuid, boolean isOnTransientStorage) {
> checkBlock(block);
> final ReceivedDeletedBlockInfo info = new ReceivedDeletedBlockInfo(
> block.getLocalBlock(), status, delHint);
> final DatanodeStorage storage = dn.getFSDataset().getStorage(storageUuid);
>
> // storage == null here because it's already removed earlier.
> for (BPServiceActor actor : bpServices) {
> actor.getIbrManager().notifyNamenodeBlock(info, storage,
> isOnTransientStorage);
> }
> } {code}
> so IBRs with a null storage are now pending.
> The reason why notifyNamenodeBlock can trigger on such blocks is up in
> DirectoryScanner#reconcile
> {code:java}
> public void reconcile() throws IOException {
> LOG.debug("reconcile start DirectoryScanning");
> scan();
> // If a volume is removed here after scan() already finished running,
> // diffs is stale and checkAndUpdate will run on a removed volume
> // HDFS-14476: run checkAndUpdate with batch to avoid holding the lock too
> // long
> int loopCount = 0;
> synchronized (diffs) {
> for (final Map.Entry<String, ScanInfo> entry : diffs.getEntries()) {
> dataset.checkAndUpdate(entry.getKey(), entry.getValue());
> ...
> } {code}
> Inside checkAndUpdate, memBlockInfo is null because all the block meta in
> memory is removed during the volume removal, but diskFile still exists. Then
> DataNode#notifyNamenodeDeletedBlock (and further down the line,
> notifyNamenodeBlock) is called on this block.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]