[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16205861#comment-16205861
 ] 

Weiwei Yang commented on HDFS-12638:
------------------------------------

Hi [~yangjiandan]

Thanks for narrowing down the root cause and providing a test case. I believe 
as long as the truncate runs as *copy-on-truncate* schema. e.g under rolling 
upgrade, upgrade not finalized or in snapshot, it will have this problem. This 
code path creates a new block for truncation and at same time the old block is 
left over in blocks map. When the file gets deleted, the old block becomes to 
be an orphan block.

Further, I read quite a few JIRAs similar to this problem. Such as HDFS-7611, 
HDFS-8113, HDFS-4867. It looks like what we deal with such blocks (if  it is 
reasonably be an orphan block) is to simply add a check to avoid NPE. For 
example in {{BlockManager#dumpBlockMeta}}

{code}
if (block instanceof BlockInfo) {
      BlockCollection bc = getBlockCollection((BlockInfo)block);
      String fileName = (bc == null) ? "[orphaned]" : bc.getName();
      out.print(fileName + ": ");
}
{code}

most places already handled the case like this. So I would suggest to use a 
similar fix to resolve this issue. A few suggestions

# Add a check in {{BlockManager#scheduleReplication}} to avoid NPE
# Review the call in {{BlockManager#chooseExcessReplicates}}, most likely it 
needs a check too
# Add a check in {{NamenodeFsck}} to fix the NPE when run {{fsck -blockId}} 
agaist an orphan block
# Add a javadoc to remind {{BlockManager#getBlockCollection}} might return a 
null

Please let me know if this makes sense, [~yangjiandan], [~kihwal], [~daryn].





> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-12638
>                 URL: https://issues.apache.org/jira/browse/HDFS-12638
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 2.8.2
>            Reporter: Jiandan Yang 
>         Attachments: HDFS-12638-branch-2.8.2.001.patch
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
>         at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to