[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

Erik Krogen (JIRA) Wed, 29 Nov 2017 18:58:38 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16272083#comment-16272083
 ]


Erik Krogen commented on HDFS-12638:
------------------------------------

I investigated this further and think that Konstantin's v002 patch should 
actually solve the problem. Actually, HDFS-9754 does not change the invariant 
that [~shv] mentioned. A few notes from the investigation:
* After incremental block deletion was added, it was already true that a block 
which was not associated with a valid INode could be present in the blocksMap. 
In {{FSNamesystem#delete()}}, we first call {{FSDirDeleteOp#delete()}} within 
the write lock. Then release the write lock, then call 
{{BlockManager#removeBlock()}} (will remove the block from blocksMap) on each 
block later on. Within {{FSDirDeleteOp#delete()}}, all INodes being deleted are 
removed from the inodesMap (see {{FSDirDeleteOp#deleteInternal()}} which calls 
{{FSNamesystem#removeLeasesAndINodes()}}). 
* This scenario meant that places e.g. 
{{BlockManager#scheduleReconstruction()}} had to check if there was a 
BlockCollection associated with the block, which it previously did by checking 
{{FSNamesystem#getBlockCollection(blkInfo.getBlockCollectionId()) != null}}. 
Now, HDFS-9754 replaced this call with {{BlockInfo#isDeleted()}}, so this means 
whenever we remove an INode from the inodesMap, we need to call 
{{BlockInfo#delete()}} to indicate that it does not have a valid 
BlockCollection associated with it (this is currently done within 
{{INodeFile#clearFile()}}, called by {{INode#destroyAndCollectBlocks()}}, 
called by {{FSDirDeleteOp#unprotectedDelete()}}).
* When HDFS-9754 was added, it did not properly mark copy-on-truncate blocks 
with {{BlockInfo#delete()}}, so the {{BlockInfo#isDeleted()}} check would fail, 
thus causing {{BlockManager#secheduleReconstruction()}} to throw NPE when it 
tries to use {{FSNamesystem#getBlockCollection(blkInfo)}} (since it assumes 
there is a valid block collection associated).
* Konstantin's patch correctly invalidates copy-on-truncate blocks, so should 
fix this NPE, at least for the case of copy-on-truncate blocks.

So +1 from me (non-binding) on the logic of the v002 patch. We should also try 
to get some unit test in for this.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-12638
>                 URL: https://issues.apache.org/jira/browse/HDFS-12638
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 2.8.2
>            Reporter: Jiandan Yang 
>            Priority: Blocker
>         Attachments: HDFS-12638-branch-2.8.2.001.patch, HDFS-12638.002.patch, 
> OphanBlocksAfterTruncateDelete.jpg
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
>         at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

Reply via email to