[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

Weiwei Yang (JIRA) Wed, 18 Oct 2017 19:41:24 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16210498#comment-16210498
 ]


Weiwei Yang commented on HDFS-12638:
------------------------------------

Hi [~daryn]

bq. it sounds like you want to avoid the symptom (NPE) 

Yes, I think our code should bear with such orphan blocks, instead of failing 
the NN with NPE like this. At least.

bq. rather than address the orphaned block (root cause)?

The test case explains how these orphaned blocks come from

||Step#||Step Operation||Explain||
|1|Create a file| this creates a few blocks, e.g B1|
|2|Run rolling upgrade (or create a snapshot)|this is the condition to trigger 
the copy-on-truncate behavior|
|3|Truncate the file to a smaller size|"copy-on-truncate" schema is used, it 
creates a new block B2 and copy required bytes from B1 to B2, and inode 
reference updated to B2|
|4|Delete this file|this will delete inode ref and remove B2 from blocks map, 
leaving B1 behind. So when we read snapshot again, it is able to find its 
original block B1.|

Please see also {{FSDirTtruncateOp#shouldCopyOnTruncate}}. I have read the 
truncate design doc, this seems to be the designed behavior. We cannot delete 
the old block otherwise snapshot won't be able to read it anymore. I assume 
when the snapshot gets deleted, these blocks will be also removed from the 
blocks map. But before that, we need to live with such orphaned blocks. Any 
thoughts on this?

Thanks


> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-12638
>                 URL: https://issues.apache.org/jira/browse/HDFS-12638
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 2.8.2
>            Reporter: Jiandan Yang 
>         Attachments: HDFS-12638-branch-2.8.2.001.patch
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
>         at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

Reply via email to