[ 
https://issues.apache.org/jira/browse/HDFS-8827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14660000#comment-14660000
 ] 

Walter Su commented on HDFS-8827:
---------------------------------

If you gonna delete 1 replica, you have excessTypes.length = 1.
The situation is you gonna delete 2 replica, you have excessTypes.length = 1.
When you ran out of {{excessTypes}}, you got NPE.

Contiguous version never handles the NPE situation. Becasue excessTypes.length 
is always right. But striped version is not.

Four situations:
1. a blockGroup has a missing internalBlk.
2. a blockGroup has a redundant internalBlk.
3. a blockGroup has a missing internalBlk, and a redundant internalBlk.
4. a blockGroup has a missing internalBlk, and 2 redundant internalBlks.
Right now,
situation #1 will be catched by ReplicationMonitor, and handle by ECWorker.
situation #2 will be catched by ReplicationMonitor, and handle by 
processOverReplicatedBlock(..)
situation #3 will NOT be catched by ReplicationMonitor.
situation #4 will be catched by ReplicationMonitor,  and handle by 
processOverReplicatedBlock(..). It's buggy because fileReplication=9 so 
excessTypes.length = 1 but the handler try to delete 2.

You got NPE because you have a #4.

processOverReplicatedBlock(..) is *greedy*, so it try to delete every redundant 
blk to avoid #3,#4 situation.That's the best it can do. So now #3, #4 is very 
unlikely to happen. But it's not enough. #3,#4 still could happen. They need 
fix.

I'll upload a patch for #4 here, later maybe. And open a jira for #3.

Thanks [~tfukudom] for reporting this.

> Erasure Coding: When namenode processes over replicated striped block, NPE 
> will be occur in ReplicationMonitor
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-8827
>                 URL: https://issues.apache.org/jira/browse/HDFS-8827
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Takuya Fukudome
>            Assignee: Takuya Fukudome
>         Attachments: HDFS-8827.1.patch, HDFS-8827.2.patch, HDFS-8827.3.patch, 
> processing-over-replica-npe.log
>
>
> In our test cluster, when namenode processed over replicated striped blocks, 
> null pointer exception(NPE) occurred. This happened under below situation: 1) 
> some datanodes shutdown. 2) namenode recovers block group which lost internal 
> blocks. 3) restart the stopped datanodes. 4) namenode processes over 
> replicated striped blocks. 5) NPE occurs
> I think BlockPlacementPolicyDefault#chooseReplicaToDelete will return null in 
> this situation which causes this NPE problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to