[ https://issues.apache.org/jira/browse/HDFS-8827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14660000#comment-14660000 ]
Walter Su commented on HDFS-8827: --------------------------------- If you gonna delete 1 replica, you have excessTypes.length = 1. The situation is you gonna delete 2 replica, you have excessTypes.length = 1. When you ran out of {{excessTypes}}, you got NPE. Contiguous version never handles the NPE situation. Becasue excessTypes.length is always right. But striped version is not. Four situations: 1. a blockGroup has a missing internalBlk. 2. a blockGroup has a redundant internalBlk. 3. a blockGroup has a missing internalBlk, and a redundant internalBlk. 4. a blockGroup has a missing internalBlk, and 2 redundant internalBlks. Right now, situation #1 will be catched by ReplicationMonitor, and handle by ECWorker. situation #2 will be catched by ReplicationMonitor, and handle by processOverReplicatedBlock(..) situation #3 will NOT be catched by ReplicationMonitor. situation #4 will be catched by ReplicationMonitor, and handle by processOverReplicatedBlock(..). It's buggy because fileReplication=9 so excessTypes.length = 1 but the handler try to delete 2. You got NPE because you have a #4. processOverReplicatedBlock(..) is *greedy*, so it try to delete every redundant blk to avoid #3,#4 situation.That's the best it can do. So now #3, #4 is very unlikely to happen. But it's not enough. #3,#4 still could happen. They need fix. I'll upload a patch for #4 here, later maybe. And open a jira for #3. Thanks [~tfukudom] for reporting this. > Erasure Coding: When namenode processes over replicated striped block, NPE > will be occur in ReplicationMonitor > -------------------------------------------------------------------------------------------------------------- > > Key: HDFS-8827 > URL: https://issues.apache.org/jira/browse/HDFS-8827 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: Takuya Fukudome > Assignee: Takuya Fukudome > Attachments: HDFS-8827.1.patch, HDFS-8827.2.patch, HDFS-8827.3.patch, > processing-over-replica-npe.log > > > In our test cluster, when namenode processed over replicated striped blocks, > null pointer exception(NPE) occurred. This happened under below situation: 1) > some datanodes shutdown. 2) namenode recovers block group which lost internal > blocks. 3) restart the stopped datanodes. 4) namenode processes over > replicated striped blocks. 5) NPE occurs > I think BlockPlacementPolicyDefault#chooseReplicaToDelete will return null in > this situation which causes this NPE problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)