[ 
https://issues.apache.org/jira/browse/HDFS-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431674#comment-13431674
 ] 

Uma Maheswara Rao G commented on HDFS-3772:
-------------------------------------------

{quote}
It can resolve this problem when we know the hang is blocked by this issue 
clearly. But if the cluster just hang in safe mode and users can not judge 
whether is blocked by this issue.
{quote}
But admin only modified this min replication property. In this case, 
deffinitely it will hang right. It should be clear for the admins I feel, when 
they modify it.

persisting this parameter at file level will be really unnecessary and this 
parameter is common for all. 

So, the other option would be to persist this parameter in Image along with the 
LAYOUT_VERSION..etc. We can bump the leyout version number and while loading 
image, if the version number is this new version number, then read that min 
replication from image file and use for the safemode validations.

But I am not sure about the case, whether we need this really or not.
So, Before moving let's take others opinion as well.

                
> HDFS NN will hang in safe mode and never come out if we change the 
> dfs.namenode.replication.min bigger.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-3772
>                 URL: https://issues.apache.org/jira/browse/HDFS-3772
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 2.0.0-alpha
>            Reporter: Yanbo Liang
>
> If the NN restarts with a new minimum replication 
> (dfs.namenode.replication.min), any files created with the old replication 
> count will expected to bump up to the new minimum upon restart automatically. 
> However, the real case is that if the NN restarts will a new minimum 
> replication which is bigger than the old one, the NN will hang in safemode 
> and never come out.
> The corresponding test case can pass is because we have missing some test 
> coverage. It had been discussed in HDFS-3734.
> If the NN received enough number of reported block which is satisfying the 
> new minimum replication, it will exit safe mode. However, if we change a 
> bigger minimum replication, there will be no enough amount blocks which are 
> satisfying the limitation.
> Look at the code segment in FSNamesystem.java:
> private synchronized void incrementSafeBlockCount(short replication) {
>       if (replication == safeReplication) {
>         this.blockSafe++;
>         checkMode();
>       }
>     }
> The DNs report blocks to NN and if the replication is equal to 
> safeReplication(It is assigned by the new minimum replication.), we will 
> increment blockSafe. But if we change a bigger minimum replication, all the 
> blocks whose replications are lower than it can not satisfy this equal 
> relationship. But actually the NN had received complete block information. It 
> cause blockSafe will not increment as usual and not reach the enough amount 
> to exit safe mode and then NN hangs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to