[jira] [Commented] (HDFS-2815) Namenode is not coming out of safemode when we perform ( NN crash + restart ) . Also FSCK report shows blocks missed.

Suresh Srinivas (Commented) (JIRA) Sat, 11 Feb 2012 12:37:23 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13206257#comment-13206257
 ]


Suresh Srinivas commented on HDFS-2815:
---------------------------------------

bq. Linking HDFS-173, the patch that added the problematic code.
HDFS-173 is not the cause. Before HDFS-173, the following was the sequence:
# Delete directory, files and blocks holding the lock. This could trigger the 
deletion of blocks at the datanodes
# Then add editlog entry outside the lock

As this jira discussion demonstrates, between the above steps, if NN crashes, 
there is possibility of block deletion on DNs. However no record of deletion 
exists in editlog.

With HDFS-173, the behavior changed to:
# Delete directory, files and blocks holding the lock. This could trigger the 
deletion of blocks * if number of blocks is small * at the datanodes
# Then add editlog entry outside the lock.
# * New change to * delete the blocks if the number of blocks is large.

Note the part that Uma is talking about is from the step 1. Still the old 
behavior.

The patch is now proposing deletion of blocks post recording it in editlog - 
from step 3 of HDFS-173. I think this sounds fine.

                
> Namenode is not coming out of safemode when we perform ( NN crash + restart ) 
> .  Also FSCK report shows blocks missed.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-2815
>                 URL: https://issues.apache.org/jira/browse/HDFS-2815
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.22.0, 0.24.0, 0.23.1, 1.0.0
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>            Priority: Critical
>         Attachments: HDFS-2815.patch
>
>
> When tested the HA(internal) with continuous switch with some 5mins gap, 
> found some *blocks missed* and namenode went into safemode after next switch.
>    
>    After the analysis, i found that this files already deleted by clients. 
> But i don't see any delete commands logs namenode log files. But namenode 
> added that blocks to invalidateSets and DNs deleted the blocks.
>    When restart of the namenode, it went into safemode and expecting some 
> more blocks to come out of safemode.
>    Here the reason could be that, file has been deleted in memory and added 
> into invalidates after this it is trying to sync the edits into editlog file. 
> By that time NN asked DNs to delete that blocks. Now namenode shuts down 
> before persisting to editlogs.( log behind)
>    Due to this reason, we may not get the INFO logs about delete, and when we 
> restart the Namenode (in my scenario it is again switch), Namenode expects 
> this deleted blocks also, as delete request is not persisted into editlog 
> before.
>    I reproduced this scenario with bedug points. *I feel, We should not add 
> the blocks to invalidates before persisting into Editlog*. 
>     Note: for switch, we used kill -9 (force kill)
>   I am currently in 0.20.2 version. Same verified in 0.23 as well in normal 
> crash + restart  scenario.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-2815) Namenode is not coming out of safemode when we perform ( NN crash + restart ) . Also FSCK report shows blocks missed.

Reply via email to