[ 
https://issues.apache.org/jira/browse/HDFS-4832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated HDFS-4832:
-------------------------------

    Attachment: HDFS-4832.patch

Thanks for your review Kihwal. I've updated the patch.
bq. isInStartupSafeMode() returns true for any auto safe mode. E.g. if the 
resource checker puts NN in safe mode, it will return true.
I have filed HDFS-4862 to fix this. The method name is unfortunately contrary 
to its behavior.
{quote}
The existing code drained scheduled work in safe mode, but the patch makes it 
immediately stops sending scheduled work to DNs. This seems correct behavior 
for safe mode, but those work can be sent out after leaving safe mode. That may 
not be ideal. E.g. if NN is suffering from a flaky DNS, DNs will appear dead, 
come back and dead again, generating a lot of invalidation and replication 
work. Admins may put NN in safe mode to safely pass the storm. When they do, 
the unnecessary work needs to stop rather than being delayed. Please make sure 
unintended damage does not occur after leaving safe mode.
{quote}
UnderReplicatedBlocks is the priority queue maintained for neededReplications, 
and it is updated when nodes join or are marked dead. However, once 
BlockManager.computeReplicationWorkForBlocks is called, the ReplicationWork is 
transferred to the DatanodeDescriptor's replicateBlocks queue, from which it 
will not be rescinded. The computeReplicationWorkForBlocks() is called every 
replicationRecheckInterval which defaults to 3 seconds. Can we please handle 
this in a separate JIRA?
                
> Namenode doesn't change the number of missing blocks in safemode when DNs 
> rejoin or leave
> -----------------------------------------------------------------------------------------
>
>                 Key: HDFS-4832
>                 URL: https://issues.apache.org/jira/browse/HDFS-4832
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.0.0, 0.23.7, 2.0.5-beta
>            Reporter: Ravi Prakash
>            Assignee: Ravi Prakash
>            Priority: Critical
>         Attachments: HDFS-4832.patch, HDFS-4832.patch, HDFS-4832.patch
>
>
> Courtesy Karri VRK Reddy!
> {quote}
> 1. Namenode lost datanodes causing missing blocks
> 2. Namenode was put in safe mode
> 3. Datanode restarted on dead nodes 
> 4. Waited for lots of time for the NN UI to reflect the recovered blocks.
> 5. Forced NN out of safe mode and suddenly,  no more missing blocks anymore.
> {quote}
> I was able to replicate this on 0.23 and trunk. I set 
> dfs.namenode.heartbeat.recheck-interval to 1 and killed the DN to simulate 
> "lost" datanode. The opposite case also has problems (i.e. Datanode failing 
> when NN is in safemode, doesn't lead to a missing blocks message)
> Without the NN updating this list of missing blocks, the grid admins will not 
> know when to take the cluster out of safemode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to