[ 
https://issues.apache.org/jira/browse/HDFS-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641135#comment-13641135
 ] 

Varun Sharma commented on HDFS-4721:
------------------------------------

I dont think - if i grep the file name - all I have is the following

2013-04-24 05:40:30,282 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
allocateBlock: 
/hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366780717760/ip-10-170-15-97.ec2.internal%2C60020%2C1366780717760.1366782030238.
 BP-889095791-10.171.1.40-1366491606582 
blk_-2482251885029951704_11942{blockUCState=UNDER_CONSTRUCTION, 
primaryNodeIndex=-1, 
replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], 
ReplicaUnderConstruction[10.168.12.138:50010|RBW], 
ReplicaUnderConstruction[10.170.6.131:50010|RBW]]}
2013-04-24 05:40:31,655 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: 
/hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366780717760/ip-10-170-15-97.ec2.internal%2C60020%2C1366780717760.1366782030238
 for DFSClient_NONMAPREDUCE_-1195338611_41
2013-04-24 06:14:43,623 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: recoverLease: [Lease.  
Holder: DFSClient_NONMAPREDUCE_-1195338611_41, pendingcreates: 1], 
src=/hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366780717760-splitting/ip-10-170-15-97.ec2.internal%2C60020%2C1366780717760.1366782030238
 from client DFSClient_NONMAPREDUCE_-1195338611_41
2013-04-24 06:14:43,623 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
Holder: DFSClient_NONMAPREDUCE_-1195338611_41, pendingcreates: 1], 
src=/hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366780717760-splitting/ip-10-170-15-97.ec2.internal%2C60020%2C1366780717760.1366782030238
2013-04-24 06:14:43,623 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
NameSystem.internalReleaseLease: File 
/hbase/.logs/ip-10-170-15-97.ec2.internal,60020,1366780717760-splitting/ip-10-170-15-97.ec2.internal%2C60020%2C1366780717760.1366782030238
 has not been closed. Lease recovery is in progress. RecoveryId = 12012 for 
block blk_-2482251885029951704_11942{blockUCState=UNDER_RECOVERY, 
primaryNodeIndex=0, replicas=[ReplicaUnderConstruction[10.170.15.97:50010|RBW], 
ReplicaUnderConstruction[10.168.12.138:50010|RBW], 
ReplicaUnderConstruction[10.170.6.131:50010|RBW]]}

So only one lease recovery call. One second after this recoverLease returns 
true, not sure why though...
                
> Speed up lease/block recovery when DN fails and a block goes into recovery
> --------------------------------------------------------------------------
>
>                 Key: HDFS-4721
>                 URL: https://issues.apache.org/jira/browse/HDFS-4721
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 2.0.3-alpha
>            Reporter: Varun Sharma
>             Fix For: 2.0.4-alpha
>
>         Attachments: 4721-hadoop2.patch, 4721-trunk.patch, 
> 4721-trunk-v2.patch, 4721-v2.patch, 4721-v3.patch, 4721-v4.patch, 
> 4721-v5.patch, 4721-v6.patch, 4721-v7.patch, 4721-v8.patch
>
>
> This was observed while doing HBase WAL recovery. HBase uses append to write 
> to its write ahead log. So initially the pipeline is setup as
> DN1 --> DN2 --> DN3
> This WAL needs to be read when DN1 fails since it houses the HBase 
> regionserver for the WAL.
> HBase first recovers the lease on the WAL file. During recovery, we choose 
> DN1 as the primary DN to do the recovery even though DN1 has failed and is 
> not heartbeating any more.
> Avoiding the stale DN1 would speed up recovery and reduce hbase MTTR. There 
> are two options.
> a) Ride on HDFS 3703 and if stale node detection is turned on, we do not 
> choose stale datanodes (typically not heart beated for 20-30 seconds) as 
> primary DN(s)
> b) We sort the replicas in order of last heart beat and always pick the ones 
> which gave the most recent heart beat
> Going to the dead datanode increases lease + block recovery since the block 
> goes into UNDER_RECOVERY state even though no one is recovering it actively. 
> Please let me know if this makes sense. If yes, whether we should move 
> forward with a) or b).
> Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to