[ 
https://issues.apache.org/jira/browse/HBASE-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641542#comment-13641542
 ] 

Varun Sharma commented on HBASE-8389:
-------------------------------------

Hi Nicholas,

Firstly I configure the HDFS cluster in the following way:

dfs.socket.timeout = 3sec
dfs.socket.write.timeout = 5sec
ipc.client.connect.timeout = 1sec
ipc.client.connect.max.retries.on.timeouts = 2 (hence total 3 retries)

The connect timeout is low since connecting should really be very fast unless 
something major is wrong. Our clusters are housed within the same AZ on amazon 
EC2 and it is very rare to see this timeouts even getting hit on EC2 which is 
known for poor I/O performance. I, for most, see this timeouts kick in during 
failures. Note that these timeouts are only used for avoiding bad datanodes and 
not for marking nodes as dead/stale, so i think these timeouts are okay for 
quick failovers - we already have high timeouts for dead node 
detection/zookeeper session (10's of seconds).

stale node timeout = 20 seconds
dead node timeout = 10 minutes
ZooKeeper session timeout = 30 seconds

HDFS is hadoop 2.0 with HDFS 3703, HDFS 3912 and HDFS 4721. The approach is the 
following:

a) A node is failed artificially using
  1) Use iptables to only allow ssh traffic and drop all traffic
  2) Suspending the processes

b) Even though we configure stale detection to be faster than hbase detection, 
lets assume that does not play out. The node is not marked stale.

c) Lease recovery attempt # 1
   i) We choose a good primary node for recovery - since its likely that the 
bad node has the worst possible heartbeat (HDFS 4721)
   ii) But we point it to recover from all 3 nodes since we are considering the 
worst case where no node is marked stale
   iii) The primary tries to reconcile the block with all 3 nodes and hits 
either
        a) dfs.socket.timeout = 3 seconds - if process is suspended
        b) ipc.connect.timeout X ipc.connect.retries which is 3 * 1 second + 3 
* 1 second sleep = 6 seconds - if we firewall the host using iptables

d) If we use a value of 4 seconds, the first recovery attempt does not finish 
in time and we initiate lease recovery #2
   i) Either a rinse and repeat of c) happens
   ii) Or the node is now stale and the block is instantly recovered from the 
remaining two replicas

I think by we could either adjust the timeout 4 seconds to be say 8 seconds and 
mostly be able to get the first attempt successful or otherwise, we just wait 
to get stale node detection and then we will have a fairly quick block recovery 
due to HDFS 4721.

I will try to test these values tomorrow, by rebooting some nodes...

                
> HBASE-8354 forces Namenode into loop with lease recovery requests
> -----------------------------------------------------------------
>
>                 Key: HBASE-8389
>                 URL: https://issues.apache.org/jira/browse/HBASE-8389
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Varun Sharma
>            Assignee: Varun Sharma
>            Priority: Critical
>             Fix For: 0.94.8
>
>         Attachments: 8389-0.94.txt, 8389-0.94-v2.txt, 8389-0.94-v3.txt, 
> 8389-0.94-v4.txt, 8389-0.94-v5.txt, 8389-0.94-v6.txt, 8389-trunk-v1.txt, 
> 8389-trunk-v2.patch, 8389-trunk-v2.txt, 8389-trunk-v3.txt, nn1.log, nn.log, 
> sample.patch
>
>
> We ran hbase 0.94.3 patched with 8354 and observed too many outstanding lease 
> recoveries because of the short retry interval of 1 second between lease 
> recoveries.
> The namenode gets into the following loop:
> 1) Receives lease recovery request and initiates recovery choosing a primary 
> datanode every second
> 2) A lease recovery is successful and the namenode tries to commit the block 
> under recovery as finalized - this takes < 10 seconds in our environment 
> since we run with tight HDFS socket timeouts.
> 3) At step 2), there is a more recent recovery enqueued because of the 
> aggressive retries. This causes the committed block to get preempted and we 
> enter a vicious cycle
> So we do,  <initiate_recovery> --> <commit_block> --> 
> <commit_preempted_by_another_recovery>
> This loop is paused after 300 seconds which is the 
> "hbase.lease.recovery.timeout". Hence the MTTR we are observing is 5 minutes 
> which is terrible. Our ZK session timeout is 30 seconds and HDFS stale node 
> detection timeout is 20 seconds.
> Note that before the patch, we do not call recoverLease so aggressively - 
> also it seems that the HDFS namenode is pretty dumb in that it keeps 
> initiating new recoveries for every call. Before the patch, we call 
> recoverLease, assume that the block was recovered, try to get the file, it 
> has zero length since its under recovery, we fail the task and retry until we 
> get a non zero length. So things just work.
> Fixes:
> 1) Expecting recovery to occur within 1 second is too aggressive. We need to 
> have a more generous timeout. The timeout needs to be configurable since 
> typically, the recovery takes as much time as the DFS timeouts. The primary 
> datanode doing the recovery tries to reconcile the blocks and hits the 
> timeouts when it tries to contact the dead node. So the recovery is as fast 
> as the HDFS timeouts.
> 2) We have another issue I report in HDFS 4721. The Namenode chooses the 
> stale datanode to perform the recovery (since its still alive). Hence the 
> first recovery request is bound to fail. So if we want a tight MTTR, we 
> either need something like HDFS 4721 or we need something like this
>   recoverLease(...)
>   sleep(1000)
>   recoverLease(...)
>   sleep(configuredTimeout)
>   recoverLease(...)
>   sleep(configuredTimeout)
> Where configuredTimeout should be large enough to let the recovery happen but 
> the first timeout is short so that we get past the moot recovery in step #1.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to