[ 
https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501838#comment-13501838
 ] 

nkeywal commented on HBASE-5843:
--------------------------------

New scenario on datanode issue during a WAL write:

Scenario: With Replication factor 2, Start 2 DN & 1 RS, do a first put. Start a 
new DN, unplug the second one. Do another put, measure the time of this second 
put.

HBase trunk / HDFS 1.1: ~5 minutes
HBase trunk / HDFS 2 branch: ~40s seconds
HBase trunk / HDFS 2.0.2-alpha-rc3: ~40 seconds


The time is HDFS 1.1 is spent in:
~66 seconds: wait for connection timeout (SocketTimeoutException: 66000 millis 
while waiting for the channel to be ready for read).
then, we have two imbricated retries loops:
- 6 retries: Failed recovery attemp #0 from primary datanode x.y.z.w:11011 -> 
NoRouteToHostException
- 10 sub-retries: Retrying connect to server: deadbox/x.y.z.w:11021. Already 
tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

There are more or less 4 seconds between two sub-retries, so the total time it 
around:
66 + 6 * (~4 * 10) = ~300 seconds. That's our 5 minutes.

If we change HDFS code to have "RetryPolicies.TRY_ONCE_THEN_FAIL" vs. the 
default "RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
SECONDS)", the put succeeds in ~80 seconds.

Conclusion:
- time with HDFS 2.x is in line with what we have for other scenarios (~40s), 
so it's acceptable today.
- time with HDFS 1.x is much less satisfying (~5 minutes!), could be easily 
decreased to 80s with an HDFS modification.

Some points to think about:
- Maybe we could decrease the timeout for WAL: we're usually writing much less 
data than for a memstore flush, so having more aggressive settings for the WAL 
makes sense. There is a (bad) side effect: we may have more false positive, and 
this could decrease the performances, and it will increase the workload when 
the cluster is globally instable. So on the long term it makes sense, but may 
be today is to early.
- While the Namenode will consider the datanode as stale after 30s, we still 
continue trying. Again, it makes sense to lower the global workload, but it's a 
little bit boring... There could be optimizations if the datanode state was 
shared to DFSClients.
- There are some cases that could be handled faster: ConnectionRefused means 
the box is there but the port is not open: no need to retry here. 
AndNoRouteToHostException could be considered as well as critical enough to 
stop trying. Again as well, this is trading global workload vs. reactivity.

                
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>
> A part of the approach is described here: 
> https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a 
> query, whatever the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to