[ https://issues.apache.org/jira/browse/HBASE-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665613#comment-13665613 ]
Nicolas Liochon commented on HBASE-8449: ---------------------------------------- Increase hbase.lease.recovery.timeout default to 15 minutes, i.e. more than a standard hdfs recovery. hbase.lease.recovery.dfs.timeout: it should not be less than 10s imho. It's not only a question of dfs timeout, it's as well that it seems that the NN seems not to like multiple calls to the recoverLease. I tested again multiple calls, the datanodes logs were complaining about "situation that should never occurs". Ok, it was with multi calls with an interval of 1 second, but it seems to be all luck. + * 1. Call recoverLease. + * 2. If it returns true, break. + * 3. If it returns false, wait a few seconds and then call it again. + * 4. If it returns true, break. + * 5. If it returns false, wait for what we think the datanode socket timeout is + * (configurable) and then try again. + * 6. If it returns true, break. + * 7. If it returns false, repeat starting at step 5. above. I would propose: the master - if HDFS-4754 is there, the master marks the node as stale as the first step of the recovery. - The master calls recover lease as a part of the distributed split. We can enhance it in an other jira to give higher priority to closed wals vs. wals being recovered. the region server: - calls isFileCLosed, if it's there. if true returns - Calls recoverLease, if true, return - if isFileCLosed is available, loop on it with a 1s sleep - else loops on 70s (configurable) sleep with recover lease > Refactor recoverLease retries and pauses informed by findings over in > hbase-8389 > -------------------------------------------------------------------------------- > > Key: HBASE-8449 > URL: https://issues.apache.org/jira/browse/HBASE-8449 > Project: HBase > Issue Type: Bug > Components: Filesystem Integration > Affects Versions: 0.94.7, 0.95.0 > Reporter: stack > Assignee: stack > Priority: Critical > Fix For: 0.95.1 > > Attachments: 8449.txt, 8449v2.txt, 8449v3.txt, 8449v4.txt > > > HBASE-8359 is an interesting issue that roams near and far. This issue is > about making use of the findings handily summarized on the end of hbase-8359 > which have it that trunk needs refactor around how it does its recoverLease > handling (and that the patch committed against HBASE-8359 is not what we want > going forward). > This issue is about making a patch that adds a lag between recoverLease > invocations where the lag is related to dfs timeouts -- the hdfs-side dfs > timeout -- and optionally makes use of the isFileClosed API if it is > available (a facility that is not yet committed to a branch near you and > unlikely to be within your locality with a good while to come). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira