[ 
https://issues.apache.org/jira/browse/HBASE-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665613#comment-13665613
 ] 

Nicolas Liochon commented on HBASE-8449:
----------------------------------------

Increase hbase.lease.recovery.timeout default to 15 minutes, i.e. more than a 
standard hdfs recovery.
hbase.lease.recovery.dfs.timeout: it should not be less than 10s imho. It's not 
only a question of dfs timeout, it's as well that it seems that the NN seems 
not to like multiple calls to the recoverLease. I tested again multiple calls, 
the datanodes logs were complaining about "situation that should never occurs". 
Ok, it was with multi calls with an interval of 1 second, but it seems to be 
all luck.

+   * 1. Call recoverLease.
+   * 2. If it returns true, break.
+   * 3. If it returns false, wait a few seconds and then call it again.
+   * 4. If it returns true, break.
+   * 5. If it returns false, wait for what we think the datanode socket 
timeout is
+   * (configurable) and then try again.
+   * 6. If it returns true, break.
+   * 7. If it returns false, repeat starting at step 5. above.


I would propose:
the master
   - if HDFS-4754 is there, the master marks the node as stale as the first 
step of the recovery.
   - The master calls recover lease as a part of the distributed split. We can 
enhance it in an other jira to give higher priority to closed wals vs. wals 
being recovered.

the region server:
    - calls isFileCLosed, if it's there. if true returns
    - Calls recoverLease, if true, return
    - if isFileCLosed is available, loop on it with a 1s sleep 
    - else loops on 70s (configurable) sleep with recover lease




                
> Refactor recoverLease retries and pauses informed by findings over in 
> hbase-8389
> --------------------------------------------------------------------------------
>
>                 Key: HBASE-8449
>                 URL: https://issues.apache.org/jira/browse/HBASE-8449
>             Project: HBase
>          Issue Type: Bug
>          Components: Filesystem Integration
>    Affects Versions: 0.94.7, 0.95.0
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.95.1
>
>         Attachments: 8449.txt, 8449v2.txt, 8449v3.txt, 8449v4.txt
>
>
> HBASE-8359 is an interesting issue that roams near and far.  This issue is 
> about making use of the findings handily summarized on the end of hbase-8359 
> which have it that trunk needs refactor around how it does its recoverLease 
> handling (and that the patch committed against HBASE-8359 is not what we want 
> going forward).
> This issue is about making a patch that adds a lag between recoverLease 
> invocations where the lag is related to dfs timeouts -- the hdfs-side dfs 
> timeout -- and optionally makes use of the isFileClosed API if it is 
> available (a facility that is not yet committed to a branch near you and 
> unlikely to be within your locality with a good while to come).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to