[ 
https://issues.apache.org/jira/browse/HBASE-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011203#comment-13011203
 ] 

Sean Sechrist commented on HBASE-3686:
--------------------------------------

I did a little more testing and it turns out this problem isn't limited to the 
misconfiguration.

You'll also lose rows if you kill -9 a region server in the middle of scan. In 
HTable.ClientScanner.next(), there's this skipFirst boolean that is supposed to 
skip the first row that was "already let out on a previous invocation". But 
instead of just skipping the first row, 
getConnection().getRegionServerWithRetries(callable) is called an extra time, 
which will skip [caching] rows.

So I think fixing it to only skip 1 row will also fixing the problem if there's 
a misconfiguration, so sending the timeout to the server won't be needed.

> Scanner timeout on RegionServer but Client won't know what happened
> -------------------------------------------------------------------
>
>                 Key: HBASE-3686
>                 URL: https://issues.apache.org/jira/browse/HBASE-3686
>             Project: HBase
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 0.89.20100924
>            Reporter: Sean Sechrist
>            Priority: Minor
>
> This can cause rows to be lost from a scan.
> See this thread where the issue was brought up: 
> http://search-hadoop.com/m/xITBQ136xGJ1
> If hbase.regionserver.lease.period is higher on the client than the server we 
> can get this series of events: 
> 1. Client is scanning along happily, and does something slow.
> 2. Scanner times out on region server
> 3. Client calls HTable.ClientScanner.next()
> 4. The region server throws an UnknownScannerException
> 5. Client catches exception and sees that it's not longer then it's 
> hbase.regionserver.lease.period config, so it doesn't throw a 
> ScannerTimeoutException. Instead, it treats it like a NSRE.
> Right now the workaround is to make sure the configs are consistent. 
> A possible fix would be to use whatever the region server's scanner timeout 
> is, rather than the local one.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to