Oops, my bad, the related JIRA was : https://issues.apache.org/jira/browse/HBASE-2161
I am suggesting that the special code client side in loadCache() of ClientScanner that is trapping the UnknownScannerException, then on purpose check if it is coming from a lease timeout (and not by a region move) to decide that it would throw a ScannerTimeoutException instead of letting the code go and just reset the scanner and start from last successful retrieve (the way it works for an unknowScannerException due to a region moving). By just removing the special handling that tries to differentiate from unkownScannerException due to lease timeout, we should have a resolution to JIRA 2161- And to our trafodion issue. We are still protecting against dead client that would cause resource leak at region server, since we keep the lease timeout mechanism. Not sure if I have overlooked something, as usually, code is here for a reason :-)... Regards, Eric -----Original Message----- From: saint....@gmail.com [mailto:saint....@gmail.com] On Behalf Of Stack Sent: Thursday, August 27, 2015 3:23 PM To: HBase Dev List <dev@hbase.apache.org> Subject: Re: Question on hbase.client.scanner.timeout.period On Tue, Aug 25, 2015 at 8:03 AM, Eric Owhadi <eric.owh...@esgyn.com> wrote: > Hello St.Ack, > Thanks for your pointer, but I had already investigated JIRA > https://issues.apache.org/jira/browse/HBASE-13090 > Unfortunately, this heartbeat will protect against rpc timeout, not > server side lease timeout that we are experiencing right now. I have > not seen an active JIRA fixing our issue. > Only https://issues.apache.org/jira/browse/HBASE6121 is complaining > about the exact same issue, but was never resolved. > > Which issue? https://issues.apache.org/jira/browse/HBASE-6121 seems unrelated. > The heartbeat JIRA in 13090 protect for situation where server scanner > takes so long to retrieve the highly filtered information, that it > exceeds the RPC timeout (hbase.rpc.timeout). > The timeout we are experiencing is the > hbase.client.scanner.timeout.period, > also deprecatedly known as hbase.regionserver.lease.period The > mechanism is different: here, region server scanners wants to protect > themselves against dead clients that would not perform "close", and > allow releasing server side scanner resources. To do that, a lease > mechanism is implemented, and if between 2 next() call, more than > hbase.regionserver.lease.period occurs, the server side scanner will > have been forced closed by this lease timeout safety mechanism. On > late next() call, client will receive a DNRIOE of type > unknownScannerException, and the client will assess that it is coming > most likely from the lease timeout (and not from a region move), > therefore throwing an exception instead of reset scanner (for the > region move scenario). > > Hbase 1.1 does not address, as far as I have researched, the > hbase.client.scanner.timeout.period issue we are facing. > > Can you not have the high-level query that is being fed by a scan do HBASE-13333? That is, tickle, the ongoing scan on occasion just to say that I'm still alive? Otherwise, what would you suggest? A scan that does not timeout? Or the client being able to set a timeout in the Scan passed to the server? Sorry for late reply, St.Ack > And yes, we will move to Hbase 1.1, and 1.0 as Cloudera and > Hortonworks are having version mismatch on the next official builds > trafodion will support. > > So my question is still open? > > Best regards, > Eric Owhadi > > > > -----Original Message----- > From: saint....@gmail.com [mailto:saint....@gmail.com] On Behalf Of > Stack > Sent: Monday, August 24, 2015 11:07 PM > To: HBase Dev List > Subject: Re: Question on hbase.client.scanner.timeout.period > > On Mon, Aug 24, 2015 at 4:48 PM, Eric Owhadi <eric.owh...@esgyn.com> > wrote: > > > Hello everyone, > > We have been facing a situation on trafodion, where we are hitting > > the hbase.client.scanner.timeout.period scenario: > > basically, when doing queries that require spilling to disk because > > of high complexity of what is involved, the underlying hbase scanner > > serving one of the operation involved in the complex query cannot > > call the next() withing the timeout specify... too busy taking care > > of other business. > > This is legit scenario, and I was wondering why in the code, special > > care is done to make sure that client side, if a DNRIOE of type > > unknownScannerException shows up, and the > > hbase.client.scanner.timeout.period time elapsed, we make sure to > > throw a scannerTimeoutException, instead of just let it go and reset > > scanner. > > > > Scanners were redone in hbase 1.1. Can Trafodion come up onto hbase 1.1? > See https://blogs.apache.org/hbase/entry/scan_improvements_in_hbase_1 > for summary. > St.Ack > > > > > I imagine that the lease time out implementation on region server > > side is supposed to protect from resource leak of scanner object > > server side. But I am not sure why we would make it so that client > > side throw this timeout exception, when in fact what just happened > > was that client was too busy to call next() on time. > > > > I am sure there is a reason, but cannot figure it out :-). > > > > BTW, I found this JIRA, talking about exact same thing: > > https://issues.apache.org/jira/browse/HBASE61-21 but with no resolution. > > > > > > Any help understanding the reason of the timeout thrwown client side > > instead of an automatic reset would be much appreciated, Best > > regards, Eric Owhadi > > >