Hello St.Ack, Thanks for your pointer, but I had already investigated JIRA https://issues.apache.org/jira/browse/HBASE-13090 Unfortunately, this heartbeat will protect against rpc timeout, not server side lease timeout that we are experiencing right now. I have not seen an active JIRA fixing our issue. Only https://issues.apache.org/jira/browse/HBASE6121 is complaining about the exact same issue, but was never resolved.
The heartbeat JIRA in 13090 protect for situation where server scanner takes so long to retrieve the highly filtered information, that it exceeds the RPC timeout (hbase.rpc.timeout). The timeout we are experiencing is the hbase.client.scanner.timeout.period, also deprecatedly known as hbase.regionserver.lease.period The mechanism is different: here, region server scanners wants to protect themselves against dead clients that would not perform "close", and allow releasing server side scanner resources. To do that, a lease mechanism is implemented, and if between 2 next() call, more than hbase.regionserver.lease.period occurs, the server side scanner will have been forced closed by this lease timeout safety mechanism. On late next() call, client will receive a DNRIOE of type unknownScannerException, and the client will assess that it is coming most likely from the lease timeout (and not from a region move), therefore throwing an exception instead of reset scanner (for the region move scenario). Hbase 1.1 does not address, as far as I have researched, the hbase.client.scanner.timeout.period issue we are facing. And yes, we will move to Hbase 1.1, and 1.0 as Cloudera and Hortonworks are having version mismatch on the next official builds trafodion will support. So my question is still open? Best regards, Eric Owhadi -----Original Message----- From: saint....@gmail.com [mailto:saint....@gmail.com] On Behalf Of Stack Sent: Monday, August 24, 2015 11:07 PM To: HBase Dev List Subject: Re: Question on hbase.client.scanner.timeout.period On Mon, Aug 24, 2015 at 4:48 PM, Eric Owhadi <eric.owh...@esgyn.com> wrote: > Hello everyone, > We have been facing a situation on trafodion, where we are hitting the > hbase.client.scanner.timeout.period scenario: > basically, when doing queries that require spilling to disk because of > high complexity of what is involved, the underlying hbase scanner > serving one of the operation involved in the complex query cannot call > the next() withing the timeout specify... too busy taking care of other > business. > This is legit scenario, and I was wondering why in the code, special > care is done to make sure that client side, if a DNRIOE of type > unknownScannerException shows up, and the > hbase.client.scanner.timeout.period time elapsed, we make sure to > throw a scannerTimeoutException, instead of just let it go and reset > scanner. > > Scanners were redone in hbase 1.1. Can Trafodion come up onto hbase 1.1? See https://blogs.apache.org/hbase/entry/scan_improvements_in_hbase_1 for summary. St.Ack > I imagine that the lease time out implementation on region server side > is supposed to protect from resource leak of scanner object server > side. But I am not sure why we would make it so that client side throw > this timeout exception, when in fact what just happened was that > client was too busy to call next() on time. > > I am sure there is a reason, but cannot figure it out :-). > > BTW, I found this JIRA, talking about exact same thing: > https://issues.apache.org/jira/browse/HBASE61-21 but with no resolution. > > Any help understanding the reason of the timeout thrwown client side > instead of an automatic reset would be much appreciated, Best regards, > Eric Owhadi >