OK will do. Not yet sure if it is easy, will know on Monday :-). Was struggling today to see how to regression test this without putting breakpoints to simulate busy client not calling next() on time in trafodion code... Eric
-----Original Message----- From: saint....@gmail.com [mailto:saint....@gmail.com] On Behalf Of Stack Sent: Friday, August 28, 2015 6:35 PM To: HBase Dev List <dev@hbase.apache.org> Subject: Re: Question on hbase.client.scanner.timeout.period On Fri, Aug 28, 2015 at 11:31 AM, Eric Owhadi <eric.owh...@esgyn.com> wrote: > That sounds good, but given trafodion needs to work on current and > future released version of HBase, unpatched, I will first implement a > ClientScannerTrafodion (to be deprecated), inheriting from > ClientScanner that will just overload the loadCache(),and make sure > that the code that is picking up the right scanner based on scan > object is bypassed to force getting the ClientScannerTrafodion when > appropriate. > Not very elegant, but need to take into consideration trafodion > deployment requirements. > Then, if we do not discover any side effect during our QA related to > this code I will port the fix on HBase to deprecate the custom scanner > (probably first on HBase 2.0, then will let the community decide if > this fix is worth it for back porting...). It will be a first for me, > but that's great, I'll take your offer to help ;-)... > Sweet. Suggest opening an umbrellas issue in hbase to implement this feature. Reference HBASE-2161 (it is closed now). Link trafodion issue to it. A subtask could have implementation in hbase 2.0, another could be backport. Is is easy to insert your T*ClientScanner? St.Ack > Regards, > Eric > > -----Original Message----- > From: saint....@gmail.com [mailto:saint....@gmail.com] On Behalf Of > Stack > Sent: Thursday, August 27, 2015 3:55 PM > To: HBase Dev List <dev@hbase.apache.org> > Subject: Re: Question on hbase.client.scanner.timeout.period > > On Thu, Aug 27, 2015 at 1:39 PM, Eric Owhadi <eric.owh...@esgyn.com> > wrote: > > > Oops, my bad, the related JIRA was : > > https://issues.apache.org/jira/browse/HBASE-2161 > > > > I am suggesting that the special code client side in loadCache() of > > ClientScanner that is trapping the UnknownScannerException, then on > > purpose check if it is coming from a lease timeout (and not by a > > region move) to decide that it would throw a ScannerTimeoutException > > instead of letting the code go and just reset the scanner and start > > from last successful retrieve (the way it works for an > > unknowScannerException due to a region moving). > > By just removing the special handling that tries to differentiate > > from unkownScannerException due to lease timeout, we should have a > > resolution to JIRA 2161- And to our trafodion issue. > > > > We are still protecting against dead client that would cause > > resource leak at region server, since we keep the lease timeout > > mechanism. > > > > Not sure if I have overlooked something, as usually, code is here > > for a reason :-)... > > > > > Your proposal sounds good to me. > > Scanner works the way it does because it has always work this way (smile). > A while back, one of the lads suggested we do like dynamodb and have > scanner have no state on the serverside, the scan next would just > supply all necessary context. It was argued against because serverside > setup is so costly. Your suggestion is similar only we do it only if > Scanner has timed out. > > Suggest we keep the current semantic in 1.x at least. We could flip to > your behavior in 2.x. Meantime, you'd have to ask for it when you set > up your Scan object by setting a flag. > > Would that work? If you want to have a go at it, I could help out on > the issue. > > St.Ack > > > > > > Regards, > > Eric > > > > > > > > -----Original Message----- > > From: saint....@gmail.com [mailto:saint....@gmail.com] On Behalf Of > > Stack > > Sent: Thursday, August 27, 2015 3:23 PM > > To: HBase Dev List <dev@hbase.apache.org> > > Subject: Re: Question on hbase.client.scanner.timeout.period > > > > On Tue, Aug 25, 2015 at 8:03 AM, Eric Owhadi <eric.owh...@esgyn.com> > > wrote: > > > > > Hello St.Ack, > > > Thanks for your pointer, but I had already investigated JIRA > > > https://issues.apache.org/jira/browse/HBASE-13090 > > > Unfortunately, this heartbeat will protect against rpc timeout, > > > not server side lease timeout that we are experiencing right now. > > > I have not seen an active JIRA fixing our issue. > > > Only https://issues.apache.org/jira/browse/HBASE6121 is > > > complaining about the exact same issue, but was never resolved. > > > > > > > > Which issue? https://issues.apache.org/jira/browse/HBASE-6121 seems > > unrelated. > > > > > > > > > The heartbeat JIRA in 13090 protect for situation where server > > > scanner takes so long to retrieve the highly filtered information, > > > that it exceeds the RPC timeout (hbase.rpc.timeout). > > > > > > > > > The timeout we are experiencing is the > > > hbase.client.scanner.timeout.period, > > > also deprecatedly known as hbase.regionserver.lease.period The > > > mechanism is different: here, region server scanners wants to > > > protect themselves against dead clients that would not perform > > > "close", and allow releasing server side scanner resources. To do > > > that, a lease mechanism is implemented, and if between 2 next() > > > call, more than hbase.regionserver.lease.period occurs, the server > > > side scanner will have been forced closed by this lease timeout > > > safety mechanism. On late next() call, client will receive a > > > DNRIOE of type unknownScannerException, and the client will assess > > > that it is coming most likely from the lease timeout (and not from > > > a region move), therefore throwing an exception instead of reset > > > scanner (for the region move scenario). > > > > > > Hbase 1.1 does not address, as far as I have researched, the > > > hbase.client.scanner.timeout.period issue we are facing. > > > > > > > > > > Can you not have the high-level query that is being fed by a scan do > > HBASE-13333? That is, tickle, the ongoing scan on occasion just to > > say that I'm still alive? > > > > Otherwise, what would you suggest? A scan that does not timeout? Or > > the client being able to set a timeout in the Scan passed to the server? > > > > Sorry for late reply, > > St.Ack > > > > > > > > > And yes, we will move to Hbase 1.1, and 1.0 as Cloudera and > > > Hortonworks are having version mismatch on the next official > > > builds trafodion will support. > > > > > > So my question is still open? > > > > > > Best regards, > > > Eric Owhadi > > > > > > > > > > > > -----Original Message----- > > > From: saint....@gmail.com [mailto:saint....@gmail.com] On Behalf > > > Of Stack > > > Sent: Monday, August 24, 2015 11:07 PM > > > To: HBase Dev List > > > Subject: Re: Question on hbase.client.scanner.timeout.period > > > > > > On Mon, Aug 24, 2015 at 4:48 PM, Eric Owhadi > > > <eric.owh...@esgyn.com> > > > wrote: > > > > > > > Hello everyone, > > > > We have been facing a situation on trafodion, where we are > > > > hitting the hbase.client.scanner.timeout.period scenario: > > > > basically, when doing queries that require spilling to disk > > > > because of high complexity of what is involved, the underlying > > > > hbase scanner serving one of the operation involved in the > > > > complex query cannot call the next() withing the timeout > > > > specify... too busy taking care of other business. > > > > This is legit scenario, and I was wondering why in the code, > > > > special care is done to make sure that client side, if a DNRIOE > > > > of type unknownScannerException shows up, and the > > > > hbase.client.scanner.timeout.period time elapsed, we make sure > > > > to throw a scannerTimeoutException, instead of just let it go > > > > and reset scanner. > > > > > > > > Scanners were redone in hbase 1.1. Can Trafodion come up onto > > > > hbase > > 1.1? > > > See > > > https://blogs.apache.org/hbase/entry/scan_improvements_in_hbase_1 > > > for summary. > > > St.Ack > > > > > > > > > > > > > I imagine that the lease time out implementation on region > > > > server side is supposed to protect from resource leak of scanner > > > > object server side. But I am not sure why we would make it so > > > > that client side throw this timeout exception, when in fact what > > > > just happened was that client was too busy to call next() on time. > > > > > > > > I am sure there is a reason, but cannot figure it out :-). > > > > > > > > BTW, I found this JIRA, talking about exact same thing: > > > > https://issues.apache.org/jira/browse/HBASE61-21 but with no > > resolution. > > > > > > > > > > > > > > Any help understanding the reason of the timeout thrwown client > > > > side instead of an automatic reset would be much appreciated, > > > > Best regards, Eric Owhadi > > > > > > > > > >