I'm wondering if the removal and re-add of the lease is racy. We used to just refresh the lease.
In the patch provided I don't remove the lease and add it back, instead just refresh it on the way out. If you apply the patch and the LeaseExceptions go away, then we will know this works for you. I've applied this patch to our internal build as part of tracking down what might be spurious LeaseExceptions. I've been blaming the clients but maybe that is wrong. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) ----- Original Message ----- > From: Mikael Sitruk <[email protected]> > To: [email protected]; Andrew Purtell <[email protected]> > Cc: > Sent: Wednesday, February 15, 2012 11:32 PM > Subject: Re: LeaseException while extracting data via pig/hbase integration > > Andy hi > > Not sure what you mean by "Does something like the below help?" The > current > code running is pasted below, line number are sightly different than yours. > It seems very close to the first file (revision "a") in your extract. > > Mikael.S > > public Result[] next(final long scannerId, int nbRows) throws IOException > { > String scannerName = String.valueOf(scannerId); > InternalScanner s = this.scanners.get(scannerName); > if (s == null) throw new UnknownScannerException("Name: " + > scannerName); > try { > checkOpen(); > } catch (IOException e) { > // If checkOpen failed, server not running or filesystem gone, > // cancel this lease; filesystem is gone or we're closing or > something. > try { > this.leases.cancelLease(scannerName); > } catch (LeaseException le) { > LOG.info("Server shutting down and client tried to access missing > scanner " + > scannerName); > } > throw e; > } > Leases.Lease lease = null; > try { > // Remove lease while its being processed in server; protects against > case > // where processing of request takes > lease expiration time. > lease = this.leases.removeLease(scannerName); > List<Result> results = new ArrayList<Result>(nbRows); > long currentScanResultSize = 0; > List<KeyValue> values = new ArrayList<KeyValue>(); > for (int i = 0; i < nbRows > && currentScanResultSize < maxScannerResultSize; i++) { > requestCount.incrementAndGet(); > // Collect values to be returned here > boolean moreRows = s.next(values); > if (!values.isEmpty()) { > for (KeyValue kv : values) { > currentScanResultSize += kv.heapSize(); > } > results.add(new Result(values)); > } > if (!moreRows) { > break; > } > values.clear(); > } > // Below is an ugly hack where we cast the InternalScanner to be a > // HRegion.RegionScanner. The alternative is to change InternalScanner > // interface but its used everywhere whereas we just need a bit of > info > // from HRegion.RegionScanner, IF its filter if any is done with the > scan > // and wants to tell the client to stop the scan. This is done by > passing > // a null result. > return ((HRegion.RegionScanner) s).isFilterDone() && > results.isEmpty() ? null > : results.toArray(new Result[0]); > } catch (Throwable t) { > if (t instanceof NotServingRegionException) { > this.scanners.remove(scannerName); > } > throw convertThrowableToIOE(cleanup(t)); > } finally { > // We're done. On way out readd the above removed lease. Adding > resets > // expiration time on lease. > if (this.scanners.containsKey(scannerName)) { > if (lease != null) this.leases.addLease(lease); > } > } > } > > On Thu, Feb 16, 2012 at 3:10 AM, Andrew Purtell <[email protected]> > wrote: > >> Hmm... >> >> Does something like the below help? >> >> >> diff --git >> a/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java >> index f9627ed..0cee8e3 100644 >> --- a/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java >> +++ b/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java >> @@ -2137,11 +2137,7 @@ public class HRegionServer implements >> HRegionInterface, HBaseRPCErrorHandler, >> } >> throw e; >> } >> - Leases.Lease lease = null; >> try { >> - // Remove lease while its being processed in server; protects >> against case >> - // where processing of request takes > lease expiration time. >> - lease = this.leases.removeLease(scannerName); >> List<Result> results = new ArrayList<Result>(nbRows); >> long currentScanResultSize = 0; >> List<KeyValue> values = new ArrayList<KeyValue>(); >> @@ -2197,10 +2193,9 @@ public class HRegionServer implements >> HRegionInterface, HBaseRPCErrorHandler, >> } >> throw convertThrowableToIOE(cleanup(t)); >> } finally { >> - // We're done. On way out readd the above removed lease. Adding >> resets >> - // expiration time on lease. >> + // We're done. On way out reset expiration time on lease. >> if (this.scanners.containsKey(scannerName)) { >> - if (lease != null) this.leases.addLease(lease); >> + this.leases.renewLease(scannerName); >> } >> } >> } >> >> >> >> Best regards, >> >> - Andy >> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein >> (via Tom White) >> >> >> >> ----- Original Message ----- >> > From: Jean-Daniel Cryans <[email protected]> >> > To: [email protected] >> > Cc: >> > Sent: Wednesday, February 15, 2012 10:17 AM >> > Subject: Re: LeaseException while extracting data via pig/hbase >> integration >> > >> > You would have to grep the lease's id, in your first email it was >> > "-7220618182832784549". >> > >> > About the time it takes to process each row, I meant client (pig) side >> > not in the RS. >> > >> > J-D >> > >> > On Tue, Feb 14, 2012 at 1:33 PM, Mikael Sitruk > <[email protected]> >> > wrote: >> >> Please see answer inline >> >> Thanks >> >> Mikael.S >> >> >> >> On Tue, Feb 14, 2012 at 8:30 PM, Jean-Daniel Cryans >> > <[email protected]>wrote: >> >> >> >>> On Tue, Feb 14, 2012 at 2:01 AM, Mikael Sitruk >> > <[email protected]> >> >>> wrote: >> >>> > hi, >> >>> > Well no, i can't figure out what is the problem, but > i saw >> > that someone >> >>> > else had the same problem (see email: > "LeaseException despite >> > high >> >>> > hbase.regionserver.lease.period") >> >>> > What can i tell is the following: >> >>> > Last week the problem was consistent >> >>> > 1. I updated hbase.regionserver.lease.period=300000 (5 > mins), >> > restarted >> >>> the >> >>> > cluster and still got the problem, the map got this > exception >> > event >> >>> before >> >>> > the 5 mins, (some after 1 min and 20 sec) >> >>> >> >>> That's extremely suspicious. Are you sure the setting is > getting >> > picked >> >>> up? :) I hope so :-) >> >>> >> >>> You should be able to tell when the lease really expires by > simply >> >>> grepping for the number in the region server log, it should > give you a >> >>> good idea of what your lease period is. >> >>> greeping on which value? the lease configured here:300000? > It does >> not >> >>> return anything, also tried in current execution where some > were ok >> and >> >>> some were not >> >>> >> >>> 2. The problem occurs only on job that will extract a large > number of >> >>> > columns (>150 cols per row) >> >>> >> >>> What's your scanner caching set to? Are you spending a > lot of time >> >>> processing each row? from the job configuration generated by > pig i can >> > see >> >>> caching set to 1, regarding the processing time of each row i > have no >> > clue >> >>> how many time it spent. the data for each row is 150 columns > of 2k >> > each. >> >>> This is approx 5 block to bring. >> >>> >> >>> > 3. The problem never occurred when only 1 map per server > is >> > running (i >> >>> have >> >>> > 8 CPU with hyper-threaded enabled = 16, so using only 1 > map per >> > machine >> >>> is >> >>> > just a waste), (at this stage I was thinking perhaps > there is a >> >>> > multi-threaded problem) >> >>> >> >>> More mappers would pull more data from the region servers so > more >> >>> concurrency from the disks, using more mappers might just > slow you >> >>> down enough that you hit the issue. >> >>> >> >> Today i ran with 8 mappers and some failed and some didn't (2 > of 4), >> > they >> >> got the lease exception after 5 mins, i will try to check the >> >> logs/sar/metric files for additional info >> >> >> >>> >> >>> > >> >>> > This week i got a sightly different behavior, after > having >> > restarted the >> >>> > servers. The extract were able to ran ok in most of the > runs even >> > with 4 >> >>> > maps running (per servers), i got only once the > exception but the >> > job was >> >>> > not killed as other runs last week >> >>> >> >>> If the client got an UnknownScannerException before the > timeout >> >>> expires (the client also keeps track of it, although it may > have a >> >>> different configuration), it will recreate the scanner. >> >>> >> >> No this is not the case. >> >> >> >>> >> >>> Which reminds me, are your regions moving around? If so, and > your >> >>> clients don't know about the high timeout, then they > might let the >> >>> exception pass on to your own code. >> >>> >> >> Region are presplited ahead, i do not have any region split > during the >> run, >> >> region size is set of 8GB, storefile is around 3.5G >> >> >> >> The test was run after major compaction, so the number of store > file >> is 1 >> >> per RS/family >> >> >> >> >> >>> >> >>> J-D >> >>> >> >> >> >> >> >> >> >> -- >> >> Mikael.S >> > >> > > > > -- > Mikael.S >
