Andrew hi, Can you please paste the method and not the output of the patch, please? So i will be able to test it (hopefully)
Thanks Mikael.S On Fri, Feb 17, 2012 at 12:23 AM, Mikael Sitruk <[email protected]>wrote: > Ok i understand you now, but i think that the lines are different so , can > you paste the method (full content instead of patch) into the email, i will > compile and check? > > Mikael.S > > > On Thu, Feb 16, 2012 at 7:49 PM, Andrew Purtell <[email protected]>wrote: > >> I'm wondering if the removal and re-add of the lease is racy. We used to >> just refresh the lease. >> >> In the patch provided I don't remove the lease and add it back, instead >> just refresh it on the way out. If you apply the patch and the >> LeaseExceptions go away, then we will know this works for you. I've applied >> this patch to our internal build as part of tracking down what might be >> spurious LeaseExceptions. I've been blaming the clients but maybe that is >> wrong. >> >> Best regards, >> >> - Andy >> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein >> (via Tom White) >> >> >> ----- Original Message ----- >> > From: Mikael Sitruk <[email protected]> >> > To: [email protected]; Andrew Purtell <[email protected]> >> > Cc: >> > Sent: Wednesday, February 15, 2012 11:32 PM >> > Subject: Re: LeaseException while extracting data via pig/hbase >> integration >> > >> > Andy hi >> > >> > Not sure what you mean by "Does something like the below help?" The >> > current >> > code running is pasted below, line number are sightly different than >> yours. >> > It seems very close to the first file (revision "a") in your extract. >> > >> > Mikael.S >> > >> > public Result[] next(final long scannerId, int nbRows) throws >> IOException >> > { >> > String scannerName = String.valueOf(scannerId); >> > InternalScanner s = this.scanners.get(scannerName); >> > if (s == null) throw new UnknownScannerException("Name: " + >> > scannerName); >> > try { >> > checkOpen(); >> > } catch (IOException e) { >> > // If checkOpen failed, server not running or filesystem gone, >> > // cancel this lease; filesystem is gone or we're closing or >> > something. >> > try { >> > this.leases.cancelLease(scannerName); >> > } catch (LeaseException le) { >> > LOG.info("Server shutting down and client tried to access >> missing >> > scanner " + >> > scannerName); >> > } >> > throw e; >> > } >> > Leases.Lease lease = null; >> > try { >> > // Remove lease while its being processed in server; protects >> against >> > case >> > // where processing of request takes > lease expiration time. >> > lease = this.leases.removeLease(scannerName); >> > List<Result> results = new ArrayList<Result>(nbRows); >> > long currentScanResultSize = 0; >> > List<KeyValue> values = new ArrayList<KeyValue>(); >> > for (int i = 0; i < nbRows >> > && currentScanResultSize < maxScannerResultSize; i++) { >> > requestCount.incrementAndGet(); >> > // Collect values to be returned here >> > boolean moreRows = s.next(values); >> > if (!values.isEmpty()) { >> > for (KeyValue kv : values) { >> > currentScanResultSize += kv.heapSize(); >> > } >> > results.add(new Result(values)); >> > } >> > if (!moreRows) { >> > break; >> > } >> > values.clear(); >> > } >> > // Below is an ugly hack where we cast the InternalScanner to be a >> > // HRegion.RegionScanner. The alternative is to change >> InternalScanner >> > // interface but its used everywhere whereas we just need a bit of >> > info >> > // from HRegion.RegionScanner, IF its filter if any is done with >> the >> > scan >> > // and wants to tell the client to stop the scan. This is done by >> > passing >> > // a null result. >> > return ((HRegion.RegionScanner) s).isFilterDone() && >> > results.isEmpty() ? null >> > : results.toArray(new Result[0]); >> > } catch (Throwable t) { >> > if (t instanceof NotServingRegionException) { >> > this.scanners.remove(scannerName); >> > } >> > throw convertThrowableToIOE(cleanup(t)); >> > } finally { >> > // We're done. On way out readd the above removed lease. Adding >> > resets >> > // expiration time on lease. >> > if (this.scanners.containsKey(scannerName)) { >> > if (lease != null) this.leases.addLease(lease); >> > } >> > } >> > } >> > >> > On Thu, Feb 16, 2012 at 3:10 AM, Andrew Purtell <[email protected]> >> > wrote: >> > >> >> Hmm... >> >> >> >> Does something like the below help? >> >> >> >> >> >> diff --git >> >> >> a/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java >> >> index f9627ed..0cee8e3 100644 >> >> --- >> a/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java >> >> +++ >> b/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java >> >> @@ -2137,11 +2137,7 @@ public class HRegionServer implements >> >> HRegionInterface, HBaseRPCErrorHandler, >> >> } >> >> throw e; >> >> } >> >> - Leases.Lease lease = null; >> >> try { >> >> - // Remove lease while its being processed in server; protects >> >> against case >> >> - // where processing of request takes > lease expiration time. >> >> - lease = this.leases.removeLease(scannerName); >> >> List<Result> results = new ArrayList<Result>(nbRows); >> >> long currentScanResultSize = 0; >> >> List<KeyValue> values = new ArrayList<KeyValue>(); >> >> @@ -2197,10 +2193,9 @@ public class HRegionServer implements >> >> HRegionInterface, HBaseRPCErrorHandler, >> >> } >> >> throw convertThrowableToIOE(cleanup(t)); >> >> } finally { >> >> - // We're done. On way out readd the above removed lease. >> Adding >> >> resets >> >> - // expiration time on lease. >> >> + // We're done. On way out reset expiration time on lease. >> >> if (this.scanners.containsKey(scannerName)) { >> >> - if (lease != null) this.leases.addLease(lease); >> >> + this.leases.renewLease(scannerName); >> >> } >> >> } >> >> } >> >> >> >> >> >> >> >> Best regards, >> >> >> >> - Andy >> >> >> >> Problems worthy of attack prove their worth by hitting back. - Piet >> Hein >> >> (via Tom White) >> >> >> >> >> >> >> >> ----- Original Message ----- >> >> > From: Jean-Daniel Cryans <[email protected]> >> >> > To: [email protected] >> >> > Cc: >> >> > Sent: Wednesday, February 15, 2012 10:17 AM >> >> > Subject: Re: LeaseException while extracting data via pig/hbase >> >> integration >> >> > >> >> > You would have to grep the lease's id, in your first email it was >> >> > "-7220618182832784549". >> >> > >> >> > About the time it takes to process each row, I meant client (pig) >> side >> >> > not in the RS. >> >> > >> >> > J-D >> >> > >> >> > On Tue, Feb 14, 2012 at 1:33 PM, Mikael Sitruk >> > <[email protected]> >> >> > wrote: >> >> >> Please see answer inline >> >> >> Thanks >> >> >> Mikael.S >> >> >> >> >> >> On Tue, Feb 14, 2012 at 8:30 PM, Jean-Daniel Cryans >> >> > <[email protected]>wrote: >> >> >> >> >> >>> On Tue, Feb 14, 2012 at 2:01 AM, Mikael Sitruk >> >> > <[email protected]> >> >> >>> wrote: >> >> >>> > hi, >> >> >>> > Well no, i can't figure out what is the problem, but >> > i saw >> >> > that someone >> >> >>> > else had the same problem (see email: >> > "LeaseException despite >> >> > high >> >> >>> > hbase.regionserver.lease.period") >> >> >>> > What can i tell is the following: >> >> >>> > Last week the problem was consistent >> >> >>> > 1. I updated hbase.regionserver.lease.period=300000 (5 >> > mins), >> >> > restarted >> >> >>> the >> >> >>> > cluster and still got the problem, the map got this >> > exception >> >> > event >> >> >>> before >> >> >>> > the 5 mins, (some after 1 min and 20 sec) >> >> >>> >> >> >>> That's extremely suspicious. Are you sure the setting is >> > getting >> >> > picked >> >> >>> up? :) I hope so :-) >> >> >>> >> >> >>> You should be able to tell when the lease really expires by >> > simply >> >> >>> grepping for the number in the region server log, it should >> > give you a >> >> >>> good idea of what your lease period is. >> >> >>> greeping on which value? the lease configured here:300000? >> > It does >> >> not >> >> >>> return anything, also tried in current execution where some >> > were ok >> >> and >> >> >>> some were not >> >> >>> >> >> >>> 2. The problem occurs only on job that will extract a large >> > number of >> >> >>> > columns (>150 cols per row) >> >> >>> >> >> >>> What's your scanner caching set to? Are you spending a >> > lot of time >> >> >>> processing each row? from the job configuration generated by >> > pig i can >> >> > see >> >> >>> caching set to 1, regarding the processing time of each row i >> > have no >> >> > clue >> >> >>> how many time it spent. the data for each row is 150 columns >> > of 2k >> >> > each. >> >> >>> This is approx 5 block to bring. >> >> >>> >> >> >>> > 3. The problem never occurred when only 1 map per server >> > is >> >> > running (i >> >> >>> have >> >> >>> > 8 CPU with hyper-threaded enabled = 16, so using only 1 >> > map per >> >> > machine >> >> >>> is >> >> >>> > just a waste), (at this stage I was thinking perhaps >> > there is a >> >> >>> > multi-threaded problem) >> >> >>> >> >> >>> More mappers would pull more data from the region servers so >> > more >> >> >>> concurrency from the disks, using more mappers might just >> > slow you >> >> >>> down enough that you hit the issue. >> >> >>> >> >> >> Today i ran with 8 mappers and some failed and some didn't (2 >> > of 4), >> >> > they >> >> >> got the lease exception after 5 mins, i will try to check the >> >> >> logs/sar/metric files for additional info >> >> >> >> >> >>> >> >> >>> > >> >> >>> > This week i got a sightly different behavior, after >> > having >> >> > restarted the >> >> >>> > servers. The extract were able to ran ok in most of the >> > runs even >> >> > with 4 >> >> >>> > maps running (per servers), i got only once the >> > exception but the >> >> > job was >> >> >>> > not killed as other runs last week >> >> >>> >> >> >>> If the client got an UnknownScannerException before the >> > timeout >> >> >>> expires (the client also keeps track of it, although it may >> > have a >> >> >>> different configuration), it will recreate the scanner. >> >> >>> >> >> >> No this is not the case. >> >> >> >> >> >>> >> >> >>> Which reminds me, are your regions moving around? If so, and >> > your >> >> >>> clients don't know about the high timeout, then they >> > might let the >> >> >>> exception pass on to your own code. >> >> >>> >> >> >> Region are presplited ahead, i do not have any region split >> > during the >> >> run, >> >> >> region size is set of 8GB, storefile is around 3.5G >> >> >> >> >> >> The test was run after major compaction, so the number of store >> > file >> >> is 1 >> >> >> per RS/family >> >> >> >> >> >> >> >> >>> >> >> >>> J-D >> >> >>> >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> Mikael.S >> >> > >> >> >> > >> > >> > >> > -- >> > Mikael.S >> > >> > > > > -- > Mikael.S > > -- Mikael.S
