You would have to grep the lease's id, in your first email it was "-7220618182832784549".
About the time it takes to process each row, I meant client (pig) side not in the RS. J-D On Tue, Feb 14, 2012 at 1:33 PM, Mikael Sitruk <[email protected]> wrote: > Please see answer inline > Thanks > Mikael.S > > On Tue, Feb 14, 2012 at 8:30 PM, Jean-Daniel Cryans > <[email protected]>wrote: > >> On Tue, Feb 14, 2012 at 2:01 AM, Mikael Sitruk <[email protected]> >> wrote: >> > hi, >> > Well no, i can't figure out what is the problem, but i saw that someone >> > else had the same problem (see email: "LeaseException despite high >> > hbase.regionserver.lease.period") >> > What can i tell is the following: >> > Last week the problem was consistent >> > 1. I updated hbase.regionserver.lease.period=300000 (5 mins), restarted >> the >> > cluster and still got the problem, the map got this exception event >> before >> > the 5 mins, (some after 1 min and 20 sec) >> >> That's extremely suspicious. Are you sure the setting is getting picked >> up? :) I hope so :-) >> >> You should be able to tell when the lease really expires by simply >> grepping for the number in the region server log, it should give you a >> good idea of what your lease period is. >> greeping on which value? the lease configured here:300000? It does not >> return anything, also tried in current execution where some were ok and >> some were not >> >> 2. The problem occurs only on job that will extract a large number of >> > columns (>150 cols per row) >> >> What's your scanner caching set to? Are you spending a lot of time >> processing each row? from the job configuration generated by pig i can see >> caching set to 1, regarding the processing time of each row i have no clue >> how many time it spent. the data for each row is 150 columns of 2k each. >> This is approx 5 block to bring. >> >> > 3. The problem never occurred when only 1 map per server is running (i >> have >> > 8 CPU with hyper-threaded enabled = 16, so using only 1 map per machine >> is >> > just a waste), (at this stage I was thinking perhaps there is a >> > multi-threaded problem) >> >> More mappers would pull more data from the region servers so more >> concurrency from the disks, using more mappers might just slow you >> down enough that you hit the issue. >> > Today i ran with 8 mappers and some failed and some didn't (2 of 4), they > got the lease exception after 5 mins, i will try to check the > logs/sar/metric files for additional info > >> >> > >> > This week i got a sightly different behavior, after having restarted the >> > servers. The extract were able to ran ok in most of the runs even with 4 >> > maps running (per servers), i got only once the exception but the job was >> > not killed as other runs last week >> >> If the client got an UnknownScannerException before the timeout >> expires (the client also keeps track of it, although it may have a >> different configuration), it will recreate the scanner. >> > No this is not the case. > >> >> Which reminds me, are your regions moving around? If so, and your >> clients don't know about the high timeout, then they might let the >> exception pass on to your own code. >> > Region are presplited ahead, i do not have any region split during the run, > region size is set of 8GB, storefile is around 3.5G > > The test was run after major compaction, so the number of store file is 1 > per RS/family > > >> >> J-D >> > > > > -- > Mikael.S
