On Tue, Feb 14, 2012 at 2:01 AM, Mikael Sitruk <[email protected]> wrote: > hi, > Well no, i can't figure out what is the problem, but i saw that someone > else had the same problem (see email: "LeaseException despite high > hbase.regionserver.lease.period") > What can i tell is the following: > Last week the problem was consistent > 1. I updated hbase.regionserver.lease.period=300000 (5 mins), restarted the > cluster and still got the problem, the map got this exception event before > the 5 mins, (some after 1 min and 20 sec)
That's extremely suspicious. Are you sure the setting is getting picked up? :) You should be able to tell when the lease really expires by simply grepping for the number in the region server log, it should give you a good idea of what your lease period is. > 2. The problem occurs only on job that will extract a large number of > columns (>150 cols per row) What's your scanner caching set to? Are you spending a lot of time processing each row? > 3. The problem never occurred when only 1 map per server is running (i have > 8 CPU with hyper-threaded enabled = 16, so using only 1 map per machine is > just a waste), (at this stage I was thinking perhaps there is a > multi-threaded problem) More mappers would pull more data from the region servers so more concurrency from the disks, using more mappers might just slow you down enough that you hit the issue. > > This week i got a sightly different behavior, after having restarted the > servers. The extract were able to ran ok in most of the runs even with 4 > maps running (per servers), i got only once the exception but the job was > not killed as other runs last week If the client got an UnknownScannerException before the timeout expires (the client also keeps track of it, although it may have a different configuration), it will recreate the scanner. Which reminds me, are your regions moving around? If so, and your clients don't know about the high timeout, then they might let the exception pass on to your own code. J-D
