Ok, I don't have this log anymore but since the problem was reproduced in
other log (which i keep), here is the grep

2012-02-08 14:13:02,970 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer:
org.apache.hadoop.hbase.regionserver.LeaseException: lease
'-6992210222685255354' does not exist
        at
org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:230)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1862)
        at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)


then 2 min later:
2012-02-08 14:15:30,660 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
Responder, call next(-6992210222685255354, 1) from 10.232.83.82:43263:
output error
2012-02-08 14:15:30,661 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 59 on 60020 caught: java.nio.channels.ClosedChannelException
        at
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
        at
org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1341)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)

finally 5 mins later
2012-02-08 14:20:30,662 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
-6992210222685255354 lease expired

So if i understand correctly from the log it take at least 2 mins for
processing one row?

** Nevertheless what is suspect is that 7 mins after the first exception
the RS indicates that the lease has expired, while priorly he complained
that the lease was not existing.

Regarding your second question "how long it take for pig to process one
row" I can't say, but the job executed successfully in 93 mins and
extracted 1,300,000 rows -> so it means that one row took an average of
0.0043 sec.
There was 101 maps ,no reducers. I presume that in order to provide better
average i need to know the time of a map and how many rows it extracted.
** Are the maths correct?

I've also checked gc on the same time the exception occurred, no special
problem where present there, gc runs each 130ms, no FULL gc occurred.

Mikael.S

On Wed, Feb 15, 2012 at 8:17 PM, Jean-Daniel Cryans <jdcry...@apache.org>wrote:

> You would have to grep the lease's id, in your first email it was
> "-7220618182832784549".
>
> About the time it takes to process each row, I meant client (pig) side
> not in the RS.
>
> J-D
>
> On Tue, Feb 14, 2012 at 1:33 PM, Mikael Sitruk <mikael.sit...@gmail.com>
> wrote:
> > Please see answer inline
> > Thanks
> > Mikael.S
> >
> > On Tue, Feb 14, 2012 at 8:30 PM, Jean-Daniel Cryans <jdcry...@apache.org
> >wrote:
> >
> >> On Tue, Feb 14, 2012 at 2:01 AM, Mikael Sitruk <mikael.sit...@gmail.com
> >
> >> wrote:
> >> > hi,
> >> > Well no, i can't figure out what is the problem, but i saw that
> someone
> >> > else had the same problem (see email: "LeaseException despite high
> >> > hbase.regionserver.lease.period")
> >> > What can i tell is the following:
> >> > Last week the problem was consistent
> >> > 1. I updated hbase.regionserver.lease.period=300000 (5 mins),
> restarted
> >> the
> >> > cluster and still got the problem, the map got this exception event
> >> before
> >> > the 5 mins, (some after 1 min and 20 sec)
> >>
> >> That's extremely suspicious. Are you sure the setting is getting picked
> >> up? :) I hope so :-)
> >>
> >> You should be able to tell when the lease really expires by simply
> >> grepping for the number in the region server log, it should give you a
> >> good idea of what your lease period is.
> >>  greeping on which value? the lease configured here:300000? It does not
> >> return anything, also tried in current execution where some were ok and
> >> some were not
> >>
> >> 2. The problem occurs only on job that will extract a large number of
> >> > columns (>150 cols per row)
> >>
> >> What's your scanner caching set to? Are you spending a lot of time
> >> processing each row? from the job configuration generated by pig i can
> see
> >> caching set to 1, regarding the processing time of each row i have no
> clue
> >> how many time it spent. the data for each row is 150 columns of 2k each.
> >> This is approx 5 block to bring.
> >>
> >> > 3. The problem never occurred when only 1 map per server is running (i
> >> have
> >> > 8 CPU with hyper-threaded enabled = 16, so using only 1 map per
> machine
> >> is
> >> > just a waste), (at this stage I was thinking perhaps there is a
> >> > multi-threaded problem)
> >>
> >> More mappers would pull more data from the region servers so more
> >> concurrency from the disks, using more mappers might just slow you
> >> down enough that you hit the issue.
> >>
> > Today i ran with 8 mappers and some failed and some didn't (2 of 4), they
> > got the lease exception after 5 mins, i will try to check the
> > logs/sar/metric files for additional info
> >
> >>
> >> >
> >> > This week i got a sightly different behavior, after having restarted
> the
> >> > servers. The extract were able to ran ok in most of the runs even
> with 4
> >> > maps running (per servers), i got only once the exception but the job
> was
> >> > not killed as other runs last week
> >>
> >> If the client got an UnknownScannerException before the timeout
> >> expires (the client also keeps track of it, although it may have a
> >> different configuration), it will recreate the scanner.
> >>
> > No this is not the case.
> >
> >>
> >> Which reminds me, are your regions moving around? If so, and your
> >> clients don't know about the high timeout, then they might let the
> >> exception pass on to your own code.
> >>
> > Region are presplited ahead, i do not have any region split during the
> run,
> > region size is set of 8GB, storefile is around 3.5G
> >
> > The test was run after major compaction, so the number of store file is 1
> > per RS/family
> >
> >
> >>
> >> J-D
> >>
> >
> >
> >
> > --
> > Mikael.S
>



-- 
Mikael.S

Reply via email to