Re: HBase scanner LeaseException

Vincent Barat Thu, 22 Nov 2012 11:50:10 -0800

Apparently, my problem seems more related to the one exposed here:http://www.nosql.se/tags/hbase-rpc-timeout/

I don't really understand the reason why next() on our scanners iscalled less than once per 60s, and actually I suspect this is NOTthe case, since we never had any scanner timeout exception when wewere running 0.90.3, this issue appeared only with 0.92.


Anyway,  increasing hbase.rpc.timeout seems to work.

We will continue our investigation, but my guess is that there is anissue in 0.92 related to how hbase handle scanners leases.


Best regards,

Le 21/11/12 09:23, Vincent Barat a écrit :

Le 21/11/12 06:05, Stack a écrit :
On Tue, Nov 20, 2012 at 8:21 AM, Vincent Barat<vincent.ba...@gmail.com> wrote:
We have changed some parameters on our 16(!) region servers :1GB more -Xmx,more rpc handler (from 10 to 30) longer timeout, but nothingseems to
improve the response time:
You have taken a look at the perf chapter Vincent:
http://hbase.apache.org/book.html#performance

You carried forward your old hbase-default.xml or did you remove it
(0.92 should have defaults in hbase-X.X.X.jar -- some defaults will
have changed).
We use the new default settings for HBase, just a few changes(more RPC handlers and longer timeout (but this last was a bad idea).
- Scans with HBase 0.92  are x3 SLOWER than with HBase 0.90.3
Any scan caching going on?
yes the cache is set between 64 and 1024 depending on the need
- A lot of simultaneous gets lead to a huge slow down of batchput & ramdom
read response time
The gets are returning lots of data? (If you thread dump theserver atthis time -- see at top of the regionserver UI -- can you seewhat we
are hung up on?  Are all handlers occupied?).
We will check this...
... despite the fact that our RS CPU load is really low (10%)
As has been suggested earlier, perhaps up the handlers?
Note: we have not (yet) activated MSlabs, nor direct read on HDFS.
MSlab will help you avoid stop-the-world GCs.  Direct read of HDFS
should speed up random access.
OK, I guess we will give it a try, but on a second step.

Thansk for your help
St.Ack
Any idea please ? I'm really stuck on that issue.

Best regards,

Le 16/11/12 20:55, Vincent Barat a écrit :
Hi,
Right now (and previously with 0.90.3) we were using thedefault value
(10).
We are trying right now to increase to 30 to see if it is better.

Thanks for your concern

Le 16/11/12 18:13, Ted Yu a écrit :
Vincent:
What's the value for hbase.regionserver.handler.count ?

I assume you keep the same value as that from 0.90.3

Thanks

On Fri, Nov 16, 2012 at 8:14 AM, Vincent
Barat<vincent.ba...@gmail.com>wrote:
Le 16/11/12 01:56, Stack a écrit :
On Thu, Nov 15, 2012 at 5:21 AM, GuillaumePerrot<gper...@ubikod.com>
wrote:
It happens when several tables are being compacted and/orwhen there
is
several scanners running.
It happens for a particular region? Anything you can tellabout theserver looking in your cluster monitoring? Is it runninghot? Whatdo the hbase regionserver stats in UI say? Anythinginteresting about
compaction queues or requests?
Hi, thanks for your answser Stack. I will take the lead onthat thread
from now on.
It does not happens on any particular region. Actually,things get
better
now since compactions have been performed on all tables andhave been
stopped.
Nevertheless, we face a dramatic decrease of performances(especially on
random gets) of the overall cluster:
Despite the fact we double our number of region servers (from8 to 16)
and
despite the fact that these region server CPU load are justabout 10% to30%, performances are really bad : very often an lightincrease of
request
lead to a clients locked on request, very long response time.It looks
like
a contention / deadlock somewhere in the HBase client and Ccode.
If you look at the thread dump all handlers are occupiedservingrequests? These timedout requests couldn't get into theserver?
We will investigate on that and report to you.
Before the timeouts, we observe an increasing CPU load ona single
region
server and if we add region servers and wait forrebalancing, we
always
have the same region server causing problems like these:
2012-11-14 20:47:08,443 WARNorg.apache.hadoop.ipc.**HBaseServer: IPC
Server Responder, call
multi(org.apache.hadoop.hbase.**client.MultiAction@2c3da1aa),rpc
version=1, client version=29, methodsFingerPrint=54742778 from
<ip>:45334: output error
2012-11-14 20:47:08,443 WARNorg.apache.hadoop.ipc.**HBaseServer: IPC
Server handler 3 on 60020 caught: java.nio.channels.**
ClosedChannelException
at sun.nio.ch.SocketChannelImpl.**ensureWriteOpen(**
SocketChannelImpl.java:133)
atsun.nio.ch.SocketChannelImpl.**write(SocketChannelImpl.java:**324)
at
org.apache.hadoop.hbase.ipc.**HBaseServer.channelWrite(**
HBaseServer.java:1653)
at
org.apache.hadoop.hbase.ipc.**HBaseServer$Responder.
processResponse(HBaseServer.**java:924)
at
org.apache.hadoop.hbase.ipc.**HBaseServer$Responder.
doRespond(HBaseServer.java:**1003)
at
org.apache.hadoop.hbase.ipc.**HBaseServer$Call.**sendResponseIfReady(
HBaseServer.java:409)
at
org.apache.hadoop.hbase.ipc.**HBaseServer$Handler.run(**
HBaseServer.java:1346)
With the same access patterns, we did not have this issuein HBase
0.90.3.
The above is other side of the timeout -- the client is gone.

Can you explain the rising CPU?
No there is no explanation (no high access a a given region for
exemple).
But this specific problem has gone when we finished compactions.


      Is it iowait on this box because of
compactions? Bad disk? Always same regionserver or issuemoves
around?

Sorry for all the questions.  0.92 should be better than 0.90
Our experience is currently the exact opposite : for us, 0.92seems to
be
times slower than the 0.90.3.

   generally (0.94 even better still -- can you go there?).
We can go to 0.94 but unfortunately, we CANNOT GO BACK (thesame way wecannot go back to 0.90.3, since there is apparently amodification of
the
format of the ROOT table).
The upgrade works, but the downgrade not. And we are afraidof having
even
more "new" problems with 0.94 and be forced to rollback to0.90.3 (with
some days of data loses).

Thanks for your reply we will continue to investigate.



      Interesting
that these issues show up post upgrade. I can't think of areason why
the different versions would bring this on...

St.Ack

Re: HBase scanner LeaseException

Reply via email to