Le 21/11/12 18:39, Stack a écrit :
So Vincent, the servers are quiet? Which would match your low CPU
observation. Clients are unable to send them load for some reason?
How many disks. What is your block cache hit number (see regionserver
log -- it gets printed every so often .... or in the below I see 99%
so your numbers should be good coming out of the regionserver).
It does not seem to be a load issue : as you say CPU is low and RPC
handlers are under used.
We got plenty of disk space, and or block cache hit is 99% on all
region servers...
Today we tried to remove some region servers (yes: we had only 8
before moving to 0.92, and we added 8 more because we thought it was
a performance issue).
We now have 12 of them, are actually the perfs are similar (just
more CPU load of course, but similar response time).
600 regions is a lot per server. You should put it on your TODO list
to have less per server -- bigger regions which you can do now you are
on 0.92.
This is definitively in our TODO. Nevertheless, our 8 RS (0.90.3)
before the move had more than 1100 regions each! Without any issue.
We increased or region size by X4 (now we use default 1GB setting).
And we plan to merge some tables.
If you major compact -- do it when site is less heavily loaded -- does
our performance go up.
Are all query types slow or just certain types?
actually thing are ok for a time (say 2 to 4ms response time) then
we got "scanner lease" exeptions... We cannot figure out what
triggers this exception (we though it was a contention somewhere, or
a server slow down, but our last investigation seem to point a bug
between server and clients).
Here is a typical set of exceptiojn we have from time to time:
client (a PIG script using HBaseStorage):
----------------------------------
2012-11-21 14:47:29,925 | ERROR | main | Launcher | Backend error
message
org.apache.hadoop.hbase.regionserver.LeaseException:
org.apache.hadoop.hbase.regionserver.LeaseException: lease
'4537659031468873643' does not exist
at
org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:231)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:2117)
at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1326)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at
java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:96)
at
org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:84)
at
org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:39)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1325)
at
org.apache.hadoop.hbase.client.HTable$ClientScanner.next(HTable.java:1293)
at
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:133)
at
org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:142)
at
org.apache.pig.backend.hadoop.hbase.HBaseTableInputFormat$HBaseTableRecordReader.nextKeyValue(HBaseTableInputFormat.java:162)
at
org.apache.pig.backend.hadoop.hbase.HBaseStorage.getNext(HBaseStorage.java:452)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:194)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Region server:
-----------
2012-11-21 14:45:55,199 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer:
org.apache.hadoop.hbase.regionserver.LeaseException: lease
'4537659031468873643' does not exist
at
org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:231)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:2117)
at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
--
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1326)
2012-11-21 14:45:57,320 WARN org.apache.hadoop.ipc.HBaseServer:
(responseTooSlow):
{"processingtimems":63895,"call":"next(4537659031468873643, 512),
rpc version=1, client version=29,
methodsFingerPrint=54742778","client":"10.124.45.132:19289","starttimems":13535090\
93424,"queuetimems":0,"class":"HRegionServer","responsesize":6,"method":"next"}
2012-11-21 14:45:57,320 WARN org.apache.hadoop.ipc.HBaseServer: IPC
Server Responder, call next(4537659031468873643, 512), rpc
version=1, client version=29, methodsFingerPrint=54742778 from
10.124.45.132:19289: output error
2012-11-21 14:45:57,323 WARN org.apache.hadoop.ipc.HBaseServer: IPC
Server handler 14 on 60020 caught:
java.nio.channels.ClosedChannelException
at
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
at
sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
at
org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1653)
I don't understand this strange responseTooSlow /
ClosedChannelException thing. If you can help me on what happens
here, it could help.
Best regards, and thank you for your concern.