Hi, When we use mapreduce to dump data from a pretty large table on hbase. One region server crash and then another. Mapreduce is deployed together with hbase.
1) From log of the region server, there are both "next" and "multi" operations on going. Is it because there is write/read conflict that cause scanner timeout? 2) Region server has 24 cores, and # max map tasks is 24 too; the table has about 30 regions (each of size 0.5G) on the region server, is it because cpu is all used by mapreduce and that case region server slow and then timeout? 2) current hbase.regionserver.handler.count is 10 by default, should it be enlarged? Please give us some advices. Thanks, Wei Log information: [Regionserver rs21:] 2013-03-11 18:36:28,148 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/adcbg21.machine.wisdom.com,60020,1363010589837/rs21%2C60020%2C1363010589837.1363025554488, entries=22417, filesize=127539793. for /hbase/.logs/rs21,60020,1363010589837/rs21%2C60020%2C1363010589837.1363026988052 2013-03-11 18:37:39,481 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 28183ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired 2013-03-11 18:37:40,163 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":29830,"call":"next(1656517918313948447, 1000), rpc version=1, client version=29, methodsFingerPrint=54742778","client":"10.20.127.21:56058","starttimems":1363027030280,"queuetimems":4602,"class":"HRegionServer","responsesize":2774484,"method":"next"} 2013-03-11 18:37:40,163 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":31195,"call":"next(-8353194140406556404, 1000), rpc version=1, client version=29, methodsFingerPrint=54742778","client":"10.20.127.21:56529","starttimems":1363027028804,"queuetimems":3634,"class":"HRegionServer","responsesize":2270919,"method":"next"} 2013-03-11 18:37:40,163 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":30965,"call":"next(2623756537510669130, 1000), rpc version=1, client version=29, methodsFingerPrint=54742778","client":"10.20.127.21:56146","starttimems":1363027028807,"queuetimems":3484,"class":"HRegionServer","responsesize":2753299,"method":"next"} 2013-03-11 18:37:40,236 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":31023,"call":"next(5293572780165196795, 1000), rpc version=1, client version=29, methodsFingerPrint=54742778","client":"10.20.127.21:56069","starttimems":1363027029086,"queuetimems":3589,"class":"HRegionServer","responsesize":2722543,"method":"next"} 2013-03-11 18:37:40,368 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":31160,"call":"next(-4285417329791344278, 1000), rpc version=1, client version=29, methodsFingerPrint=54742778","client":"10.20.127.21:56586","starttimems":1363027029204,"queuetimems":3707,"class":"HRegionServer","responsesize":2938870,"method":"next"} 2013-03-11 18:37:43,652 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":31249,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@2d19985a), rpc version=1, client version=29, methodsFingerPrint=54742778","client":"10.20.109.21:35342","starttimems":1363027031505,"queuetimems":5720,"class":"HRegionServer","responsesize":0,"method":"multi"} 2013-03-11 18:37:49,108 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":38813,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@19c59a2e), rpc version=1, client version=29, methodsFingerPrint=54742778","client":"10.20.125.11:57078","starttimems":1363027030273,"queuetimems":4663,"class":"HRegionServer","responsesize":0,"method":"multi"} 2013-03-11 18:37:50,410 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":38893,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@40022ddb), rpc version=1, client version=29, methodsFingerPrint=54742778","client":"10.20.109.20:51698","starttimems":1363027031505,"queuetimems":5720,"class":"HRegionServer","responsesize":0,"method":"multi"} 2013-03-11 18:37:50,642 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":40037,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@6b8bc8cf), rpc version=1, client version=29, methodsFingerPrint=54742778","client":"10.20.125.11:57078","starttimems":1363027030601,"queuetimems":4818,"class":"HRegionServer","responsesize":0,"method":"multi"} 2013-03-11 18:37:51,529 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":10880,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@6928d7b), rpc version=1, client version=29, methodsFingerPrint=54742778","client":"10.20.125.11:57076","starttimems":1363027060645,"queuetimems":34763,"class":"HRegionServer","responsesize":0,"method":"multi"} 2013-03-11 18:37:51,776 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":41327,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@354baf25), rpc version=1, client version=29, methodsFingerPrint=54742778","client":"10.20.125.11:57076","starttimems":1363027030411,"queuetimems":4680,"class":"HRegionServer","responsesize":0,"method":"multi"} 2013-03-11 18:38:32,361 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":10204,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@6d86b477), rpc version=1, client version=29, methodsFingerPrint=54742778","client":"10.20.125.10:36950","starttimems":1363027102044,"queuetimems":11027,"class":"HRegionServer","responsesize":0,"method":"multi"} ---------------------------------------------------------- [master:] 2013-03-11 18:35:39,386 WARN org.apache.hadoop.conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS 2013-03-11 18:38:25,892 INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing because balanced cluster; servers=10 regions=477 average=47.7 mostloaded=52 leastloaded=45 2013-03-11 18:39:42,002 INFO org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [rs21,60020,1363010589837] 2013-03-11 18:39:42,007 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for rs21,60020,1363010589837 2013-03-11 18:39:42,024 INFO org.apache.hadoop.hbase.master.SplitLogManager: dead splitlog workers [rs21,60020,1363010589837] 2013-03-11 18:39:42,033 INFO org.apache.hadoop.hbase.master.SplitLogManager: started splitting logs in [hdfs://rs26/hbase/.logs/rs21,60020,1363010589837-splitting] 2013-03-11 18:39:42,179 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Frs26%3A8020%2Fhbase%2F.logs%2Frs21%2C60020%2C1363010589837-splitting%2Frs21%252C60020%252C1363010589837.1363010594599 acquired by rs19,1363010590987 ---------------------------------------------------------- [Regionserver rs21:] 2013-03-11 18:40:06,326 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call next(8419833035992682478, 1000), rpc version=1, client version=29, methodsFingerPrint=54742778 from 10.20.127.21:33592: output error 2013-03-11 18:40:06,326 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner -1069130278416755239 lease expired 2013-03-11 18:40:06,327 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner -7554305902624086957 lease expired 2013-03-11 18:40:06,327 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner -1817452922125791171 lease expired 2013-03-11 18:40:06,327 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner -7601125239682768076 lease expired 2013-03-11 18:40:06,327 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 82186ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired 2013-03-11 18:40:06,327 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 5506385887933665130 lease expired 2013-03-11 18:40:06,327 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 8405483593065293761 lease expired 2013-03-11 18:40:06,327 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 8270919548717867130 lease expired 2013-03-11 18:40:06,327 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner -5350053253744349360 lease expired 2013-03-11 18:40:06,328 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner -2774223416111392810 lease expired 2013-03-11 18:40:06,328 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 5293572780165196795 lease expired 2013-03-11 18:40:06,328 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 2904518513855545553 lease expired 2013-03-11 18:40:06,328 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 121972859714825295 lease expired 2013-03-11 18:40:06,328 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call next(1316499555392112856, 1000), rpc version=1, client version=29, methodsFingerPrint=54742778 from 10.20.127.21:33552: output error 2013-03-11 18:40:06,328 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 751003440851341708 lease expired 2013-03-11 18:40:06,328 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 3456313588148401866 lease expired 2013-03-11 18:40:06,328 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner -1893617732870830965 lease expired 2013-03-11 18:40:06,329 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner -677968643251998870 lease expired 2013-03-11 18:40:06,329 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 2623756537510669130 lease expired 2013-03-11 18:40:06,329 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner -4453586756904422814 lease expired 2013-03-11 18:40:06,329 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 4513044921501336208 lease expired 2013-03-11 18:40:06,329 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 8419833035992682478 lease expired 2013-03-11 18:40:06,329 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 6790476379016482048 lease expired 2013-03-11 18:40:06,329 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 1316499555392112856 lease expired 2013-03-11 18:40:06,337 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call next(6790476379016482048, 1000), rpc version=1, client version=29, methodsFingerPrint=54742778 from 10.20.127.21:33603: output error 2013-03-11 18:40:06,485 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 86375ms for sessionid 0x13c9789d2289cd1, closing socket connection and attempting reconnect 2013-03-11 18:40:06,493 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server handler 5 on 60020 caught: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) at org.apache.hadoop.hbase.ipc.HBaseServer.channelIO(HBaseServer.java:1732) at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1675) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:940) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:1019) at org.apache.hadoop.hbase.ipc.HBaseServer$Call.sendResponseIfReady(HBaseServer.java:425) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1365) 2013-03-11 18:40:06,489 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server Responder: doAsyncWrite threw exception java.io.IOException: Broken pipe 2013-03-11 18:40:06,517 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server handler 8 on 60020 caught: java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcher.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:69) at sun.nio.ch.IOUtil.write(IOUtil.java:40) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334) at org.apache.hadoop.hbase.ipc.HBaseServer.channelIO(HBaseServer.java:1732) at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1675) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:940) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:1019) at org.apache.hadoop.hbase.ipc.HBaseServer$Call.sendResponseIfReady(HBaseServer.java:425) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1365)