Thanks J-D. Sorry for a long time break of the reply! I check the logs of the .META. regionserver, it's indeed the problem like you described. But I found another problem.
The .META. Region has changed it's address, but last for a long time, CatalogTracker still cache the old address. So while another regionserver(not the .META. regionserver) split region, it will send IPC request to put, this will execute in the old regionserver. >>From HMaster, we can see that at 09:02:34, the region has been opened in >>100-13: 2011-05-25 08:37:03,908 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region .META.,,1.1028785192 on 157-5-100-9,20020,1306257984044 2011-05-25 09:02:34,334 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region .META.,,1.1028785192 on 157-5-100-13,20020,1306266506022 2011-05-25 09:15:57,649 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of .META.,,1 at address=157-5-100-9:20020; java.io.EOFException 2011-05-25 09:15:57,649 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: Current cached META location is not valid, resetting >>From RegionServer 100-13 and at 09:15:44, we can see the .META. address >>cached in CatalogTracker was still 100-9 2011-05-25 09:15:44,232 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread: Running of failed split of ufdr,0065286138106876#4228000,1306260358978.37875b35a870957da534ad29fd2944d5.; java.io.IOException: Server not running, aborting at org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2352) at org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1653) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) 2011-05-25 09:15:44,232 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Exception running postOpenDeployTasks; region=11dc72d94c7a5a3d19b0c0c3c49624a5 java.io.IOException: Call to 157-5-100-9/157.5.100.9:20020 failed on local exception: java.nio.channels.ClosedByInterruptException at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:806) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) at $Proxy8.getRegionInfo(Unknown Source) at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:424) at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:272) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:331) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:364) at org.apache.hadoop.hbase.catalog.MetaEditor.updateRegionLocation(MetaEditor.java:142) at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1354) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler$PostOpenDeployTasksThread.run(OpenRegionHandler.java:215) Caused by: java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:341) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:143) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at java.io.DataOutputStream.flush(DataOutputStream.java:106) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:518) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:751) ... 9 more Is it a bug here? I think the cached info should be invalid and reset after the .META. address has been changed immediately , but it's not. Thanks! Jieshan Bean ----------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------- Oh I'm so dumb, I didn't see that this is coming from _another_ region server. What happens when splitting is that we try to update the .META. table which requires talking to another region server, except that in this case the RS was aborting meaning that the split transaction had to be rolled back. Looking at the logs of the RS that carried .META. at that time would give you more clues into what really happened. J-D On Tue, May 31, 2011 at 5:24 PM, bijieshan <bijies...@huawei.com> wrote: > I paste the logs as following. To short of the logs, some similar and closer > logs I deleted: > >> 2011-05-25 09:15:44,232 INFO >> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Running rollback of >> failed split of >> ufdr,0065286138106876#4228000,1306260358978.37875b35a870957da534ad29fd2944d5.; >> java.io.IOException: Server not running, aborting >> at >> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2352) >> at >> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1653) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) >> at >> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) >> >>>>[description] >> Before that time, everything seemed ok(it received so many open-region >> requests, and handled them), except several Exception happened during the >> time. >> >>>>[logs] >> 2011-05-25 09:15:36,214 INFO >> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open >> region: ufdr,110018,1305617626846.f22eafaf06c3be7f7e5a447cec40c7fa. >> 2011-05-25 09:15:36,214 INFO >> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open >> region: ufdr,111336,1305617636659.b47f1cf09b10aecaee6805a985646b30. >> 2011-05-25 09:15:36,215 INFO >> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open >> region: ufdr,280884,1305617952747.7cac8ccc9c9770ef0e48807e07bc8bec. >> 2011-05-25 09:15:36,215 INFO >> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open >> region: ufdr,280446,1305617949439.b332cfff565d77b2e2d4f1a2c52494b0. >> 2011-05-25 09:15:36,216 INFO >> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open >> region: ufdr,101974,1305617623149.b60c71565f2c73b81477fe25ba80378b. >> 2011-05-25 09:15:36,413 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: >> regionserver:20020-0x32ffbc7e6e00321 Attempting to transition node >> 11dc72d94c7a5a3d19b0c0c3c49624a5 from RS_ZK_REGION_OPENING to >> RS_ZK_REGION_OPENING >> 2011-05-25 09:15:36,423 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: >> regionserver:20020-0x32ffbc7e6e00321 Successfully transitioned node >> 11dc72d94c7a5a3d19b0c0c3c49624a5 from RS_ZK_REGION_OPENING to >> RS_ZK_REGION_OPENING >> 2011-05-25 09:15:37,645 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: >> regionserver:20020-0x32ffbc7e6e00321 Attempting to transition node >> c4c657deda864fa2018b38388ee928e6 from RS_ZK_REGION_OPENING to >> RS_ZK_REGION_OPENING >> 2011-05-25 09:15:37,653 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: >> regionserver:20020-0x32ffbc7e6e00321 Attempting to transition node >> d2546cde1d6546133af74f25252b652c from RS_ZK_REGION_OPENING to >> RS_ZK_REGION_OPENING >> 2011-05-25 09:15:37,660 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: >> regionserver:20020-0x32ffbc7e6e00321 Successfully transitioned node >> c4c657deda864fa2018b38388ee928e6 from RS_ZK_REGION_OPENING to >> RS_ZK_REGION_OPENING >> 2011-05-25 09:15:37,662 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: >> regionserver:20020-0x32ffbc7e6e00321 Attempting to transition node >> bd07db88de68c1022c5bba9e5471c9f9 from RS_ZK_REGION_OPENING to >> RS_ZK_REGION_OPENING >> 2011-05-25 09:15:37,668 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: >> regionserver:20020-0x32ffbc7e6e00321 Successfully transitioned node >> d2546cde1d6546133af74f25252b652c from RS_ZK_REGION_OPENING to >> RS_ZK_REGION_OPENING >> >>>>[description] >> From the code, I saw that , only the Server has aborted, the msg of "Server >> not running, aborting" could be shown. It determined by the param of >> HRegionServer#abortRequested, if it was true, and if somewhere called the >> method of HRegionServer#checkOpen, this msg would been printed. >> >> /** >> * Called to verify that this server is up and running. >> * >> * @throws IOException >> */ >> protected void checkOpen() throws IOException { >> if (this.stopped || this.abortRequested) { >> throw new IOException("Server not running" >> + (this.abortRequested ? ", aborting" : "")); >> } >> if (!fsOk) { >> throw new IOException("File system not available"); >> } >> } >> >> For I said there's no aborted info been found in Logs: >> >> public void abort(String reason, Throwable cause) { >> if (cause != null) { >> LOG.fatal("ABORTING region server " + this + ": " + reason, cause); >> } else { >> LOG.fatal("ABORTING region server " + this + ": " + reason); >> } >> this.abortRequested = true; >> this.reservedSpace.clear(); >> if (this.metrics != null) { >> LOG.info("Dump of metrics: " + this.metrics); >> } >> stop(reason); >> } >> >> For there's no prev-aborting found in the logs, how did the sudden msg of " >> java.io.IOException: Server not running, aborting " occurred? >> Thanks! >> >> Jieshan Bean. >> >> >> >> >> >