First issue: UnknownHostException is unforgiving, your machines need to be able to talk to haddop2-zk3 (is that a typo?) and it seems that at least that one can't. The reason the machine dies is that we usually try to "fail fast" in HBase.
Second issue: There's not enough information, all I see is a region server shutting down and the reason why is probably before that. Third issue: https://issues.apache.org/jira/browse/HBASE-3664 Fourth issue: it's now 3 minutes in 0.90 for the timeout to happen. J-D On Tue, Mar 22, 2011 at 10:39 AM, Eran Kutner <e...@gigya.com> wrote: > Hi, > I'm trying to use replication between two HBase clusters and I'm > encountering all kinds of crashes and weird behavior. > > First, it seems that starting a region server when the peer ZKs are > not available will cause the server to fail to start: > > 2011-03-22 08:31:56,647 INFO > org.apache.hadoop.hbase.replication.ReplicationZookeeper: Replication > is now started > 2011-03-22 08:31:56,668 WARN > org.apache.hadoop.hbase.zookeeper.ZKConfig: > java.net.UnknownHostException: haddop2-zk3 > at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) > at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:850) > at > java.net.InetAddress.getAddressFromNameService(InetAddress.java:1201) > at java.net.InetAddress.getAllByName0(InetAddress.java:1154) > at java.net.InetAddress.getAllByName(InetAddress.java:1084) > at java.net.InetAddress.getAllByName(InetAddress.java:1020) > at java.net.InetAddress.getByName(InetAddress.java:970) > at > org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:206) > at > org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:250) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:113) > at > org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:288) > at > org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:253) > at > org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:182) > at > org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:142) > at > org.apache.hadoop.hbase.replication.regionserver.Replication.<init>(Replication.java:75) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1092) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:875) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1472) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:563) > at java.lang.Thread.run(Thread.java:662) > > 2011-03-22 08:31:56,669 WARN > org.apache.hadoop.hbase.zookeeper.ZKConfig: > java.net.UnknownHostException: haddop2-zk2 > at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) > at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:850) > at > java.net.InetAddress.getAddressFromNameService(InetAddress.java:1201) > at java.net.InetAddress.getAllByName0(InetAddress.java:1154) > at java.net.InetAddress.getAllByName(InetAddress.java:1084) > at java.net.InetAddress.getAllByName(InetAddress.java:1020) > at java.net.InetAddress.getByName(InetAddress.java:970) > at > org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:206) > at > org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:250) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:113) > at > org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:288) > at > org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:253) > at > org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:182) > at > org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:142) > at > org.apache.hadoop.hbase.replication.regionserver.Replication.<init>(Replication.java:75) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1092) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:875) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1472) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:563) > at java.lang.Thread.run(Thread.java:662) > > 2011-03-22 08:31:56,669 INFO org.apache.zookeeper.ZooKeeper: > Initiating client connection, > connectString=haddop2-zk3:2181,haddop2-zk2:2181,hadoop2-zk1:2181 > sessionTimeout=180000 watcher=connection to cluster: 1 > 2011-03-22 08:31:56,670 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Failed > initialization > 2011-03-22 08:31:56,670 ERROR > org.apache.hadoop.hbase.regionserver.HRegionServer: Failed init > java.net.UnknownHostException: haddop2-zk3 > at java.net.InetAddress.getAllByName0(InetAddress.java:1158) > at java.net.InetAddress.getAllByName(InetAddress.java:1084) > at java.net.InetAddress.getAllByName(InetAddress.java:1020) > at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:386) > at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:331) > at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:377) > at org.apache.hadoop.hbase.zookeeper.ZKUtil.connect(ZKUtil.java:97) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:119) > at > org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:288) > at > org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:253) > at > org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:182) > at > org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:142) > at > org.apache.hadoop.hbase.replication.regionserver.Replication.<init>(Replication.java:75) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1092) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:875) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1472) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:563) > at java.lang.Thread.run(Thread.java:662) > 2011-03-22 08:31:56,675 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region > server serverName=hadoop1-s05.farm-ny.gigya.com,60020,1300797113247, > load=(requests=0, regions=0, usedHeap=24, maxHeap=987): Unhandled > exception: haddop2-zk3 > java.net.UnknownHostException: haddop2-zk3 > at java.net.InetAddress.getAllByName0(InetAddress.java:1158) > at java.net.InetAddress.getAllByName(InetAddress.java:1084) > at java.net.InetAddress.getAllByName(InetAddress.java:1020) > at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:386) > at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:331) > at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:377) > at org.apache.hadoop.hbase.zookeeper.ZKUtil.connect(ZKUtil.java:97) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:119) > at > org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:288) > at > org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:253) > at > org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:182) > at > org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:142) > at > org.apache.hadoop.hbase.replication.regionserver.Replication.<init>(Replication.java:75) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1092) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:875) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1472) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:563) > at java.lang.Thread.run(Thread.java:662) > 2011-03-22 08:31:56,675 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Unhandled > exception: haddop2-zk3 > 2011-03-22 08:31:56,675 INFO org.apache.hadoop.ipc.HBaseServer: > Stopping server on 60020 > 2011-03-22 08:31:56,679 INFO > org.apache.hadoop.hbase.regionserver.StoreFile: Allocating > LruBlockCache with maximum size 197.5m > 2011-03-22 08:31:56,683 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server > at: hadoop1-s05.farm-ny.gigya.com,60020,1300797113247 > 2011-03-22 08:31:56,683 DEBUG > org.apache.hadoop.hbase.catalog.CatalogTracker: Stopping catalog > tracker org.apache.hadoop.hbase.catalog.CatalogTracker@508aeb74 > 2011-03-22 08:31:56,684 INFO > org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closing > leases > 2011-03-22 08:31:56,684 INFO > org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closed > leases > 2011-03-22 08:31:56,684 INFO > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: > Closed zookeeper sessionid=0x22e669588a20058 > 2011-03-22 08:31:56,692 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x22e669588a20058 closed > 2011-03-22 08:31:56,692 INFO org.apache.zookeeper.ClientCnxn: > EventThread shut down > 2011-03-22 08:31:56,700 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x12e669588b8004d closed > 2011-03-22 08:31:56,700 INFO org.apache.zookeeper.ClientCnxn: > EventThread shut down > 2011-03-22 08:31:56,702 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook > starting; hbase.shutdown.hook=true; > fsShutdownHook=Thread[Thread-15,5,main] > 2011-03-22 08:31:56,702 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown > hook > 2011-03-22 08:31:56,702 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs > shutdown hook thread. > 2011-03-22 08:31:56,804 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook > finished. > > > Second, it seems that when I'm shutting down a region server on the > peer cluster region servers on the source cluster connect to it are > also shutting down: > 2011-03-22 09:03:34,541 INFO > org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closing > leases > 2011-03-22 09:03:34,541 INFO > org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closed > leases > 2011-03-22 09:03:34,644 INFO > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: > Closed zookeeper sessionid=0x12e669588b80050 > 2011-03-22 09:03:34,653 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x12e669588b80050 closed > 2011-03-22 09:03:34,653 INFO org.apache.zookeeper.ClientCnxn: > EventThread shut down > 2011-03-22 09:03:34,662 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x22e669588a2005d closed > 2011-03-22 09:03:34,662 INFO org.apache.zookeeper.ClientCnxn: > EventThread shut down > 2011-03-22 09:03:34,664 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Closing source 1 because: Region server is closing > 2011-03-22 09:03:39,377 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Source exiting 1 > 2011-03-22 09:03:39,431 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Source exiting 1 > 2011-03-22 09:03:39,431 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020 > exiting > 2011-03-22 09:03:39,432 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook > starting; hbase.shutdown.hook=true; > fsShutdownHook=Thread[Thread-15,5,main] > 2011-03-22 09:03:39,432 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown > hook > 2011-03-22 09:03:39,432 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs > shutdown hook thread. > 2011-03-22 09:03:39,433 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook > finished. > > > > Third, sometimes it crashes without any reason I can understand. See > the attached log dump. It includes the entire load process from start > to shutdown of the region server. When I configure "stop_replication" > everything is OK, here's what happens after "start_replication": > 2011-03-22 09:38:59,199 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Replication is disabled, sleeping 1000 times 10 > 2011-03-22 09:38:59,333 INFO > org.apache.hadoop.hbase.replication.ReplicationZookeeper: Replication > is now started > 2011-03-22 09:39:09,202 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Opening log for replication > hadoop1-s05.farm-ny.gigya.com%3A60020.1300799921876 at 124 > 2011-03-22 09:39:09,215 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > currentNbOperations:0 and seenEntries:1 and size: 191 > 2011-03-22 09:39:09,215 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: > Going to report log > #hadoop1-s05.farm-ny.gigya.com%3A60020.1300799921876 for position 315 > in > hdfs://hadoop1-m1:8020/hbase/.logs/hadoop1-s05.farm-ny.gigya.com,60020,1300799918373/hadoop1-s05.farm-ny.gigya.com%3A60020.1300799921876 > 2011-03-22 09:39:09,224 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region > server serverName=hadoop1-s05.farm-ny.gigya.com,60020,1300799918373, > load=(requests=0, regions=3, usedHeap=41, maxHeap=987): Writing > replication status > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode > = NoNode for > /hbase/replication/rs/hadoop1-s05.farm-ny.gigya.com,60020,1300799918373/1/hadoop1-s05.farm-ny.gigya.com%3A60020.1300799921876 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:102) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:42) > at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038) > at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:708) > at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:751) > at > org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:432) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:131) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:332) > 2011-03-22 09:39:09,225 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: > request=0.0, regions=3, stores=5, storefiles=5, storefileIndexSize=1, > memstoreSize=0, compactionQueueSize=0, flushQueueSize=0, usedHeap=41, > maxHeap=987, blockCacheSize=1702768, blockCacheFree=205390992, > blockCacheCount=3, blockCacheHitCount=15, blockCacheMissCount=3, > blockCacheEvictedCount=0, blockCacheHitRatio=83, > blockCacheHitCachingRatio=83 > 2011-03-22 09:39:09,225 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Writing > replication status > 2011-03-22 09:39:09,225 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: > Removing 0 logs in the list: [] > 2011-03-22 09:39:09,225 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Nothing to replicate, sleeping 1000 times 10 > 2011-03-22 09:39:10,996 INFO org.apache.hadoop.ipc.HBaseServer: > Stopping server on 60020 > 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: PRI > IPC Server handler 0 on 60020: exiting > 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: PRI > IPC Server handler 2 on 60020: exiting > 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 18 on 60020: exiting > 2011-03-22 09:39:10,998 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping > infoServer > 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 30 on 60020: exiting > 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI > IPC Server handler 5 on 60020: exiting > 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: > Stopping IPC Server Responder > 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI > IPC Server handler 4 on 60020: exiting > 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI > IPC Server handler 6 on 60020: exiting > 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 20 on 60020: exiting > 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 15 on 60020: exiting > 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 9 on 60020: exiting > 2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 21 on 60020: exiting > 2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 49 on 60020: exiting > 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 14 on 60020: exiting > 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI > IPC Server handler 8 on 60020: exiting > 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 17 on 60020: exiting > 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 44 on 60020: exiting > 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 8 on 60020: exiting > 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 42 on 60020: exiting > 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 41 on 60020: exiting > 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 40 on 60020: exiting > 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 39 on 60020: exiting > 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 38 on 60020: exiting > 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 37 on 60020: exiting > 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 36 on 60020: exiting > 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 34 on 60020: exiting > 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 33 on 60020: exiting > 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 32 on 60020: exiting > 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 31 on 60020: exiting > 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 29 on 60020: exiting > 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 28 on 60020: exiting > 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 27 on 60020: exiting > 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 26 on 60020: exiting > 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 25 on 60020: exiting > 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 24 on 60020: exiting > 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 23 on 60020: exiting > 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 13 on 60020: exiting > 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 7 on 60020: exiting > 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 16 on 60020: exiting > 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 12 on 60020: exiting > 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 6 on 60020: exiting > 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 11 on 60020: exiting > 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 5 on 60020: exiting > 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 4 on 60020: exiting > 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 0 on 60020: exiting > 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: PRI > IPC Server handler 9 on 60020: exiting > 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: PRI > IPC Server handler 7 on 60020: exiting > 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: > Stopping IPC Server listener on 60020 > 2011-03-22 09:39:11,004 INFO org.mortbay.log: Stopped > SelectChannelConnector@0.0.0.0:60030 > 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 35 on 60020: exiting > 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 43 on 60020: exiting > 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 45 on 60020: exiting > 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 46 on 60020: exiting > 2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 47 on 60020: exiting > 2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 48 on 60020: exiting > 2011-03-22 09:39:11,010 INFO > org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting. > 2011-03-22 09:39:11,010 DEBUG > org.apache.hadoop.hbase.regionserver.wal.HLog: > regionserver60020.logSyncer interrupted while waiting for sync > requests > 2011-03-22 09:39:11,010 INFO > org.apache.hadoop.hbase.regionserver.wal.HLog: > regionserver60020.logSyncer exiting > 2011-03-22 09:39:11,010 DEBUG > org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog writer in > hdfs://hadoop1-m1:8020/hbase/.logs/hadoop1-s05.farm-ny.gigya.com,60020,1300799918373 > 2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 22 on 60020: exiting > 2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 10 on 60020: exiting > 2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 3 on 60020: exiting > 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI > IPC Server handler 3 on 60020: exiting > 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 2 on 60020: exiting > 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI > IPC Server handler 1 on 60020: exiting > 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 19 on 60020: exiting > 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 1 on 60020: exiting > 2011-03-22 09:39:11,011 DEBUG > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: > Processing close of -ROOT-,,0.70236052 > 2011-03-22 09:39:11,011 INFO > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: > regionserver60020.cacheFlusher exiting > 2011-03-22 09:39:11,010 DEBUG > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: > Processing close of > TEST_Msg,MZVWNHOSUZYUOYNQKDIAVSCQEPHXVWVXIMGLGXGSSXQTZQMOZCZDCQAUWFSXARWYEBMBRCJMXPHXBIQNDTYTWRURMMOBFISBBSPYEKWWSNGMJCSOPFUGTDBMGUPFOIHOXGWI\x00,1300193788355.4ca0c6cf6654b8f6fd7e3bbba0b9fc6c. > 2011-03-22 09:39:11,010 DEBUG > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: > Processing close of > TEST_UserSettings,,1300103207136.c438541b556672c4f4486416baa371f0. > 2011-03-22 09:39:11,009 INFO > org.apache.hadoop.hbase.regionserver.CompactSplitThread: > regionserver60020.compactor exiting > 2011-03-22 09:39:11,012 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: Closing > TEST_UserSettings,,1300103207136.c438541b556672c4f4486416baa371f0.: > disabling compactions & flushes > 2011-03-22 09:39:11,009 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker: > regionserver60020.majorCompactionChecker exiting > 2011-03-22 09:39:11,012 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for > region TEST_UserSettings,,1300103207136.c438541b556672c4f4486416baa371f0. > 2011-03-22 09:39:11,012 DEBUG > org.apache.hadoop.hbase.regionserver.Store: closed Settings > 2011-03-22 09:39:11,012 DEBUG > org.apache.hadoop.hbase.regionserver.Store: closed default > 2011-03-22 09:39:11,012 INFO > org.apache.hadoop.hbase.regionserver.HRegion: Closed > TEST_UserSettings,,1300103207136.c438541b556672c4f4486416baa371f0. > 2011-03-22 09:39:11,012 DEBUG > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: > Closed region > TEST_UserSettings,,1300103207136.c438541b556672c4f4486416baa371f0. > 2011-03-22 09:39:11,012 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: Closing > TEST_Msg,MZVWNHOSUZYUOYNQKDIAVSCQEPHXVWVXIMGLGXGSSXQTZQMOZCZDCQAUWFSXARWYEBMBRCJMXPHXBIQNDTYTWRURMMOBFISBBSPYEKWWSNGMJCSOPFUGTDBMGUPFOIHOXGWI\x00,1300193788355.4ca0c6cf6654b8f6fd7e3bbba0b9fc6c.: > disabling compactions & flushes > 2011-03-22 09:39:11,011 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: Closing > -ROOT-,,0.70236052: disabling compactions & flushes > 2011-03-22 09:39:11,012 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for > region > TEST_Msg,MZVWNHOSUZYUOYNQKDIAVSCQEPHXVWVXIMGLGXGSSXQTZQMOZCZDCQAUWFSXARWYEBMBRCJMXPHXBIQNDTYTWRURMMOBFISBBSPYEKWWSNGMJCSOPFUGTDBMGUPFOIHOXGWI\x00,1300193788355.4ca0c6cf6654b8f6fd7e3bbba0b9fc6c. > 2011-03-22 09:39:11,012 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for > region -ROOT-,,0.70236052 > 2011-03-22 09:39:11,013 DEBUG > org.apache.hadoop.hbase.regionserver.Store: closed Data > 2011-03-22 09:39:11,013 DEBUG > org.apache.hadoop.hbase.regionserver.Store: closed default > 2011-03-22 09:39:11,013 DEBUG > org.apache.hadoop.hbase.regionserver.Store: closed info > 2011-03-22 09:39:11,013 INFO > org.apache.hadoop.hbase.regionserver.HRegion: Closed > TEST_Msg,MZVWNHOSUZYUOYNQKDIAVSCQEPHXVWVXIMGLGXGSSXQTZQMOZCZDCQAUWFSXARWYEBMBRCJMXPHXBIQNDTYTWRURMMOBFISBBSPYEKWWSNGMJCSOPFUGTDBMGUPFOIHOXGWI\x00,1300193788355.4ca0c6cf6654b8f6fd7e3bbba0b9fc6c. > 2011-03-22 09:39:11,013 INFO > org.apache.hadoop.hbase.regionserver.HRegion: Closed > -ROOT-,,0.70236052 > 2011-03-22 09:39:11,013 DEBUG > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: > Closed region > TEST_Msg,MZVWNHOSUZYUOYNQKDIAVSCQEPHXVWVXIMGLGXGSSXQTZQMOZCZDCQAUWFSXARWYEBMBRCJMXPHXBIQNDTYTWRURMMOBFISBBSPYEKWWSNGMJCSOPFUGTDBMGUPFOIHOXGWI\x00,1300193788355.4ca0c6cf6654b8f6fd7e3bbba0b9fc6c. > 2011-03-22 09:39:11,013 DEBUG > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: > Closed region -ROOT-,,0.70236052 > 2011-03-22 09:39:11,066 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server > at: hadoop1-s05.farm-ny.gigya.com,60020,1300799918373 > 2011-03-22 09:39:11,066 DEBUG > org.apache.hadoop.hbase.catalog.CatalogTracker: Stopping catalog > tracker org.apache.hadoop.hbase.catalog.CatalogTracker@2eb0a3f5 > 2011-03-22 09:39:11,066 INFO > org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closing > leases > 2011-03-22 09:39:11,066 INFO > org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closed > leases > 2011-03-22 09:39:11,169 INFO > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: > Closed zookeeper sessionid=0x22e669588a20061 > 2011-03-22 09:39:11,181 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x22e669588a20061 closed > 2011-03-22 09:39:11,181 INFO org.apache.zookeeper.ClientCnxn: > EventThread shut down > 2011-03-22 09:39:11,189 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x12e669588b80057 closed > 2011-03-22 09:39:11,189 INFO org.apache.zookeeper.ClientCnxn: > EventThread shut down > 2011-03-22 09:39:11,190 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Closing source 1 because: Region server is closing > 2011-03-22 09:39:12,414 INFO > org.apache.hadoop.hbase.regionserver.Leases: > regionserver60020.leaseChecker closing leases > 2011-03-22 09:39:12,415 INFO > org.apache.hadoop.hbase.regionserver.Leases: > regionserver60020.leaseChecker closed leases > 2011-03-22 09:39:19,229 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Source exiting 1 > 2011-03-22 09:39:19,229 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020 > exiting > 2011-03-22 09:39:19,230 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook > starting; hbase.shutdown.hook=true; > fsShutdownHook=Thread[Thread-15,5,main] > 2011-03-22 09:39:19,230 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown > hook > 2011-03-22 09:39:19,230 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs > shutdown hook thread. > 2011-03-22 09:39:19,231 INFO > org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook > finished. > > > > Forth and probably worst of all, it seems that when the servers are > crashing this way the master still thinks they are alive so the region > is not transitioned and is therefor inaccessible. How long should it > normally take the master to detect a dead region server? > > Any help on what's going on would be greatly appreciated. > > -eran >