First issue: UnknownHostException is unforgiving, your machines need
to be able to talk to haddop2-zk3 (is that a typo?)  and it seems that
at least that one can't. The reason the machine dies is that we
usually try to "fail fast" in HBase.

Second issue: There's not enough information, all I see is a region
server shutting down and the reason why is probably before that.

Third issue: https://issues.apache.org/jira/browse/HBASE-3664

Fourth issue: it's now 3 minutes in 0.90 for the timeout to happen.

J-D

On Tue, Mar 22, 2011 at 10:39 AM, Eran Kutner <e...@gigya.com> wrote:
> Hi,
> I'm trying to use replication between two HBase clusters and I'm
> encountering all kinds of crashes and weird behavior.
>
> First, it seems that starting a region server when the peer ZKs are
> not available will cause the server to fail to start:
>
> 2011-03-22 08:31:56,647 INFO
> org.apache.hadoop.hbase.replication.ReplicationZookeeper: Replication
> is now started
> 2011-03-22 08:31:56,668 WARN
> org.apache.hadoop.hbase.zookeeper.ZKConfig:
> java.net.UnknownHostException: haddop2-zk3
>         at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
>         at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:850)
>         at 
> java.net.InetAddress.getAddressFromNameService(InetAddress.java:1201)
>         at java.net.InetAddress.getAllByName0(InetAddress.java:1154)
>         at java.net.InetAddress.getAllByName(InetAddress.java:1084)
>         at java.net.InetAddress.getAllByName(InetAddress.java:1020)
>         at java.net.InetAddress.getByName(InetAddress.java:970)
>         at 
> org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:206)
>         at 
> org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:250)
>         at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:113)
>         at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:288)
>         at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:253)
>         at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:182)
>         at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:142)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.Replication.<init>(Replication.java:75)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1092)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:875)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1472)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:563)
>         at java.lang.Thread.run(Thread.java:662)
>
> 2011-03-22 08:31:56,669 WARN
> org.apache.hadoop.hbase.zookeeper.ZKConfig:
> java.net.UnknownHostException: haddop2-zk2
>         at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
>         at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:850)
>         at 
> java.net.InetAddress.getAddressFromNameService(InetAddress.java:1201)
>         at java.net.InetAddress.getAllByName0(InetAddress.java:1154)
>         at java.net.InetAddress.getAllByName(InetAddress.java:1084)
>         at java.net.InetAddress.getAllByName(InetAddress.java:1020)
>         at java.net.InetAddress.getByName(InetAddress.java:970)
>         at 
> org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:206)
>         at 
> org.apache.hadoop.hbase.zookeeper.ZKConfig.getZKQuorumServersString(ZKConfig.java:250)
>         at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:113)
>         at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:288)
>         at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:253)
>         at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:182)
>         at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:142)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.Replication.<init>(Replication.java:75)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1092)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:875)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1472)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:563)
>         at java.lang.Thread.run(Thread.java:662)
>
> 2011-03-22 08:31:56,669 INFO org.apache.zookeeper.ZooKeeper:
> Initiating client connection,
> connectString=haddop2-zk3:2181,haddop2-zk2:2181,hadoop2-zk1:2181
> sessionTimeout=180000 watcher=connection to cluster: 1
> 2011-03-22 08:31:56,670 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Failed
> initialization
> 2011-03-22 08:31:56,670 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: Failed init
> java.net.UnknownHostException: haddop2-zk3
>         at java.net.InetAddress.getAllByName0(InetAddress.java:1158)
>         at java.net.InetAddress.getAllByName(InetAddress.java:1084)
>         at java.net.InetAddress.getAllByName(InetAddress.java:1020)
>         at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:386)
>         at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:331)
>         at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:377)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.connect(ZKUtil.java:97)
>         at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:119)
>         at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:288)
>         at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:253)
>         at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:182)
>         at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:142)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.Replication.<init>(Replication.java:75)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1092)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:875)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1472)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:563)
>         at java.lang.Thread.run(Thread.java:662)
> 2011-03-22 08:31:56,675 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> server serverName=hadoop1-s05.farm-ny.gigya.com,60020,1300797113247,
> load=(requests=0, regions=0, usedHeap=24, maxHeap=987): Unhandled
> exception: haddop2-zk3
> java.net.UnknownHostException: haddop2-zk3
>         at java.net.InetAddress.getAllByName0(InetAddress.java:1158)
>         at java.net.InetAddress.getAllByName(InetAddress.java:1084)
>         at java.net.InetAddress.getAllByName(InetAddress.java:1020)
>         at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:386)
>         at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:331)
>         at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:377)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.connect(ZKUtil.java:97)
>         at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:119)
>         at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.getPeer(ReplicationZookeeper.java:288)
>         at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectToPeer(ReplicationZookeeper.java:253)
>         at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.connectExistingPeers(ReplicationZookeeper.java:182)
>         at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.<init>(ReplicationZookeeper.java:142)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.Replication.<init>(Replication.java:75)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1092)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:875)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryReportForDuty(HRegionServer.java:1472)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:563)
>         at java.lang.Thread.run(Thread.java:662)
> 2011-03-22 08:31:56,675 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Unhandled
> exception: haddop2-zk3
> 2011-03-22 08:31:56,675 INFO org.apache.hadoop.ipc.HBaseServer:
> Stopping server on 60020
> 2011-03-22 08:31:56,679 INFO
> org.apache.hadoop.hbase.regionserver.StoreFile: Allocating
> LruBlockCache with maximum size 197.5m
> 2011-03-22 08:31:56,683 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server
> at: hadoop1-s05.farm-ny.gigya.com,60020,1300797113247
> 2011-03-22 08:31:56,683 DEBUG
> org.apache.hadoop.hbase.catalog.CatalogTracker: Stopping catalog
> tracker org.apache.hadoop.hbase.catalog.CatalogTracker@508aeb74
> 2011-03-22 08:31:56,684 INFO
> org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closing
> leases
> 2011-03-22 08:31:56,684 INFO
> org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closed
> leases
> 2011-03-22 08:31:56,684 INFO
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> Closed zookeeper sessionid=0x22e669588a20058
> 2011-03-22 08:31:56,692 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x22e669588a20058 closed
> 2011-03-22 08:31:56,692 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> 2011-03-22 08:31:56,700 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x12e669588b8004d closed
> 2011-03-22 08:31:56,700 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> 2011-03-22 08:31:56,702 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
> starting; hbase.shutdown.hook=true;
> fsShutdownHook=Thread[Thread-15,5,main]
> 2011-03-22 08:31:56,702 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown
> hook
> 2011-03-22 08:31:56,702 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs
> shutdown hook thread.
> 2011-03-22 08:31:56,804 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
> finished.
>
>
> Second, it seems that when I'm shutting down a region server on the
> peer cluster region servers on the source cluster connect to it are
> also shutting down:
> 2011-03-22 09:03:34,541 INFO
> org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closing
> leases
> 2011-03-22 09:03:34,541 INFO
> org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closed
> leases
> 2011-03-22 09:03:34,644 INFO
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> Closed zookeeper sessionid=0x12e669588b80050
> 2011-03-22 09:03:34,653 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x12e669588b80050 closed
> 2011-03-22 09:03:34,653 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> 2011-03-22 09:03:34,662 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x22e669588a2005d closed
> 2011-03-22 09:03:34,662 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> 2011-03-22 09:03:34,664 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Closing source 1 because: Region server is closing
> 2011-03-22 09:03:39,377 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Source exiting 1
> 2011-03-22 09:03:39,431 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Source exiting 1
> 2011-03-22 09:03:39,431 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020
> exiting
> 2011-03-22 09:03:39,432 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
> starting; hbase.shutdown.hook=true;
> fsShutdownHook=Thread[Thread-15,5,main]
> 2011-03-22 09:03:39,432 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown
> hook
> 2011-03-22 09:03:39,432 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs
> shutdown hook thread.
> 2011-03-22 09:03:39,433 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
> finished.
>
>
>
> Third, sometimes it crashes without any reason I can understand. See
> the attached log dump. It includes the entire load process from start
> to shutdown of the region server. When I configure "stop_replication"
> everything is OK, here's what happens after "start_replication":
> 2011-03-22 09:38:59,199 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Replication is disabled, sleeping 1000 times 10
> 2011-03-22 09:38:59,333 INFO
> org.apache.hadoop.hbase.replication.ReplicationZookeeper: Replication
> is now started
> 2011-03-22 09:39:09,202 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Opening log for replication
> hadoop1-s05.farm-ny.gigya.com%3A60020.1300799921876 at 124
> 2011-03-22 09:39:09,215 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> currentNbOperations:0 and seenEntries:1 and size: 191
> 2011-03-22 09:39:09,215 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
> Going to report log
> #hadoop1-s05.farm-ny.gigya.com%3A60020.1300799921876 for position 315
> in 
> hdfs://hadoop1-m1:8020/hbase/.logs/hadoop1-s05.farm-ny.gigya.com,60020,1300799918373/hadoop1-s05.farm-ny.gigya.com%3A60020.1300799921876
> 2011-03-22 09:39:09,224 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> server serverName=hadoop1-s05.farm-ny.gigya.com,60020,1300799918373,
> load=(requests=0, regions=3, usedHeap=41, maxHeap=987): Writing
> replication status
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode
> = NoNode for 
> /hbase/replication/rs/hadoop1-s05.farm-ny.gigya.com,60020,1300799918373/1/hadoop1-s05.farm-ny.gigya.com%3A60020.1300799921876
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:708)
>         at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:751)
>         at 
> org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:432)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:131)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:332)
> 2011-03-22 09:39:09,225 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> request=0.0, regions=3, stores=5, storefiles=5, storefileIndexSize=1,
> memstoreSize=0, compactionQueueSize=0, flushQueueSize=0, usedHeap=41,
> maxHeap=987, blockCacheSize=1702768, blockCacheFree=205390992,
> blockCacheCount=3, blockCacheHitCount=15, blockCacheMissCount=3,
> blockCacheEvictedCount=0, blockCacheHitRatio=83,
> blockCacheHitCachingRatio=83
> 2011-03-22 09:39:09,225 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Writing
> replication status
> 2011-03-22 09:39:09,225 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
> Removing 0 logs in the list: []
> 2011-03-22 09:39:09,225 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Nothing to replicate, sleeping 1000 times 10
> 2011-03-22 09:39:10,996 INFO org.apache.hadoop.ipc.HBaseServer:
> Stopping server on 60020
> 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 0 on 60020: exiting
> 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 2 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 18 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping
> infoServer
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 30 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 5 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer:
> Stopping IPC Server Responder
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 4 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 6 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 20 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 15 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 9 on 60020: exiting
> 2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 21 on 60020: exiting
> 2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 49 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 14 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 8 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 17 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 44 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 8 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 42 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 41 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 40 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 39 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 38 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 37 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 36 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 34 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 33 on 60020: exiting
> 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 32 on 60020: exiting
> 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 31 on 60020: exiting
> 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 29 on 60020: exiting
> 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 28 on 60020: exiting
> 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 27 on 60020: exiting
> 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 26 on 60020: exiting
> 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 25 on 60020: exiting
> 2011-03-22 09:39:11,001 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 24 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 23 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 13 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 7 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 16 on 60020: exiting
> 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 12 on 60020: exiting
> 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 6 on 60020: exiting
> 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 11 on 60020: exiting
> 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 5 on 60020: exiting
> 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 4 on 60020: exiting
> 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 0 on 60020: exiting
> 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 9 on 60020: exiting
> 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 7 on 60020: exiting
> 2011-03-22 09:39:10,997 INFO org.apache.hadoop.ipc.HBaseServer:
> Stopping IPC Server listener on 60020
> 2011-03-22 09:39:11,004 INFO org.mortbay.log: Stopped
> SelectChannelConnector@0.0.0.0:60030
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 35 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 43 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 45 on 60020: exiting
> 2011-03-22 09:39:11,000 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 46 on 60020: exiting
> 2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 47 on 60020: exiting
> 2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 48 on 60020: exiting
> 2011-03-22 09:39:11,010 INFO
> org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
> 2011-03-22 09:39:11,010 DEBUG
> org.apache.hadoop.hbase.regionserver.wal.HLog:
> regionserver60020.logSyncer interrupted while waiting for sync
> requests
> 2011-03-22 09:39:11,010 INFO
> org.apache.hadoop.hbase.regionserver.wal.HLog:
> regionserver60020.logSyncer exiting
> 2011-03-22 09:39:11,010 DEBUG
> org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog writer in
> hdfs://hadoop1-m1:8020/hbase/.logs/hadoop1-s05.farm-ny.gigya.com,60020,1300799918373
> 2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 22 on 60020: exiting
> 2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 10 on 60020: exiting
> 2011-03-22 09:39:10,999 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 3 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 3 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 2 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: PRI
> IPC Server handler 1 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 19 on 60020: exiting
> 2011-03-22 09:39:10,998 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server handler 1 on 60020: exiting
> 2011-03-22 09:39:11,011 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
> Processing close of -ROOT-,,0.70236052
> 2011-03-22 09:39:11,011 INFO
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
> regionserver60020.cacheFlusher exiting
> 2011-03-22 09:39:11,010 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
> Processing close of
> TEST_Msg,MZVWNHOSUZYUOYNQKDIAVSCQEPHXVWVXIMGLGXGSSXQTZQMOZCZDCQAUWFSXARWYEBMBRCJMXPHXBIQNDTYTWRURMMOBFISBBSPYEKWWSNGMJCSOPFUGTDBMGUPFOIHOXGWI\x00,1300193788355.4ca0c6cf6654b8f6fd7e3bbba0b9fc6c.
> 2011-03-22 09:39:11,010 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
> Processing close of
> TEST_UserSettings,,1300103207136.c438541b556672c4f4486416baa371f0.
> 2011-03-22 09:39:11,009 INFO
> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> regionserver60020.compactor exiting
> 2011-03-22 09:39:11,012 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Closing
> TEST_UserSettings,,1300103207136.c438541b556672c4f4486416baa371f0.:
> disabling compactions & flushes
> 2011-03-22 09:39:11,009 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker:
> regionserver60020.majorCompactionChecker exiting
> 2011-03-22 09:39:11,012 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for
> region TEST_UserSettings,,1300103207136.c438541b556672c4f4486416baa371f0.
> 2011-03-22 09:39:11,012 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: closed Settings
> 2011-03-22 09:39:11,012 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: closed default
> 2011-03-22 09:39:11,012 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: Closed
> TEST_UserSettings,,1300103207136.c438541b556672c4f4486416baa371f0.
> 2011-03-22 09:39:11,012 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
> Closed region 
> TEST_UserSettings,,1300103207136.c438541b556672c4f4486416baa371f0.
> 2011-03-22 09:39:11,012 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Closing
> TEST_Msg,MZVWNHOSUZYUOYNQKDIAVSCQEPHXVWVXIMGLGXGSSXQTZQMOZCZDCQAUWFSXARWYEBMBRCJMXPHXBIQNDTYTWRURMMOBFISBBSPYEKWWSNGMJCSOPFUGTDBMGUPFOIHOXGWI\x00,1300193788355.4ca0c6cf6654b8f6fd7e3bbba0b9fc6c.:
> disabling compactions & flushes
> 2011-03-22 09:39:11,011 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Closing
> -ROOT-,,0.70236052: disabling compactions & flushes
> 2011-03-22 09:39:11,012 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for
> region 
> TEST_Msg,MZVWNHOSUZYUOYNQKDIAVSCQEPHXVWVXIMGLGXGSSXQTZQMOZCZDCQAUWFSXARWYEBMBRCJMXPHXBIQNDTYTWRURMMOBFISBBSPYEKWWSNGMJCSOPFUGTDBMGUPFOIHOXGWI\x00,1300193788355.4ca0c6cf6654b8f6fd7e3bbba0b9fc6c.
> 2011-03-22 09:39:11,012 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for
> region -ROOT-,,0.70236052
> 2011-03-22 09:39:11,013 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: closed Data
> 2011-03-22 09:39:11,013 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: closed default
> 2011-03-22 09:39:11,013 DEBUG
> org.apache.hadoop.hbase.regionserver.Store: closed info
> 2011-03-22 09:39:11,013 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: Closed
> TEST_Msg,MZVWNHOSUZYUOYNQKDIAVSCQEPHXVWVXIMGLGXGSSXQTZQMOZCZDCQAUWFSXARWYEBMBRCJMXPHXBIQNDTYTWRURMMOBFISBBSPYEKWWSNGMJCSOPFUGTDBMGUPFOIHOXGWI\x00,1300193788355.4ca0c6cf6654b8f6fd7e3bbba0b9fc6c.
> 2011-03-22 09:39:11,013 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: Closed
> -ROOT-,,0.70236052
> 2011-03-22 09:39:11,013 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
> Closed region 
> TEST_Msg,MZVWNHOSUZYUOYNQKDIAVSCQEPHXVWVXIMGLGXGSSXQTZQMOZCZDCQAUWFSXARWYEBMBRCJMXPHXBIQNDTYTWRURMMOBFISBBSPYEKWWSNGMJCSOPFUGTDBMGUPFOIHOXGWI\x00,1300193788355.4ca0c6cf6654b8f6fd7e3bbba0b9fc6c.
> 2011-03-22 09:39:11,013 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler:
> Closed region -ROOT-,,0.70236052
> 2011-03-22 09:39:11,066 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server
> at: hadoop1-s05.farm-ny.gigya.com,60020,1300799918373
> 2011-03-22 09:39:11,066 DEBUG
> org.apache.hadoop.hbase.catalog.CatalogTracker: Stopping catalog
> tracker org.apache.hadoop.hbase.catalog.CatalogTracker@2eb0a3f5
> 2011-03-22 09:39:11,066 INFO
> org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closing
> leases
> 2011-03-22 09:39:11,066 INFO
> org.apache.hadoop.hbase.regionserver.Leases: regionserver60020 closed
> leases
> 2011-03-22 09:39:11,169 INFO
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> Closed zookeeper sessionid=0x22e669588a20061
> 2011-03-22 09:39:11,181 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x22e669588a20061 closed
> 2011-03-22 09:39:11,181 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> 2011-03-22 09:39:11,189 INFO org.apache.zookeeper.ZooKeeper: Session:
> 0x12e669588b80057 closed
> 2011-03-22 09:39:11,189 INFO org.apache.zookeeper.ClientCnxn:
> EventThread shut down
> 2011-03-22 09:39:11,190 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Closing source 1 because: Region server is closing
> 2011-03-22 09:39:12,414 INFO
> org.apache.hadoop.hbase.regionserver.Leases:
> regionserver60020.leaseChecker closing leases
> 2011-03-22 09:39:12,415 INFO
> org.apache.hadoop.hbase.regionserver.Leases:
> regionserver60020.leaseChecker closed leases
> 2011-03-22 09:39:19,229 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Source exiting 1
> 2011-03-22 09:39:19,229 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020
> exiting
> 2011-03-22 09:39:19,230 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
> starting; hbase.shutdown.hook=true;
> fsShutdownHook=Thread[Thread-15,5,main]
> 2011-03-22 09:39:19,230 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown
> hook
> 2011-03-22 09:39:19,230 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs
> shutdown hook thread.
> 2011-03-22 09:39:19,231 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
> finished.
>
>
>
> Forth and probably worst of all, it seems that when the servers are
> crashing this way the master still thinks they are alive so the region
> is not transitioned and is therefor inaccessible. How long should it
> normally take the master to detect a dead region server?
>
> Any help on what's going on would be greatly appreciated.
>
> -eran
>

Reply via email to