Hello, Tatsuya.
Yesterday we trapped in the same problem - master was down.
Here is a part of hbase master log after hbase became unavailable through
hbase shell and Java hbase client.
> 2009-10-29 00:00:36,920 INFO org.apache.hadoop.hbase.master.ServerManager:
> 1 region servers, 0 dead, average load 5.0
> 2009-10-29 00:00:37,150 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scanning meta region {server: 127.0.0.1:53169,
> regionname: -ROOT-,,0, startKe$
> 2009-10-29 00:00:37,151 WARN org.apache.hadoop.hbase.master.BaseScanner:
> Scan ROOT region
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
> at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308)
> at
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:831)
> at
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:712)
> at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328)
> at $Proxy2.openScanner(Unknown Source)
> at
> org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:160)
> at
> org.apache.hadoop.hbase.master.RootScanner.scanRoot(RootScanner.java:54)
> at
> org.apache.hadoop.hbase.master.RootScanner.maintenanceScan(RootScanner.java:79)
> at
> org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:136)
> at org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>
and this repeats to the end of log.
Zookeeper log part:
2009-10-29 02:45:14,138 INFO
> org.apache.zookeeper.server.persistence.FileTxnSnapLog: Snapshotting: 17b
> 2009-10-29 08:34:04,751 INFO org.apache.zookeeper.server.NIOServerCnxn:
> Connected to /127.0.0.1:56897 lastZxid 0
> 2009-10-29 08:34:04,751 INFO org.apache.zookeeper.server.NIOServerCnxn:
> Creating new session 0x124962e1e530013
> 2009-10-29 08:34:04,776 INFO org.apache.zookeeper.server.NIOServerCnxn:
> Finished init of 0x124962e1e530013 valid:true
> 2009-10-29 08:34:09,689 WARN
> org.apache.zookeeper.server.PrepRequestProcessor: Got exception when
> processing sessionid:0x124962e1e530013 type:create cxid:0x2
> zxid:0xfffffffffff$
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode =
> NodeExists
> at
> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:245)
> at
> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114)
>
and this repeats to the end of log.
HBase became unavailable after we try to scan table with 6 000 000 rows
several times.
Hbase Java client log:
Error during table scan: java.lang.RuntimeException:
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact
> region server 127.0.0.1:53169 for region channel_products,,1256660737751,
> row '', but failed after 10 attempts.
> Exceptions:
> java.lang.NoClassDefFoundError: org/mortbay/log/Log
> java.net.ConnectException: Connection refused
> java.net.ConnectException: Connection refused
> java.net.ConnectException: Connection refused
> java.net.ConnectException: Connection refused
> java.net.ConnectException: Connection refused
> java.net.ConnectException: Connection refused
> java.net.ConnectException: Connection refused
> java.net.ConnectException: Connection refused
> java.net.ConnectException: Connection refused
>
HBase shell log:
hb...@localhost:/hadoop$ ./hbase/bin/hbase shell
> HBase Shell; enter 'help<RETURN>' for list of supported commands.
> Version: 0.20.1, r822817, Wed Oct 7 11:55:42 PDT 2009
> hbase(main):001:0> status
> 1 servers, 0 dead, 5.0000 average load
> hbase(main):002:0> list
> 09/10/29 08:34:09 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:11 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:13 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:15 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:19 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:19 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:21 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:23 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:25 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:29 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:31 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:33 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:35 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:37 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:41 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:41 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:43 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:45 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:47 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:51 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:53 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:55 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:57 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:34:59 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:03 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:03 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:05 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:07 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:09 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:13 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:15 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:17 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:19 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:21 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:25 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:25 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:27 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:29 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:31 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:35 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:39 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:41 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:43 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:45 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:49 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:49 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:51 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:53 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:55 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> 09/10/29 08:35:59 INFO ipc.HbaseRPC: Server at /127.0.0.1:53169 could not
> be reached after 1 tries, giving up.
> NativeException: org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Trying to contact region server null for region , row '', but failed after 5
> attempts.
> Exceptions:
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up
> proxy to /127.0.0.1:53169 after attempts=1
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up
> proxy to /127.0.0.1:53169 after attempts=1
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up
> proxy to /127.0.0.1:53169 after attempts=1
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up
> proxy to /127.0.0.1:53169 after attempts=1
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up
> proxy to /127.0.0.1:53169 after attempts=1
>
> from org/apache/hadoop/hbase/client/HConnectionManager.java:1001:in
> `getRegionServerWithRetries'
> from org/apache/hadoop/hbase/client/MetaScanner.java:55:in `metaScan'
> from org/apache/hadoop/hbase/client/MetaScanner.java:28:in `metaScan'
> from org/apache/hadoop/hbase/client/HConnectionManager.java:432:in
> `listTables'
> from org/apache/hadoop/hbase/client/HBaseAdmin.java:127:in `listTables'
> from sun/reflect/NativeMethodAccessorImpl.java:-2:in `invoke0'
> from sun/reflect/NativeMethodAccessorImpl.java:39:in `invoke'
> from sun/reflect/DelegatingMethodAccessorImpl.java:25:in `invoke'
> from java/lang/reflect/Method.java:597:in `invoke'
> from org/jruby/javasupport/JavaMethod.java:298:in
> `invokeWithExceptionHandling'
> from org/jruby/javasupport/JavaMethod.java:259:in `invoke'
> from org/jruby/java/invokers/InstanceMethodInvoker.java:36:in `call'
> from org/jruby/runtime/callsite/CachingCallSite.java:253:in
> `cacheAndCall'
> from org/jruby/runtime/callsite/CachingCallSite.java:72:in `call'
> from org/jruby/ast/CallNoArgNode.java:61:in `interpret'
> from org/jruby/ast/ForNode.java:104:in `interpret'
> ... 112 levels...
> from hadoop/hbase/bin/$_dot_dot_/bin/hirb#start:-1:in `call'
> from org/jruby/internal/runtime/methods/DynamicMethod.java:226:in
> `call'
> from org/jruby/internal/runtime/methods/CompiledMethod.java:211:in
> `call'
> from org/jruby/internal/runtime/methods/CompiledMethod.java:71:in
> `call'
> from org/jruby/runtime/callsite/CachingCallSite.java:253:in
> `cacheAndCall'
> from org/jruby/runtime/callsite/CachingCallSite.java:72:in `call'
> from hadoop/hbase/bin/$_dot_dot_/bin/hirb.rb:497:in `__file__'
> from hadoop/hbase/bin/$_dot_dot_/bin/hirb.rb:-1:in `load'
> from org/jruby/Ruby.java:577:in `runScript'
> from org/jruby/Ruby.java:480:in `runNormally'
> from org/jruby/Ruby.java:354:in `runFromMain'
> from org/jruby/Main.java:229:in `run'
> from org/jruby/Main.java:110:in `run'
> from org/jruby/Main.java:94:in `main'
> from /hadoop/hbase/bin/../bin/hirb.rb:338:in `list'
> from (hbase):3hbase(main):003:0>
>
HDFS name node still available through web interface.
NameNode 'localhost:9000'
Started: Tue Oct 27 03:12:08 EET 2009 Version: 0.20.1, r810220 Compiled:
Tue Sep 1 20:55:56 UTC 2009 by oom Upgrades: There are no upgrades in
progress.
*Browse the filesystem <http://77.122.169.205:50070/nn_browsedfscontent.jsp>
*
*Namenode Logs <http://77.122.169.205:50070/logs/>*
------------------------------
Cluster Summary * * * 116 files and directories, 98 blocks = 214 total. Heap
Size is 10.94 MB / 963 MB (1%)
*
Configured Capacity : 229.36 GB DFS Used : 3.04 GB Non DFS Used : 14.46
GB DFS Remaining : 211.87 GB DFS Used% : 1.32 % DFS Remaining% : 92.37 % Live
Nodes <http://77.122.169.205:50070/dfsnodelist.jsp?whatNodes=LIVE> : 1 Dead
Nodes <http://77.122.169.205:50070/dfsnodelist.jsp?whatNodes=DEAD> : 0
------------------------------
NameNode Storage:
*Storage Directory**Type**State*/hadoop/hdfs/nameIMAGE_AND_EDITSActive
------------------------------
Hadoop <http://hadoop.apache.org/core>, 2009.
Could you check this, maybe some other thoughts will appear.
Thanks a lot for your time.
-------------------------------------------------
Best wishes, Artyom Shvedchikov