[ 
https://issues.apache.org/jira/browse/HBASE-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018791#comment-13018791
 ] 

gaojinchao commented on HBASE-3722:
-----------------------------------

In my cluster :
1.HDFS cluster is HA namenode( ANN and BNN)
2.HBASE Version 0.90.1:
  Active Hmaster: C4C1 
  Backup Hmaster: C4C2
  Region server: C4C3,C4C4,C4C5,...

operation:
1.ANN crashed and BNN becomed Active(that needs some time)
2.Some region server crashed(eg:C4C3 has meta table) that Hbase client is 
putting into data and some Region server is ok.
3.Hmaster split hlog failed and skip it.
4.BNN had been active and Hmaster had finished processed shutdown event.
5.A lots of data is lost that region server had crashed.


log as:
14:57:58 C4C3 shutdow itself  because of ANN crashed.
skip splitlog and ressigned Meta table.  

2011-04-12 14:57:58,782 INFO 
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs 
for C4C3.site,60020,1302590910433
2011-04-12 14:57:59,790 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: C4C1/157.5.100.1:9000. Already tried 0 time(s).
....
2011-04-12 14:58:08,793 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: C4C1/157.5.100.1:9000. Already tried 9 time(s).
2011-04-12 14:58:08,795 ERROR org.apache.hadoop.hbase.master.MasterFileSystem: 
Failed splitting hdfs://C4C1:9000/hbase/.logs/C4C3.site,60020,1302590910433
java.net.ConnectException: Call to C4C1/157.5.100.1:9000 failed on connection 
exception: java.net.ConnectException: Connection refused
2011-04-12 14:58:08,805 INFO 
org.apache.hadoop.hbase.catalog.RootLocationEditor: Unsetting ROOT region 
location in ZooKeeper
2011-04-12 14:58:08,880 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: 
Failed verification of .META.,,1 at address=C4C3.site:60020; 
java.net.ConnectException: Connection refused
2011-04-12 14:58:08,880 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: 
Current cached META location is not valid, resetting

Hmaster finished process shutdown event when BNN becomes active and meta table 
ressigned 

2011-04-12 15:00:31,681 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: C4C1/157.5.100.1:9000. Already tried 0 time(s).
2011-04-12 15:00:32,682 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: C4C1/157.5.100.1:9000. Already tried 1 time(s).
2011-04-12 15:00:40,698 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Regions in transition timed out:  .META.,,1.1028785192 state=OPENING, 
ts=1302591600701
2011-04-12 15:00:40,699 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Region has been OPENING for too long, reassigning region=.META.,,1.1028785192
2011-04-12 15:00:40,709 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Successfully transitioned region=.META.,,1.1028785192 into OFFLINE and forcing 
a new assignment
2011-04-12 15:00:40,712 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Regions in transition timed out:  -ROOT-,,0.70236052 state=OPENING, 
ts=1302591600718
2011-04-12 15:00:40,712 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Region has been OPENING for too long, reassigning region=-ROOT-,,0.70236052
2011-04-12 15:00:40,725 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Successfully transitioned region=-ROOT-,,0.70236052 into OFFLINE and forcing a 
new assignment
2011-04-12 15:00:40,892 INFO org.apache.hadoop.hbase.zookeeper.MetaNodeTracker: 
Detected completed assignment of META, notifying catalog tracker
2011-04-12 15:00:45,870 INFO 
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 0 
region(s) that C4C3.site,60020,1302590910433 was carrying (skipping 0 
regions(s) that are already in transition)
2011-04-12 15:00:45,870 INFO 
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished 
processing of shutdown of C4C3.site,60020,1302590910433



It has been lost that the Hlog is skipped if Hmaster don't restart when NN 
recovered.
so I think Hmaster should shutdown itslef when NN crashed.
like as region server roll Hlog shutdowns itself when it catchs any IO 
exception.

>  A lot of data is lost when name node crashed
> ---------------------------------------------
>
>                 Key: HBASE-3722
>                 URL: https://issues.apache.org/jira/browse/HBASE-3722
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.1
>            Reporter: gaojinchao
>         Attachments: HmasterFilesystem_PatchV1.patch
>
>
> I'm not sure exactly what arose it. there is some split failed logs .
> the master should shutdown itself when the HDFS is crashed.
>  The logs is :
>  2011-03-22 13:21:55,056 WARN 
>  org.apache.hadoop.hbase.master.LogCleaner: Error while cleaning the 
>  logs
>  java.net.ConnectException: Call to C4C1/157.5.100.1:9000 failed on 
> connection exception: java.net.ConnectException: Connection refused
>          at org.apache.hadoop.ipc.Client.wrapException(Client.java:844)
>          at org.apache.hadoop.ipc.Client.call(Client.java:820)
>          at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:221)
>          at $Proxy5.getListing(Unknown Source)
>          at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>          at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>          at java.lang.reflect.Method.invoke(Method.java:597)
>          at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>          at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>          at $Proxy5.getListing(Unknown Source)
>          at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:614)
>          at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:252)
>          at 
> org.apache.hadoop.hbase.master.LogCleaner.chore(LogCleaner.java:121)
>          at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
>          at 
>  org.apache.hadoop.hbase.master.LogCleaner.run(LogCleaner.java:154)
>  Caused by: java.net.ConnectException: Connection refused
>          at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>          at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>          at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>          at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
>          at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:332)
>          at 
> org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:202)
>          at org.apache.hadoop.ipc.Client.getConnection(Client.java:943)
>          at org.apache.hadoop.ipc.Client.call(Client.java:788)
>          ... 13 more
>  2011-03-22 13:21:56,056 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: C4C1/157.5.100.1:9000. Already tried 0 time(s).
>  2011-03-22 13:21:57,057 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: C4C1/157.5.100.1:9000. Already tried 1 time(s).
>  2011-03-22 13:21:58,057 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: C4C1/157.5.100.1:9000. Already tried 2 time(s).
>  2011-03-22 13:21:59,057 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: C4C1/157.5.100.1:9000. Already tried 3 time(s).
>  2011-03-22 13:22:00,058 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: C4C1/157.5.100.1:9000. Already tried 4 time(s).
>  2011-03-22 13:22:01,058 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: C4C1/157.5.100.1:9000. Already tried 5 time(s).
>  2011-03-22 13:22:02,059 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: C4C1/157.5.100.1:9000. Already tried 6 time(s).
>  2011-03-22 13:22:03,059 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: C4C1/157.5.100.1:9000. Already tried 7 time(s).
>  2011-03-22 13:22:04,059 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: C4C1/157.5.100.1:9000. Already tried 8 time(s).
>  2011-03-22 13:22:05,060 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: C4C1/157.5.100.1:9000. Already tried 9 time(s).
>  2011-03-22 13:22:05,060 ERROR 
>  org.apache.hadoop.hbase.master.MasterFileSystem: Failed splitting 
>  hdfs://C4C1:9000/hbase/.logs/C4C9.site,60020,1300767633398
>  java.net.ConnectException: Call to C4C1/157.5.100.1:9000 failed on 
> connection exception: java.net.ConnectException: Connection refused
>          at org.apache.hadoop.ipc.Client.wrapException(Client.java:844)
>          at org.apache.hadoop.ipc.Client.call(Client.java:820)
>          at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:221)
>          at $Proxy5.getFileInfo(Unknown Source)
>          at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>          at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>          at java.lang.reflect.Method.invoke(Method.java:597)
>          at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>          at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>          at $Proxy5.getFileInfo(Unknown Source)
>          at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:623)
>          at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:461)
>          at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:690)
>          at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:177)
>          at 
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:196)
>          at 
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:95)
>          at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)
>          at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>          at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>          at java.lang.Thread.run(Thread.java:662)
>  Caused by: java.net.ConnectException: Connection refused
>          at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>          at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>          at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>          at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
>          at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:332)
>          at 
> org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:202)
>          at org.apache.hadoop.ipc.Client.getConnection(Client.java:943)
>          at org.apache.hadoop.ipc.Client.call(Client.java:788)
>          ... 18 more
>  2011-03-22 13:22:45,600 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: C4C1/157.5.100.1:9000. Already tried 0 time(s).
>  2011-03-22 13:22:46,600 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: C4C1/157.5.100.1:9000. Already tried 1 time(s).
>  2011-03-22 13:22:47,601 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: C4C1/157.5.100.1:9000. Already tried 2 time(s).
>  2011-03-22 13:22:48,601 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: C4C1/157.5.100.1:9000. Already tried 3 time(s).
>  2011-03-22 13:22:49,601 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: C4C1/157.5.100.1:9000. Already tried 4 time(s).
>  2011-03-22 13:22:50,602 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: C4C1/157.5.100.1:9000. Already tried 5 time(s).
>  2011-03-22 13:22:51,602 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: C4C1/157.5.100.1:9000. Already tried 6 time(s).
>  2011-03-22 13:22:52,602 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: C4C1/157.5.100.1:9000. Already tried 7 time(s).
>  2011-03-22 13:22:53,603 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: C4C1/157.5.100.1:9000. Already tried 8 time(s).
>  2011-03-22 13:22:54,603 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: C4C1/157.5.100.1:9000. Already tried 9 time(s).
>  2011-03-22 13:22:54,603 WARN 
>  org.apache.hadoop.hbase.master.LogCleaner: Error while cleaning the 
>  logs
>  java.net.ConnectException: Call to C4C1/157.5.100.1:9000 failed on 
> connection exception: java.net.ConnectException: Connection refused
>          at org.apache.hadoop.ipc.Client.wrapException(Client.java:844)
>          at org.apache.hadoop.ipc.Client.call(Client.java:820)
>          at org.apache.hadoop.ipc.RPC$Invok

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to