[ https://issues.apache.org/jira/browse/HBASE-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018791#comment-13018791 ]
gaojinchao commented on HBASE-3722: ----------------------------------- In my cluster : 1.HDFS cluster is HA namenode( ANN and BNN) 2.HBASE Version 0.90.1: Active Hmaster: C4C1 Backup Hmaster: C4C2 Region server: C4C3,C4C4,C4C5,... operation: 1.ANN crashed and BNN becomed Active(that needs some time) 2.Some region server crashed(eg:C4C3 has meta table) that Hbase client is putting into data and some Region server is ok. 3.Hmaster split hlog failed and skip it. 4.BNN had been active and Hmaster had finished processed shutdown event. 5.A lots of data is lost that region server had crashed. log as: 14:57:58 C4C3 shutdow itself because of ANN crashed. skip splitlog and ressigned Meta table. 2011-04-12 14:57:58,782 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for C4C3.site,60020,1302590910433 2011-04-12 14:57:59,790 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: C4C1/157.5.100.1:9000. Already tried 0 time(s). .... 2011-04-12 14:58:08,793 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: C4C1/157.5.100.1:9000. Already tried 9 time(s). 2011-04-12 14:58:08,795 ERROR org.apache.hadoop.hbase.master.MasterFileSystem: Failed splitting hdfs://C4C1:9000/hbase/.logs/C4C3.site,60020,1302590910433 java.net.ConnectException: Call to C4C1/157.5.100.1:9000 failed on connection exception: java.net.ConnectException: Connection refused 2011-04-12 14:58:08,805 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Unsetting ROOT region location in ZooKeeper 2011-04-12 14:58:08,880 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of .META.,,1 at address=C4C3.site:60020; java.net.ConnectException: Connection refused 2011-04-12 14:58:08,880 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: Current cached META location is not valid, resetting Hmaster finished process shutdown event when BNN becomes active and meta table ressigned 2011-04-12 15:00:31,681 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: C4C1/157.5.100.1:9000. Already tried 0 time(s). 2011-04-12 15:00:32,682 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: C4C1/157.5.100.1:9000. Already tried 1 time(s). 2011-04-12 15:00:40,698 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: .META.,,1.1028785192 state=OPENING, ts=1302591600701 2011-04-12 15:00:40,699 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPENING for too long, reassigning region=.META.,,1.1028785192 2011-04-12 15:00:40,709 INFO org.apache.hadoop.hbase.master.AssignmentManager: Successfully transitioned region=.META.,,1.1028785192 into OFFLINE and forcing a new assignment 2011-04-12 15:00:40,712 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: -ROOT-,,0.70236052 state=OPENING, ts=1302591600718 2011-04-12 15:00:40,712 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPENING for too long, reassigning region=-ROOT-,,0.70236052 2011-04-12 15:00:40,725 INFO org.apache.hadoop.hbase.master.AssignmentManager: Successfully transitioned region=-ROOT-,,0.70236052 into OFFLINE and forcing a new assignment 2011-04-12 15:00:40,892 INFO org.apache.hadoop.hbase.zookeeper.MetaNodeTracker: Detected completed assignment of META, notifying catalog tracker 2011-04-12 15:00:45,870 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 0 region(s) that C4C3.site,60020,1302590910433 was carrying (skipping 0 regions(s) that are already in transition) 2011-04-12 15:00:45,870 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished processing of shutdown of C4C3.site,60020,1302590910433 It has been lost that the Hlog is skipped if Hmaster don't restart when NN recovered. so I think Hmaster should shutdown itslef when NN crashed. like as region server roll Hlog shutdowns itself when it catchs any IO exception. > A lot of data is lost when name node crashed > --------------------------------------------- > > Key: HBASE-3722 > URL: https://issues.apache.org/jira/browse/HBASE-3722 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 0.90.1 > Reporter: gaojinchao > Attachments: HmasterFilesystem_PatchV1.patch > > > I'm not sure exactly what arose it. there is some split failed logs . > the master should shutdown itself when the HDFS is crashed. > The logs is : > 2011-03-22 13:21:55,056 WARN > org.apache.hadoop.hbase.master.LogCleaner: Error while cleaning the > logs > java.net.ConnectException: Call to C4C1/157.5.100.1:9000 failed on > connection exception: java.net.ConnectException: Connection refused > at org.apache.hadoop.ipc.Client.wrapException(Client.java:844) > at org.apache.hadoop.ipc.Client.call(Client.java:820) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:221) > at $Proxy5.getListing(Unknown Source) > at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) > at $Proxy5.getListing(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:614) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:252) > at > org.apache.hadoop.hbase.master.LogCleaner.chore(LogCleaner.java:121) > at org.apache.hadoop.hbase.Chore.run(Chore.java:66) > at > org.apache.hadoop.hbase.master.LogCleaner.run(LogCleaner.java:154) > Caused by: java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:332) > at > org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:202) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:943) > at org.apache.hadoop.ipc.Client.call(Client.java:788) > ... 13 more > 2011-03-22 13:21:56,056 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: C4C1/157.5.100.1:9000. Already tried 0 time(s). > 2011-03-22 13:21:57,057 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: C4C1/157.5.100.1:9000. Already tried 1 time(s). > 2011-03-22 13:21:58,057 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: C4C1/157.5.100.1:9000. Already tried 2 time(s). > 2011-03-22 13:21:59,057 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: C4C1/157.5.100.1:9000. Already tried 3 time(s). > 2011-03-22 13:22:00,058 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: C4C1/157.5.100.1:9000. Already tried 4 time(s). > 2011-03-22 13:22:01,058 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: C4C1/157.5.100.1:9000. Already tried 5 time(s). > 2011-03-22 13:22:02,059 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: C4C1/157.5.100.1:9000. Already tried 6 time(s). > 2011-03-22 13:22:03,059 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: C4C1/157.5.100.1:9000. Already tried 7 time(s). > 2011-03-22 13:22:04,059 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: C4C1/157.5.100.1:9000. Already tried 8 time(s). > 2011-03-22 13:22:05,060 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: C4C1/157.5.100.1:9000. Already tried 9 time(s). > 2011-03-22 13:22:05,060 ERROR > org.apache.hadoop.hbase.master.MasterFileSystem: Failed splitting > hdfs://C4C1:9000/hbase/.logs/C4C9.site,60020,1300767633398 > java.net.ConnectException: Call to C4C1/157.5.100.1:9000 failed on > connection exception: java.net.ConnectException: Connection refused > at org.apache.hadoop.ipc.Client.wrapException(Client.java:844) > at org.apache.hadoop.ipc.Client.call(Client.java:820) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:221) > at $Proxy5.getFileInfo(Unknown Source) > at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) > at $Proxy5.getFileInfo(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:623) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:461) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:690) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:177) > at > org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:196) > at > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:95) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > Caused by: java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:332) > at > org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:202) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:943) > at org.apache.hadoop.ipc.Client.call(Client.java:788) > ... 18 more > 2011-03-22 13:22:45,600 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: C4C1/157.5.100.1:9000. Already tried 0 time(s). > 2011-03-22 13:22:46,600 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: C4C1/157.5.100.1:9000. Already tried 1 time(s). > 2011-03-22 13:22:47,601 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: C4C1/157.5.100.1:9000. Already tried 2 time(s). > 2011-03-22 13:22:48,601 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: C4C1/157.5.100.1:9000. Already tried 3 time(s). > 2011-03-22 13:22:49,601 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: C4C1/157.5.100.1:9000. Already tried 4 time(s). > 2011-03-22 13:22:50,602 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: C4C1/157.5.100.1:9000. Already tried 5 time(s). > 2011-03-22 13:22:51,602 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: C4C1/157.5.100.1:9000. Already tried 6 time(s). > 2011-03-22 13:22:52,602 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: C4C1/157.5.100.1:9000. Already tried 7 time(s). > 2011-03-22 13:22:53,603 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: C4C1/157.5.100.1:9000. Already tried 8 time(s). > 2011-03-22 13:22:54,603 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: C4C1/157.5.100.1:9000. Already tried 9 time(s). > 2011-03-22 13:22:54,603 WARN > org.apache.hadoop.hbase.master.LogCleaner: Error while cleaning the > logs > java.net.ConnectException: Call to C4C1/157.5.100.1:9000 failed on > connection exception: java.net.ConnectException: Connection refused > at org.apache.hadoop.ipc.Client.wrapException(Client.java:844) > at org.apache.hadoop.ipc.Client.call(Client.java:820) > at org.apache.hadoop.ipc.RPC$Invok -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira