Forwarding to hdfs-dev@ since the logs were pertaining to namenode. Cheers
---------- Forwarded message ---------- From: louis.hust <louis.h...@gmail.com> Date: Fri, Dec 5, 2014 at 8:52 PM Subject: Re: Slow waitForAckedSeqno took too long time To: u...@hbase.apache.org Hi Ted, Thanks for your reply. Actually I shutdown the l-hbase1.dev.dba.cn0 at the time 11:47:51 and the hbase recover at 12:03 the below log is the zkfc log on l-hbase2 {log} 2014-12-05 11:18:47,077 INFO org.apache.hadoop.ha.ActiveStandbyElector: Already in election. Not re-connecting. 2014-12-05 11:48:16,023 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced... 2014-12-05 11:48:16,025 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a096d79636c757374657212036e6e311a146c2d6862617365312e6462612e6465762e636e3020d43e28d33e 2014-12-05 11:48:16,028 INFO org.apache.hadoop.ha.ZKFailoverController: Should fence: NameNode at l-hbase1.dba.dev.cn0/10.86.36.217:8020 2014-12-05 11:48:32,031 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS) 2014-12-05 11:48:47,033 WARN org.apache.hadoop.ha.FailoverController: Unable to gracefully make NameNode at l-hbase1.dba.dev.cn0/10.86.36.217:8020 standby (unable to connect) java.net.ConnectException: Call From l-hbase2.dba.dev.cn0. qunar.com/10.86.36.218 to l-hbase1.dba.dev.cn0:8020 failed on connection exception: java.net.ConnectException: Connection timed out; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) at org.apache.hadoop.ipc.Client.call(Client.java:1413) at org.apache.hadoop.ipc.Client.call(Client.java:1362) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy8.transitionToStandby(Unknown Source) at org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112) at org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172) at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:503) at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494) at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59) at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837) at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: java.net.ConnectException: Connection timed out at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1461) at org.apache.hadoop.ipc.Client.call(Client.java:1380) ... 14 more 2014-12-05 11:48:47,034 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning Service Fencing Process... ====== 2014-12-05 11:48:47,034 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.ShellCommandFencer(/bin/true) 2014-12-05 11:48:47,040 INFO org.apache.hadoop.ha.ShellCommandFencer: Launched fencing command '/bin/true' with pid 24513 2014-12-05 11:48:47,044 INFO org.apache.hadoop.ha.NodeFencer: ====== Fencing successful by method org.apache.hadoop.ha.ShellCommandFencer(/bin/true) ====== 2014-12-05 11:48:47,044 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /hadoop-ha/mycluster/ActiveBreadCrumb to indicate that the local node is the most recent active... 2014-12-05 11:48:47,063 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at l-hbase2.dba.dev.cn0/10.86.36.218:8020 active... 2014-12-05 11:48:47,284 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at l-hbase2.dba.dev.cn0/10.86.36.218:8020 to active state 2014-12-05 12:29:56,906 INFO org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Allowed RPC access from hadoop (auth:SIMPLE) at 10.86.36.217 2014-12-05 12:29:56,907 INFO org.apache.hadoop.ha.ZKFailoverController: Requested by hadoop (auth:SIMPLE) at 10.86.36.217 to cede active role. 2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully ensured local node is in standby mode 2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election 2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ActiveStandbyElector: Deleting bread-crumb of active node… {log} the name node log is a bit large, so attach it. On Dec 5, 2014, at 23:01, Ted Yu <yuzhih...@gmail.com> wrote: The warning was logged by DFSOutputStream. What was the load on hdfs around 2014-12-05 12:03 ? Have you checked namenode log ? Cheers On Thu, Dec 4, 2014 at 9:01 PM, mail list <louis.hust...@gmail.com> wrote: Hi ,all I deploy Hbase0.98.6-cdh5.2.0 on 3 machine: l-hbase1.dev.dba.cn0(hadoop namenode active, HMaster active) l-hbase2.dev.dba.cn0(hadoop namenode standby, HMaster standby, hadoop datanode) l-hbase3.dev.dba.cn0(regionserver, hadoop datanode) Then I shutdown the l-hbase1.dev.dba.cn0, But HBase can not work until about 15mins later. I check the log and find the following log in the region server’s log: 2014-12-05 12:03:19,169 WARN [regionserver60020-WAL.AsyncSyncer0] hdfs.DFSClient: Slow waitForAckedSeqno took 927762ms (threshold=30000ms) 2014-12-05 12:03:19,186 INFO [regionserver60020-WAL.AsyncSyncer0] wal.FSHLog: Slow sync cost: 927779 ms, current pipeline: [ 10.86.36.219:50010] 2014-12-05 12:03:19,186 DEBUG [regionserver60020.logRoller] regionserver.LogRoller: HLog roll requested 2014-12-05 12:03:19,236 WARN [regionserver60020-WAL.AsyncSyncer1] hdfs.DFSClient: Slow waitForAckedSeqno took 867706ms (threshold=30000ms) It seems the WAL Asysnc took too long time for region server recovery? I don’t know if the log matters ? Can any body explain the reason? and how to reduce the time for recovery?