Fwd: Slow waitForAckedSeqno took too long time

Ted Yu Sat, 06 Dec 2014 07:24:42 -0800

Forwarding to hdfs-dev@ since the logs were pertaining to namenode.

Cheers

---------- Forwarded message ----------
From: louis.hust <louis.h...@gmail.com>
Date: Fri, Dec 5, 2014 at 8:52 PM
Subject: Re: Slow waitForAckedSeqno took too long time
To: u...@hbase.apache.org

Hi Ted,

Thanks for your reply.

Actually I shutdown the l-hbase1.dev.dba.cn0 at the time 11:47:51 and the
hbase recover at 12:03

the below log is the zkfc log on l-hbase2

{log}
2014-12-05 11:18:47,077 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Already in election. Not re-connecting.
2014-12-05 11:48:16,023 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Checking for any old active which needs to be fenced...
2014-12-05 11:48:16,025 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old
node exists:
0a096d79636c757374657212036e6e311a146c2d6862617365312e6462612e6465762e636e3020d43e28d33e
2014-12-05 11:48:16,028 INFO org.apache.hadoop.ha.ZKFailoverController:
Should fence: NameNode at l-hbase1.dba.dev.cn0/10.86.36.217:8020
2014-12-05 11:48:32,031 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 0 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1,
sleepTime=1000 MILLISECONDS)
2014-12-05 11:48:47,033 WARN org.apache.hadoop.ha.FailoverController:
Unable to gracefully make NameNode at l-hbase1.dba.dev.cn0/10.86.36.217:8020
standby (unable to connect)
java.net.ConnectException: Call From l-hbase2.dba.dev.cn0.
qunar.com/10.86.36.218 to l-hbase1.dba.dev.cn0:8020 failed on connection
exception: java.net.ConnectException: Connection timed out; For more
details see:  http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1413)
at org.apache.hadoop.ipc.Client.call(Client.java:1362)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy8.transitionToStandby(Unknown Source)
at
org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
at
org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
at
org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:503)
at
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
at
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
at
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
at
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901)
at
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800)
at
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: java.net.ConnectException: Connection timed out
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1461)
at org.apache.hadoop.ipc.Client.call(Client.java:1380)
... 14 more
2014-12-05 11:48:47,034 INFO org.apache.hadoop.ha.NodeFencer: ======
Beginning Service Fencing Process... ======
2014-12-05 11:48:47,034 INFO org.apache.hadoop.ha.NodeFencer: Trying method
1/1: org.apache.hadoop.ha.ShellCommandFencer(/bin/true)
2014-12-05 11:48:47,040 INFO org.apache.hadoop.ha.ShellCommandFencer:
Launched fencing command '/bin/true' with pid 24513
2014-12-05 11:48:47,044 INFO org.apache.hadoop.ha.NodeFencer: ======
Fencing successful by method
org.apache.hadoop.ha.ShellCommandFencer(/bin/true) ======
2014-12-05 11:48:47,044 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Writing znode /hadoop-ha/mycluster/ActiveBreadCrumb to indicate that the
local node is the most recent active...
2014-12-05 11:48:47,063 INFO org.apache.hadoop.ha.ZKFailoverController:
Trying to make NameNode at l-hbase2.dba.dev.cn0/10.86.36.218:8020 active...
2014-12-05 11:48:47,284 INFO org.apache.hadoop.ha.ZKFailoverController:
Successfully transitioned NameNode at l-hbase2.dba.dev.cn0/10.86.36.218:8020
to active state
2014-12-05 12:29:56,906 INFO
org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Allowed RPC access
from hadoop (auth:SIMPLE) at 10.86.36.217
2014-12-05 12:29:56,907 INFO org.apache.hadoop.ha.ZKFailoverController:
Requested by hadoop (auth:SIMPLE) at 10.86.36.217 to cede active role.
2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ZKFailoverController:
Successfully ensured local node is in standby mode
2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Yielding from election
2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Deleting bread-crumb of active node…

{log}

the name node log is a bit large, so attach it.

On Dec 5, 2014, at 23:01, Ted Yu <yuzhih...@gmail.com> wrote:

The warning was logged by DFSOutputStream.

What was the load on hdfs around 2014-12-05 12:03 ?
Have you checked namenode log ?

Cheers

On Thu, Dec 4, 2014 at 9:01 PM, mail list <louis.hust...@gmail.com> wrote:

Hi ,all

I deploy Hbase0.98.6-cdh5.2.0 on 3 machine:

l-hbase1.dev.dba.cn0(hadoop namenode active, HMaster active)
l-hbase2.dev.dba.cn0(hadoop namenode standby, HMaster standby, hadoop
datanode)
l-hbase3.dev.dba.cn0(regionserver, hadoop datanode)

Then I shutdown the l-hbase1.dev.dba.cn0,  But HBase can not work until
about 15mins later.
I check the log and find the following log in the region server’s log:

2014-12-05 12:03:19,169 WARN  [regionserver60020-WAL.AsyncSyncer0]
hdfs.DFSClient: Slow waitForAckedSeqno took 927762ms (threshold=30000ms)
2014-12-05 12:03:19,186 INFO  [regionserver60020-WAL.AsyncSyncer0]
wal.FSHLog: Slow sync cost: 927779 ms, current pipeline: [
10.86.36.219:50010]
2014-12-05 12:03:19,186 DEBUG [regionserver60020.logRoller]
regionserver.LogRoller: HLog roll requested
2014-12-05 12:03:19,236 WARN  [regionserver60020-WAL.AsyncSyncer1]
hdfs.DFSClient: Slow waitForAckedSeqno took 867706ms (threshold=30000ms)

It seems the WAL Asysnc took too long time for region server recovery? I
don’t know if the log matters ?
Can any body explain the reason? and how to reduce the time for recovery?

Fwd: Slow waitForAckedSeqno took too long time

Reply via email to