[ https://issues.apache.org/jira/browse/HDFS-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Toshihiro Suzuki updated HDFS-14123: ------------------------------------ Attachment: HDFS-14123.01.patch > NameNode failover doesn't happen when running fsfreeze for the NameNode dir > (dfs.namenode.name.dir) > --------------------------------------------------------------------------------------------------- > > Key: HDFS-14123 > URL: https://issues.apache.org/jira/browse/HDFS-14123 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha > Reporter: Toshihiro Suzuki > Assignee: Toshihiro Suzuki > Priority: Major > Attachments: HDFS-14123.01.patch > > > I ran fsfreeze for the NameNode dir (dfs.namenode.name.dir) in my cluster for > test purpose, but NameNode failover didn't happen. > {code} > fsfreeze -f /mnt > {code} > /mnt is a separate filesystem partition from /. And the NameNode dir > "dfs.namenode.name.dir" is /mnt/hadoop/hdfs/namenode. > I checked the source code, and I found monitorHealth RPC from ZKFC doesn't > fail even if the NameNode dir is frozen. I think that's why the failover > doesn't happen. > Also if the NameNode dir is frozen, it looks like FSImage.rollEditLog() gets > stuck like the following, and it keeps holding the write lock of > FSNamesystem, which causes HDFS service down: > {code} > "IPC Server handler 5 on default port 8020" #53 daemon prio=5 os_prio=0 > tid=0x00007f56b96e2000 nid=0x5042 in Object.wait() [0x00007f56937bb000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync$SyncEdit.logSyncWait(FSEditLogAsync.java:317) > - locked <0x00000000c58ca268> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.logSyncAll(FSEditLogAsync.java:147) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.endCurrentLogSegment(FSEditLog.java:1422) > - locked <0x00000000c58ca268> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1316) > - locked <0x00000000c58ca268> (a > org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4740) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1307) > at > org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:148) > at > org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:14726) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:898) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:844) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2727) > Locked ownable synchronizers: > - <0x00000000c5f4ca10> (a > java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) > {code} > I believe NameNode failover should happen in this case. One idea is to check > if the NameNode dir is working when NameNode receives monitorHealth RPC from > ZKFC. > I will attach a patch for this idea. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org