[ 
https://issues.apache.org/jira/browse/HDFS-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16708390#comment-16708390
 ] 

Toshihiro Suzuki commented on HDFS-14123:
-----------------------------------------

I attached a patch to check if the NameNode dir is working when NameNode 
receives monitorHealth RPC from ZKFC. And I tested it in my env, and NameNode 
failover happened when running fsfreeze for the NameNode dir. Could someone 
please review the patch?

> NameNode failover doesn't happen when running fsfreeze for the NameNode dir 
> (dfs.namenode.name.dir)
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-14123
>                 URL: https://issues.apache.org/jira/browse/HDFS-14123
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>            Reporter: Toshihiro Suzuki
>            Assignee: Toshihiro Suzuki
>            Priority: Major
>         Attachments: HDFS-14123.01.patch
>
>
> I ran fsfreeze for the NameNode dir (dfs.namenode.name.dir) in my cluster for 
> test purpose, but NameNode failover didn't happen.
> {code}
> fsfreeze -f /mnt
> {code}
> /mnt is a separate filesystem partition from /. And the NameNode dir 
> "dfs.namenode.name.dir" is /mnt/hadoop/hdfs/namenode.
> I checked the source code, and I found monitorHealth RPC from ZKFC doesn't 
> fail even if the NameNode dir is frozen. I think that's why the failover 
> doesn't happen.
> Also if the NameNode dir is frozen, it looks like FSImage.rollEditLog() gets 
> stuck like the following, and it keeps holding the write lock of 
> FSNamesystem, which causes HDFS service down:
> {code}
> "IPC Server handler 5 on default port 8020" #53 daemon prio=5 os_prio=0 
> tid=0x00007f56b96e2000 nid=0x5042 in Object.wait() [0x00007f56937bb000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync$SyncEdit.logSyncWait(FSEditLogAsync.java:317)
>         - locked <0x00000000c58ca268> (a 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.logSyncAll(FSEditLogAsync.java:147)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.endCurrentLogSegment(FSEditLog.java:1422)
>         - locked <0x00000000c58ca268> (a 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1316)
>         - locked <0x00000000c58ca268> (a 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1322)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4740)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1307)
>         at 
> org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:148)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:14726)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:898)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:844)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2727)
>    Locked ownable synchronizers:
>         - <0x00000000c5f4ca10> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
> {code}
> I believe NameNode failover should happen in this case. One idea is to check 
> if the NameNode dir is working when NameNode receives monitorHealth RPC from 
> ZKFC.
> I will attach a patch for this idea.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to