[ https://issues.apache.org/jira/browse/HDFS-9659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15105226#comment-15105226 ]
Surendra Singh Lilhore commented on HDFS-9659: ---------------------------------------------- Attached initial patch Please review.. > EditLogTailerThread to Active Namenode RPC should timeout > --------------------------------------------------------- > > Key: HDFS-9659 > URL: https://issues.apache.org/jira/browse/HDFS-9659 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha, namenode > Affects Versions: 3.0.0 > Reporter: Surendra Singh Lilhore > Assignee: Surendra Singh Lilhore > Priority: Critical > Attachments: HDFS-9659.patch > > > {{EditLogTailerThread}} to Active {{Namenode}} RPC doesn't have timeout and > it’s removed in HDFS-6440. > When inject the disk slow and consume system IO to the active name node, the > nameservice can't switch and this is because SNN not able to stop > {{EditLogTailerThread}}. > *Thread dump from SNN* > {noformat} > "IPC Server handler 33 on 25000" #118 daemon prio=5 os_prio=0 > tid=0x00007f2384409800 nid=0x26c89 in Object.wait() [0x00007f2376ac7000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1245) > - locked <0x00000006d517f538> (a > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread) > at java.lang.Thread.join(Thread.java:1319) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.stop(EditLogTailer.java:183) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.stopStandbyServices(FSNamesystem.java:1284) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.stopStandbyServices(NameNode.java:1852) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.exitState(StandbyState.java:72) > at > org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:62) > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1684) > {noformat} > *Thread dump for {{EditLogTailerThread}}*, it is stuck in > {{NamenodeProtocolTranslatorPB.rollEditLog()}} rpc call. > {noformat} > "Edit log tailer" #150 prio=5 os_prio=0 tid=0x00007f2395569800 nid=0x26cac in > Object.wait() [0x00007f2374aa7000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at org.apache.hadoop.ipc.Client.call(Client.java:1503) > - locked <0x00000006d581bb90> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1448) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy16.rollEditLog(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:148) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:301) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:298) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$MultipleNameNodeProxy.call(EditLogTailer.java:420) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)