[ https://issues.apache.org/jira/browse/HDFS-10719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377862#comment-17377862 ]
Denis Serduik commented on HDFS-10719: -------------------------------------- [~kpalanisamy] I've been beaten by this fix/issue. Consider scenario when active NN VM goes down ( temporary due to network issue or permanently due to hardware failure). Standby NN can't get activate exactly for this reason as part of qjournal connection string. It causes whole (what? ) NN goes down. See logs below: {noformat} 2021-07-09 06:34:53,932 INFO namenode.RedundantEditLogInputStream: Fast-forwarding stream 'http://XX-XXXXX-XXX.internal.cloudapp.net:8480/getJournal?jid=conduit&segmentTxId=449&storageInfo=-65%3A326724551%3A1625810812563%3ACID-b9e44cb5-91e4-4a55-85dc-6442fa9e44e4&inProgressOk=true' to transaction ID 4492021-07-09 06:34:53,932 INFO namenode.RedundantEditLogInputStream: Fast-forwarding stream 'http://XX-XXXXX-XXX.internal.cloudapp.net:8480/getJournal?jid=conduit&segmentTxId=449&storageInfo=-65%3A326724551%3A1625810812563%3ACID-b9e44cb5-91e4-4a55-85dc-6442fa9e44e4&inProgressOk=true' to transaction ID 4492021-07-09 06:34:53,970 INFO namenode.FSImage: Edits file http://XX-XXXXX-XXX.internal.cloudapp.net:8480/getJournal?jid=conduit&segmentTxId=449&storageInfo=-65%3A326724551%3A1625810812563%3ACID-b9e44cb5-91e4-4a55-85dc-6442fa9e44e4&inProgressOk=true, http://YYY-YYYYY-YYYY.internal.cloudapp.net:8480/getJournal?jid=conduit&segmentTxId=449&storageInfo=-65%3A326724551%3A1625810812563%3ACID-b9e44cb5-91e4-4a55-85dc-6442fa9e44e4&inProgressOk=true of size 42 edits # 2 loaded in 0 seconds2021-07-09 06:36:22,127 INFO namenode.FSNamesystem: Stopping services started for standby state2021-07-09 06:36:22,128 WARN ha.EditLogTailer: Edit log tailer interruptedjava.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:469) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:399) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:416) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:484) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:412)2021-07-09 06:36:22,130 INFO namenode.FSNamesystem: Starting services required for active state2021-07-09 06:36:22,131 ERROR namenode.NameNode: Error encountered requiring NN shutdown. Shutting down immediately.java.lang.IllegalArgumentException: Unable to construct journal, qjournal://XX-XXXXX-XXX.internal.cloudapp.net:8485;YYY-YYYYY-YYYY.internal.cloudapp.net:8485;ZZ-ZZZZZ-ZZZZ.internal.cloudapp.net:8485/conduit at org.apache.hadoop.hdfs.server.namenode.FSEditLog.createJournal(FSEditLog.java:1824) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.initJournals(FSEditLog.java:294) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.initJournalsForWrite(FSEditLog.java:259) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1223) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1890) at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61) at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49) at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1749) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1742) at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107) at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.createJournal(FSEditLog.java:1811) ... 19 moreCaused by: java.net.UnknownHostException: XX-XXXXX-XXX.internal.cloudapp.net:8485 at org.apache.hadoop.hdfs.server.common.Util.getAddressesList(Util.java:378) at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.createLoggers(QuorumJournalManager.java:388) at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.createLoggers(QuorumJournalManager.java:170) at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.<init>(QuorumJournalManager.java:126) at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.<init>(QuorumJournalManager.java:105) ... 24 more2021-07-09 06:36:22,132 INFO util.ExitUtil: Exiting with status 1: java.lang.IllegalArgumentException: Unable to construct journal, qjournal://XX-XXXXX-XXX.internal.cloudapp.net:8485;YYY-YYYYY-YYYY.internal.cloudapp.net:8485;ZZ-ZZZZZ-ZZZZ.internal.cloudapp.net:8485/conduit2021-07-09 06:36:22,133 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************SHUTDOWN_MSG: Shutting down NameNode at YYY-YYYYY-YYYY/10.9.0.5************************************************************/WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of HADOOP_PREFIX.2021-07-09 06:36:22,837 INFO namenode.NameNode: STARTUP_MSG: {noformat} > In HA, Namenode is failed to start If any of the Quorum hostname is > unresolved. > ------------------------------------------------------------------------------- > > Key: HDFS-10719 > URL: https://issues.apache.org/jira/browse/HDFS-10719 > Project: Hadoop HDFS > Issue Type: Bug > Components: journal-node, namenode > Affects Versions: 2.7.1 > Environment: HDP-2.4 > Reporter: Karthik Palanisamy > Assignee: Karthik Palanisamy > Priority: Major > Labels: patch > Attachments: HDFS-10719-1.patch, HDFS-10719-2.patch, > HDFS-10719-3.patch, HDFS-10719-4.patch > > > 2016-08-03 02:53:53,760 ERROR namenode.NameNode (NameNode.java:main(1712)) - > Failed to start namenode. > java.lang.IllegalArgumentException: Unable to construct journal, > qjournal://xxxx1:8485;xxxx2:8485;xxxx3:8485/shva > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.createJournal(FSEditLog.java:1637) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.initJournals(FSEditLog.java:282) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.initSharedJournalsForRead(FSEditLog.java:260) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.initEditLog(FSImage.java:789) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:634) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:294) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:983) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:688) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:662) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:726) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:951) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:935) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1641) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1707) > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.createJournal(FSEditLog.java:1635) > ... 13 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannelMetrics.getName(IPCLoggerChannelMetrics.java:107) > at > org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannelMetrics.create(IPCLoggerChannelMetrics.java:91) > at > org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel.<init>(IPCLoggerChannel.java:178) > at > org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$1.createLogger(IPCLoggerChannel.java:156) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.createLoggers(QuorumJournalManager.java:367) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.createLoggers(QuorumJournalManager.java:149) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.<init>(QuorumJournalManager.java:116) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.<init>(QuorumJournalManager.java:105) > ... 18 more > 2016-08-03 02:53:53,765 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - > Exiting with status 1 > 2016-08-03 02:53:53,768 INFO namenode.NameNode (LogAdapter.java:info(47)) - > SHUTDOWN_MSG: > *and the failover is not successful* > I have attached the patch, It allows the Namenode to start if the majority of > the Quorums are resolvable. > throws warning if the quorum is unresolvable. > throws Unknown host exception if the majority of the journals are > unresolvable. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org