[ https://issues.apache.org/jira/browse/HDFS-16849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697582#comment-17697582 ]
Karthik Palanisamy commented on HDFS-16849: ------------------------------------------- Yes Arpit, SNN keeps retrying but fail always until we reboot the namenode. local exception: org.apache.hadoop.security.KerberosAuthException: Login failure for user: hdfs/xxxx javax.security.auth.login.LoginException: Client not found in Kerberos database (6)] The problem is our checkpoint which didn't run. Customers think that the checkpoint was doing fine since SNN up. But in reality, SNN is dead-state. > Terminate SNN when failing to perform EditLogTailing > ---------------------------------------------------- > > Key: HDFS-16849 > URL: https://issues.apache.org/jira/browse/HDFS-16849 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Reporter: Karthik Palanisamy > Priority: Major > > We should terminate SNN if we fail LogTrailing for sufficient JN. We found > this after Kerberos error. > {code:java} > 2022-10-14 10:53:16,796 INFO > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms > (timeout=20000 ms) for a response for selectStreamingInputStreams. Exceptions > so far: [xxxx:8485: DestHost:destPort xxxx:8485 , LocalHost:localPort > xxxx/xxxx:0. Failed on local exception: > org.apache.hadoop.security.KerberosAuthException: Login failure for user: > hdfs/xxxx javax.security.auth.login.LoginException: Client not found in > Kerberos database (6)] > 2022-10-14 10:53:30,796 WARN > org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input > streams from QJM to [xxxx:8485, yyyy:8485, zzzz:8485]. Skipping. > java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to > respond. > at > org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:138) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectStreamingInputStreams(QuorumJournalManager.java:605) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:523) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:269) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1673) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1706) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:311) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:464) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:414) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:431) > at java.base/java.security.AccessController.doPrivileged(Native > Method) > at java.base/javax.security.auth.Subject.doAs(Subject.java:361) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:427) > {code} > > We have no check whether sufficient JN met: > [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L280] > So we should implement a similar check this, > [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java#L395] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org