[jira] [Commented] (HDFS-16950) Gap in edits after -initializeSharedEdits
[ https://issues.apache.org/jira/browse/HDFS-16950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17780001#comment-17780001 ] Wei-Chiu Chuang commented on HDFS-16950: Karthik said because of the missing edit logs it caused data loss. And it's reproducible. A workaround would be to enter the NN in safe mode, take checkpoint, before proceed with the migration. > Gap in edits after -initializeSharedEdits > - > > Key: HDFS-16950 > URL: https://issues.apache.org/jira/browse/HDFS-16950 > Project: Hadoop HDFS > Issue Type: Bug > Components: journal-node, namenode >Reporter: Karthik Palanisamy >Priority: Critical > > Namenode failed in the production cluster when JN role is migrated. > {code:java} > ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start > namenode. > java.io.IOException: There appears to be a gap in the edit log. We expected > txid xx, but got txid xx. {code} > InitializeSharedEdits issued as part of the role migration step. Note, no > checkpoint is performed in the past few hours. > InitializeSharedEdits created a new log segment from the edit_inprogres > transaction and deleted all old transactions. > My ask here is to delete any edit transaction older than the fimage > transaction. But currently, it deletes all transactions and no check is > enforced in JNStorage#format(). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16950) Gap in edits after -initializeSharedEdits
[ https://issues.apache.org/jira/browse/HDFS-16950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739929#comment-17739929 ] Srinivasu Majeti commented on HDFS-16950: - Hi [~kpalanisamy] , Could we make this a bug instead of an improvement ? > Gap in edits after -initializeSharedEdits > - > > Key: HDFS-16950 > URL: https://issues.apache.org/jira/browse/HDFS-16950 > Project: Hadoop HDFS > Issue Type: Improvement > Components: journal-node, namenode >Reporter: Karthik Palanisamy >Priority: Major > > Namenode failed in the production cluster when JN role is migrated. > {code:java} > ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start > namenode. > java.io.IOException: There appears to be a gap in the edit log. We expected > txid xx, but got txid xx. {code} > InitializeSharedEdits issued as part of the role migration step. Note, no > checkpoint is performed in the past few hours. > InitializeSharedEdits created a new log segment from the edit_inprogres > transaction and deleted all old transactions. > My ask here is to delete any edit transaction older than the fimage > transaction. But currently, it deletes all transactions and no check is > enforced in JNStorage#format(). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16950) Gap in edits after -initializeSharedEdits
[ https://issues.apache.org/jira/browse/HDFS-16950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17700860#comment-17700860 ] Karthik Palanisamy commented on HDFS-16950: --- For example: NN meta dir: {code:java} -rw-r--r-- 1 hdfs hdfs 18K Mar 14 23:51 fsimage_0003493 -rw-r--r-- 1 hdfs hdfs 62 Mar 14 23:51 fsimage_0003493.md5 -rw-r--r-- 1 hdfs hdfs 193 Mar 14 23:51 VERSION -rw-r--r-- 1 hdfs hdfs 1.0M Mar 15 00:06 edits_0003494-0003670 -rw-r--r-- 1 hdfs hdfs 2.3K Mar 15 00:13 edits_0003671-0003689 -rw-r--r-- 1 hdfs hdfs 1.0M Mar 15 00:14 edits_0003690-0003696 -rw-r--r-- 1 hdfs hdfs 2.3K Mar 15 00:18 edits_0003697-0003718 -rw-r--r-- 1 hdfs hdfs5 Mar 15 00:18 seen_txid -rw-r--r-- 1 hdfs hdfs 1.0M Mar 15 00:18 edits_inprogress_0003719 {code} JN format is issued which removed all the edits in the JN meta dir: {code:java} 2023-03-15 00:22:02,321 INFO [main] common.Storage (Storage.java:clearDirectory(442)) - Will remove files: [/data/dfs/jn/current/edits_0003337-0003487, /data/dfs/jn/current/seen_txid, /data/dfs/jn/current/edits_0003488-0003489, /data/dfs/jn/current/VERSION, /data/dfs/jn/current/edits_0003490-0003491, /data/dfs/jn/current/edits_0003492-0003493, /data/dfs/jn/current/edits_0003494-0003670, /data/dfs/jn/current/edits_0003697-0003718, /data/dfs/jn/current/edits_inprogress_0003719] {code} In the end, it created a new log segment from edits_inprogress. {code:java} (FileJournalManager.java:finalizeLogSegment(145)) - Finalizing edits file /data/dfs/jn/current/edits_inprogress_0003719 -> /data/dfs/jn/current/edits_0003719-0003736 {code} So we lost trxn between fsimage and edit_inprogress, resulting edit gap. > Gap in edits after -initializeSharedEdits > - > > Key: HDFS-16950 > URL: https://issues.apache.org/jira/browse/HDFS-16950 > Project: Hadoop HDFS > Issue Type: Bug > Components: journal-node, namenode >Reporter: Karthik Palanisamy >Priority: Major > > Namenode failed in the production cluster when JN role is migrated. > {code:java} > ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start > namenode. > java.io.IOException: There appears to be a gap in the edit log. We expected > txid xx, but got txid xx. {code} > InitializeSharedEdits issued as part of the role migration step. Note, no > checkpoint is performed in the past few hours. > InitializeSharedEdits created a new log segment from the edit_inprogres > transaction and deleted all old transactions. > My ask here is to delete any edit transaction older than the fimage > transaction. But currently, it deletes all transactions and no check is > enforced in JNStorage#format(). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org