[ https://issues.apache.org/jira/browse/HDFS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15849052#comment-15849052 ]
Jing Zhao commented on HDFS-4025: --------------------------------- Thanks for the updating the patch, [~hanishakoneru]. The latest patch looks pretty good to me. Some minor comments: # In hdfs-default.xml, "i" --> "if" {code} + <name>dfs.journalnode.enable.sync</name> + <value>true</value> + <description> + If true, the journal nodes wil sync with each other. The journal nodes + will periodically gossip with other journal nodes to compare edit log + manifests and i they detect any missing log segment, they will download + it from the other journal nodes. + </description> +</property> {code} # In JournalNodeSyncer.java, the following code will generate an {{UnsupportedOperationException}} since thisJournalEditLogs is an immutable list. In fact this add op can be skipped. {code} if (success) { thisJournalEditLogs.add(missingLog); } {code} # Maybe "Transferring" can be changed to "Downloading"? {code} LOG.info("Transferring Missing Edit Log from " + url + " to " + jnStorage .getRoot()); {code} # {{finalEditsFile}} should be {{tmpEditsFile}}. {code} LOG.info("Downloaded file " + tmpEditsFile.getName() + " size " + finalEditsFile.length() + " bytes."); {code} # In {{TestJournalNodeSync}}, {{jid}} can be declared as final, and {{editLogExists}} can be private. # For {{deleteEditLog}}, we can either change the while loop to an if, or refresh logFile instance within the while loop. {code} + while (logFile.isInProgress()) { + dfsCluster.getNameNode(0).getRpcServer().rollEditLog(); {code} # The following code can be simplified as "Assert.assertTrue("Couldn't delete edit log file", deleteFile.delete());" {code} + if (!deleteFile.delete()) { + assert false: "Couldn't delete edit log file"; + return null; + } {code} # In {{generateEditLog}}, let's also check the result of {{doAndEdit}}. I.e., we do "Assert.assertTrue(doAnEdit());" > QJM: Sychronize past log segments to JNs that missed them > --------------------------------------------------------- > > Key: HDFS-4025 > URL: https://issues.apache.org/jira/browse/HDFS-4025 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ha > Affects Versions: QuorumJournalManager (HDFS-3077) > Reporter: Todd Lipcon > Assignee: Hanisha Koneru > Fix For: QuorumJournalManager (HDFS-3077) > > Attachments: HDFS-4025.000.patch, HDFS-4025.001.patch, > HDFS-4025.002.patch, HDFS-4025.003.patch, HDFS-4025.004.patch, > HDFS-4025.005.patch, HDFS-4025.006.patch, HDFS-4025.007.patch, > HDFS-4025.008.patch, HDFS-4025.009.patch > > > Currently, if a JournalManager crashes and misses some segment of logs, and > then comes back, it will be re-added as a valid part of the quorum on the > next log roll. However, it will not have a complete history of log segments > (i.e any individual JN may have gaps in its transaction history). This > mirrors the behavior of the NameNode when there are multiple local > directories specified. > However, it would be better if a background thread noticed these gaps and > "filled them in" by grabbing the segments from other JournalNodes. This > increases the resilience of the system when JournalNodes get reformatted or > otherwise lose their local disk. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org