[ https://issues.apache.org/jira/browse/HDFS-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aaron T. Myers updated HDFS-2709: --------------------------------- Attachment: HDFS-2709-HDFS-1623.patch I was indeed able to reproduce this error, usually within about 20 iterations. Here's an updated patch which fixes that occasional test failure. I don't think this will still be an issue once HDFS-2738 goes in, but the fix to get this test to pass without that was this change in {{EditLogFileInputStream}}: {code} @@ -185,8 +203,9 @@ class EditLogFileInputStream extends EditLogInputStream { + logVersion + ". Current version = " + HdfsConstants.LAYOUT_VERSION + "."); } - assert logVersion <= Storage.LAST_UPGRADABLE_LAYOUT_VERSION : - "Unsupported version " + logVersion; + if (logVersion > Storage.LAST_UPGRADABLE_LAYOUT_VERSION) { + throw new IOException("Unsupported version " + logVersion); + } return logVersion; } {code} I then ran 70 iterations of {{TestHASafeMode}} with this patch, and never saw an error. > HA: Appropriately handle error conditions in EditLogTailer > ---------------------------------------------------------- > > Key: HDFS-2709 > URL: https://issues.apache.org/jira/browse/HDFS-2709 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ha, name-node > Affects Versions: HA branch (HDFS-1623) > Reporter: Todd Lipcon > Assignee: Aaron T. Myers > Priority: Critical > Attachments: HDFS-2709-HDFS-1623.patch, HDFS-2709-HDFS-1623.patch, > HDFS-2709-HDFS-1623.patch, HDFS-2709-HDFS-1623.patch, > HDFS-2709-HDFS-1623.patch, HDFS-2709-HDFS-1623.patch, > HDFS-2709-HDFS-1623.patch > > > Currently if the edit log tailer experiences an error replaying edits in the > middle of a file, it will go back to retrying from the beginning of the file > on the next tailing iteration. This is incorrect since many of the edits will > have already been replayed, and not all edits are idempotent. > Instead, we either need to (a) support reading from the middle of a finalized > file (ie skip those edits already applied), or (b) abort the standby if it > hits an error while tailing. If "a" isn't simple, let's do "b" for now and > come back to 'a' later since this is a rare circumstance and better to abort > than be incorrect. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira