[ https://issues.apache.org/jira/browse/HDFS-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177483#comment-13177483 ]
Aaron T. Myers commented on HDFS-2709: -------------------------------------- Thanks a lot for the thorough review, Eli. Comments inline. I also found and fixed another little bug involving a potential race between the edits log tailer thread and rolling edits logs. I'll post an updated patch in a moment. bq. This change handles errors reading an edit from the log (the common case) but not when there's a failure to apply an edit (eg if there was a bug, or a silent corruption somehow went unnoticed). While loadEdits won't ignore (will throw) this exception it does get propagated up to the catch of Throwable in EditLogTailer#run so we effectively retry endlessly in this case. Need to replace the TODO(HA) comment there with code to shutdown the SBN. Feel free to punt to another jira. Indeed, I had originally intended to do this as part of a separate JIRA, but I'm rethinking that decision. I've added some code to shutdown the SBN, and amended the tests to verify this behavior. bq. How about adding a test that uses multiple shared edits dirs, and shows that a failure to read from one of them will cause the tailer to not catch up, can file a jira for a future change that is OK with faulty shared dirs as long as one is working. Multiple shared edits dirs isn't currently supported or tested. It's certainly an obvious improvement worth doing, but there are currently no tests for it. We should probably file a JIRA to test that. bq. In FileJournalManager#getNumberOfTransactions, not that the we loosen the check to elf.containsTxId(fromTxid) isn't the last else case dead code? Yes indeed, not sure how I missed that. Removed. bq. I think we can remove the "TODO(HA): Should this happen when called by the tailer?" comment in loadEdits right since we always create new streams when we select them? Yes indeed. Removed. bq. Would it be simpler in LimitedEditLogAnswer#answer to spy on each stream and stub readOp rather than introduce LimitedEditLogInputStream? Different? Yes. Simpler? Maybe. I did it this way because I thought creating spies within spies was kind of gross. I switched it to use a spy in this latest patch, which is at least less code. :) bq. How about introducing DFSHATestUtil and put waitForStandbyToCatchUp and CouldNotCatchUpException there? Seems like the methods you pointed out in the HDFS-2692 review could go there as well). Good idea. Let's do it in a separate JIRA though, along the lines of "consolidate generic HA test helper methods." bq. Nit: "IOException e", s/e/ioe/ Done. bq. testFailuretoReadEdits needs a javadoc Done. bq. waitForStandbyToCatchUp needs a javadoc indicating it waits for NN_LAG_TIMEOUT then throws CouldNotCatchUp Done. > HA: Appropriately handle error conditions in EditLogTailer > ---------------------------------------------------------- > > Key: HDFS-2709 > URL: https://issues.apache.org/jira/browse/HDFS-2709 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ha, name-node > Affects Versions: HA branch (HDFS-1623) > Reporter: Todd Lipcon > Assignee: Aaron T. Myers > Priority: Critical > Attachments: HDFS-2709-HDFS-1623.patch, HDFS-2709-HDFS-1623.patch, > HDFS-2709-HDFS-1623.patch > > > Currently if the edit log tailer experiences an error replaying edits in the > middle of a file, it will go back to retrying from the beginning of the file > on the next tailing iteration. This is incorrect since many of the edits will > have already been replayed, and not all edits are idempotent. > Instead, we either need to (a) support reading from the middle of a finalized > file (ie skip those edits already applied), or (b) abort the standby if it > hits an error while tailing. If "a" isn't simple, let's do "b" for now and > come back to 'a' later since this is a rare circumstance and better to abort > than be incorrect. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira