[jira] [Commented] (HDFS-2709) HA: Appropriately handle error conditions in EditLogTailer

Aaron T. Myers (Commented) (JIRA) Thu, 29 Dec 2011 17:02:08 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177483#comment-13177483
 ]


Aaron T. Myers commented on HDFS-2709:
--------------------------------------

Thanks a lot for the thorough review, Eli. Comments inline. I also found and 
fixed another little bug involving a potential race between the edits log 
tailer thread and rolling edits logs. I'll post an updated patch in a moment.

bq. This change handles errors reading an edit from the log (the common case) 
but not when there's a failure to apply an edit (eg if there was a bug, or a 
silent corruption somehow went unnoticed). While loadEdits won't ignore (will 
throw) this exception it does get propagated up to the catch of Throwable in 
EditLogTailer#run so we effectively retry endlessly in this case. Need to 
replace the TODO(HA) comment there with code to shutdown the SBN. Feel free to 
punt to another jira.

Indeed, I had originally intended to do this as part of a separate JIRA, but 
I'm rethinking that decision. I've added some code to shutdown the SBN, and 
amended the tests to verify this behavior.

bq. How about adding a test that uses multiple shared edits dirs, and shows 
that a failure to read from one of them will cause the tailer to not catch up, 
can file a jira for a future change that is OK with faulty shared dirs as long 
as one is working.

Multiple shared edits dirs isn't currently supported or tested. It's certainly 
an obvious improvement worth doing, but there are currently no tests for it. We 
should probably file a JIRA to test that.

bq. In FileJournalManager#getNumberOfTransactions, not that the we loosen the 
check to elf.containsTxId(fromTxid) isn't the last else case dead code?

Yes indeed, not sure how I missed that. Removed.

bq. I think we can remove the "TODO(HA): Should this happen when called by the 
tailer?" comment in loadEdits right since we always create new streams when we 
select them?

Yes indeed. Removed.

bq. Would it be simpler in LimitedEditLogAnswer#answer to spy on each stream 
and stub readOp rather than introduce LimitedEditLogInputStream?

Different? Yes. Simpler? Maybe. I did it this way because I thought creating 
spies within spies was kind of gross. I switched it to use a spy in this latest 
patch, which is at least less code. :)

bq. How about introducing DFSHATestUtil and put waitForStandbyToCatchUp and 
CouldNotCatchUpException there? Seems like the methods you pointed out in the 
HDFS-2692 review could go there as well).

Good idea. Let's do it in a separate JIRA though, along the lines of 
"consolidate generic HA test helper methods."

bq. Nit: "IOException e", s/e/ioe/

Done.

bq. testFailuretoReadEdits needs a javadoc

Done.

bq. waitForStandbyToCatchUp needs a javadoc indicating it waits for 
NN_LAG_TIMEOUT then throws CouldNotCatchUp

Done.
                
> HA: Appropriately handle error conditions in EditLogTailer
> ----------------------------------------------------------
>
>                 Key: HDFS-2709
>                 URL: https://issues.apache.org/jira/browse/HDFS-2709
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ha, name-node
>    Affects Versions: HA branch (HDFS-1623)
>            Reporter: Todd Lipcon
>            Assignee: Aaron T. Myers
>            Priority: Critical
>         Attachments: HDFS-2709-HDFS-1623.patch, HDFS-2709-HDFS-1623.patch, 
> HDFS-2709-HDFS-1623.patch
>
>
> Currently if the edit log tailer experiences an error replaying edits in the 
> middle of a file, it will go back to retrying from the beginning of the file 
> on the next tailing iteration. This is incorrect since many of the edits will 
> have already been replayed, and not all edits are idempotent.
> Instead, we either need to (a) support reading from the middle of a finalized 
> file (ie skip those edits already applied), or (b) abort the standby if it 
> hits an error while tailing. If "a" isn't simple, let's do "b" for now and 
> come back to 'a' later since this is a rare circumstance and better to abort 
> than be incorrect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-2709) HA: Appropriately handle error conditions in EditLogTailer

Reply via email to