[ 
https://issues.apache.org/jira/browse/HDFS-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177735#comment-13177735
 ] 

Aaron T. Myers commented on HDFS-2709:
--------------------------------------

bq. Rather than modify EditLogFileInputStream to take a startTxId, why not do 
the "skipping" (what you call setInitialPosition) from the caller? ie modify 
FSEditLogLoader to skip the transactions that have already been replayed? The 
skipping code doesn't seem specific to the input stream itself.

What I did seems cleaner to me. We're necessarily changing the code which 
selects streams to allow a request for a starting txid in the middle of an ELF, 
so why should that return an ELFIS which starts at a lower txid?

bq. I'm not convinced why we need to have the partialLoadOk flag in 
FSEditLogLoader. IMO if the log is truncated, it's still an error as far as the 
loader is concerned - we just want to let the caller continue from where the 
error occured. The only trick is how to go about getting the last successfully 
loaded txid out of the FSEditLogLoader in the error case – I guess a member 
variable and a getter would work there? Do you think this ends up messier than 
the way you've done it?

I considered that. I also considered throwing a custom {{Exception}} which 
includes the last successfully-loaded txid. Both of those seemed more messy 
than the way I did it, but I could probably be convinced otherwise.

Note that the way I did it in this patch does not preclude the 
{{EditLogTailer}} from detecting that not all expected transactions were 
loaded. The {{EditLogTailer}} already knows both how many are available from 
the files in the shared dir, and how many transactions were in fact loaded. 
This would allow one to implement Eli's suggestion of "retry a read failure X 
times and then exit," though this patch does not currently do that.

bq. Can we add some non-HA tests that exercise 
FileJournalManager/FSEditLogLoader's ability to start mid-stream? Not sure if 
that's feasible.

I'm not quite sure what you mean by this. The way the code is currently 
structured, the code for continuing from the middle of an ELF will only be 
reached in an HA context. That's the point of the {{partialLoadOk}} option, 
which is only passed as true when HA is enabled.
                
> HA: Appropriately handle error conditions in EditLogTailer
> ----------------------------------------------------------
>
>                 Key: HDFS-2709
>                 URL: https://issues.apache.org/jira/browse/HDFS-2709
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ha, name-node
>    Affects Versions: HA branch (HDFS-1623)
>            Reporter: Todd Lipcon
>            Assignee: Aaron T. Myers
>            Priority: Critical
>         Attachments: HDFS-2709-HDFS-1623.patch, HDFS-2709-HDFS-1623.patch, 
> HDFS-2709-HDFS-1623.patch
>
>
> Currently if the edit log tailer experiences an error replaying edits in the 
> middle of a file, it will go back to retrying from the beginning of the file 
> on the next tailing iteration. This is incorrect since many of the edits will 
> have already been replayed, and not all edits are idempotent.
> Instead, we either need to (a) support reading from the middle of a finalized 
> file (ie skip those edits already applied), or (b) abort the standby if it 
> hits an error while tailing. If "a" isn't simple, let's do "b" for now and 
> come back to 'a' later since this is a rare circumstance and better to abort 
> than be incorrect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to