[ 
https://issues.apache.org/jira/browse/HDFS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026675#comment-13026675
 ] 

Todd Lipcon commented on HDFS-1580:
-----------------------------------

Hi Jitendra. Here are some thoughts on your latest document:

- While I appreciate that this work will probably make snapshots a little 
easier down the road, it's by far not the most difficult part of supporting 
snapshots, nor is it really the goal we're trying to address. So I think it's 
premature to mention snapshots in the design.
- The concept of "layout version" I think has been overloaded way too much. We 
currently use a single version number to indicate (a) the file and 
serialziation format for image dumps, (b) the file and serialization format for 
edit logs, and (c) the actual layout of files within the {{current/}} 
directory. I would like to advocate splitting this out into 
IMAGE_FORMAT_VERSION, EDITS_FORMAT_VERSION, and LAYOUT_VERSION. To be clear, 
this jira is mostly concerned with what I would call EDITS_FORMAT_VERSION (e.g. 
the way in which we turn a mkdirs into bytes). Do you agree with this 
interpretation?
- The idea of a {{purgeTransactions}} call makes sense -- after a checkpoint 
has been uploaded for txid N, we don't need edits prior to N. However, there 
are some policies that make sense to me like "keep edits for at least a week". 
Would you assume these retention policies would be the responsibility of the 
edit log implementation? ie that, even if told to purge transactions older than 
txid N, it might keep them around for some time, or take care of archiving them 
to a NAS/HDFS?
- For the {{getInputStream}} call, is there any restriction on valid values of 
{{sinceTxId}} that it be on any kind of boundary? e.g that it must correspond 
to a "mark" call? See more about this below regarding the idea of "log segments"
- I don't entirely understand the usage of the {{setVersion}} call. When would 
the version of a log change mid-stream?
- I'm not entirely clear on "mark" as well. The semantics described in the 
"Discussion" section are what I would normally call {{sync}}, but in other 
parts of the document it's described as a {{roll}} equivalent. If it's not 
sync, then we're missing sync altogether, and that implies that each {{write}} 
call will have to sync on its own, thus breaking group commit. I think we 
should maintain the existing buffering/syncing calls {{write}}, 
{{setReadyToFlush}}, and {{flushAndSync}}.
- The {{EditLogInputStream}} interface is strange - it's called InputStream but 
doesn't follow a normal InputStream API. It's something sort of like an 
Iterator, but also doesn't implement that interface. Could we add a wrapper 
class {{EditTransaction}}, and make EditLogInputStream an 
Interable<EditTransaction>? EditTransaction would then take the {{getTxnId}} 
call.
- The API {{getTxn}} shouldn't return {{byte[]}} since that implies an extra 
buffer copy to get a transaction into its own array. Instead it should be able 
to point into an existing byte array. Alternatively, the input stream could 
continue to implement InputStream so we can use the existing editlog loading 
code.

As I've proposed over in some other JIRAs, I think we should do away with the 
{{roll}} call, and instead make the concept of _log segments_ a first class 
citizen. In the file-based storage case, a log segment is an individual file. 
In the BK case, it may be that a log segment is a ledger (I don't know BK's API 
well).

Thus, rolling the logs becomes a sequence like:
{code}
    endCurrentLogSegment();
    long nextTxId = getLastWrittenTxId() + 1;
    LOG.info("Rolling edit logs. Next txid after roll will be " + nextTxId);
    startLogSegment(nextTxId);
{code}
where {{endCurrentLogSegment}} closes off the current segment across all 
journals, and {{startLogSegment}} starts a new output stream across all 
journals.

The advantages I see of this approach are:
- elsewhere we have discussed that we want to keep the property that logs 
always roll together across all parts of the system, and thus that the storage 
directories have parallel contents with identical names and identical file 
contents. It's possible to achieve this with just the roll API, but it becomes 
more obvious how to do it with the segment concept. As one example, consider 
what happens when one journal fails (eg due to an NFS mount going down 
temporarily). While it's down, we don't write txns to this journal. But, after 
some time we may notice that the mount is available again. Rather than just 
calling {{roll}} here, it makes sense to be explicit that we're starting a new 
segment, and be explicit about the starting txid of that new segment.

- We generally want the property that, while saving a namespace or in safe 
mode, we don't accept edits. Thus, it would be nice to have the edit log 
actually be closed during this operation. Splitting {{roll}} into a 
{{endCurrent}} and {{startNext}} allows us to add the namespace dump between 
the two and make sure that no edits could possibly be written while saving.

What do you think about these suggestions? You can see a working tree with the 
"log segment" concept at 
https://github.com/toddlipcon/hadoop-hdfs/tree/hdfs-1073-march/src/java/org/apache/hadoop/hdfs/server/namenode/

> Add interface for generic Write Ahead Logging mechanisms
> --------------------------------------------------------
>
>                 Key: HDFS-1580
>                 URL: https://issues.apache.org/jira/browse/HDFS-1580
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Ivan Kelly
>             Fix For: Edit log branch (HDFS-1073)
>
>         Attachments: EditlogInterface.1.pdf, HDFS-1580+1521.diff, 
> HDFS-1580.diff, HDFS-1580.diff, HDFS-1580.diff, generic_wal_iface.pdf, 
> generic_wal_iface.pdf, generic_wal_iface.pdf, generic_wal_iface.txt
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to