[jira] [Commented] (HDFS-1580) Add interface for generic Write Ahead Logging mechanisms

Todd Lipcon (JIRA) Thu, 28 Apr 2011 16:28:44 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026772#comment-13026772
 ]


Todd Lipcon commented on HDFS-1580:
-----------------------------------

bq. which is relevant for understanding the edit logs. I agree that version is 
a little overloaded but that can be addressed in a different jira

Agreed that's a separate JIRA -- I just wanted to clarify that the version 
you're talking about here is the "edits log serialization format version" 
rather than something about actual layout.

bq. If namenode has called purgeTransactions it should never ask for older 
transaction ids

Fair enough.

bq. Apart from that, sinceTxnId doesn't assume any boundary

I think that will really complicate things like edits transfer in the 2NN. In 
the file-based storage there's no clean way to seek to a particular transaction 
ID, meaning we'd have to add in this facility into EditLogInputStream, etc. 
That's a lot of complexity for little benefit that I can see.

bq. The motivation for "mark" method was that BK has this limitation that open 
ledgers cannot be read, "mark" will give a cue to a BK implementation that the 
current ledger should be made available for reading

This seems like a somewhat serious flaw. If we anticipate using BK for HA, I 
was under the impression that the "hot backup" would be following along on the 
edits as they're written into BK. What you're saying here implies that the 
primary NN would have to be rolling its logs every few seconds if you want the 
standby to be truly "hot".

bq. If an implementation doesn't have this limitation it can just ignore mark, 
that is why I didn't call it roll

Another way of doing this is to say that, if an implementation _does_ have this 
limitation, it can choose to "mark" whenever it likes. No?

bq. I assumed that a write also syncs, because in most operations we sync 
immediately after writing the log, and in this design we are writing the entire 
transaction as a unit. 

In fact this is not at all how the current design works. Most operations write 
the edit to the log while holding the FSN lock (to ensure serialized order 
between ops) and then drop the FSN lock to sync. This allows group commit and 
is crucial for reasonable throughput.

bq. Management of buffers and flush, should be the responsibility of the 
implementation.

But flush needs to be coordinated as a separate action from writing in order to 
achieve lock release and group commit.

bq. readNext() reads the version and txnId and synchronizes the underlying 
inputstream to the begining of transaction record and then getTxn can directly 
return the underlying inputstream for reading the transaction bytes

Yep, that makes sense.


bq. LogSegments gets rid of roll method but exposes the underlying units of 
storage to the namenode which I don't think is required
It's not absolutely required in the theoretical sense, but in the sense that 
we'd like to keep the code as simple as possible, I think it helps that goal. 
For example, edit log transfer right now is based around the concept of 
discrete files which can be entirely fetched, with an associated md5sum. If we 
have to support fetching arbitrary ranges of transactions, these safety checks 
become more difficult to implement. And, we need to split the "file transfer" 
code into two different code paths, one for files (fsimage) and another for 
edits (arbitrary transaction ranges)

bq. Do we really want this property? Isn't it better that we don't expose any 
boundaries between transactions to the namenode?

Yes, this property is very useful for operations. Refer to the discussion on 
HDFS-1073 about this property. The fact that I can run "md5sum 
/data/{1..4}/dfs/name/current/*" and verify that the files are all identical 
gives me great peace of mind.




> Add interface for generic Write Ahead Logging mechanisms
> --------------------------------------------------------
>
>                 Key: HDFS-1580
>                 URL: https://issues.apache.org/jira/browse/HDFS-1580
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Ivan Kelly
>             Fix For: Edit log branch (HDFS-1073)
>
>         Attachments: EditlogInterface.1.pdf, HDFS-1580+1521.diff, 
> HDFS-1580.diff, HDFS-1580.diff, HDFS-1580.diff, generic_wal_iface.pdf, 
> generic_wal_iface.pdf, generic_wal_iface.pdf, generic_wal_iface.txt
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-1580) Add interface for generic Write Ahead Logging mechanisms

Reply via email to