[jira] [Commented] (HDFS-2601) Proposal to store edits and checkpoints inside HDFS itself for namenode

Karthik Ranganathan (Commented) (JIRA) Fri, 20 Jan 2012 08:16:06 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189868#comment-13189868
 ]


Karthik Ranganathan commented on HDFS-2601:
-------------------------------------------

@Sanjay - done, removed the HA from the title.

@Hari - a little confused about the write latency on sync versus async. By 
definition, I think we cant go the async route right - because the nn has to 
complete a transaction (which involved writing to fsedits) before ack-ing to 
the client.

Comparing this pipeline solution with bookkeeper, I am trying to understand how 
the BK based solution would do any better - because the nn would still have to 
wait for the BK system to ack that it has written to the logical fsedits file 
on 3 different machines.

Another data point is that HBase today writes the its Write-Ahead Log (similar 
in function to fsedits) - to HDFS as a file, and the latencies do not seem too 
bad. The 2 things that HBase does are:
- group commit a bunch of transactions instead of making one RPC per edit
- once the replication pipeline drops to below 3, it immediately closes the 
file so that nn will re-replicate.

Wondering if a similar issue would work for the HDFS case - the only perf 
related enhancement we might need is parallel writes from the client. 
                
> Proposal to store edits and checkpoints inside HDFS itself for namenode
> -----------------------------------------------------------------------
>
>                 Key: HDFS-2601
>                 URL: https://issues.apache.org/jira/browse/HDFS-2601
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: name-node
>            Reporter: Karthik Ranganathan
>
> Would have liked to make this a "brainstorming" JIRA but couldn't find the 
> option for some reason.
> I have talked to a quite a few people about this proposal at Facebook 
> internally (HDFS folks like Hairong and Dhruba, as well as HBase folks 
> interested in this feature), and wanted to broaden the audience.
> At the core of the HA feature, we need 2 things:
> A. the secondary NN (or avatar stand-by or whatever we call it) needs to read 
> all the fsedits and fsimage data written by the primary NN
> B. Once the stand-by has taken over, the old NN should not be allowed to make 
> any edits
> The basic idea is as follows (there are some variants, we can hone in on the 
> details if we like the general approach):
> 1. The write path for fsedits and fsimage: 
> 1.1 The NN uses a dfs client to write fsedits and fsimage. These will be 
> regular hdfs files written using the write pipeline.
> 1.2 Let us say the fsimage and fsedits files are written to a well-known 
> location in the local HDFS itself (say /.META or some such location)
> 1.3 The create files and add blocks to files in this path are not written to 
> fsimage or fsedits. The location of the blocks for the files in this location 
> are known to all namenodes - primary and standby - somehow (some 
> possibilities here - write these block ids to zk or use reserved block ids or 
> write some meta-data into the blocks itself and store the blocks in a well 
> known location on all the datanodes)
> 1.4 If the replication factor on the write pipeline decreases, we close the 
> block immediately and allow NN to re-replicate to bring up the replication 
> factor. We continue writing to a new block
> 2. The read path on a NN failure
> 2.1 Since the new NN "knows" the location of the blocks for the fsedits and 
> fsimage (again the same possibilities as mentioned above), there is nothing 
> to do to determine this
> 2.2 It can read the files it needs using the HDFS client itself
> 3. Fencing - if a NN is unresponsive, a new NN takes over, old NN should not 
> be allowed to perform any action
> 3.1 Use HDFS lease recovery for the fsedits and fsimage files - the new NN 
> will close all these files baing written to by the old NN (and hence all the 
> blocks)
> 3.2 The new NN (avatar NN) will write its address into ZK to let everyone 
> know its the master
> 3.3 The new NN now gets the lease for these files and starts writing into the 
> fsedits and fsimage
> 3.4 The old NN cannot write into the file as the block it was writing to was 
> closed and it does not have the lease. If it needs to re-open these files, it 
> needs to check zk to see it is indeed the current master, if not it should 
> exit.
> 4. Misc considerations:
> 4.1 If needed, we can specify favored nodes to place the blocks for this data 
> in specific set of nodes (say we want to use a different set of RAIDed nodes, 
> etc). 
> 4.2 Since we wont record the entries for /.META in fsedits and fsimage, a 
> "hadoop dfs -ls /" command wont show the files. This is probably ok, and can 
> be fixed if not.
> 4.3 If we have 256MB block sizes, then 20GB fsimage file would need 80 block 
> ids - the NN would need only these 80 block ids to read all the fsedits data. 
> The fsimage data is even lesser. This is very tractable using a variety of 
> the techniques (the possibilities mentioned above).
> The advantage is that we are re-using the existing HDFS client (with some 
> enhancements of course), and making the solution self-sufficient on the 
> existing HDFS. Also, the operational complexity is greatly reduced.
> Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-2601) Proposal to store edits and checkpoints inside HDFS itself for namenode

Reply via email to