[ https://issues.apache.org/jira/browse/HDFS-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189868#comment-13189868 ]
Karthik Ranganathan commented on HDFS-2601: ------------------------------------------- @Sanjay - done, removed the HA from the title. @Hari - a little confused about the write latency on sync versus async. By definition, I think we cant go the async route right - because the nn has to complete a transaction (which involved writing to fsedits) before ack-ing to the client. Comparing this pipeline solution with bookkeeper, I am trying to understand how the BK based solution would do any better - because the nn would still have to wait for the BK system to ack that it has written to the logical fsedits file on 3 different machines. Another data point is that HBase today writes the its Write-Ahead Log (similar in function to fsedits) - to HDFS as a file, and the latencies do not seem too bad. The 2 things that HBase does are: - group commit a bunch of transactions instead of making one RPC per edit - once the replication pipeline drops to below 3, it immediately closes the file so that nn will re-replicate. Wondering if a similar issue would work for the HDFS case - the only perf related enhancement we might need is parallel writes from the client. > Proposal to store edits and checkpoints inside HDFS itself for namenode > ----------------------------------------------------------------------- > > Key: HDFS-2601 > URL: https://issues.apache.org/jira/browse/HDFS-2601 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node > Reporter: Karthik Ranganathan > > Would have liked to make this a "brainstorming" JIRA but couldn't find the > option for some reason. > I have talked to a quite a few people about this proposal at Facebook > internally (HDFS folks like Hairong and Dhruba, as well as HBase folks > interested in this feature), and wanted to broaden the audience. > At the core of the HA feature, we need 2 things: > A. the secondary NN (or avatar stand-by or whatever we call it) needs to read > all the fsedits and fsimage data written by the primary NN > B. Once the stand-by has taken over, the old NN should not be allowed to make > any edits > The basic idea is as follows (there are some variants, we can hone in on the > details if we like the general approach): > 1. The write path for fsedits and fsimage: > 1.1 The NN uses a dfs client to write fsedits and fsimage. These will be > regular hdfs files written using the write pipeline. > 1.2 Let us say the fsimage and fsedits files are written to a well-known > location in the local HDFS itself (say /.META or some such location) > 1.3 The create files and add blocks to files in this path are not written to > fsimage or fsedits. The location of the blocks for the files in this location > are known to all namenodes - primary and standby - somehow (some > possibilities here - write these block ids to zk or use reserved block ids or > write some meta-data into the blocks itself and store the blocks in a well > known location on all the datanodes) > 1.4 If the replication factor on the write pipeline decreases, we close the > block immediately and allow NN to re-replicate to bring up the replication > factor. We continue writing to a new block > 2. The read path on a NN failure > 2.1 Since the new NN "knows" the location of the blocks for the fsedits and > fsimage (again the same possibilities as mentioned above), there is nothing > to do to determine this > 2.2 It can read the files it needs using the HDFS client itself > 3. Fencing - if a NN is unresponsive, a new NN takes over, old NN should not > be allowed to perform any action > 3.1 Use HDFS lease recovery for the fsedits and fsimage files - the new NN > will close all these files baing written to by the old NN (and hence all the > blocks) > 3.2 The new NN (avatar NN) will write its address into ZK to let everyone > know its the master > 3.3 The new NN now gets the lease for these files and starts writing into the > fsedits and fsimage > 3.4 The old NN cannot write into the file as the block it was writing to was > closed and it does not have the lease. If it needs to re-open these files, it > needs to check zk to see it is indeed the current master, if not it should > exit. > 4. Misc considerations: > 4.1 If needed, we can specify favored nodes to place the blocks for this data > in specific set of nodes (say we want to use a different set of RAIDed nodes, > etc). > 4.2 Since we wont record the entries for /.META in fsedits and fsimage, a > "hadoop dfs -ls /" command wont show the files. This is probably ok, and can > be fixed if not. > 4.3 If we have 256MB block sizes, then 20GB fsimage file would need 80 block > ids - the NN would need only these 80 block ids to read all the fsedits data. > The fsimage data is even lesser. This is very tractable using a variety of > the techniques (the possibilities mentioned above). > The advantage is that we are re-using the existing HDFS client (with some > enhancements of course), and making the solution self-sufficient on the > existing HDFS. Also, the operational complexity is greatly reduced. > Thoughts? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira