[ 
https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13468816#comment-13468816
 ] 

Todd Lipcon commented on HDFS-3077:
-----------------------------------

bq. You list 4 steps and have comments that steps 2 and 4 correspond with the 
the two phases of paxos. However you add step 3 in the middle which is not part 
of paxos

Not sure I follow. Here's the mapping, using "Paxos Made Simple" as a reference 
(http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf)

Step 2's request (PrepareRecovery) corresponds to Prepare (Phase 1a)
Step 2's response corresponds to "Promise" (Phase 1b)
Step 3's request (AcceptRecovery) corresponds to "Accept!" (Phase 2a)
Step 3's response corresponds to "Accepted" (Phase 2b)

After this point, Paxos itself is actually complete - the final consensus value 
is learned once a quorum of nodes have completed Phase 2 for a given proposal.

Step 4's request (Finalize) corresponds to "Learned" or "Commit" which are 
generally sent to "Learners" after the consensus phase completes. This is 
described in Section 2.3 of the above-referenced paper:
{quote}
More generally, the acceptors could respond with their acceptances to
some set of distinguished learners, each of which can then inform all the
learners when a value has been chosen.
{quote}
In our case, the client doubles as the "distinguished learner", and the JNs all 
double as the other Learners.

bq. Another concern is that this design has more steps than paxos which is 
generally considered complicated to get right

Per above, I don't think there are any "extra" steps. It is a fairly faithful 
implementation of Paxos except that we use a side-channel (HTTP from 
node-to-node) instead of the main message passing channel to actually 
communicate the value to be decided.

I am also confident that the implementation is correct, after many CPU-years of 
randomized fault testing. If you can find any counter-example I would be really 
glad to hear about it, and will immediately drop everything to investigate.


bq. > This is the same behavior that a NameNode takes at startup today in 
branch-2 – if there is an entirely empty edit log file, it is removed at 
startup.
bq. Why did we do this in normal local disk journal? Doesn't the no-op 
transaction handle the "empty" case. We had added the non-op transaction to 
deal with repeated restarts and also repeated rolls.
bq.

The no-op transaction isn't added atomically when the file is created. The file 
is created empty, then the header is appended, then the noop 
(START_LOG_SEGMENT) txn is appended. The NN could crash between any of these 
steps. If we open a log and find that it has no transactions (ie not even the 
"noop"), then we know we crashed right after opening the log but before writing 
anything (maybe not even the header). So, we can safely remove it and pretend 
the log was never started.

                
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>             Fix For: QuorumJournalManager (HDFS-3077)
>
>         Attachments: hdfs-3077-partial.txt, hdfs-3077-test-merge.txt, 
> hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, 
> hdfs-3077.txt, hdfs-3077.txt, qjournal-design.pdf, qjournal-design.pdf, 
> qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf, 
> qjournal-design.tex, qjournal-design.tex
>
>
> Currently, one of the weak points of the HA design is that it relies on 
> shared storage such as an NFS filer for the shared edit log. One alternative 
> that has been proposed is to depend on BookKeeper, a ZooKeeper subproject 
> which provides a highly available replicated edit log on commodity hardware. 
> This JIRA is to implement another alternative, based on a quorum commit 
> protocol, integrated more tightly in HDFS and with the requirements driven 
> only by HDFS's needs rather than more generic use cases. More details to 
> follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to