On Thu, Sep 20, 2012 at 1:38 PM, Suresh Srinivas <sur...@hortonworks.com> wrote: > I need a week or so to go over the design and review the code changes. I > will post my comments to the jira directly. Meanwhile any updates made to > the design document would help.
The latest rev on the design doc is fully up to date as of yesterday. Thanks -Todd > On Wed, Sep 19, 2012 at 12:53 PM, Todd Lipcon <t...@cloudera.com> wrote: > >> Hi all, >> >> Work has been progressing steadily for the last several months on the >> HDFS-3077 (QuorumJournalManager) branch. The branch is now "feature >> complete", and I believe ready for a merge soon now. Here is a brief >> overview of some salient points: >> >> * The QJM can be configured to act as a shared journal directory much >> the same way that the BKJM can. It supports all of the same operations >> as the other JMs in trunk, and fulfills all of the stated design goals >> from the design document. >> * Documentation has been added to the HA guide to explain how to set >> up the QJM as a shared edits mechanism. >> * Metrics have been added to both the NameNode (client side) and >> JournalNode (server side). These metrics should be sufficient to >> detect potential issues both in terms of availability and in operation >> latency. >> * Security is fully implemented with the normal mechanisms >> * The branch has been swept for findbugs, javac warnings, license >> headers, etc, and should be "good to go" by those standards. It is >> also fully up-to-date with trunk. >> * The design doc on HDFS-3077 has been updated to reflect the final >> state of the implementation. >> >> As this is a critical part of NN metadata storage, I imagine people >> will be interested to hear about the test status as well: >> >> * Unit/functional test coverage is pretty high. As a rough measure, >> there are ~2300 lines of test code (via sloccount, ie not counting >> comments etc) vs 3300 lines of non-test code. The test coverage of >> the new package is as follows: 92% coverage of client code, 86% >> coverage of server code. The uncovered areas are mostly >> assertion/sanity checks that never fail, and security code which can't >> be tested by unit tests. >> * In terms of state-space coverage, there is one randomized stress >> test which injects faults randomly. This test has caught almost every >> bug in the design or implementation, and alone covers ~80% of the >> code. Since it is randomized, running it repeatedly covers more areas >> of the state-space -- I've written a small MR job which runs it in >> parallel on a Hadoop cluster, and have run it for many slot-years. >> (Actually, this test case uncovered an unrelated Jetty bug that has >> been hitting us for a couple years now!) >> * In terms of actual cluster testing: >> ** We have done cluster testing of a small secure HA cluster using QJM >> for shared edits. This covers the security code which cannot be >> covered by the automated tests. >> ** We have been running a QJM-based HA setup on a 100-node test >> cluster for several weeks with no new issues in quite some time. This >> cluster runs a mixed QA workload - eg hive benchmarks, >> teragen/terasort, gridmix, etc. >> ** We have tested failover/failback in both small and large clusters. >> >> In terms of performance: >> >> I collected the logs from the above-mentioned 100-node test cluster, >> and looked at the "SyncTimes" reported by the periodic FSEditLog >> statistics printout. The active NN in this cluster is configured to >> write to two local disks, and write shared edits to a >> QuorumJournalManager. The QJM is configured with three JournalNodes, >> all in the same rack as the active NN, and one on the same machine as >> the active NN. The average sync times are: QJM: 6.6ms, Local disk 1: >> 7.1ms, Local disk 2: 5.7ms. The maximum times seen in the log are: >> QJM: 19.8ms, Local 1: 30.8ms, Local 2: 22.8ms. This shows that the >> quorum behavior achieves the goal of smoothing out latencies, since it >> can proceed even if one of the underlying disks is temporarily slow. >> >> I also ran some manual tests by running a loop like: while true ; do >> sleep 0.1 ; kill -STOP $PID_OF_JN ; sleep 0.5 ; kill -CONT $PID_OF_JN >> ; done. This allows the JN to only run 100ms out of every 600ms. I >> applied a client load to the NN and verified that the NN's operation >> latency was unaffected even though one of the JNs slowly was falling >> behind. >> >> In terms of risk assessment for the merge: >> >> - This feature is entirely optional. The changes to existing code in >> the branch are fairly minimal improvements to the support for >> pluggable JournalManagers. We have been merging these changes into our >> CDH nightly tree and performed all of our usual daily QA and >> integration against these builds, so I feel confident that they do not >> introduce any new risk. >> - Even when the feature is enabled for shared edits, users will likely >> continue to log edits to locally mounted FileJournalManagers in >> addition to the shared QJM. So, if there is any bug in the QJM, the >> system metadata is not at risk. >> - Overall, we have taken a conservative approach and favored >> durability/correctness over availability. For example, we have many >> extra sanity checks and assertions to check our assumptions, and if >> any fail, we will abort rather than continue on with a risk of >> data-loss. >> >> So, in summary, I think the branch is very nearly ready to be merged >> into trunk. I will continue to perform stress tests, but at this point >> I am not aware of any deficiencies which would be considered a >> blocker. If anyone has any questions about the code, design, or tests, >> please feel free to chime in on HDFS-3077 or the relevant subtasks. >> >> Thanks >> Todd >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> > > > > -- > http://hortonworks.com/download/ -- Todd Lipcon Software Engineer, Cloudera