I need a week or so to go over the design and review the code changes. I will post my comments to the jira directly. Meanwhile any updates made to the design document would help.
Regards, Suresh On Wed, Sep 19, 2012 at 12:53 PM, Todd Lipcon <t...@cloudera.com> wrote: > Hi all, > > Work has been progressing steadily for the last several months on the > HDFS-3077 (QuorumJournalManager) branch. The branch is now "feature > complete", and I believe ready for a merge soon now. Here is a brief > overview of some salient points: > > * The QJM can be configured to act as a shared journal directory much > the same way that the BKJM can. It supports all of the same operations > as the other JMs in trunk, and fulfills all of the stated design goals > from the design document. > * Documentation has been added to the HA guide to explain how to set > up the QJM as a shared edits mechanism. > * Metrics have been added to both the NameNode (client side) and > JournalNode (server side). These metrics should be sufficient to > detect potential issues both in terms of availability and in operation > latency. > * Security is fully implemented with the normal mechanisms > * The branch has been swept for findbugs, javac warnings, license > headers, etc, and should be "good to go" by those standards. It is > also fully up-to-date with trunk. > * The design doc on HDFS-3077 has been updated to reflect the final > state of the implementation. > > As this is a critical part of NN metadata storage, I imagine people > will be interested to hear about the test status as well: > > * Unit/functional test coverage is pretty high. As a rough measure, > there are ~2300 lines of test code (via sloccount, ie not counting > comments etc) vs 3300 lines of non-test code. The test coverage of > the new package is as follows: 92% coverage of client code, 86% > coverage of server code. The uncovered areas are mostly > assertion/sanity checks that never fail, and security code which can't > be tested by unit tests. > * In terms of state-space coverage, there is one randomized stress > test which injects faults randomly. This test has caught almost every > bug in the design or implementation, and alone covers ~80% of the > code. Since it is randomized, running it repeatedly covers more areas > of the state-space -- I've written a small MR job which runs it in > parallel on a Hadoop cluster, and have run it for many slot-years. > (Actually, this test case uncovered an unrelated Jetty bug that has > been hitting us for a couple years now!) > * In terms of actual cluster testing: > ** We have done cluster testing of a small secure HA cluster using QJM > for shared edits. This covers the security code which cannot be > covered by the automated tests. > ** We have been running a QJM-based HA setup on a 100-node test > cluster for several weeks with no new issues in quite some time. This > cluster runs a mixed QA workload - eg hive benchmarks, > teragen/terasort, gridmix, etc. > ** We have tested failover/failback in both small and large clusters. > > In terms of performance: > > I collected the logs from the above-mentioned 100-node test cluster, > and looked at the "SyncTimes" reported by the periodic FSEditLog > statistics printout. The active NN in this cluster is configured to > write to two local disks, and write shared edits to a > QuorumJournalManager. The QJM is configured with three JournalNodes, > all in the same rack as the active NN, and one on the same machine as > the active NN. The average sync times are: QJM: 6.6ms, Local disk 1: > 7.1ms, Local disk 2: 5.7ms. The maximum times seen in the log are: > QJM: 19.8ms, Local 1: 30.8ms, Local 2: 22.8ms. This shows that the > quorum behavior achieves the goal of smoothing out latencies, since it > can proceed even if one of the underlying disks is temporarily slow. > > I also ran some manual tests by running a loop like: while true ; do > sleep 0.1 ; kill -STOP $PID_OF_JN ; sleep 0.5 ; kill -CONT $PID_OF_JN > ; done. This allows the JN to only run 100ms out of every 600ms. I > applied a client load to the NN and verified that the NN's operation > latency was unaffected even though one of the JNs slowly was falling > behind. > > In terms of risk assessment for the merge: > > - This feature is entirely optional. The changes to existing code in > the branch are fairly minimal improvements to the support for > pluggable JournalManagers. We have been merging these changes into our > CDH nightly tree and performed all of our usual daily QA and > integration against these builds, so I feel confident that they do not > introduce any new risk. > - Even when the feature is enabled for shared edits, users will likely > continue to log edits to locally mounted FileJournalManagers in > addition to the shared QJM. So, if there is any bug in the QJM, the > system metadata is not at risk. > - Overall, we have taken a conservative approach and favored > durability/correctness over availability. For example, we have many > extra sanity checks and assertions to check our assumptions, and if > any fail, we will abort rather than continue on with a risk of > data-loss. > > So, in summary, I think the branch is very nearly ready to be merged > into trunk. I will continue to perform stress tests, but at this point > I am not aware of any deficiencies which would be considered a > blocker. If anyone has any questions about the code, design, or tests, > please feel free to chime in on HDFS-3077 or the relevant subtasks. > > Thanks > Todd > -- > Todd Lipcon > Software Engineer, Cloudera > -- http://hortonworks.com/download/