Hi all, Work has been progressing steadily for the last several months on the HDFS-3077 (QuorumJournalManager) branch. The branch is now "feature complete", and I believe ready for a merge soon now. Here is a brief overview of some salient points:
* The QJM can be configured to act as a shared journal directory much the same way that the BKJM can. It supports all of the same operations as the other JMs in trunk, and fulfills all of the stated design goals from the design document. * Documentation has been added to the HA guide to explain how to set up the QJM as a shared edits mechanism. * Metrics have been added to both the NameNode (client side) and JournalNode (server side). These metrics should be sufficient to detect potential issues both in terms of availability and in operation latency. * Security is fully implemented with the normal mechanisms * The branch has been swept for findbugs, javac warnings, license headers, etc, and should be "good to go" by those standards. It is also fully up-to-date with trunk. * The design doc on HDFS-3077 has been updated to reflect the final state of the implementation. As this is a critical part of NN metadata storage, I imagine people will be interested to hear about the test status as well: * Unit/functional test coverage is pretty high. As a rough measure, there are ~2300 lines of test code (via sloccount, ie not counting comments etc) vs 3300 lines of non-test code. The test coverage of the new package is as follows: 92% coverage of client code, 86% coverage of server code. The uncovered areas are mostly assertion/sanity checks that never fail, and security code which can't be tested by unit tests. * In terms of state-space coverage, there is one randomized stress test which injects faults randomly. This test has caught almost every bug in the design or implementation, and alone covers ~80% of the code. Since it is randomized, running it repeatedly covers more areas of the state-space -- I've written a small MR job which runs it in parallel on a Hadoop cluster, and have run it for many slot-years. (Actually, this test case uncovered an unrelated Jetty bug that has been hitting us for a couple years now!) * In terms of actual cluster testing: ** We have done cluster testing of a small secure HA cluster using QJM for shared edits. This covers the security code which cannot be covered by the automated tests. ** We have been running a QJM-based HA setup on a 100-node test cluster for several weeks with no new issues in quite some time. This cluster runs a mixed QA workload - eg hive benchmarks, teragen/terasort, gridmix, etc. ** We have tested failover/failback in both small and large clusters. In terms of performance: I collected the logs from the above-mentioned 100-node test cluster, and looked at the "SyncTimes" reported by the periodic FSEditLog statistics printout. The active NN in this cluster is configured to write to two local disks, and write shared edits to a QuorumJournalManager. The QJM is configured with three JournalNodes, all in the same rack as the active NN, and one on the same machine as the active NN. The average sync times are: QJM: 6.6ms, Local disk 1: 7.1ms, Local disk 2: 5.7ms. The maximum times seen in the log are: QJM: 19.8ms, Local 1: 30.8ms, Local 2: 22.8ms. This shows that the quorum behavior achieves the goal of smoothing out latencies, since it can proceed even if one of the underlying disks is temporarily slow. I also ran some manual tests by running a loop like: while true ; do sleep 0.1 ; kill -STOP $PID_OF_JN ; sleep 0.5 ; kill -CONT $PID_OF_JN ; done. This allows the JN to only run 100ms out of every 600ms. I applied a client load to the NN and verified that the NN's operation latency was unaffected even though one of the JNs slowly was falling behind. In terms of risk assessment for the merge: - This feature is entirely optional. The changes to existing code in the branch are fairly minimal improvements to the support for pluggable JournalManagers. We have been merging these changes into our CDH nightly tree and performed all of our usual daily QA and integration against these builds, so I feel confident that they do not introduce any new risk. - Even when the feature is enabled for shared edits, users will likely continue to log edits to locally mounted FileJournalManagers in addition to the shared QJM. So, if there is any bug in the QJM, the system metadata is not at risk. - Overall, we have taken a conservative approach and favored durability/correctness over availability. For example, we have many extra sanity checks and assertions to check our assumptions, and if any fail, we will abort rather than continue on with a risk of data-loss. So, in summary, I think the branch is very nearly ready to be merged into trunk. I will continue to perform stress tests, but at this point I am not aware of any deficiencies which would be considered a blocker. If anyone has any questions about the code, design, or tests, please feel free to chime in on HDFS-3077 or the relevant subtasks. Thanks Todd -- Todd Lipcon Software Engineer, Cloudera