We have been backporting Todd's HDFS-3077 branch changes on top of
branch-2 for a while now, and testing the result in small clusters
(5-10 nodes). Although we certainly have not had the test coverage
Todd describes below that is their internal testing, we can add the
datapoint that the QJM, in black box testing, functions as advertised
and is resiliant to both single-JN and single-NN fault scenarios. I'd
add the caveat that support for running with security enabled was
added only recently and so our experience with that is limited, though
successful.

On Wed, Sep 19, 2012 at 12:53 PM, Todd Lipcon <t...@cloudera.com> wrote:
> Hi all,
>
> Work has been progressing steadily for the last several months on the
> HDFS-3077 (QuorumJournalManager) branch. The branch is now "feature
> complete", and I believe ready for a merge soon now. Here is a brief
> overview of some salient points:
>
> * The QJM can be configured to act as a shared journal directory much
> the same way that the BKJM can. It supports all of the same operations
> as the other JMs in trunk, and fulfills all of the stated design goals
> from the design document.
> * Documentation has been added to the HA guide to explain how to set
> up the QJM as a shared edits mechanism.
> * Metrics have been added to both the NameNode (client side) and
> JournalNode (server side). These metrics should be sufficient to
> detect potential issues both in terms of availability and in operation
> latency.
> * Security is fully implemented with the normal mechanisms
> * The branch has been swept for findbugs, javac warnings, license
> headers, etc, and should be "good to go" by those standards. It is
> also fully up-to-date with trunk.
> * The design doc on HDFS-3077 has been updated to reflect the final
> state of the implementation.
>
> As this is a critical part of NN metadata storage, I imagine people
> will be interested to hear about the test status as well:
>
> * Unit/functional test coverage is pretty high. As a rough measure,
> there are ~2300 lines of test code (via sloccount, ie not counting
> comments etc) vs 3300 lines of non-test code.  The test coverage of
> the new package is as follows: 92% coverage of client code, 86%
> coverage of server code. The uncovered areas are mostly
> assertion/sanity checks that never fail, and security code which can't
> be tested by unit tests.
> * In terms of state-space coverage, there is one randomized stress
> test which injects faults randomly. This test has caught almost every
> bug in the design or implementation, and alone covers ~80% of the
> code. Since it is randomized, running it repeatedly covers more areas
> of the state-space  -- I've written a small MR job which runs it in
> parallel on a Hadoop cluster, and have run it for many slot-years.
> (Actually, this test case uncovered an unrelated Jetty bug that has
> been hitting us for a couple years now!)
> * In terms of actual cluster testing:
> ** We have done cluster testing of a small secure HA cluster using QJM
> for shared edits. This covers the security code which cannot be
> covered by the automated tests.
> ** We have been running a QJM-based HA setup on a 100-node test
> cluster for several weeks with no new issues in quite some time. This
> cluster runs a mixed QA workload - eg hive benchmarks,
> teragen/terasort, gridmix, etc.
> ** We have tested failover/failback in both small and large clusters.
>
> In terms of performance:
>
> I collected the logs from the above-mentioned 100-node test cluster,
> and looked at the "SyncTimes" reported by the periodic FSEditLog
> statistics printout. The active NN in this cluster is configured to
> write to two local disks, and write shared edits to a
> QuorumJournalManager. The QJM is configured with three JournalNodes,
> all in the same rack as the active NN, and one on the same machine as
> the active NN. The average sync times are: QJM: 6.6ms, Local disk 1:
> 7.1ms, Local disk 2: 5.7ms. The maximum times seen in the log are:
> QJM: 19.8ms, Local 1: 30.8ms, Local 2: 22.8ms. This shows that the
> quorum behavior achieves the goal of smoothing out latencies, since it
> can proceed even if one of the underlying disks is temporarily slow.
>
> I also ran some manual tests by running a loop like: while true ; do
> sleep 0.1 ; kill -STOP $PID_OF_JN ; sleep 0.5 ; kill -CONT $PID_OF_JN
> ; done. This allows the JN to only run 100ms out of every 600ms. I
> applied a client load to the NN and verified that the NN's operation
> latency was unaffected even though one of the JNs slowly was falling
> behind.
>
> In terms of risk assessment for the merge:
>
> - This feature is entirely optional. The changes to existing code in
> the branch are fairly minimal improvements to the support for
> pluggable JournalManagers. We have been merging these changes into our
> CDH nightly tree and performed all of our usual daily QA and
> integration against these builds, so I feel confident that they do not
> introduce any new risk.
> - Even when the feature is enabled for shared edits, users will likely
> continue to log edits to locally mounted FileJournalManagers in
> addition to the shared QJM. So, if there is any bug in the QJM, the
> system metadata is not at risk.
> - Overall, we have taken a conservative approach and favored
> durability/correctness over availability. For example, we have many
> extra sanity checks and assertions to check our assumptions, and if
> any fail, we will abort rather than continue on with a risk of
> data-loss.
>
> So, in summary, I think the branch is very nearly ready to be merged
> into trunk. I will continue to perform stress tests, but at this point
> I am not aware of any deficiencies which would be considered a
> blocker. If anyone has any questions about the code, design, or tests,
> please feel free to chime in on HDFS-3077 or the relevant subtasks.
>
> Thanks
> Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)

Reply via email to