Hi all,

Work has been progressing steadily for the last several months on the
HDFS-3077 (QuorumJournalManager) branch. The branch is now "feature
complete", and I believe ready for a merge soon now. Here is a brief
overview of some salient points:

* The QJM can be configured to act as a shared journal directory much
the same way that the BKJM can. It supports all of the same operations
as the other JMs in trunk, and fulfills all of the stated design goals
from the design document.
* Documentation has been added to the HA guide to explain how to set
up the QJM as a shared edits mechanism.
* Metrics have been added to both the NameNode (client side) and
JournalNode (server side). These metrics should be sufficient to
detect potential issues both in terms of availability and in operation
latency.
* Security is fully implemented with the normal mechanisms
* The branch has been swept for findbugs, javac warnings, license
headers, etc, and should be "good to go" by those standards. It is
also fully up-to-date with trunk.
* The design doc on HDFS-3077 has been updated to reflect the final
state of the implementation.

As this is a critical part of NN metadata storage, I imagine people
will be interested to hear about the test status as well:

* Unit/functional test coverage is pretty high. As a rough measure,
there are ~2300 lines of test code (via sloccount, ie not counting
comments etc) vs 3300 lines of non-test code.  The test coverage of
the new package is as follows: 92% coverage of client code, 86%
coverage of server code. The uncovered areas are mostly
assertion/sanity checks that never fail, and security code which can't
be tested by unit tests.
* In terms of state-space coverage, there is one randomized stress
test which injects faults randomly. This test has caught almost every
bug in the design or implementation, and alone covers ~80% of the
code. Since it is randomized, running it repeatedly covers more areas
of the state-space  -- I've written a small MR job which runs it in
parallel on a Hadoop cluster, and have run it for many slot-years.
(Actually, this test case uncovered an unrelated Jetty bug that has
been hitting us for a couple years now!)
* In terms of actual cluster testing:
** We have done cluster testing of a small secure HA cluster using QJM
for shared edits. This covers the security code which cannot be
covered by the automated tests.
** We have been running a QJM-based HA setup on a 100-node test
cluster for several weeks with no new issues in quite some time. This
cluster runs a mixed QA workload - eg hive benchmarks,
teragen/terasort, gridmix, etc.
** We have tested failover/failback in both small and large clusters.

In terms of performance:

I collected the logs from the above-mentioned 100-node test cluster,
and looked at the "SyncTimes" reported by the periodic FSEditLog
statistics printout. The active NN in this cluster is configured to
write to two local disks, and write shared edits to a
QuorumJournalManager. The QJM is configured with three JournalNodes,
all in the same rack as the active NN, and one on the same machine as
the active NN. The average sync times are: QJM: 6.6ms, Local disk 1:
7.1ms, Local disk 2: 5.7ms. The maximum times seen in the log are:
QJM: 19.8ms, Local 1: 30.8ms, Local 2: 22.8ms. This shows that the
quorum behavior achieves the goal of smoothing out latencies, since it
can proceed even if one of the underlying disks is temporarily slow.

I also ran some manual tests by running a loop like: while true ; do
sleep 0.1 ; kill -STOP $PID_OF_JN ; sleep 0.5 ; kill -CONT $PID_OF_JN
; done. This allows the JN to only run 100ms out of every 600ms. I
applied a client load to the NN and verified that the NN's operation
latency was unaffected even though one of the JNs slowly was falling
behind.

In terms of risk assessment for the merge:

- This feature is entirely optional. The changes to existing code in
the branch are fairly minimal improvements to the support for
pluggable JournalManagers. We have been merging these changes into our
CDH nightly tree and performed all of our usual daily QA and
integration against these builds, so I feel confident that they do not
introduce any new risk.
- Even when the feature is enabled for shared edits, users will likely
continue to log edits to locally mounted FileJournalManagers in
addition to the shared QJM. So, if there is any bug in the QJM, the
system metadata is not at risk.
- Overall, we have taken a conservative approach and favored
durability/correctness over availability. For example, we have many
extra sanity checks and assertions to check our assumptions, and if
any fail, we will abort rather than continue on with a risk of
data-loss.

So, in summary, I think the branch is very nearly ready to be merged
into trunk. I will continue to perform stress tests, but at this point
I am not aware of any deficiencies which would be considered a
blocker. If anyone has any questions about the code, design, or tests,
please feel free to chime in on HDFS-3077 or the relevant subtasks.

Thanks
Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to