Unit Test Notes

Stack Fri, 22 May 2020 13:33:31 -0700

After a bit of work, there are currently no flakies in branch-2.3 and all
tests passed over the last ten nightlies (a nightly is a comprehensive
build that runs the full test suite once for jdk8+hadoop2, again for
jdk8+hadoop3, and again for jdk11+hadoop3). You can see this by looking at
our flakies dashboard for branch-2.3 [1][2]. Branch-2 is not too far behind
with one flakey and a recent nightly test failure [3].

This 'cleanliness' is a little noteworthy, IMO.

Other branches have not had the same focus so their state varies w/
attention paid.

Attempts were also recently made at speeding up the jenkins test builds
playing w/ maven forkcount, shrinking test resource usage, and with the
maven -T which allows manipulating levels of maven module build/test
parallelism (HBASE-24150, HBASE-24072, etc.). There was little yield to be
had here...perhaps a 20% improvement. Complications included: jenkins build
slaves allow two executors/builds to run at the same time so when an hbase
build runs, it is sharing the machine w/ another (often another hbase
build); host and docker resource constraints; and that our module
inter-dependency constrains how much parallelism is allowed.

As part of the above work in branch-2/branch-2.3, tests were run locally on
various hardware. It should come as no surprise that the experience varied
w/ environment (less so as flakies were addressed). On better hardware,
tests can be made run more furiously so they use all the machine and
complete faster.

The settings we have as our defaults are configured to suit the Apache
Jenkins build environment which is usually 16CPUs/48G. As said above,
Jenkins slaves allow two builds machines so halve these resources when an
HBase build runs on Apache Infrastructure. So as to be considerate of our
companion Apache projects, defaults are relatively 'mild': our forkcount is
set to 0.25 all of the CPUs in the machine. On Apache Jenkins, 0.25*16CPU
== 4 CPUs for hbase build. We also set -T2 which means up to two modules
building in parallel where possible (each with above configured forkcount).
Our test suites on Jenkins continue to take hours.

On a 40CPU linux machine with the below arguments where we use half the
CPUs in the machine (and ulimit -u 40960), all tests run in just under an
hour:

$ x="0.50C" ; nohup mvn -T2 -Dsurefire.firstPartForkCount=$x
-Dsurefire.secondPartForkCount=$x test -PrunAllTests

Upping the forkcount on this machine beyond 0.50C tended to bring a rush of
tests exiting... (To be investigated). On this machine, tests currently
pass about 80% of the time. To be improved.

On an anemic 4CPU VM, I can run the below and it will pass 60% of the time.
It takes ~5hours:

$ x="1.0C" ; nohup mvn -T2 -Dsurefire.firstPartForkCount=$x
-Dsurefire.secondPartForkCount=$x test -PrunAllTests

On a mac w/ 12CPUs, I can run same command as above. It passes with about
same frequency and takes just over 1 1/2 hours.

On my laptop it is less reliable passing about 1/3rd of the time in about 2
1/2 hours.

If I use less resources, a lesser forkcount, the tests complete more often
(but take correspondingly longer).

Going forward, we will continue to watch branch-2/branch-2.3. Regards
speedup, there is a bunch to do. A large win is to be had improving the
HDFS mini cluster adding configuration (lots of resources such as pool
thread counts are hard coded and numbers that are large for small test run)
and working on speeding startup times.

oao,
S

1.
https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests/job/branch-2.3/lastSuccessfulBuild/artifact/dashboard.html
2. Unfortunately, the nightly list shows reds though all tests passed
because of report assemblage issues being addressed by infra:
https://issues.apache.org/jira/browse/INFRA-20025
3.
https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests/job/branch-2/lastSuccessfulBuild/artifact/dashboard.html

Unit Test Notes

Reply via email to