After a bit of work, there are currently no flakies in branch-2.3 and all tests passed over the last ten nightlies (a nightly is a comprehensive build that runs the full test suite once for jdk8+hadoop2, again for jdk8+hadoop3, and again for jdk11+hadoop3). You can see this by looking at our flakies dashboard for branch-2.3 [1][2]. Branch-2 is not too far behind with one flakey and a recent nightly test failure [3].
This 'cleanliness' is a little noteworthy, IMO. Other branches have not had the same focus so their state varies w/ attention paid. Attempts were also recently made at speeding up the jenkins test builds playing w/ maven forkcount, shrinking test resource usage, and with the maven -T which allows manipulating levels of maven module build/test parallelism (HBASE-24150, HBASE-24072, etc.). There was little yield to be had here...perhaps a 20% improvement. Complications included: jenkins build slaves allow two executors/builds to run at the same time so when an hbase build runs, it is sharing the machine w/ another (often another hbase build); host and docker resource constraints; and that our module inter-dependency constrains how much parallelism is allowed. As part of the above work in branch-2/branch-2.3, tests were run locally on various hardware. It should come as no surprise that the experience varied w/ environment (less so as flakies were addressed). On better hardware, tests can be made run more furiously so they use all the machine and complete faster. The settings we have as our defaults are configured to suit the Apache Jenkins build environment which is usually 16CPUs/48G. As said above, Jenkins slaves allow two builds machines so halve these resources when an HBase build runs on Apache Infrastructure. So as to be considerate of our companion Apache projects, defaults are relatively 'mild': our forkcount is set to 0.25 all of the CPUs in the machine. On Apache Jenkins, 0.25*16CPU == 4 CPUs for hbase build. We also set -T2 which means up to two modules building in parallel where possible (each with above configured forkcount). Our test suites on Jenkins continue to take hours. On a 40CPU linux machine with the below arguments where we use half the CPUs in the machine (and ulimit -u 40960), all tests run in just under an hour: $ x="0.50C" ; nohup mvn -T2 -Dsurefire.firstPartForkCount=$x -Dsurefire.secondPartForkCount=$x test -PrunAllTests Upping the forkcount on this machine beyond 0.50C tended to bring a rush of tests exiting... (To be investigated). On this machine, tests currently pass about 80% of the time. To be improved. On an anemic 4CPU VM, I can run the below and it will pass 60% of the time. It takes ~5hours: $ x="1.0C" ; nohup mvn -T2 -Dsurefire.firstPartForkCount=$x -Dsurefire.secondPartForkCount=$x test -PrunAllTests On a mac w/ 12CPUs, I can run same command as above. It passes with about same frequency and takes just over 1 1/2 hours. On my laptop it is less reliable passing about 1/3rd of the time in about 2 1/2 hours. If I use less resources, a lesser forkcount, the tests complete more often (but take correspondingly longer). Going forward, we will continue to watch branch-2/branch-2.3. Regards speedup, there is a bunch to do. A large win is to be had improving the HDFS mini cluster adding configuration (lots of resources such as pool thread counts are hard coded and numbers that are large for small test run) and working on speeding startup times. oao, S 1. https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests/job/branch-2.3/lastSuccessfulBuild/artifact/dashboard.html 2. Unfortunately, the nightly list shows reds though all tests passed because of report assemblage issues being addressed by infra: https://issues.apache.org/jira/browse/INFRA-20025 3. https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests/job/branch-2/lastSuccessfulBuild/artifact/dashboard.html