Thank you for sending these detailed notes. I fear I will be duplicating these efforts on branch-1 when (eventually) preparing for 1.7.0.
On Fri, May 22, 2020 at 1:32 PM Stack <st...@duboce.net> wrote: > After a bit of work, there are currently no flakies in branch-2.3 and all > tests passed over the last ten nightlies (a nightly is a comprehensive > build that runs the full test suite once for jdk8+hadoop2, again for > jdk8+hadoop3, and again for jdk11+hadoop3). You can see this by looking at > our flakies dashboard for branch-2.3 [1][2]. Branch-2 is not too far behind > with one flakey and a recent nightly test failure [3]. > > This 'cleanliness' is a little noteworthy, IMO. > > Other branches have not had the same focus so their state varies w/ > attention paid. > > Attempts were also recently made at speeding up the jenkins test builds > playing w/ maven forkcount, shrinking test resource usage, and with the > maven -T which allows manipulating levels of maven module build/test > parallelism (HBASE-24150, HBASE-24072, etc.). There was little yield to be > had here...perhaps a 20% improvement. Complications included: jenkins build > slaves allow two executors/builds to run at the same time so when an hbase > build runs, it is sharing the machine w/ another (often another hbase > build); host and docker resource constraints; and that our module > inter-dependency constrains how much parallelism is allowed. > > As part of the above work in branch-2/branch-2.3, tests were run locally on > various hardware. It should come as no surprise that the experience varied > w/ environment (less so as flakies were addressed). On better hardware, > tests can be made run more furiously so they use all the machine and > complete faster. > > The settings we have as our defaults are configured to suit the Apache > Jenkins build environment which is usually 16CPUs/48G. As said above, > Jenkins slaves allow two builds machines so halve these resources when an > HBase build runs on Apache Infrastructure. So as to be considerate of our > companion Apache projects, defaults are relatively 'mild': our forkcount is > set to 0.25 all of the CPUs in the machine. On Apache Jenkins, 0.25*16CPU > == 4 CPUs for hbase build. We also set -T2 which means up to two modules > building in parallel where possible (each with above configured forkcount). > Our test suites on Jenkins continue to take hours. > > On a 40CPU linux machine with the below arguments where we use half the > CPUs in the machine (and ulimit -u 40960), all tests run in just under an > hour: > > $ x="0.50C" ; nohup mvn -T2 -Dsurefire.firstPartForkCount=$x > -Dsurefire.secondPartForkCount=$x test -PrunAllTests > > Upping the forkcount on this machine beyond 0.50C tended to bring a rush of > tests exiting... (To be investigated). On this machine, tests currently > pass about 80% of the time. To be improved. > > On an anemic 4CPU VM, I can run the below and it will pass 60% of the time. > It takes ~5hours: > > $ x="1.0C" ; nohup mvn -T2 -Dsurefire.firstPartForkCount=$x > -Dsurefire.secondPartForkCount=$x test -PrunAllTests > > On a mac w/ 12CPUs, I can run same command as above. It passes with about > same frequency and takes just over 1 1/2 hours. > > On my laptop it is less reliable passing about 1/3rd of the time in about 2 > 1/2 hours. > > If I use less resources, a lesser forkcount, the tests complete more often > (but take correspondingly longer). > > Going forward, we will continue to watch branch-2/branch-2.3. Regards > speedup, there is a bunch to do. A large win is to be had improving the > HDFS mini cluster adding configuration (lots of resources such as pool > thread counts are hard coded and numbers that are large for small test run) > and working on speeding startup times. > > oao, > S > > 1. > > https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests/job/branch-2.3/lastSuccessfulBuild/artifact/dashboard.html > 2. Unfortunately, the nightly list shows reds though all tests passed > because of report assemblage issues being addressed by infra: > https://issues.apache.org/jira/browse/INFRA-20025 > 3. > > https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests/job/branch-2/lastSuccessfulBuild/artifact/dashboard.html > -- Best regards, Andrew Words like orphans lost among the crosstalk, meaning torn from truth's decrepit hands - A23, Crosstalk