I spent some time this weekend working on setting upstream Jenkins so we can move from the Cloudera-internal precommit builds to something visible to all developers.
At the same time, I've been working on switching us over to fully distributed test running. The idea is this: *Builders* We have a small pool of builder Jenkins slaves. These slaves are fairly well-provisioned (lots of cores and fast IO) so that they can compile quickly, and are long-running so they can keep a hot ccache. I'm thinking it might even make sense to do something like run 4 slaves in docker containers on a single 'n1-highcpu-32' GCE instance, so each slave can burst to super fast speeds, rather than static partitioning. Because these slaves are fast and have hot caches, a typical precommit job (where most files don't change) can get all the way built in 20-30 seconds. If thirdparty changed, it might add 10 minutes or so, but we could probably work on improving parallelism of the thirdparty build as well. *Distributed test running* Once the builder has built all the tests, it uses the 'dist_test' script to submit the tests to be run on a cluster. The test slaves are preemptible n1-standard-4 VMs on GCE at the moment, set up with autoscaling. I've been testing with 30 instances running at a time, but it's easy to scale to arbitrary amounts of parallelism. These instances are cheap (6c/hr) so 30 slaves is quite affordable. The test runners download a test and its dependencies, run the test, and upload the results up to S3. (probably makes sense to switch to GCS at some point, but not a big deal) *Result collection* The builder collects the results, and then runs the non-distributed tests (python/java) locally. The results are moved back into the normal test log directory so that Jenkins can parse them as usual. *Run-times* With the above setup, I'm getting runtimes from 6-12 minutes depending on the build type. I think there is still some room for improvement, but this is still about 6x faster that what we're getting today. The lowest hanging fruit is probably to do the local python/java tests on the builder box while the distributed tests are running on the other machines, rather than sequentially. That should bring the times down another 2-3 minutes. *TODO* A few things need to be done before this can take the place of our existing internal gerrit jobs: - the auto-scaling behavior isn't working great at the moment. I think we need to switch away from Google's autoscaling to something custom (but simple) - need to integrate flaky test retries into the dist test framework - not hard, since the framework already supports retrying, but need to do the plumbing - had to make a bunch of changes to our tooling to get this to work: https://github.com/toddlipcon/kudu/commits/upstream-gerrit -- will have to get these integrated. The most controversial one is changing RELEASE precommit builds to use dynamic linking -- otherwise the test binaries take 20GB of space and take forever to distribute to the tester slaves. *What this means for you* Hopefully the switchover should be pretty smooth, and the result will be (a) faster tests, and (b) less worry that adding more tests will increase precommit times. Perhaps more importantly, this change will open up precommit testing to developers outside of Cloudera! Let me know if you have any questions. I'll send another email before making any changes to the "production" setup. -Todd -- Todd Lipcon Software Engineer, Cloudera
