I spent some time this weekend working on setting upstream Jenkins so we
can move from the Cloudera-internal precommit builds to something visible
to all developers.

At the same time, I've been working on switching us over to fully
distributed test running. The idea is this:

*Builders*
We have a small pool of builder Jenkins slaves. These slaves are fairly
well-provisioned (lots of cores and fast IO) so that they can compile
quickly, and are long-running so they can keep a hot ccache. I'm thinking
it might even make sense to do something like run 4 slaves in docker
containers on a single 'n1-highcpu-32' GCE instance, so each slave can
burst to super fast speeds, rather than static partitioning.

Because these slaves are fast and have hot caches, a typical precommit job
(where most files don't change) can get all the way built in 20-30 seconds.
If thirdparty changed, it might add 10 minutes or so, but we could probably
work on improving parallelism of the thirdparty build as well.

*Distributed test running*
Once the builder has built all the tests, it uses the 'dist_test' script to
submit the tests to be run on a cluster. The test slaves are preemptible
n1-standard-4 VMs on GCE at the moment, set up with autoscaling. I've been
testing with 30 instances running at a time, but it's easy to scale to
arbitrary amounts of parallelism. These instances are cheap (6c/hr) so 30
slaves is quite affordable.

The test runners download a test and its dependencies, run the test, and
upload the results up to S3. (probably makes sense to switch to GCS at some
point, but not a big deal)

*Result collection*
The builder collects the results, and then runs the non-distributed tests
(python/java) locally. The results are moved back into the normal test log
directory so that Jenkins can parse them as usual.

*Run-times*
With the above setup, I'm getting runtimes from 6-12 minutes depending on
the build type. I think there is still some room for improvement, but this
is still about 6x faster that what we're getting today. The lowest hanging
fruit is probably to do the local python/java tests on the builder box
while the distributed tests are running on the other machines, rather than
sequentially. That should bring the times down another 2-3 minutes.

*TODO*
A few things need to be done before this can take the place of our existing
internal gerrit jobs:
- the auto-scaling behavior isn't working great at the moment. I think we
need to switch away from Google's autoscaling to something custom (but
simple)
- need to integrate flaky test retries into the dist test framework - not
hard, since the framework already supports retrying, but need to do the
plumbing
- had to make a bunch of changes to our tooling to get this to work:
https://github.com/toddlipcon/kudu/commits/upstream-gerrit  -- will have to
get these integrated. The most controversial one is changing RELEASE
precommit builds to use dynamic linking -- otherwise the test binaries take
20GB of space and take forever to distribute to the tester slaves.


*What this means for you*
Hopefully the switchover should be pretty smooth, and the result will be
(a) faster tests, and (b) less worry that adding more tests will increase
precommit times. Perhaps more importantly, this change will open up
precommit testing to developers outside of Cloudera!

Let me know if you have any questions. I'll send another email before
making any changes to the "production" setup.

-Todd

-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to