Sounds great. One thing I've noticed with local builds on a macbook pro vs. remote builds on an EC2 machines with 2x the cores and 2x the ram is that my laptop will build thirdparty in significantly less time. I think this is because I have an SSD locally. we have 10's of GB in binaries (mostly llvm debug symbols) which have to get copied around.
- Dan On Mon, Jan 11, 2016 at 1:02 AM, Todd Lipcon <[email protected]> wrote: > I spent some time this weekend working on setting upstream Jenkins so we > can move from the Cloudera-internal precommit builds to something visible > to all developers. > > At the same time, I've been working on switching us over to fully > distributed test running. The idea is this: > > *Builders* > We have a small pool of builder Jenkins slaves. These slaves are fairly > well-provisioned (lots of cores and fast IO) so that they can compile > quickly, and are long-running so they can keep a hot ccache. I'm thinking > it might even make sense to do something like run 4 slaves in docker > containers on a single 'n1-highcpu-32' GCE instance, so each slave can > burst to super fast speeds, rather than static partitioning. > > Because these slaves are fast and have hot caches, a typical precommit job > (where most files don't change) can get all the way built in 20-30 seconds. > If thirdparty changed, it might add 10 minutes or so, but we could probably > work on improving parallelism of the thirdparty build as well. > > *Distributed test running* > Once the builder has built all the tests, it uses the 'dist_test' script to > submit the tests to be run on a cluster. The test slaves are preemptible > n1-standard-4 VMs on GCE at the moment, set up with autoscaling. I've been > testing with 30 instances running at a time, but it's easy to scale to > arbitrary amounts of parallelism. These instances are cheap (6c/hr) so 30 > slaves is quite affordable. > > The test runners download a test and its dependencies, run the test, and > upload the results up to S3. (probably makes sense to switch to GCS at some > point, but not a big deal) > > *Result collection* > The builder collects the results, and then runs the non-distributed tests > (python/java) locally. The results are moved back into the normal test log > directory so that Jenkins can parse them as usual. > > *Run-times* > With the above setup, I'm getting runtimes from 6-12 minutes depending on > the build type. I think there is still some room for improvement, but this > is still about 6x faster that what we're getting today. The lowest hanging > fruit is probably to do the local python/java tests on the builder box > while the distributed tests are running on the other machines, rather than > sequentially. That should bring the times down another 2-3 minutes. > > *TODO* > A few things need to be done before this can take the place of our existing > internal gerrit jobs: > - the auto-scaling behavior isn't working great at the moment. I think we > need to switch away from Google's autoscaling to something custom (but > simple) > - need to integrate flaky test retries into the dist test framework - not > hard, since the framework already supports retrying, but need to do the > plumbing > - had to make a bunch of changes to our tooling to get this to work: > https://github.com/toddlipcon/kudu/commits/upstream-gerrit -- will have > to > get these integrated. The most controversial one is changing RELEASE > precommit builds to use dynamic linking -- otherwise the test binaries take > 20GB of space and take forever to distribute to the tester slaves. > > > *What this means for you* > Hopefully the switchover should be pretty smooth, and the result will be > (a) faster tests, and (b) less worry that adding more tests will increase > precommit times. Perhaps more importantly, this change will open up > precommit testing to developers outside of Cloudera! > > Let me know if you have any questions. I'll send another email before > making any changes to the "production" setup. > > -Todd > > -- > Todd Lipcon > Software Engineer, Cloudera >
