I can look into that. It would be great to get the thirdparty size down. - Dan
On Mon, Jan 11, 2016 at 6:12 PM, Todd Lipcon <[email protected]> wrote: > On Mon, Jan 11, 2016 at 6:07 PM, Dan Burkert <[email protected]> wrote: > > > Sounds great. One thing I've noticed with local builds on a macbook pro > > vs. remote builds on an EC2 machines with 2x the cores and 2x the ram is > > that my laptop will build thirdparty in significantly less time. I think > > this is because I have an SSD locally. we have 10's of GB in binaries > > (mostly llvm debug symbols) which have to get copied around. > > > > Maybe we should kill '-g' for LLVM? Or try -gline-tables-only or whatever > that clang option is? It's theoretically supposed to reduce the size of the > debug binaries significantly. > > Google does have "Local SSD" storage available, so we could get the builds > happening on such a mount point to speed things up as well. > > -Todd > > > > - Dan > > > > On Mon, Jan 11, 2016 at 1:02 AM, Todd Lipcon <[email protected]> wrote: > > > > > I spent some time this weekend working on setting upstream Jenkins so > we > > > can move from the Cloudera-internal precommit builds to something > visible > > > to all developers. > > > > > > At the same time, I've been working on switching us over to fully > > > distributed test running. The idea is this: > > > > > > *Builders* > > > We have a small pool of builder Jenkins slaves. These slaves are fairly > > > well-provisioned (lots of cores and fast IO) so that they can compile > > > quickly, and are long-running so they can keep a hot ccache. I'm > thinking > > > it might even make sense to do something like run 4 slaves in docker > > > containers on a single 'n1-highcpu-32' GCE instance, so each slave can > > > burst to super fast speeds, rather than static partitioning. > > > > > > Because these slaves are fast and have hot caches, a typical precommit > > job > > > (where most files don't change) can get all the way built in 20-30 > > seconds. > > > If thirdparty changed, it might add 10 minutes or so, but we could > > probably > > > work on improving parallelism of the thirdparty build as well. > > > > > > *Distributed test running* > > > Once the builder has built all the tests, it uses the 'dist_test' > script > > to > > > submit the tests to be run on a cluster. The test slaves are > preemptible > > > n1-standard-4 VMs on GCE at the moment, set up with autoscaling. I've > > been > > > testing with 30 instances running at a time, but it's easy to scale to > > > arbitrary amounts of parallelism. These instances are cheap (6c/hr) so > 30 > > > slaves is quite affordable. > > > > > > The test runners download a test and its dependencies, run the test, > and > > > upload the results up to S3. (probably makes sense to switch to GCS at > > some > > > point, but not a big deal) > > > > > > *Result collection* > > > The builder collects the results, and then runs the non-distributed > tests > > > (python/java) locally. The results are moved back into the normal test > > log > > > directory so that Jenkins can parse them as usual. > > > > > > *Run-times* > > > With the above setup, I'm getting runtimes from 6-12 minutes depending > on > > > the build type. I think there is still some room for improvement, but > > this > > > is still about 6x faster that what we're getting today. The lowest > > hanging > > > fruit is probably to do the local python/java tests on the builder box > > > while the distributed tests are running on the other machines, rather > > than > > > sequentially. That should bring the times down another 2-3 minutes. > > > > > > *TODO* > > > A few things need to be done before this can take the place of our > > existing > > > internal gerrit jobs: > > > - the auto-scaling behavior isn't working great at the moment. I think > we > > > need to switch away from Google's autoscaling to something custom (but > > > simple) > > > - need to integrate flaky test retries into the dist test framework - > not > > > hard, since the framework already supports retrying, but need to do the > > > plumbing > > > - had to make a bunch of changes to our tooling to get this to work: > > > https://github.com/toddlipcon/kudu/commits/upstream-gerrit -- will > have > > > to > > > get these integrated. The most controversial one is changing RELEASE > > > precommit builds to use dynamic linking -- otherwise the test binaries > > take > > > 20GB of space and take forever to distribute to the tester slaves. > > > > > > > > > *What this means for you* > > > Hopefully the switchover should be pretty smooth, and the result will > be > > > (a) faster tests, and (b) less worry that adding more tests will > increase > > > precommit times. Perhaps more importantly, this change will open up > > > precommit testing to developers outside of Cloudera! > > > > > > Let me know if you have any questions. I'll send another email before > > > making any changes to the "production" setup. > > > > > > -Todd > > > > > > -- > > > Todd Lipcon > > > Software Engineer, Cloudera > > > > > > > > > -- > Todd Lipcon > Software Engineer, Cloudera >
