Re: Update on upstream (and distributed) testing

Dan Burkert Mon, 11 Jan 2016 18:08:35 -0800

Sounds great.  One thing I've noticed with local builds on a macbook pro
vs. remote builds on an EC2 machines with 2x the cores and 2x the ram is
that my laptop will build thirdparty in significantly less time.  I think
this is because I have an SSD locally.  we have 10's of GB in binaries
(mostly llvm debug symbols) which have to get copied around.


- Dan

On Mon, Jan 11, 2016 at 1:02 AM, Todd Lipcon <[email protected]> wrote:

> I spent some time this weekend working on setting upstream Jenkins so we
> can move from the Cloudera-internal precommit builds to something visible
> to all developers.
>
> At the same time, I've been working on switching us over to fully
> distributed test running. The idea is this:
>
> *Builders*
> We have a small pool of builder Jenkins slaves. These slaves are fairly
> well-provisioned (lots of cores and fast IO) so that they can compile
> quickly, and are long-running so they can keep a hot ccache. I'm thinking
> it might even make sense to do something like run 4 slaves in docker
> containers on a single 'n1-highcpu-32' GCE instance, so each slave can
> burst to super fast speeds, rather than static partitioning.
>
> Because these slaves are fast and have hot caches, a typical precommit job
> (where most files don't change) can get all the way built in 20-30 seconds.
> If thirdparty changed, it might add 10 minutes or so, but we could probably
> work on improving parallelism of the thirdparty build as well.
>
> *Distributed test running*
> Once the builder has built all the tests, it uses the 'dist_test' script to
> submit the tests to be run on a cluster. The test slaves are preemptible
> n1-standard-4 VMs on GCE at the moment, set up with autoscaling. I've been
> testing with 30 instances running at a time, but it's easy to scale to
> arbitrary amounts of parallelism. These instances are cheap (6c/hr) so 30
> slaves is quite affordable.
>
> The test runners download a test and its dependencies, run the test, and
> upload the results up to S3. (probably makes sense to switch to GCS at some
> point, but not a big deal)
>
> *Result collection*
> The builder collects the results, and then runs the non-distributed tests
> (python/java) locally. The results are moved back into the normal test log
> directory so that Jenkins can parse them as usual.
>
> *Run-times*
> With the above setup, I'm getting runtimes from 6-12 minutes depending on
> the build type. I think there is still some room for improvement, but this
> is still about 6x faster that what we're getting today. The lowest hanging
> fruit is probably to do the local python/java tests on the builder box
> while the distributed tests are running on the other machines, rather than
> sequentially. That should bring the times down another 2-3 minutes.
>
> *TODO*
> A few things need to be done before this can take the place of our existing
> internal gerrit jobs:
> - the auto-scaling behavior isn't working great at the moment. I think we
> need to switch away from Google's autoscaling to something custom (but
> simple)
> - need to integrate flaky test retries into the dist test framework - not
> hard, since the framework already supports retrying, but need to do the
> plumbing
> - had to make a bunch of changes to our tooling to get this to work:
> https://github.com/toddlipcon/kudu/commits/upstream-gerrit  -- will have
> to
> get these integrated. The most controversial one is changing RELEASE
> precommit builds to use dynamic linking -- otherwise the test binaries take
> 20GB of space and take forever to distribute to the tester slaves.
>
>
> *What this means for you*
> Hopefully the switchover should be pretty smooth, and the result will be
> (a) faster tests, and (b) less worry that adding more tests will increase
> precommit times. Perhaps more importantly, this change will open up
> precommit testing to developers outside of Cloudera!
>
> Let me know if you have any questions. I'll send another email before
> making any changes to the "production" setup.
>
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Update on upstream (and distributed) testing

Reply via email to