Just to follow up on this, we just finished pushing the changes that allow
this to work (thanks Adar and Dan for reviews). I enabled the gerrit
trigger plugin on the Jenkins instance, so you should start to see comments
from 'Kudu Jenkins' on any new patches you upload.

If you see flakiness, issues, or slow builds, feel free to ping me. I think
the turnaround should be faster with this setup (it's auto-scaling up to
100 VMs running tests) but with any infrastructure change it might take
some time to iron out issues. You can feel free to use the existing gerrit
instance as an alternative/addition if this one fails on any given patch.

-Todd

On Mon, Jan 11, 2016 at 1:02 AM, Todd Lipcon <[email protected]> wrote:

> I spent some time this weekend working on setting upstream Jenkins so we
> can move from the Cloudera-internal precommit builds to something visible
> to all developers.
>
> At the same time, I've been working on switching us over to fully
> distributed test running. The idea is this:
>
> *Builders*
> We have a small pool of builder Jenkins slaves. These slaves are fairly
> well-provisioned (lots of cores and fast IO) so that they can compile
> quickly, and are long-running so they can keep a hot ccache. I'm thinking
> it might even make sense to do something like run 4 slaves in docker
> containers on a single 'n1-highcpu-32' GCE instance, so each slave can
> burst to super fast speeds, rather than static partitioning.
>
> Because these slaves are fast and have hot caches, a typical precommit job
> (where most files don't change) can get all the way built in 20-30 seconds.
> If thirdparty changed, it might add 10 minutes or so, but we could probably
> work on improving parallelism of the thirdparty build as well.
>
> *Distributed test running*
> Once the builder has built all the tests, it uses the 'dist_test' script
> to submit the tests to be run on a cluster. The test slaves are preemptible
> n1-standard-4 VMs on GCE at the moment, set up with autoscaling. I've been
> testing with 30 instances running at a time, but it's easy to scale to
> arbitrary amounts of parallelism. These instances are cheap (6c/hr) so 30
> slaves is quite affordable.
>
> The test runners download a test and its dependencies, run the test, and
> upload the results up to S3. (probably makes sense to switch to GCS at some
> point, but not a big deal)
>
> *Result collection*
> The builder collects the results, and then runs the non-distributed tests
> (python/java) locally. The results are moved back into the normal test log
> directory so that Jenkins can parse them as usual.
>
> *Run-times*
> With the above setup, I'm getting runtimes from 6-12 minutes depending on
> the build type. I think there is still some room for improvement, but this
> is still about 6x faster that what we're getting today. The lowest hanging
> fruit is probably to do the local python/java tests on the builder box
> while the distributed tests are running on the other machines, rather than
> sequentially. That should bring the times down another 2-3 minutes.
>
> *TODO*
> A few things need to be done before this can take the place of our
> existing internal gerrit jobs:
> - the auto-scaling behavior isn't working great at the moment. I think we
> need to switch away from Google's autoscaling to something custom (but
> simple)
> - need to integrate flaky test retries into the dist test framework - not
> hard, since the framework already supports retrying, but need to do the
> plumbing
> - had to make a bunch of changes to our tooling to get this to work:
> https://github.com/toddlipcon/kudu/commits/upstream-gerrit  -- will have
> to get these integrated. The most controversial one is changing RELEASE
> precommit builds to use dynamic linking -- otherwise the test binaries take
> 20GB of space and take forever to distribute to the tester slaves.
>
>
> *What this means for you*
> Hopefully the switchover should be pretty smooth, and the result will be
> (a) faster tests, and (b) less worry that adding more tests will increase
> precommit times. Perhaps more importantly, this change will open up
> precommit testing to developers outside of Cloudera!
>
> Let me know if you have any questions. I'll send another email before
> making any changes to the "production" setup.
>
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to