Re: Update on upstream (and distributed) testing

Todd Lipcon Mon, 11 Jan 2016 18:13:38 -0800

On Mon, Jan 11, 2016 at 6:07 PM, Dan Burkert <[email protected]> wrote:


> Sounds great.  One thing I've noticed with local builds on a macbook pro
> vs. remote builds on an EC2 machines with 2x the cores and 2x the ram is
> that my laptop will build thirdparty in significantly less time.  I think
> this is because I have an SSD locally.  we have 10's of GB in binaries
> (mostly llvm debug symbols) which have to get copied around.
>

Maybe we should kill '-g' for LLVM? Or try -gline-tables-only or whatever
that clang option is? It's theoretically supposed to reduce the size of the
debug binaries significantly.

Google does have "Local SSD" storage available, so we could get the builds
happening on such a mount point to speed things up as well.

-Todd


> - Dan
>
> On Mon, Jan 11, 2016 at 1:02 AM, Todd Lipcon <[email protected]> wrote:
>
> > I spent some time this weekend working on setting upstream Jenkins so we
> > can move from the Cloudera-internal precommit builds to something visible
> > to all developers.
> >
> > At the same time, I've been working on switching us over to fully
> > distributed test running. The idea is this:
> >
> > *Builders*
> > We have a small pool of builder Jenkins slaves. These slaves are fairly
> > well-provisioned (lots of cores and fast IO) so that they can compile
> > quickly, and are long-running so they can keep a hot ccache. I'm thinking
> > it might even make sense to do something like run 4 slaves in docker
> > containers on a single 'n1-highcpu-32' GCE instance, so each slave can
> > burst to super fast speeds, rather than static partitioning.
> >
> > Because these slaves are fast and have hot caches, a typical precommit
> job
> > (where most files don't change) can get all the way built in 20-30
> seconds.
> > If thirdparty changed, it might add 10 minutes or so, but we could
> probably
> > work on improving parallelism of the thirdparty build as well.
> >
> > *Distributed test running*
> > Once the builder has built all the tests, it uses the 'dist_test' script
> to
> > submit the tests to be run on a cluster. The test slaves are preemptible
> > n1-standard-4 VMs on GCE at the moment, set up with autoscaling. I've
> been
> > testing with 30 instances running at a time, but it's easy to scale to
> > arbitrary amounts of parallelism. These instances are cheap (6c/hr) so 30
> > slaves is quite affordable.
> >
> > The test runners download a test and its dependencies, run the test, and
> > upload the results up to S3. (probably makes sense to switch to GCS at
> some
> > point, but not a big deal)
> >
> > *Result collection*
> > The builder collects the results, and then runs the non-distributed tests
> > (python/java) locally. The results are moved back into the normal test
> log
> > directory so that Jenkins can parse them as usual.
> >
> > *Run-times*
> > With the above setup, I'm getting runtimes from 6-12 minutes depending on
> > the build type. I think there is still some room for improvement, but
> this
> > is still about 6x faster that what we're getting today. The lowest
> hanging
> > fruit is probably to do the local python/java tests on the builder box
> > while the distributed tests are running on the other machines, rather
> than
> > sequentially. That should bring the times down another 2-3 minutes.
> >
> > *TODO*
> > A few things need to be done before this can take the place of our
> existing
> > internal gerrit jobs:
> > - the auto-scaling behavior isn't working great at the moment. I think we
> > need to switch away from Google's autoscaling to something custom (but
> > simple)
> > - need to integrate flaky test retries into the dist test framework - not
> > hard, since the framework already supports retrying, but need to do the
> > plumbing
> > - had to make a bunch of changes to our tooling to get this to work:
> > https://github.com/toddlipcon/kudu/commits/upstream-gerrit  -- will have
> > to
> > get these integrated. The most controversial one is changing RELEASE
> > precommit builds to use dynamic linking -- otherwise the test binaries
> take
> > 20GB of space and take forever to distribute to the tester slaves.
> >
> >
> > *What this means for you*
> > Hopefully the switchover should be pretty smooth, and the result will be
> > (a) faster tests, and (b) less worry that adding more tests will increase
> > precommit times. Perhaps more importantly, this change will open up
> > precommit testing to developers outside of Cloudera!
> >
> > Let me know if you have any questions. I'll send another email before
> > making any changes to the "production" setup.
> >
> > -Todd
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Update on upstream (and distributed) testing

Reply via email to