Re: Update on upstream (and distributed) testing

Dan Burkert Mon, 11 Jan 2016 18:16:37 -0800

I can look into that.   It would be great to get the thirdparty size down.

- Dan


On Mon, Jan 11, 2016 at 6:12 PM, Todd Lipcon <[email protected]> wrote:

> On Mon, Jan 11, 2016 at 6:07 PM, Dan Burkert <[email protected]> wrote:
>
> > Sounds great.  One thing I've noticed with local builds on a macbook pro
> > vs. remote builds on an EC2 machines with 2x the cores and 2x the ram is
> > that my laptop will build thirdparty in significantly less time.  I think
> > this is because I have an SSD locally.  we have 10's of GB in binaries
> > (mostly llvm debug symbols) which have to get copied around.
> >
>
> Maybe we should kill '-g' for LLVM? Or try -gline-tables-only or whatever
> that clang option is? It's theoretically supposed to reduce the size of the
> debug binaries significantly.
>
> Google does have "Local SSD" storage available, so we could get the builds
> happening on such a mount point to speed things up as well.
>
> -Todd
>
>
> > - Dan
> >
> > On Mon, Jan 11, 2016 at 1:02 AM, Todd Lipcon <[email protected]> wrote:
> >
> > > I spent some time this weekend working on setting upstream Jenkins so
> we
> > > can move from the Cloudera-internal precommit builds to something
> visible
> > > to all developers.
> > >
> > > At the same time, I've been working on switching us over to fully
> > > distributed test running. The idea is this:
> > >
> > > *Builders*
> > > We have a small pool of builder Jenkins slaves. These slaves are fairly
> > > well-provisioned (lots of cores and fast IO) so that they can compile
> > > quickly, and are long-running so they can keep a hot ccache. I'm
> thinking
> > > it might even make sense to do something like run 4 slaves in docker
> > > containers on a single 'n1-highcpu-32' GCE instance, so each slave can
> > > burst to super fast speeds, rather than static partitioning.
> > >
> > > Because these slaves are fast and have hot caches, a typical precommit
> > job
> > > (where most files don't change) can get all the way built in 20-30
> > seconds.
> > > If thirdparty changed, it might add 10 minutes or so, but we could
> > probably
> > > work on improving parallelism of the thirdparty build as well.
> > >
> > > *Distributed test running*
> > > Once the builder has built all the tests, it uses the 'dist_test'
> script
> > to
> > > submit the tests to be run on a cluster. The test slaves are
> preemptible
> > > n1-standard-4 VMs on GCE at the moment, set up with autoscaling. I've
> > been
> > > testing with 30 instances running at a time, but it's easy to scale to
> > > arbitrary amounts of parallelism. These instances are cheap (6c/hr) so
> 30
> > > slaves is quite affordable.
> > >
> > > The test runners download a test and its dependencies, run the test,
> and
> > > upload the results up to S3. (probably makes sense to switch to GCS at
> > some
> > > point, but not a big deal)
> > >
> > > *Result collection*
> > > The builder collects the results, and then runs the non-distributed
> tests
> > > (python/java) locally. The results are moved back into the normal test
> > log
> > > directory so that Jenkins can parse them as usual.
> > >
> > > *Run-times*
> > > With the above setup, I'm getting runtimes from 6-12 minutes depending
> on
> > > the build type. I think there is still some room for improvement, but
> > this
> > > is still about 6x faster that what we're getting today. The lowest
> > hanging
> > > fruit is probably to do the local python/java tests on the builder box
> > > while the distributed tests are running on the other machines, rather
> > than
> > > sequentially. That should bring the times down another 2-3 minutes.
> > >
> > > *TODO*
> > > A few things need to be done before this can take the place of our
> > existing
> > > internal gerrit jobs:
> > > - the auto-scaling behavior isn't working great at the moment. I think
> we
> > > need to switch away from Google's autoscaling to something custom (but
> > > simple)
> > > - need to integrate flaky test retries into the dist test framework -
> not
> > > hard, since the framework already supports retrying, but need to do the
> > > plumbing
> > > - had to make a bunch of changes to our tooling to get this to work:
> > > https://github.com/toddlipcon/kudu/commits/upstream-gerrit  -- will
> have
> > > to
> > > get these integrated. The most controversial one is changing RELEASE
> > > precommit builds to use dynamic linking -- otherwise the test binaries
> > take
> > > 20GB of space and take forever to distribute to the tester slaves.
> > >
> > >
> > > *What this means for you*
> > > Hopefully the switchover should be pretty smooth, and the result will
> be
> > > (a) faster tests, and (b) less worry that adding more tests will
> increase
> > > precommit times. Perhaps more importantly, this change will open up
> > > precommit testing to developers outside of Cloudera!
> > >
> > > Let me know if you have any questions. I'll send another email before
> > > making any changes to the "production" setup.
> > >
> > > -Todd
> > >
> > > --
> > > Todd Lipcon
> > > Software Engineer, Cloudera
> > >
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Update on upstream (and distributed) testing

Reply via email to