Re: Getting automated builds back on track

Adam Kocoloski Fri, 26 Jul 2019 16:24:42 -0700

Great email.

As a tactical step, does it make sense to back out the qemu-based builds from 
the main pipeline while we work on the timeout issues?


Adam

> On Jul 26, 2019, at 5:29 PM, Joan Touzet <[email protected]> wrote:
> 
> Hello again,
> 
> Adam poked me on IRC today asking a few questions about the state of Jenkins, 
> and why we're not gnerating test binaries for download.
> 
> The reason is simple: the tests are failing.
> 
> I've discussed this topic before twice at length with little feedback:
> 
> https://lists.apache.org/thread.html/6e2bedbbf5c2b28af4237d0936dc21f056fdafa2ea0c0b457285b9dc@%3Cdev.couchdb.apache.org%3E
> 
> https://lists.apache.org/thread.html/16a310e3342d3f1ca73fb85f62829b76bbfa3759e418386b07e2827f@%3Cdev.couchdb.apache.org%3E
> 
> 
> I have 4 specific proposals to get us back on track:
> 
>  1. Get more targeted build workers for ppc64le and aarch64 platforms.
> 
>     This is critical while we wait for #4 below. By having >1 hardware
>     platform to build on for each of these, we can hopefully pass those
>     architectures regularly, and start building real downloads and Docker
>     images for each of these. I know the user community really wants this.
> 
>     If we get at least 2 of each worker, I'll change Jenkinsfile to use
>     those tagged workers rather than the qemu emulation we currently
>     have (and is failing).
> 
> 
>  2. Receive and provision the new CouchDB Jenkins build machine. IBM is
>     being very generous in getting this set up, and Paul Davis mentioned
>     the machine should be ready in the very near future.
> 
>     Provisioning will have to include Docker + the qemu support. See
>     https://issues.apache.org/jira/browse/INFRA-18322 for details on that
>     and https://issues.apache.org/jira/browse/INFRA-17404 for the general
>     provisioning approach (we download Jenkins .jar from the ASF machine,
>     set it up to be `runit`-run on boot, run as many as we can on the
>     machine (I think the HW was selected to run 8 of these at once),
>     install the prerequisites, and request the 8x worker+password infos
>     from ASF Infra.
> 
>     We have a choice: do we set this up just as 8x Jenkins workers, or do
>     we also start running our own Jenkins master (potentially on
>     couchdb-vm2)? The motivation to do the latter would be to add
>     credentials that could be used for automatic uploading of binaries to
>     places like bintray and Docker. (I am currently engaged with Infra in
>     trying to solve this for many projects, including Apache OpenWhisk.
>     One of the major limiting factors is that the shared ASF Jenkins
>     master's credentials can be accessed by all users on the server. This
>     is obviously a security nightmare.)
> 
>     At the moment, we are "OK" using the ASF Jenkins master instance. But
>     as soon as we start depending on this service widely (see below) it'll
>     be very disruptive to take it down, even for a day or two. So it may
>     be best to make this decision sooner rather than later.
> 
>     I'll be in touch with Infra next week on the global "automated
>     binary builds" issue, and will ask for guidance at that time.
> 
>  3. Switch our PR gate on GitHub from Travis CI to Jenkins CI. This way,
>     people won't be blocked on PRs waiting forever anymore, since we'll
>     have a lot of compute resources at our disposal. That said,
>     **PEOPLE HAVE TO START FIXING THE INTERMITTENT TEST CASE FAILURES**
>     or we'll be right back to "Hey, it didn't pass...I'll just click
>     Retry" again. 😒 🤢  This will have to be a team effort.
> 
>  4. Get rid of all timeouts in all test cases. A few proposals for this
>     were made in the context of ExUnit. Can we get some more progress
>     here?
> 
>     https://github.com/apache/couchdb/issues/2030
>     https://github.com/apache/couchdb/pull/2039
> 
>  5. Once 4 is done, we can consider moving aarch64/ppc64le/other binary
>     builds to qemu support, meaning we can test all platforms just on
>     simple x86_64 machines. It's not a required move, but if we lose
>     access to the other platforms, or they go down, it's a backup
>     strategy.
> 
> What do people think?

Re: Getting automated builds back on track

Reply via email to