Great email. As a tactical step, does it make sense to back out the qemu-based builds from the main pipeline while we work on the timeout issues?
Adam > On Jul 26, 2019, at 5:29 PM, Joan Touzet <woh...@apache.org> wrote: > > Hello again, > > Adam poked me on IRC today asking a few questions about the state of Jenkins, > and why we're not gnerating test binaries for download. > > The reason is simple: the tests are failing. > > I've discussed this topic before twice at length with little feedback: > > https://lists.apache.org/thread.html/6e2bedbbf5c2b28af4237d0936dc21f056fdafa2ea0c0b457285b9dc@%3Cdev.couchdb.apache.org%3E > > https://lists.apache.org/thread.html/16a310e3342d3f1ca73fb85f62829b76bbfa3759e418386b07e2827f@%3Cdev.couchdb.apache.org%3E > > > I have 4 specific proposals to get us back on track: > > 1. Get more targeted build workers for ppc64le and aarch64 platforms. > > This is critical while we wait for #4 below. By having >1 hardware > platform to build on for each of these, we can hopefully pass those > architectures regularly, and start building real downloads and Docker > images for each of these. I know the user community really wants this. > > If we get at least 2 of each worker, I'll change Jenkinsfile to use > those tagged workers rather than the qemu emulation we currently > have (and is failing). > > > 2. Receive and provision the new CouchDB Jenkins build machine. IBM is > being very generous in getting this set up, and Paul Davis mentioned > the machine should be ready in the very near future. > > Provisioning will have to include Docker + the qemu support. See > https://issues.apache.org/jira/browse/INFRA-18322 for details on that > and https://issues.apache.org/jira/browse/INFRA-17404 for the general > provisioning approach (we download Jenkins .jar from the ASF machine, > set it up to be `runit`-run on boot, run as many as we can on the > machine (I think the HW was selected to run 8 of these at once), > install the prerequisites, and request the 8x worker+password infos > from ASF Infra. > > We have a choice: do we set this up just as 8x Jenkins workers, or do > we also start running our own Jenkins master (potentially on > couchdb-vm2)? The motivation to do the latter would be to add > credentials that could be used for automatic uploading of binaries to > places like bintray and Docker. (I am currently engaged with Infra in > trying to solve this for many projects, including Apache OpenWhisk. > One of the major limiting factors is that the shared ASF Jenkins > master's credentials can be accessed by all users on the server. This > is obviously a security nightmare.) > > At the moment, we are "OK" using the ASF Jenkins master instance. But > as soon as we start depending on this service widely (see below) it'll > be very disruptive to take it down, even for a day or two. So it may > be best to make this decision sooner rather than later. > > I'll be in touch with Infra next week on the global "automated > binary builds" issue, and will ask for guidance at that time. > > 3. Switch our PR gate on GitHub from Travis CI to Jenkins CI. This way, > people won't be blocked on PRs waiting forever anymore, since we'll > have a lot of compute resources at our disposal. That said, > **PEOPLE HAVE TO START FIXING THE INTERMITTENT TEST CASE FAILURES** > or we'll be right back to "Hey, it didn't pass...I'll just click > Retry" again. 😒 🤢 This will have to be a team effort. > > 4. Get rid of all timeouts in all test cases. A few proposals for this > were made in the context of ExUnit. Can we get some more progress > here? > > https://github.com/apache/couchdb/issues/2030 > https://github.com/apache/couchdb/pull/2039 > > 5. Once 4 is done, we can consider moving aarch64/ppc64le/other binary > builds to qemu support, meaning we can test all platforms just on > simple x86_64 machines. It's not a required move, but if we lose > access to the other platforms, or they go down, it's a backup > strategy. > > What do people think?