Hi Raúl, Thanks for starting this thread. Flaky CI is a big challenge for us in Hadoop too. I can speak to some things that we've done to try to improve it. We're still struggling, but some of these have helped improve the situation.
We recently rewrote our test-patch.sh with a lot of nice new functionality. (Credit goes to Allen Wittenauer.) https://issues.apache.org/jira/browse/HADOOP-11746 The release notes field in the jira describes the new functionality. There is a lot to take in there, so here is a summary of my personal favorites: 1. It can run against any branch of the codebase, not just trunk, by following a simple naming convention that includes the branch name when uploading your patch. 2. It has some smarts to try to minimize execution time. For example, if the patch only changes shell scripts or documentation, then it assumes there is no need to run JUnit tests. 3. It eliminated race conditions during concurrent test-patch.sh runs caused by storing state in shared local files. The ZooKeeper test-patch.sh appears to be a fork of an older version of the Hadoop script, so maybe the ZooKeeper script is subject to similar race conditions. 4. It has some hooks for pluggability, making it easier to add more custom checks. We could explore porting HADOOP-11746 over to ZooKeeper. It won't be as simple as copying it right over to the ZooKeeper repo, because some of the logic is specific to the Hadoop repo. We've also had trouble with flaky tests. Unfortunately, these often require a ton of engineering time to fix. The typical root causes I've seen are: 1. Tests start servers bound to hard-coded port numbers, so if multiple test runs execute concurrently, then one of them will get a bind exception. The solution is always to bind to an ephemeral port in test code. 2. Tests do not do proper resource cleanup. This can manifest as file descriptor leaks leading to hitting the open file descriptor limit, thread leaks, or 2 background threads trying to do the same job and interfering with one another. File descriptor leaks are particularly nasty for test runs on Windows, where the default file locking behavior can prevent subsequent tests from using a working directory for test data. The solution is to track these down and use try-finally, try-with-resources, JUnit @After etc. to ensure clean-up. 3. Tests are non-deterministic, such as by hard-coding a sleep time to wait for an asynchronous action to complete. The solutions usually involve providing hooks into lower-layer logic, such as to receive a callback from the asynchronous action, so that the test can be deterministic. 4. Tests hard-code a file path separator to '/', which doesn't work correctly on Windows. The code fixes for this usually are obvious once you spot the problem. It can take a long time to track these down and fix them. To help people iterate faster, I've proposed another test-patch.sh enhancement that would allow the contributor to request that only specific tests are run on a patch. https://issues.apache.org/jira/browse/HADOOP-11895 This would help engineers get quicker feedback, especially in the event that a test failure only repros on the Jenkins hosts. In my experience (again primarily on Hadoop), it's much more often that we see flaky tests rather than bad Jenkins hosts. When there is a bad host, it's usually pretty obvious. Each Jenkins run reports the host that ran the job, so we can identify a trend of a particular problem happening on a particular host. --Chris Nauroth On 5/3/15, 10:28 AM, "Raúl Gutiérrez Segalés" <[email protected]> wrote: >Hi all, > >This has probably come up before but do we have any thoughts on making CI >better? Is the problem jenkins? Is it flaky tests? Bad CI workers? All of >the above? > >I see we waste loads of time with trivial (or unrelated to the actual >failures) patches triggering failed builds all the time. I'd like to spend >some time improving our experience here, but would love some >pointers/thoughts. > >Ideas? > > >Cheers, >-rgs
