Faster builds on CI + increased stability + easier to reproduce CI problems

Jarek Potiuk Sat, 22 Aug 2020 08:12:56 -0700

Hello everyone,

Just wanted to let you know that we merged last week quite an overhaul of
the CI architecture we have in Github Actions.


TL;DR; It should be faster, more stable and it should be super-easy to
reproduce any CI failure locally.

We should have quite a bit faster, much more stable - and as a side effect
- easy to diagnose CI builds. There are few PRs left to merge - solving
some teething problems and adding some optimizations and we might need to
implement one workaround for missing GitHub API, but it looks pretty good
after few days of watching.

The gist of the change is that we could start using a new "workflow_run"
feature of GitHub Actions that allows us to only build each image once and
reuse it for all the jobs - previously those images were built (using
latest sources) for every single job. Now they are built only once.

Some stats for average runs (we have way bigger gains in situations where
python released new patch-level version):

   - Prepare image job: 5 minutes 30 seconds -> 1 minute 7 seconds (~80%
   improvement)
   - Longest job time: 34 minutes => 29 minutes 30 seconds (~15%
   improvement in longest job)
   - Build time saved per build (!)  = 27 jobs * 4.5 minutes ~ 2h machine
   build time saved for each build (!)

This change also should improve overall stability. There were a number of
problems where building image failed - this should be now ~ 10 x less
likely to happen as we build images only 3 times instead of ~30.

As a result - we are better citizens, but also it means we should have far
less queuing time in case several PRs start in quick succession.

Also - as a side effect but an important one - we have now a super-easy way
to reproduce any failure in CI. This is the final setup which I thought
about when I implemented Breeze. Now anyone can just log in to GitHub
registry and run this:

`breeze --github-image-id <RUN_ID> --backend <BACKEND> --python <X.Y>`

Then you should be dropped into the EXACT same environment that was used
for a particular failed "run" in Github Actions - including airflow sources
used for that. You do not have to check-out the code etc.

This means that you (or anyone else trying to help) should be able to
re-run most of the failed tests locally and reproduce the failures (and try
to fix them).

Documentation with all the details and command you can use is coming in
https://github.com/apache/airflow/pull/10380 - happy to get some reviews.

J.

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Faster builds on CI + increased stability + easier to reproduce CI problems

Reply via email to