Hello everyone, Just wanted to let you know that we merged last week quite an overhaul of the CI architecture we have in Github Actions.
TL;DR; It should be faster, more stable and it should be super-easy to reproduce any CI failure locally. We should have quite a bit faster, much more stable - and as a side effect - easy to diagnose CI builds. There are few PRs left to merge - solving some teething problems and adding some optimizations and we might need to implement one workaround for missing GitHub API, but it looks pretty good after few days of watching. The gist of the change is that we could start using a new "workflow_run" feature of GitHub Actions that allows us to only build each image once and reuse it for all the jobs - previously those images were built (using latest sources) for every single job. Now they are built only once. Some stats for average runs (we have way bigger gains in situations where python released new patch-level version): - Prepare image job: 5 minutes 30 seconds -> 1 minute 7 seconds (~80% improvement) - Longest job time: 34 minutes => 29 minutes 30 seconds (~15% improvement in longest job) - Build time saved per build (!) = 27 jobs * 4.5 minutes ~ 2h machine build time saved for each build (!) This change also should improve overall stability. There were a number of problems where building image failed - this should be now ~ 10 x less likely to happen as we build images only 3 times instead of ~30. As a result - we are better citizens, but also it means we should have far less queuing time in case several PRs start in quick succession. Also - as a side effect but an important one - we have now a super-easy way to reproduce any failure in CI. This is the final setup which I thought about when I implemented Breeze. Now anyone can just log in to GitHub registry and run this: `breeze --github-image-id <RUN_ID> --backend <BACKEND> --python <X.Y>` Then you should be dropped into the EXACT same environment that was used for a particular failed "run" in Github Actions - including airflow sources used for that. You do not have to check-out the code etc. This means that you (or anyone else trying to help) should be able to re-run most of the failed tests locally and reproduce the failures (and try to fix them). Documentation with all the details and command you can use is coming in https://github.com/apache/airflow/pull/10380 - happy to get some reviews. J. -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/>