Hello everyone,

I have a proposal - very much COVID-19-inspired on how to fix our CI tests...

After the recent problems with CI together with Daniel and Tomek we
decided to make an emergency migration to Github Actions. So we did.

I think overall it was a good move, but we had some problems with it.
It turns out that while we were blaming Travis for everything wrong
that happened in our builds, it was not always Travis' fault. We have
some tests that are also failing in Github Actions and I think it's
the highest time we fix them.

I spend a better part of the weekend bring trying different things and
implementing numerous optimizations back to our CI configuration (a
lot of those were lost during the emergency move).

While running it I had many issues and I think I found a good way to
handle our flaky tests. I would love that others think about it.

Those interested - please take a look at the PR "Bring back CI
optimisations" https://github.com/apache/airflow/pull/8393
Corresponding GituhbActions here:
https://github.com/apache/airflow/actions/runs/82410109

I implemented a lot of optimizations in this PR (some of them will
only take effect after we merge to master) but most of all I wanted to
introduce a concept of "quarantined tests" (good name isn't it :) )

Here is the idea:

 - tests that are marked as @pytest.mark.quarantined are skipped in
regular runs (I identified 58 potential candidates - not all of them
are flaky but I wanted to be safe)
 - there is one dedicated "Quarantine" job that runs only quarantined
tests (it's Postgres 9.6 with Python 3.6 for now)
 - those "quarantined" tests are run with 90 s. timeout each and rerun
up to 3 times if they fail
 - failure of any of the Quarantine tests does not fail the whole CI
 - I plan to create GithUb issues for groups of those tests
(MoveOutOfQuarantine NNNN)
 - I think it's best if we split them between committers
- The job of the committers will be to observe the stability of those tests
- once we fix and observe that the tests are "stable" we  move them
out of Quarantine back to regular tests (by removing
@pytest.mark.quarantined)
- the goal is to move all our tests out of Quarantine
- in the future we can move any flaky test to Quarantine (by adding
@pytest.mark.quarantined) and it will give us time to observe it and
fix any flakiness.

Let me know what you think of it?

J.

-- 
Jarek Potiuk
Polidea | Principal Software Engineer

M: +48 660 796 129

Reply via email to