*TL;DR; I have a proposal how we can remedy often failing CI tests and I
have a kind request to other committers to help me to fix it in a good way.*

As we all noticed we had recently some often (far too often) failing tests
in Travis. The situation is not very good and we have to remedy it fairly
quickly.

I think we can do it without compromising the quality and without temporary
disabling some of the tests.

*Root cause of the problem*

The root cause of the problem seems to be memory used during tests. After
adding instafail we know that often the tests are failing because there is
not enough memory on Travis machines. This is a combination of the way how
our virtual machines are allocated in Travis infrastructure, more and more
tests we have, the fact that our tests require a lot of "integrations"
(running as separate images - cassandra, rabbitmq, postgres/mysql databases
etc) and the fact that running them with pytests (pytest apparently uses
more memory).

*Proposal*

One of the proposals on slack was to get rid of Cassandra tests and disable
cassandra temporarily - but I think we can do better and I can get it
merged in a day or two and get it sorted out for now (and good for the
future).

I already wrote an integration test proposal recently
<https://lists.apache.org/thread.html/120af497f4adf482162be9583a93651fd206b71db213255f52ad8b7a%40%3Cdev.airflow.apache.org%3E>
(I
will resurrect that thread now) how we can split our integration tests
using pytest markers and get support from Breeze and our CI testing
framework into separate integrations. I already have working code for that
(it is a result of my resumed work on Production Image) and most of the
code is already Green in Travis and they need to get a review from other
committers. At the end of the message I copy the excerpt from the docs how
this will work.

Once we have that in, we will have a very easy (and maintainable for the
future) way that helps both with CI resources but also make Breeze far more
usable (and less resource-hungry):

   - add pytest markers so that we know which tests are "integration" ones.
   - start Breeze locally without any external integrations (right now only
   Kerberos is needed) - most of the tests works there. Far less resource usage
   - start Breeze easily with *--integration mongo --integration
cassandra *etc. whenever
   we need to run tests for that integration
   - run all the "non-integration" tests in CI without the integrations
   started
   - run only the "integration-related" tests in CI with the integrations
   started
   - we will have more jobs in CI but they should run much more reliably
   and faster in general
   - also one of the changes is to improve the way we build kind/kubernetes
   tests in order to unblock migration to GithubActions that Tomek works on -
   that might be our ultimate speedup/stabilisation
   - For those curious ones is updated documentation in my PR: "Launching
   Breeze integrations"
   
<https://github.com/PolideaInternal/airflow/blob/separate-integrations/BREEZE.rst#launching-breeze-integrations>,
   "Running
   
<https://github.com/PolideaInternal/airflow/blob/separate-integrations/BREEZE.rst#running-tests-with-kubernetes-in-breeze>
   tests with Kubernetes in Breeze"
   
<https://github.com/PolideaInternal/airflow/blob/separate-integrations/BREEZE.rst#running-tests-with-kubernetes-in-breeze>

*PRs - kind request to other committers*

I have a series of PRs that are already implementing almost all of it (I
needed that in order to implement Production Image support). They are
depending on each other - I added unit test support for Bash scripts and
several improvements and added simplifications:

   - [AIRFLOW-6489] Add BATS support for Bash unit testing
   <https://github.com/apache/airflow/pull/7081> [ready for review] -
   needed to get more control over other changes.
   - [AIRFLOW-6475] Remove duplication of volume mount specs in Breeze.
   <https://github.com/apache/airflow/pull/7065>[ready for review]-
   improves the consistency on how we run Breeze/CI
   - [AIRFLOW-6491] improve parameter handling in breeze
   <https://github.com/apache/airflow/pull/7084> [ready for review] -
   tested and improved way how we handle --options in Breeze (needed for
   Kubernetes improvements
   - [AIRFLOW-5704] Improve Kind Kubernetes scripts for local testing
   <https://github.com/apache/airflow/pull/6516>. - [testing]  improve
   handling Kubernetes Kind testing (fixes issues with loading images/
   upgrades kind to latest version).
   - [AIRFLOW-6489] Separate integrations [WIP]
   <https://github.com/apache/airflow/pull/7091> - [WIP] this is the test
   introducing different integrations - it already works for Breeze and has
   support for deciding which integrations should be started  - I just need to
   separate out the "Integration tests" to separate jobs.

I have a kind request to other committers - can you please take a look at
those and help to merge it quickly?

I have also follow-up PRs for production image, but the above PRs should
help us to solve the CI crisis in a good way and let me continue on the
prod image ones.

J.


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Reply via email to