Re: [PROPOSAL] (likely) significantly shorten CI test time: splitting to db/non-db tests

Vincent Beck Tue, 24 Oct 2023 10:31:21 -0700

+1 I like this one. I think it is totally worth it adding this decorator, 
mostly because I think the effort is not huge. I also think, as a contributor, 
it is a good exercise to, when writing tests, figure out whether the tests I am 
writing is using the DB.


On 2023/10/24 16:31:57 Jarek Potiuk wrote:
> Hello everyone,
> 
> *TL;DR; *I have a proposal on how we can probably significantly decrease
> time needed to run our CI tests. All that at the expense of  small - I
> think-  effort by the need to mark tests as "db-tests" by contributors
> enforced by our CI (in most cases that won't even be needed)..
> 
> *A bit more context:*
> 
> I have started to focus on AIP-44 work (internal API) and part of this work
> is to make it possible to run subset of our tests (related to
> untrusted components and providers) in a db-isolation mode to see if we
> have not forgotten anything (and to keep db isolation mode in the future).
> Part of this task is to see how many tests we have that actually use the
> DB. I had some interesting findings and I thought I would explore an idea
> to split our tests to DB and NON-DB tests.
> 
> After some initial exploration I believe we can get at least 30-50%
> reduction in our CI test execution time (and build test time) at the
> expense of the necessity of marking tests as "db_tests" when they need
> the DB.
> 
> This topic has been brought up occasionally in the past - why do we run our
> tests on multiple versions of multiple databases. And the problem is that
> many of our tests are using the DB and we need to test them on all of them.
> But also - turns out we have quite a lot of tests that do not use the DB.
> This - for example - does not allow us to use "pytest-xdist" to run the
> test in parallel freely, we have to use the "custom" docker based
> parallelism in CI.
> 
> So the idea is that If we could separate DB from NON-DB tests we could only
> run the DB tests on multiple databases, but then all the "non-db tests"
> could only be run once. It would only make sense though if we have a
> significant number of non-DB tests, if we could clearly separate them and
> if it was easy in the future to keep them separated (so only mark tests as
> db_tests when they actually need DB).
> 
> *Findings:*
> 
> I tried it out and I found out that:
> 
> a) Turns out that we have quite a significant number of tests that are
> non-db tests. Likely between 30% to 50% by my estimate - both in core and
> providers and they are scattered among our code - mostly whole "modules"
> are either db or non-db tests.
> 
> b) It is rather easy to separate db from non-db tests without any
> structural change to our test structure (just by using a
> custom @pytest.mark.db_test - on test, class, or even module level). So we
> can easily handle cases where the whole module is "mostly non-db" and only
> few test cases are "db" - there are some cases like that, but the custom
> markers are rather nice to help with that.
> 
> c) It should be easy to keep them separated in the future with only very
> little overhead. Basically when your tests will start failing in CI and
> tell you "mark your test with @pytest.mark.db_test" you will have to do it
> - but in most cases you won't even have to do it because your test will be
> added to the module that is already marked as db_test module
> 
> d) We can use pytest-xdist for parallel execution of non-db tests which
> will make the test suites of "non-db" tests much more straightforward -
> basically `pytest tests --skip-db-tests -n auto` should do the trick). And
> you could do the same locally as well.
> 
> *Impact on contributors: *
> 
> There is absolutely no impact on running the tests locally. Non-db tests
> will run when DB is there, no problem with that. I think in a vast number
> of cases it will be mostly no change for regular contributors. Only
> occasionally (when new test module is added for example) their tests will
> fail in CI with very clear instructions from the CI to mark their tests as
> "@pytest.mark.db_test" or marking the whole module as "db_test module" by
> `pytestmark = pytest.mark.db_test`. This will happen when their tests will
> be treated as "non-db" tests and attempted to run in CI.
> 
> *Summarising:*
> 
> I think we can achieve some 30%-50% reduction in time/CPU needed if we
> implement it. Of course it means again complicating our CI builds quite a
> bit. But with those gains I think it's totally worth it.
> 
> I have a draft (still needs quite some work) -  PR showing how it could
> look like  https://github.com/apache/airflow/pull/35160. I think in 1-2
> days I should have a ready-to-merge change with all the tests passing and
> with 30%-50% reduction of the resources needed and time.
> 
> Would love your comments - is it worth it? Should I proceed with it? Any
> other comments?
> 
> J.
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org

Re: [PROPOSAL] (likely) significantly shorten CI test time: splitting to db/non-db tests

Reply via email to