+1 I like this one. I think it is totally worth it adding this decorator, mostly because I think the effort is not huge. I also think, as a contributor, it is a good exercise to, when writing tests, figure out whether the tests I am writing is using the DB.
On 2023/10/24 16:31:57 Jarek Potiuk wrote: > Hello everyone, > > *TL;DR; *I have a proposal on how we can probably significantly decrease > time needed to run our CI tests. All that at the expense of small - I > think- effort by the need to mark tests as "db-tests" by contributors > enforced by our CI (in most cases that won't even be needed).. > > *A bit more context:* > > I have started to focus on AIP-44 work (internal API) and part of this work > is to make it possible to run subset of our tests (related to > untrusted components and providers) in a db-isolation mode to see if we > have not forgotten anything (and to keep db isolation mode in the future). > Part of this task is to see how many tests we have that actually use the > DB. I had some interesting findings and I thought I would explore an idea > to split our tests to DB and NON-DB tests. > > After some initial exploration I believe we can get at least 30-50% > reduction in our CI test execution time (and build test time) at the > expense of the necessity of marking tests as "db_tests" when they need > the DB. > > This topic has been brought up occasionally in the past - why do we run our > tests on multiple versions of multiple databases. And the problem is that > many of our tests are using the DB and we need to test them on all of them. > But also - turns out we have quite a lot of tests that do not use the DB. > This - for example - does not allow us to use "pytest-xdist" to run the > test in parallel freely, we have to use the "custom" docker based > parallelism in CI. > > So the idea is that If we could separate DB from NON-DB tests we could only > run the DB tests on multiple databases, but then all the "non-db tests" > could only be run once. It would only make sense though if we have a > significant number of non-DB tests, if we could clearly separate them and > if it was easy in the future to keep them separated (so only mark tests as > db_tests when they actually need DB). > > *Findings:* > > I tried it out and I found out that: > > a) Turns out that we have quite a significant number of tests that are > non-db tests. Likely between 30% to 50% by my estimate - both in core and > providers and they are scattered among our code - mostly whole "modules" > are either db or non-db tests. > > b) It is rather easy to separate db from non-db tests without any > structural change to our test structure (just by using a > custom @pytest.mark.db_test - on test, class, or even module level). So we > can easily handle cases where the whole module is "mostly non-db" and only > few test cases are "db" - there are some cases like that, but the custom > markers are rather nice to help with that. > > c) It should be easy to keep them separated in the future with only very > little overhead. Basically when your tests will start failing in CI and > tell you "mark your test with @pytest.mark.db_test" you will have to do it > - but in most cases you won't even have to do it because your test will be > added to the module that is already marked as db_test module > > d) We can use pytest-xdist for parallel execution of non-db tests which > will make the test suites of "non-db" tests much more straightforward - > basically `pytest tests --skip-db-tests -n auto` should do the trick). And > you could do the same locally as well. > > *Impact on contributors: * > > There is absolutely no impact on running the tests locally. Non-db tests > will run when DB is there, no problem with that. I think in a vast number > of cases it will be mostly no change for regular contributors. Only > occasionally (when new test module is added for example) their tests will > fail in CI with very clear instructions from the CI to mark their tests as > "@pytest.mark.db_test" or marking the whole module as "db_test module" by > `pytestmark = pytest.mark.db_test`. This will happen when their tests will > be treated as "non-db" tests and attempted to run in CI. > > *Summarising:* > > I think we can achieve some 30%-50% reduction in time/CPU needed if we > implement it. Of course it means again complicating our CI builds quite a > bit. But with those gains I think it's totally worth it. > > I have a draft (still needs quite some work) - PR showing how it could > look like https://github.com/apache/airflow/pull/35160. I think in 1-2 > days I should have a ready-to-merge change with all the tests passing and > with 30%-50% reduction of the resources needed and time. > > Would love your comments - is it worth it? Should I proceed with it? Any > other comments? > > J. > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org For additional commands, e-mail: dev-h...@airflow.apache.org