Re: [PROPOSAL] (likely) significantly shorten CI test time: splitting to db/non-db tests

Aritra Basu Tue, 24 Oct 2023 10:47:44 -0700

+1
Definitely a +1 from me, seems like a relatively small effort to get good
returns


--
Regards,
Aritra Basu

On Tue, Oct 24, 2023, 11:01 PM Vincent Beck <[email protected]> wrote:

> +1 I like this one. I think it is totally worth it adding this decorator,
> mostly because I think the effort is not huge. I also think, as a
> contributor, it is a good exercise to, when writing tests, figure out
> whether the tests I am writing is using the DB.
>
> On 2023/10/24 16:31:57 Jarek Potiuk wrote:
> > Hello everyone,
> >
> > *TL;DR; *I have a proposal on how we can probably significantly decrease
> > time needed to run our CI tests. All that at the expense of  small - I
> > think-  effort by the need to mark tests as "db-tests" by contributors
> > enforced by our CI (in most cases that won't even be needed)..
> >
> > *A bit more context:*
> >
> > I have started to focus on AIP-44 work (internal API) and part of this
> work
> > is to make it possible to run subset of our tests (related to
> > untrusted components and providers) in a db-isolation mode to see if we
> > have not forgotten anything (and to keep db isolation mode in the
> future).
> > Part of this task is to see how many tests we have that actually use the
> > DB. I had some interesting findings and I thought I would explore an idea
> > to split our tests to DB and NON-DB tests.
> >
> > After some initial exploration I believe we can get at least 30-50%
> > reduction in our CI test execution time (and build test time) at the
> > expense of the necessity of marking tests as "db_tests" when they need
> > the DB.
> >
> > This topic has been brought up occasionally in the past - why do we run
> our
> > tests on multiple versions of multiple databases. And the problem is that
> > many of our tests are using the DB and we need to test them on all of
> them.
> > But also - turns out we have quite a lot of tests that do not use the DB.
> > This - for example - does not allow us to use "pytest-xdist" to run the
> > test in parallel freely, we have to use the "custom" docker based
> > parallelism in CI.
> >
> > So the idea is that If we could separate DB from NON-DB tests we could
> only
> > run the DB tests on multiple databases, but then all the "non-db tests"
> > could only be run once. It would only make sense though if we have a
> > significant number of non-DB tests, if we could clearly separate them and
> > if it was easy in the future to keep them separated (so only mark tests
> as
> > db_tests when they actually need DB).
> >
> > *Findings:*
> >
> > I tried it out and I found out that:
> >
> > a) Turns out that we have quite a significant number of tests that are
> > non-db tests. Likely between 30% to 50% by my estimate - both in core and
> > providers and they are scattered among our code - mostly whole "modules"
> > are either db or non-db tests.
> >
> > b) It is rather easy to separate db from non-db tests without any
> > structural change to our test structure (just by using a
> > custom @pytest.mark.db_test - on test, class, or even module level). So
> we
> > can easily handle cases where the whole module is "mostly non-db" and
> only
> > few test cases are "db" - there are some cases like that, but the custom
> > markers are rather nice to help with that.
> >
> > c) It should be easy to keep them separated in the future with only very
> > little overhead. Basically when your tests will start failing in CI and
> > tell you "mark your test with @pytest.mark.db_test" you will have to do
> it
> > - but in most cases you won't even have to do it because your test will
> be
> > added to the module that is already marked as db_test module
> >
> > d) We can use pytest-xdist for parallel execution of non-db tests which
> > will make the test suites of "non-db" tests much more straightforward -
> > basically `pytest tests --skip-db-tests -n auto` should do the trick).
> And
> > you could do the same locally as well.
> >
> > *Impact on contributors: *
> >
> > There is absolutely no impact on running the tests locally. Non-db tests
> > will run when DB is there, no problem with that. I think in a vast number
> > of cases it will be mostly no change for regular contributors. Only
> > occasionally (when new test module is added for example) their tests will
> > fail in CI with very clear instructions from the CI to mark their tests
> as
> > "@pytest.mark.db_test" or marking the whole module as "db_test module" by
> > `pytestmark = pytest.mark.db_test`. This will happen when their tests
> will
> > be treated as "non-db" tests and attempted to run in CI.
> >
> > *Summarising:*
> >
> > I think we can achieve some 30%-50% reduction in time/CPU needed if we
> > implement it. Of course it means again complicating our CI builds quite a
> > bit. But with those gains I think it's totally worth it.
> >
> > I have a draft (still needs quite some work) -  PR showing how it could
> > look like  https://github.com/apache/airflow/pull/35160. I think in 1-2
> > days I should have a ready-to-merge change with all the tests passing and
> > with 30%-50% reduction of the resources needed and time.
> >
> > Would love your comments - is it worth it? Should I proceed with it? Any
> > other comments?
> >
> > J.
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [PROPOSAL] (likely) significantly shorten CI test time: splitting to db/non-db tests

Reply via email to