Re: Often failing tests in CI (and a way to fix them quickly and future-proof)

Tomasz Urbaszek Sun, 26 Jan 2020 23:50:18 -0800

Great job Jarek! 🚀

T.


On Sun, Jan 26, 2020 at 6:09 PM Jarek Potiuk <jarek.pot...@polidea.com>
wrote:

> While fixing the v1-10-test in preparation to 1.10.8 we've managed to fix a
> number of flaky tests, including kerberos-related flakiness, so I expect a
> lot more stability - we still had quite a few flaky tests last week, but I
> hope as of today it will be a LOT better.
>
> We have now got rid of the hardware entropy dependencies, fixed some
> "random" tests (literally they predictably failed 1/10 runs because
> randomness) and we have a robust mechanism to make sure that all the
> integrations are up and running before tests are started. This should
> really help with CI stability.
>
> Ah - and we've also sped-up the CI tests as well. We split out pylint tests
> which were the longest-running part of static tests, moved doc building to
> the "test" phase, we have more - but smaller - jobs. Seems that we also
> have more parallel workers available on Apache side, so by utilising
> parallel running we shaved off at least 10 minutes elapsed time from
> average CI pipeline execution.
>
> More improvements will come after we move to GitHub Actions (which is next
> in line).
>
> I think this thread can be closed :).
>
> J.
>
>
> On Wed, Jan 15, 2020 at 11:38 PM Jarek Potiuk <jarek.pot...@polidea.com>
> wrote:
>
> > Merged now - please rebase to latest master, this should workaround the
> > intermittent failures.
> >
> > J.
> >
> > On Wed, Jan 15, 2020 at 11:15 PM Jarek Potiuk <jarek.pot...@polidea.com>
> > wrote:
> >
> >> Hello everyone,
> >>
> >> I think - thanks to diagnostics we added I found the root cause for
> that.
> >>
> >> The most probable reason why we had the problem, is that something
> >> changed on Travis CI regarding entropy sources. Our integrations
> >> (cassandra, mysql, kerberos) need enough entropy (source of random
> data) on
> >> the servers to generate certificates/keys etc. Because of security
> reasons
> >> - you need usually a reliable (hardware) source of random data for those
> >> applications that use TLS/SSL and generate their own certificates. It
> seems
> >> right now on Travis the source of entropy is shared between multiple
> >> running jobs (dockers) and it slows down startup time a lot (10s of
> second
> >> rather than 100s of ms). So if we have a lot of parallel jobs using
> entropy
> >> running on the same hardware - startup time for those might be very
> slow.
> >>
> >> I've opened an issue in the community section of Travis for that:
> >>
> https://travis-ci.community/t/not-enough-entropy-during-builds-likely/6878
> >>
> >> In the meantime we have a workaround (waiting until all integrations /db
> >> start before we run tests) that I will merge soon:
> >> https://github.com/apache/airflow/pull/7172 (waiting for Travis build
> to
> >> complete).
> >>
> >> Possibly later we speed it up by using software source of entropy (we do
> >> not need hardware entropy for CI tests) but this might take a bit more
> time.
> >>
> >> J.
> >>
> >> On Tue, Jan 14, 2020 at 5:24 PM Jarek Potiuk <jarek.pot...@polidea.com>
> >> wrote:
> >>
> >>> We have other kerberos-related failures. I disabled
> >>> temporarily "kerberos-specific" build until I add some more
> diagnostics and
> >>> test it.
> >>>
> >>> Please rebase to latest master.
> >>>
> >>> J.
> >>>
> >>> On Tue, Jan 14, 2020 at 7:24 AM Jarek Potiuk <jarek.pot...@polidea.com
> >
> >>> wrote:
> >>>
> >>>> Seems tests are stable  - but the kerberos problem is happening often
> >>>> enough to take a look. I will see what I can do to make it stable.
> seems
> >>>> that might be a race between kerberos initialising and tests starting
> to
> >>>> run.
> >>>>
> >>>> On Mon, Jan 13, 2020 at 8:58 PM Jarek Potiuk <
> jarek.pot...@polidea.com>
> >>>> wrote:
> >>>>
> >>>>> Just merged the change with integration separation/slimming down the
> >>>>> tests on CI. https://github.com/apache/airflow/pull/7091
> >>>>>
> >>>>> It looks like it is far more stable, I just had one failure with
> >>>>> kerberos not starting (which also happened sometimes with old
> tests). We
> >>>>> will look in the future at some of the "xfailed/xpassed" tests -
> those that
> >>>>> we know are problematic. We have 8 of them now.
> >>>>>
> >>>>> Also Breeze is now much more enjoyable to use. Pls. take a look at
> the
> >>>>> docs.
> >>>>>
> >>>>> J.
> >>>>>
> >>>>> On Wed, Jan 8, 2020 at 2:23 PM Jarek Potiuk <
> jarek.pot...@polidea.com>
> >>>>> wrote:
> >>>>>
> >>>>>> I like what you've done with the separate integrations, and that
> >>>>>>> coupled with pytest markers and better "import error" handling in
> the tests
> >>>>>>> would make it easier to run a sub-set of the tests without having
> to
> >>>>>>> install everything (for instance not having to install mysql
> client libs.
> >>>>>>
> >>>>>>
> >>>>>> Cool. That's exactly what I am working on in
> >>>>>> https://github.com/apache/airflow/pull/7091 -> I want to get all
> the
> >>>>>> tests run in integration-less CI, select all those that failed and
> treat
> >>>>>> them appropriately.
> >>>>>>
> >>>>>>
> >>>>>>> Admittedly less of a worry with breeze/docker, but still would be
> >>>>>>> nice to skip/deselct tests when deps aren't there)
> >>>>>>>
> >>>>>>
> >>>>>> Yeah. For me it's the same. I think we had recently a few
> discussions
> >>>>>> with first time users that they have difficulty contributing
> because they
> >>>>>> do not know how to reproduce failing CI reliably locally. I think
> the
> >>>>>> resource of Breeze environment for simple tests was a big
> >>>>>> blocker/difficulty for some users so slimming it down and making it
> >>>>>> integration-less by default will be really helpful. I will also
> make it the
> >>>>>> "default" way of reproducing tests - i will remove the separate bash
> >>>>>> scripts which were an intermediate step. This is the same work
> especially
> >>>>>> that I use the same mechanism and ... well - it will be far easier
> for me
> >>>>>> to have integration - specific cases working in CI  if i also have
> Breeze
> >>>>>> to support it (eating my own dog food).
> >>>>>>
> >>>>>>
> >>>>>>> Most of these PRs are merged now, I've glanced over #7091 and like
> >>>>>>> the look of it, good work! You'll let us know when we should take
> a deeper
> >>>>>>> look?
> >>>>>>>
> >>>>>>
> >>>>>> Yep I will. I hope today/tomorrow - most of it is ready. I also
> >>>>>> managed to VASTLY simplified running kubernetes kind (One less
> docker
> >>>>>> image, everything runs in the same docker engine as the
> airflow-testing
> >>>>>> itself) in https://github.com/apache/airflow/pull/6516 which is
> >>>>>> prerequisite for #7091  - so both will need to be reviewed. I marke
> >>>>>>
> >>>>>>
> >>>>>>> For cassandra tests specifically I'm not sure there is a huge
> amount
> >>>>>>> of value in actually running the tests against cassandra -- we are
> using
> >>>>>>> the official python module for it, and the test is basically
> running these
> >>>>>>> queries - DROP TABLE IF EXISTS, CREATE TABLE, INSERT INTO TABLE,
> and then
> >>>>>>> running hook.record_exists -- that seems like it's testing
> cassandra
> >>>>>>> itself, when I think all we should do is test that
> hook.record_exists calls
> >>>>>>> the execute method on the connection with the right string. I'll
> knock up a
> >>>>>>> PR for this.
> >>>>>>> Do we think it's worth keeping the non-mocked/integration tests
> too?
> >>>>>>>
> >>>>>>
> >>>>>> I would not remove them just yet. Let's see how it works when I
> >>>>>> separate it out. I have a feeling that we have very little number
> of those
> >>>>>> integration tests overall so maybe it will be stable and fast
> enough when
> >>>>>> we only run those in a separate job. I think it's good to have
> different
> >>>>>> levels of tests (unit/integration/system) as they find different
> types of
> >>>>>> problems.  As long as we can have integration/system tests clearly
> >>>>>> separated, stable and easy to disable/enable - I am all for having
> >>>>>> different types of tests. There is this old and well established
> concept of
> >>>>>> Test Pyramid https://martinfowler.com/bliki/TestPyramid.html  which
> >>>>>> applies very accurately to our case. By adding markers/categorising
> the
> >>>>>> tests and seeing how many of those tests we have, how stable they
> are, how
> >>>>>> long they are and (eventtually) how much it costs us - we can make
> better
> >>>>>> decisions.
> >>>>>>
> >>>>>> J.
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>>
> >>>>> Jarek Potiuk
> >>>>> Polidea <https://www.polidea.com/> | Principal Software Engineer
> >>>>>
> >>>>> M: +48 660 796 129 <+48660796129>
> >>>>> [image: Polidea] <https://www.polidea.com/>
> >>>>>
> >>>>>
> >>>>
> >>>> --
> >>>>
> >>>> Jarek Potiuk
> >>>> Polidea <https://www.polidea.com/> | Principal Software Engineer
> >>>>
> >>>> M: +48 660 796 129 <+48660796129>
> >>>> [image: Polidea] <https://www.polidea.com/>
> >>>>
> >>>>
> >>>
> >>> --
> >>>
> >>> Jarek Potiuk
> >>> Polidea <https://www.polidea.com/> | Principal Software Engineer
> >>>
> >>> M: +48 660 796 129 <+48660796129>
> >>> [image: Polidea] <https://www.polidea.com/>
> >>>
> >>>
> >>
> >> --
> >>
> >> Jarek Potiuk
> >> Polidea <https://www.polidea.com/> | Principal Software Engineer
> >>
> >> M: +48 660 796 129 <+48660796129>
> >> [image: Polidea] <https://www.polidea.com/>
> >>
> >>
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
> >
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>


-- 

Tomasz Urbaszek
Polidea <https://www.polidea.com/> | Software Engineer

M: +48 505 628 493 <+48505628493>
E: tomasz.urbas...@polidea.com <tomasz.urbasz...@polidea.com>

Unique Tech
Check out our projects! <https://www.polidea.com/our-work>

Re: Often failing tests in CI (and a way to fix them quickly and future-proof)

Reply via email to