Merged now - please rebase to latest master, this should workaround the intermittent failures.
J. On Wed, Jan 15, 2020 at 11:15 PM Jarek Potiuk <[email protected]> wrote: > Hello everyone, > > I think - thanks to diagnostics we added I found the root cause for that. > > The most probable reason why we had the problem, is that something changed > on Travis CI regarding entropy sources. Our integrations (cassandra, mysql, > kerberos) need enough entropy (source of random data) on the servers to > generate certificates/keys etc. Because of security reasons - you need > usually a reliable (hardware) source of random data for those applications > that use TLS/SSL and generate their own certificates. It seems right now on > Travis the source of entropy is shared between multiple running jobs > (dockers) and it slows down startup time a lot (10s of second rather than > 100s of ms). So if we have a lot of parallel jobs using entropy running on > the same hardware - startup time for those might be very slow. > > I've opened an issue in the community section of Travis for that: > https://travis-ci.community/t/not-enough-entropy-during-builds-likely/6878 > > In the meantime we have a workaround (waiting until all integrations /db > start before we run tests) that I will merge soon: > https://github.com/apache/airflow/pull/7172 (waiting for Travis build to > complete). > > Possibly later we speed it up by using software source of entropy (we do > not need hardware entropy for CI tests) but this might take a bit more time. > > J. > > On Tue, Jan 14, 2020 at 5:24 PM Jarek Potiuk <[email protected]> > wrote: > >> We have other kerberos-related failures. I disabled >> temporarily "kerberos-specific" build until I add some more diagnostics and >> test it. >> >> Please rebase to latest master. >> >> J. >> >> On Tue, Jan 14, 2020 at 7:24 AM Jarek Potiuk <[email protected]> >> wrote: >> >>> Seems tests are stable - but the kerberos problem is happening often >>> enough to take a look. I will see what I can do to make it stable. seems >>> that might be a race between kerberos initialising and tests starting to >>> run. >>> >>> On Mon, Jan 13, 2020 at 8:58 PM Jarek Potiuk <[email protected]> >>> wrote: >>> >>>> Just merged the change with integration separation/slimming down the >>>> tests on CI. https://github.com/apache/airflow/pull/7091 >>>> >>>> It looks like it is far more stable, I just had one failure with >>>> kerberos not starting (which also happened sometimes with old tests). We >>>> will look in the future at some of the "xfailed/xpassed" tests - those that >>>> we know are problematic. We have 8 of them now. >>>> >>>> Also Breeze is now much more enjoyable to use. Pls. take a look at the >>>> docs. >>>> >>>> J. >>>> >>>> On Wed, Jan 8, 2020 at 2:23 PM Jarek Potiuk <[email protected]> >>>> wrote: >>>> >>>>> I like what you've done with the separate integrations, and that >>>>>> coupled with pytest markers and better "import error" handling in the >>>>>> tests >>>>>> would make it easier to run a sub-set of the tests without having to >>>>>> install everything (for instance not having to install mysql client libs. >>>>> >>>>> >>>>> Cool. That's exactly what I am working on in >>>>> https://github.com/apache/airflow/pull/7091 -> I want to get all the >>>>> tests run in integration-less CI, select all those that failed and treat >>>>> them appropriately. >>>>> >>>>> >>>>>> Admittedly less of a worry with breeze/docker, but still would be >>>>>> nice to skip/deselct tests when deps aren't there) >>>>>> >>>>> >>>>> Yeah. For me it's the same. I think we had recently a few discussions >>>>> with first time users that they have difficulty contributing because they >>>>> do not know how to reproduce failing CI reliably locally. I think the >>>>> resource of Breeze environment for simple tests was a big >>>>> blocker/difficulty for some users so slimming it down and making it >>>>> integration-less by default will be really helpful. I will also make it >>>>> the >>>>> "default" way of reproducing tests - i will remove the separate bash >>>>> scripts which were an intermediate step. This is the same work especially >>>>> that I use the same mechanism and ... well - it will be far easier for me >>>>> to have integration - specific cases working in CI if i also have Breeze >>>>> to support it (eating my own dog food). >>>>> >>>>> >>>>>> Most of these PRs are merged now, I've glanced over #7091 and like >>>>>> the look of it, good work! You'll let us know when we should take a >>>>>> deeper >>>>>> look? >>>>>> >>>>> >>>>> Yep I will. I hope today/tomorrow - most of it is ready. I also >>>>> managed to VASTLY simplified running kubernetes kind (One less docker >>>>> image, everything runs in the same docker engine as the airflow-testing >>>>> itself) in https://github.com/apache/airflow/pull/6516 which is >>>>> prerequisite for #7091 - so both will need to be reviewed. I marke >>>>> >>>>> >>>>>> For cassandra tests specifically I'm not sure there is a huge amount >>>>>> of value in actually running the tests against cassandra -- we are using >>>>>> the official python module for it, and the test is basically running >>>>>> these >>>>>> queries - DROP TABLE IF EXISTS, CREATE TABLE, INSERT INTO TABLE, and then >>>>>> running hook.record_exists -- that seems like it's testing cassandra >>>>>> itself, when I think all we should do is test that hook.record_exists >>>>>> calls >>>>>> the execute method on the connection with the right string. I'll knock >>>>>> up a >>>>>> PR for this. >>>>>> Do we think it's worth keeping the non-mocked/integration tests too? >>>>>> >>>>> >>>>> I would not remove them just yet. Let's see how it works when I >>>>> separate it out. I have a feeling that we have very little number of those >>>>> integration tests overall so maybe it will be stable and fast enough when >>>>> we only run those in a separate job. I think it's good to have different >>>>> levels of tests (unit/integration/system) as they find different types of >>>>> problems. As long as we can have integration/system tests clearly >>>>> separated, stable and easy to disable/enable - I am all for having >>>>> different types of tests. There is this old and well established concept >>>>> of >>>>> Test Pyramid https://martinfowler.com/bliki/TestPyramid.html which >>>>> applies very accurately to our case. By adding markers/categorising the >>>>> tests and seeing how many of those tests we have, how stable they are, how >>>>> long they are and (eventtually) how much it costs us - we can make better >>>>> decisions. >>>>> >>>>> J. >>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> Jarek Potiuk >>>> Polidea <https://www.polidea.com/> | Principal Software Engineer >>>> >>>> M: +48 660 796 129 <+48660796129> >>>> [image: Polidea] <https://www.polidea.com/> >>>> >>>> >>> >>> -- >>> >>> Jarek Potiuk >>> Polidea <https://www.polidea.com/> | Principal Software Engineer >>> >>> M: +48 660 796 129 <+48660796129> >>> [image: Polidea] <https://www.polidea.com/> >>> >>> >> >> -- >> >> Jarek Potiuk >> Polidea <https://www.polidea.com/> | Principal Software Engineer >> >> M: +48 660 796 129 <+48660796129> >> [image: Polidea] <https://www.polidea.com/> >> >> > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/> | Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > [image: Polidea] <https://www.polidea.com/> > > -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/>
