Great job Jarek! 🚀 T.
On Sun, Jan 26, 2020 at 6:09 PM Jarek Potiuk <jarek.pot...@polidea.com> wrote: > While fixing the v1-10-test in preparation to 1.10.8 we've managed to fix a > number of flaky tests, including kerberos-related flakiness, so I expect a > lot more stability - we still had quite a few flaky tests last week, but I > hope as of today it will be a LOT better. > > We have now got rid of the hardware entropy dependencies, fixed some > "random" tests (literally they predictably failed 1/10 runs because > randomness) and we have a robust mechanism to make sure that all the > integrations are up and running before tests are started. This should > really help with CI stability. > > Ah - and we've also sped-up the CI tests as well. We split out pylint tests > which were the longest-running part of static tests, moved doc building to > the "test" phase, we have more - but smaller - jobs. Seems that we also > have more parallel workers available on Apache side, so by utilising > parallel running we shaved off at least 10 minutes elapsed time from > average CI pipeline execution. > > More improvements will come after we move to GitHub Actions (which is next > in line). > > I think this thread can be closed :). > > J. > > > On Wed, Jan 15, 2020 at 11:38 PM Jarek Potiuk <jarek.pot...@polidea.com> > wrote: > > > Merged now - please rebase to latest master, this should workaround the > > intermittent failures. > > > > J. > > > > On Wed, Jan 15, 2020 at 11:15 PM Jarek Potiuk <jarek.pot...@polidea.com> > > wrote: > > > >> Hello everyone, > >> > >> I think - thanks to diagnostics we added I found the root cause for > that. > >> > >> The most probable reason why we had the problem, is that something > >> changed on Travis CI regarding entropy sources. Our integrations > >> (cassandra, mysql, kerberos) need enough entropy (source of random > data) on > >> the servers to generate certificates/keys etc. Because of security > reasons > >> - you need usually a reliable (hardware) source of random data for those > >> applications that use TLS/SSL and generate their own certificates. It > seems > >> right now on Travis the source of entropy is shared between multiple > >> running jobs (dockers) and it slows down startup time a lot (10s of > second > >> rather than 100s of ms). So if we have a lot of parallel jobs using > entropy > >> running on the same hardware - startup time for those might be very > slow. > >> > >> I've opened an issue in the community section of Travis for that: > >> > https://travis-ci.community/t/not-enough-entropy-during-builds-likely/6878 > >> > >> In the meantime we have a workaround (waiting until all integrations /db > >> start before we run tests) that I will merge soon: > >> https://github.com/apache/airflow/pull/7172 (waiting for Travis build > to > >> complete). > >> > >> Possibly later we speed it up by using software source of entropy (we do > >> not need hardware entropy for CI tests) but this might take a bit more > time. > >> > >> J. > >> > >> On Tue, Jan 14, 2020 at 5:24 PM Jarek Potiuk <jarek.pot...@polidea.com> > >> wrote: > >> > >>> We have other kerberos-related failures. I disabled > >>> temporarily "kerberos-specific" build until I add some more > diagnostics and > >>> test it. > >>> > >>> Please rebase to latest master. > >>> > >>> J. > >>> > >>> On Tue, Jan 14, 2020 at 7:24 AM Jarek Potiuk <jarek.pot...@polidea.com > > > >>> wrote: > >>> > >>>> Seems tests are stable - but the kerberos problem is happening often > >>>> enough to take a look. I will see what I can do to make it stable. > seems > >>>> that might be a race between kerberos initialising and tests starting > to > >>>> run. > >>>> > >>>> On Mon, Jan 13, 2020 at 8:58 PM Jarek Potiuk < > jarek.pot...@polidea.com> > >>>> wrote: > >>>> > >>>>> Just merged the change with integration separation/slimming down the > >>>>> tests on CI. https://github.com/apache/airflow/pull/7091 > >>>>> > >>>>> It looks like it is far more stable, I just had one failure with > >>>>> kerberos not starting (which also happened sometimes with old > tests). We > >>>>> will look in the future at some of the "xfailed/xpassed" tests - > those that > >>>>> we know are problematic. We have 8 of them now. > >>>>> > >>>>> Also Breeze is now much more enjoyable to use. Pls. take a look at > the > >>>>> docs. > >>>>> > >>>>> J. > >>>>> > >>>>> On Wed, Jan 8, 2020 at 2:23 PM Jarek Potiuk < > jarek.pot...@polidea.com> > >>>>> wrote: > >>>>> > >>>>>> I like what you've done with the separate integrations, and that > >>>>>>> coupled with pytest markers and better "import error" handling in > the tests > >>>>>>> would make it easier to run a sub-set of the tests without having > to > >>>>>>> install everything (for instance not having to install mysql > client libs. > >>>>>> > >>>>>> > >>>>>> Cool. That's exactly what I am working on in > >>>>>> https://github.com/apache/airflow/pull/7091 -> I want to get all > the > >>>>>> tests run in integration-less CI, select all those that failed and > treat > >>>>>> them appropriately. > >>>>>> > >>>>>> > >>>>>>> Admittedly less of a worry with breeze/docker, but still would be > >>>>>>> nice to skip/deselct tests when deps aren't there) > >>>>>>> > >>>>>> > >>>>>> Yeah. For me it's the same. I think we had recently a few > discussions > >>>>>> with first time users that they have difficulty contributing > because they > >>>>>> do not know how to reproduce failing CI reliably locally. I think > the > >>>>>> resource of Breeze environment for simple tests was a big > >>>>>> blocker/difficulty for some users so slimming it down and making it > >>>>>> integration-less by default will be really helpful. I will also > make it the > >>>>>> "default" way of reproducing tests - i will remove the separate bash > >>>>>> scripts which were an intermediate step. This is the same work > especially > >>>>>> that I use the same mechanism and ... well - it will be far easier > for me > >>>>>> to have integration - specific cases working in CI if i also have > Breeze > >>>>>> to support it (eating my own dog food). > >>>>>> > >>>>>> > >>>>>>> Most of these PRs are merged now, I've glanced over #7091 and like > >>>>>>> the look of it, good work! You'll let us know when we should take > a deeper > >>>>>>> look? > >>>>>>> > >>>>>> > >>>>>> Yep I will. I hope today/tomorrow - most of it is ready. I also > >>>>>> managed to VASTLY simplified running kubernetes kind (One less > docker > >>>>>> image, everything runs in the same docker engine as the > airflow-testing > >>>>>> itself) in https://github.com/apache/airflow/pull/6516 which is > >>>>>> prerequisite for #7091 - so both will need to be reviewed. I marke > >>>>>> > >>>>>> > >>>>>>> For cassandra tests specifically I'm not sure there is a huge > amount > >>>>>>> of value in actually running the tests against cassandra -- we are > using > >>>>>>> the official python module for it, and the test is basically > running these > >>>>>>> queries - DROP TABLE IF EXISTS, CREATE TABLE, INSERT INTO TABLE, > and then > >>>>>>> running hook.record_exists -- that seems like it's testing > cassandra > >>>>>>> itself, when I think all we should do is test that > hook.record_exists calls > >>>>>>> the execute method on the connection with the right string. I'll > knock up a > >>>>>>> PR for this. > >>>>>>> Do we think it's worth keeping the non-mocked/integration tests > too? > >>>>>>> > >>>>>> > >>>>>> I would not remove them just yet. Let's see how it works when I > >>>>>> separate it out. I have a feeling that we have very little number > of those > >>>>>> integration tests overall so maybe it will be stable and fast > enough when > >>>>>> we only run those in a separate job. I think it's good to have > different > >>>>>> levels of tests (unit/integration/system) as they find different > types of > >>>>>> problems. As long as we can have integration/system tests clearly > >>>>>> separated, stable and easy to disable/enable - I am all for having > >>>>>> different types of tests. There is this old and well established > concept of > >>>>>> Test Pyramid https://martinfowler.com/bliki/TestPyramid.html which > >>>>>> applies very accurately to our case. By adding markers/categorising > the > >>>>>> tests and seeing how many of those tests we have, how stable they > are, how > >>>>>> long they are and (eventtually) how much it costs us - we can make > better > >>>>>> decisions. > >>>>>> > >>>>>> J. > >>>>>> > >>>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> > >>>>> Jarek Potiuk > >>>>> Polidea <https://www.polidea.com/> | Principal Software Engineer > >>>>> > >>>>> M: +48 660 796 129 <+48660796129> > >>>>> [image: Polidea] <https://www.polidea.com/> > >>>>> > >>>>> > >>>> > >>>> -- > >>>> > >>>> Jarek Potiuk > >>>> Polidea <https://www.polidea.com/> | Principal Software Engineer > >>>> > >>>> M: +48 660 796 129 <+48660796129> > >>>> [image: Polidea] <https://www.polidea.com/> > >>>> > >>>> > >>> > >>> -- > >>> > >>> Jarek Potiuk > >>> Polidea <https://www.polidea.com/> | Principal Software Engineer > >>> > >>> M: +48 660 796 129 <+48660796129> > >>> [image: Polidea] <https://www.polidea.com/> > >>> > >>> > >> > >> -- > >> > >> Jarek Potiuk > >> Polidea <https://www.polidea.com/> | Principal Software Engineer > >> > >> M: +48 660 796 129 <+48660796129> > >> [image: Polidea] <https://www.polidea.com/> > >> > >> > > > > -- > > > > Jarek Potiuk > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > M: +48 660 796 129 <+48660796129> > > [image: Polidea] <https://www.polidea.com/> > > > > > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/> | Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > [image: Polidea] <https://www.polidea.com/> > -- Tomasz Urbaszek Polidea <https://www.polidea.com/> | Software Engineer M: +48 505 628 493 <+48505628493> E: tomasz.urbas...@polidea.com <tomasz.urbasz...@polidea.com> Unique Tech Check out our projects! <https://www.polidea.com/our-work>