Re: Often failing tests in CI (and a way to fix them quickly and future-proof)

Jarek Potiuk Wed, 15 Jan 2020 14:39:21 -0800

Merged now - please rebase to latest master, this should workaround the
intermittent failures.


J.

On Wed, Jan 15, 2020 at 11:15 PM Jarek Potiuk <[email protected]>
wrote:

> Hello everyone,
>
> I think - thanks to diagnostics we added I found the root cause for that.
>
> The most probable reason why we had the problem, is that something changed
> on Travis CI regarding entropy sources. Our integrations (cassandra, mysql,
> kerberos) need enough entropy (source of random data) on the servers to
> generate certificates/keys etc. Because of security reasons - you need
> usually a reliable (hardware) source of random data for those applications
> that use TLS/SSL and generate their own certificates. It seems right now on
> Travis the source of entropy is shared between multiple running jobs
> (dockers) and it slows down startup time a lot (10s of second rather than
> 100s of ms). So if we have a lot of parallel jobs using entropy running on
> the same hardware - startup time for those might be very slow.
>
> I've opened an issue in the community section of Travis for that:
> https://travis-ci.community/t/not-enough-entropy-during-builds-likely/6878
>
> In the meantime we have a workaround (waiting until all integrations /db
> start before we run tests) that I will merge soon:
> https://github.com/apache/airflow/pull/7172 (waiting for Travis build to
> complete).
>
> Possibly later we speed it up by using software source of entropy (we do
> not need hardware entropy for CI tests) but this might take a bit more time.
>
> J.
>
> On Tue, Jan 14, 2020 at 5:24 PM Jarek Potiuk <[email protected]>
> wrote:
>
>> We have other kerberos-related failures. I disabled
>> temporarily "kerberos-specific" build until I add some more diagnostics and
>> test it.
>>
>> Please rebase to latest master.
>>
>> J.
>>
>> On Tue, Jan 14, 2020 at 7:24 AM Jarek Potiuk <[email protected]>
>> wrote:
>>
>>> Seems tests are stable  - but the kerberos problem is happening often
>>> enough to take a look. I will see what I can do to make it stable. seems
>>> that might be a race between kerberos initialising and tests starting to
>>> run.
>>>
>>> On Mon, Jan 13, 2020 at 8:58 PM Jarek Potiuk <[email protected]>
>>> wrote:
>>>
>>>> Just merged the change with integration separation/slimming down the
>>>> tests on CI. https://github.com/apache/airflow/pull/7091
>>>>
>>>> It looks like it is far more stable, I just had one failure with
>>>> kerberos not starting (which also happened sometimes with old tests). We
>>>> will look in the future at some of the "xfailed/xpassed" tests - those that
>>>> we know are problematic. We have 8 of them now.
>>>>
>>>> Also Breeze is now much more enjoyable to use. Pls. take a look at the
>>>> docs.
>>>>
>>>> J.
>>>>
>>>> On Wed, Jan 8, 2020 at 2:23 PM Jarek Potiuk <[email protected]>
>>>> wrote:
>>>>
>>>>> I like what you've done with the separate integrations, and that
>>>>>> coupled with pytest markers and better "import error" handling in the 
>>>>>> tests
>>>>>> would make it easier to run a sub-set of the tests without having to
>>>>>> install everything (for instance not having to install mysql client libs.
>>>>>
>>>>>
>>>>> Cool. That's exactly what I am working on in
>>>>> https://github.com/apache/airflow/pull/7091 -> I want to get all the
>>>>> tests run in integration-less CI, select all those that failed and treat
>>>>> them appropriately.
>>>>>
>>>>>
>>>>>> Admittedly less of a worry with breeze/docker, but still would be
>>>>>> nice to skip/deselct tests when deps aren't there)
>>>>>>
>>>>>
>>>>> Yeah. For me it's the same. I think we had recently a few discussions
>>>>> with first time users that they have difficulty contributing because they
>>>>> do not know how to reproduce failing CI reliably locally. I think the
>>>>> resource of Breeze environment for simple tests was a big
>>>>> blocker/difficulty for some users so slimming it down and making it
>>>>> integration-less by default will be really helpful. I will also make it 
>>>>> the
>>>>> "default" way of reproducing tests - i will remove the separate bash
>>>>> scripts which were an intermediate step. This is the same work especially
>>>>> that I use the same mechanism and ... well - it will be far easier for me
>>>>> to have integration - specific cases working in CI  if i also have Breeze
>>>>> to support it (eating my own dog food).
>>>>>
>>>>>
>>>>>> Most of these PRs are merged now, I've glanced over #7091 and like
>>>>>> the look of it, good work! You'll let us know when we should take a 
>>>>>> deeper
>>>>>> look?
>>>>>>
>>>>>
>>>>> Yep I will. I hope today/tomorrow - most of it is ready. I also
>>>>> managed to VASTLY simplified running kubernetes kind (One less docker
>>>>> image, everything runs in the same docker engine as the airflow-testing
>>>>> itself) in https://github.com/apache/airflow/pull/6516 which is
>>>>> prerequisite for #7091  - so both will need to be reviewed. I marke
>>>>>
>>>>>
>>>>>> For cassandra tests specifically I'm not sure there is a huge amount
>>>>>> of value in actually running the tests against cassandra -- we are using
>>>>>> the official python module for it, and the test is basically running 
>>>>>> these
>>>>>> queries - DROP TABLE IF EXISTS, CREATE TABLE, INSERT INTO TABLE, and then
>>>>>> running hook.record_exists -- that seems like it's testing cassandra
>>>>>> itself, when I think all we should do is test that hook.record_exists 
>>>>>> calls
>>>>>> the execute method on the connection with the right string. I'll knock 
>>>>>> up a
>>>>>> PR for this.
>>>>>> Do we think it's worth keeping the non-mocked/integration tests too?
>>>>>>
>>>>>
>>>>> I would not remove them just yet. Let's see how it works when I
>>>>> separate it out. I have a feeling that we have very little number of those
>>>>> integration tests overall so maybe it will be stable and fast enough when
>>>>> we only run those in a separate job. I think it's good to have different
>>>>> levels of tests (unit/integration/system) as they find different types of
>>>>> problems.  As long as we can have integration/system tests clearly
>>>>> separated, stable and easy to disable/enable - I am all for having
>>>>> different types of tests. There is this old and well established concept 
>>>>> of
>>>>> Test Pyramid https://martinfowler.com/bliki/TestPyramid.html  which
>>>>> applies very accurately to our case. By adding markers/categorising the
>>>>> tests and seeing how many of those tests we have, how stable they are, how
>>>>> long they are and (eventtually) how much it costs us - we can make better
>>>>> decisions.
>>>>>
>>>>> J.
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Jarek Potiuk
>>>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>>>
>>>> M: +48 660 796 129 <+48660796129>
>>>> [image: Polidea] <https://www.polidea.com/>
>>>>
>>>>
>>>
>>> --
>>>
>>> Jarek Potiuk
>>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>>
>>> M: +48 660 796 129 <+48660796129>
>>> [image: Polidea] <https://www.polidea.com/>
>>>
>>>
>>
>> --
>>
>> Jarek Potiuk
>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>
>> M: +48 660 796 129 <+48660796129>
>> [image: Polidea] <https://www.polidea.com/>
>>
>>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Often failing tests in CI (and a way to fix them quickly and future-proof)

Reply via email to