Re: Often failing tests in CI (and a way to fix them quickly and future-proof)

Jarek Potiuk Wed, 15 Jan 2020 14:15:58 -0800

Hello everyone,

I think - thanks to diagnostics we added I found the root cause for that.


The most probable reason why we had the problem, is that something changed
on Travis CI regarding entropy sources. Our integrations (cassandra, mysql,
kerberos) need enough entropy (source of random data) on the servers to
generate certificates/keys etc. Because of security reasons - you need
usually a reliable (hardware) source of random data for those applications
that use TLS/SSL and generate their own certificates. It seems right now on
Travis the source of entropy is shared between multiple running jobs
(dockers) and it slows down startup time a lot (10s of second rather than
100s of ms). So if we have a lot of parallel jobs using entropy running on
the same hardware - startup time for those might be very slow.

I've opened an issue in the community section of Travis for that:
https://travis-ci.community/t/not-enough-entropy-during-builds-likely/6878

In the meantime we have a workaround (waiting until all integrations /db
start before we run tests) that I will merge soon:
https://github.com/apache/airflow/pull/7172 (waiting for Travis build to
complete).

Possibly later we speed it up by using software source of entropy (we do
not need hardware entropy for CI tests) but this might take a bit more time.

J.

On Tue, Jan 14, 2020 at 5:24 PM Jarek Potiuk <jarek.pot...@polidea.com>
wrote:

> We have other kerberos-related failures. I disabled
> temporarily "kerberos-specific" build until I add some more diagnostics and
> test it.
>
> Please rebase to latest master.
>
> J.
>
> On Tue, Jan 14, 2020 at 7:24 AM Jarek Potiuk <jarek.pot...@polidea.com>
> wrote:
>
>> Seems tests are stable  - but the kerberos problem is happening often
>> enough to take a look. I will see what I can do to make it stable. seems
>> that might be a race between kerberos initialising and tests starting to
>> run.
>>
>> On Mon, Jan 13, 2020 at 8:58 PM Jarek Potiuk <jarek.pot...@polidea.com>
>> wrote:
>>
>>> Just merged the change with integration separation/slimming down the
>>> tests on CI. https://github.com/apache/airflow/pull/7091
>>>
>>> It looks like it is far more stable, I just had one failure with
>>> kerberos not starting (which also happened sometimes with old tests). We
>>> will look in the future at some of the "xfailed/xpassed" tests - those that
>>> we know are problematic. We have 8 of them now.
>>>
>>> Also Breeze is now much more enjoyable to use. Pls. take a look at the
>>> docs.
>>>
>>> J.
>>>
>>> On Wed, Jan 8, 2020 at 2:23 PM Jarek Potiuk <jarek.pot...@polidea.com>
>>> wrote:
>>>
>>>> I like what you've done with the separate integrations, and that
>>>>> coupled with pytest markers and better "import error" handling in the 
>>>>> tests
>>>>> would make it easier to run a sub-set of the tests without having to
>>>>> install everything (for instance not having to install mysql client libs.
>>>>
>>>>
>>>> Cool. That's exactly what I am working on in
>>>> https://github.com/apache/airflow/pull/7091 -> I want to get all the
>>>> tests run in integration-less CI, select all those that failed and treat
>>>> them appropriately.
>>>>
>>>>
>>>>> Admittedly less of a worry with breeze/docker, but still would be nice
>>>>> to skip/deselct tests when deps aren't there)
>>>>>
>>>>
>>>> Yeah. For me it's the same. I think we had recently a few discussions
>>>> with first time users that they have difficulty contributing because they
>>>> do not know how to reproduce failing CI reliably locally. I think the
>>>> resource of Breeze environment for simple tests was a big
>>>> blocker/difficulty for some users so slimming it down and making it
>>>> integration-less by default will be really helpful. I will also make it the
>>>> "default" way of reproducing tests - i will remove the separate bash
>>>> scripts which were an intermediate step. This is the same work especially
>>>> that I use the same mechanism and ... well - it will be far easier for me
>>>> to have integration - specific cases working in CI  if i also have Breeze
>>>> to support it (eating my own dog food).
>>>>
>>>>
>>>>> Most of these PRs are merged now, I've glanced over #7091 and like the
>>>>> look of it, good work! You'll let us know when we should take a deeper 
>>>>> look?
>>>>>
>>>>
>>>> Yep I will. I hope today/tomorrow - most of it is ready. I also managed
>>>> to VASTLY simplified running kubernetes kind (One less docker image,
>>>> everything runs in the same docker engine as the airflow-testing itself) in
>>>> https://github.com/apache/airflow/pull/6516 which is prerequisite for
>>>> #7091  - so both will need to be reviewed. I marke
>>>>
>>>>
>>>>> For cassandra tests specifically I'm not sure there is a huge amount
>>>>> of value in actually running the tests against cassandra -- we are using
>>>>> the official python module for it, and the test is basically running these
>>>>> queries - DROP TABLE IF EXISTS, CREATE TABLE, INSERT INTO TABLE, and then
>>>>> running hook.record_exists -- that seems like it's testing cassandra
>>>>> itself, when I think all we should do is test that hook.record_exists 
>>>>> calls
>>>>> the execute method on the connection with the right string. I'll knock up 
>>>>> a
>>>>> PR for this.
>>>>> Do we think it's worth keeping the non-mocked/integration tests too?
>>>>>
>>>>
>>>> I would not remove them just yet. Let's see how it works when I
>>>> separate it out. I have a feeling that we have very little number of those
>>>> integration tests overall so maybe it will be stable and fast enough when
>>>> we only run those in a separate job. I think it's good to have different
>>>> levels of tests (unit/integration/system) as they find different types of
>>>> problems.  As long as we can have integration/system tests clearly
>>>> separated, stable and easy to disable/enable - I am all for having
>>>> different types of tests. There is this old and well established concept of
>>>> Test Pyramid https://martinfowler.com/bliki/TestPyramid.html  which
>>>> applies very accurately to our case. By adding markers/categorising the
>>>> tests and seeing how many of those tests we have, how stable they are, how
>>>> long they are and (eventtually) how much it costs us - we can make better
>>>> decisions.
>>>>
>>>> J.
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Jarek Potiuk
>>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>>
>>> M: +48 660 796 129 <+48660796129>
>>> [image: Polidea] <https://www.polidea.com/>
>>>
>>>
>>
>> --
>>
>> Jarek Potiuk
>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>
>> M: +48 660 796 129 <+48660796129>
>> [image: Polidea] <https://www.polidea.com/>
>>
>>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Often failing tests in CI (and a way to fix them quickly and future-proof)

Reply via email to