potiuk commented on PR #35473: URL: https://github.com/apache/airflow/pull/35473#issuecomment-1795408176
> 2min is a long time for testing. And what is the difference with the function above? It's long but when we run the tests in parallel where we have 8 docker compose instances running heavy DB tests at the same time - each with its own database on a single machine where I/O and especially kernel is shared between all the tests and DBs and at the same time some of the paralell tests might be resetting and wiping their databases, - all at the AWS instances that already have a rather slow-ish filesyste, the I/O contention and competing for it might cause huge delays - tests that runs normally 2 seconds might and sometimes (but only sometimes) will run for 10s of mintues if the contentions happen all the time with all the DB operations - say you have 1000 block writes to disk during the test, just having a 10 ms waiting time for every write will make it 10 seconds. This is my guess why we have this sometimes much longer running tests. Generally we are parallising a lot of those tests utilising multiple CPUS and a lot of memory, each of the machines runs up to 16 containrers during the test and they re utilising generally 80% - 90% of the 8 CPUs they have. Unfortunatley (even after my recent quest to separate the DB from non-DB tests) - we still have a lot (7000) of tests that require the DB and those tests must be run in a serialized way if they use the same DB (because they use the DB to store and retrieve state and they clean the DB up before and after usually). And we need to optimize those for "Test time" not only CPU - that's why we have highly optimized setup where we split those tests in several parallel types - each of them running their DB tests sequentially - each with own DB. This does not scale linearly (due to I/O and kernel contention mentioned above) but they scale pretty well (2 CPU is around 3 times slower than 8 CPU instance to run full suite of DB tests). Unfortunately - side effect of this setup is that sometimes, when there is a lot of contention from "nosiy neighbours" on the same machine in different container instances, test that normally take 10 seconds will suddenly take 20 or 30 or even 60 seconds. Those are usually the more complex tests - they are not really unit tests, but they are really end-to-end tests of certain functionality - in this case this is a test that performs backfill using Dask Executor - and it utilises the whole mechanism of our backfill. Which means that it performs all the back/forth database reading and writing, runs scheduler and executor to handle the backfill operation etc. And this means that it might get a lot of contention from parallell tests doing similar heavy work (if so it happens that the tests that are run in parallell in different containers are also doing a lot of DB operations at the same time). And the last part is not really a deterministic one. It varies from machine to machine and from run to run. It's also pretty known fact that not all Amazon instances are equal - some are "better" than others - in terms of being closer of further in terms of roundrip time to the DB or having slower memory or having more noisy neighbour on the bar e metal (those are VMs not bare-metal machines so they might run in parallel with other VMs that are run on the same hardware. Layers upon layers of abstraction that make "test time" in this particular case pretty unpredictable and wildly varying - from run to run, The timeout increase in this case is our "what is the worse that can happen for the tests while still consider it a success" - and there are sometimes cases where we know that a given test will simply take sometimes a long time. In this case I am guessing it's not hanging but simply taking slightly (or even more slighlty) longer than 30 seconds. And there is no way to find out other than ... increasing the timeout. If it will stop happening after timeout increase, it means that our guess was right. If it will continue happening after the timeout inreased significantly we will have to look deeper (I will likely quarantine the test and create an issue and we will look into addressing it before 2.8 - we try to keep quarantined test level low - we have currently 3 of those). The value of increase I usually do is 3x (this is why 100 which is roughly 3x original 30) - to give enough of a space. Also it requires to inscrease per-test "hard" timeout to a bit more than that = 120 becuase we have currently 60s as the default "hard" test time for each test. This is in order to get such test fail rather than hang the whole build because if we let the test hang till the whole test job timeout (80 minutes now I believe) then the job will be cancelled and we will not know what test caused it. So both 100 and 120 are the values that come from experience of unflak-ing other tests and experiment to see if my guess is right. Note - it's not final value. It's experiment to see if our hypothesis is right. Once we run our CI builds quite a few times, and we will see this problem gone (which as usual I am planning it), we will be able to see how much those tests take when succed on average - we will see if they take 30 or 60 or 90 seconds sometimes to succeed - I usua lly take a random sumple from a 10-20 longest running tests I can find over last week or so - and our tests have summaries at the end showing the longest running tests and we can see there what is the variation and maximum time we can expect for the test. This is what I do when I fight with such flaky tests usually. I hope the explanation is detailed enough to describe the reasoning of the change and root causes of what's happening here. Is that good enough of an explanation? Or maybe (knowing the context) do you have some ideas how we can improve the process and treat those tests differently ? I am all ears @bolkedebruin :). Really I would love others to atempt to diagnose and fix those flaky tests so if there are other ideas to help with those - I am all for reviewing and commenting PRs and attempts to get rid of thsoe flakes :). I would really, really, really love that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
