potiuk commented on PR #35473:
URL: https://github.com/apache/airflow/pull/35473#issuecomment-1795408176

   > 2min is a long time for testing. And what is the difference with the 
function above?
   
   It's long but when we run the tests in parallel where we have 8 docker 
compose instances running heavy DB tests at the same time - each with its own 
database on a single machine where I/O and especially kernel is shared between 
all the tests and DBs and at the same time some of the paralell tests might be 
resetting and wiping their databases, - all at the AWS instances that already 
have a rather slow-ish filesyste, the I/O contention and competing for it might 
cause huge delays  - tests that runs normally 2 seconds might and sometimes 
(but only sometimes) will run for 10s of mintues if the contentions happen all 
the time with all the DB operations - say you have 1000 block writes to disk 
during the test, just having a 10 ms waiting time for every write will make it 
10 seconds. 
   
   This is my guess why we have this sometimes much longer running tests. 
   
   Generally we are parallising a lot of those tests utilising multiple CPUS 
and a lot of memory, each of the machines runs up to 16 containrers during the  
test and they re utilising generally 80% - 90% of the 8 CPUs they have. 
Unfortunatley (even after my recent quest to separate the DB from non-DB tests) 
- we  still have a lot (7000) of tests that require the DB and those tests must 
be run in a serialized way if they use the same DB (because they use the DB to 
store and retrieve state and they clean the DB up before and after usually). 
And we need to optimize those for "Test time" not only CPU - that's why we have 
highly optimized setup where we split those tests in several parallel types - 
each of them running their DB tests sequentially  - each with own DB. This does 
not scale linearly (due to I/O and kernel contention mentioned above) but they 
scale pretty well (2 CPU is around 3 times slower than 8 CPU instance to run 
full suite of DB tests).
   
   Unfortunately - side effect of this setup is that sometimes, when there is a 
lot of contention from "nosiy neighbours" on the same machine in different 
container instances, test that normally take 10 seconds will suddenly take 20 
or 30 or even 60 seconds. 
   
   Those are usually the more complex tests - they are not really unit tests, 
but they are really end-to-end tests of certain functionality - in this case 
this is a test that performs backfill using Dask Executor - and it utilises the 
whole mechanism of our backfill. Which means that it performs all the 
back/forth database reading and writing, runs scheduler and executor to handle 
the backfill operation etc. And this means that it might get a lot of 
contention from parallell tests doing similar heavy work (if so it happens that 
the tests that are run in parallell in different containers are also doing a 
lot of DB operations at the same time). And the last part is not really a 
deterministic one. It varies from machine to machine and from run to run. It's 
also pretty known fact that not all Amazon instances are equal - some are 
"better" than others - in terms of being closer of further in terms of roundrip 
time to the DB or having slower memory or having more noisy neighbour on the bar
 e metal (those are VMs not bare-metal machines so they might run in parallel 
with other VMs  that are run on the same hardware. Layers upon layers of 
abstraction that make "test time" in this particular case pretty unpredictable 
and wildly varying - from run to run,
   
   The timeout increase in this case  is our "what is the worse that can happen 
for the tests while still consider it a success" - and there are sometimes 
cases where we know that a given test will simply take sometimes a long time. 
In this case I am guessing it's not hanging but simply taking slightly (or even 
more slighlty) longer than 30 seconds. And there is no way to find out other 
than ... increasing the timeout. If it will stop happening after timeout 
increase, it means that our guess was right. If it will continue happening 
after the timeout inreased significantly we will have to look deeper (I will 
likely quarantine the test and create an issue and we will look into addressing 
it before 2.8 - we try to keep quarantined test level low - we have currently 3 
of those). 
   
   The value of increase I usually do is 3x (this is why 100 which is roughly 
3x original 30) - to give enough of a space. Also it requires to inscrease 
per-test "hard"  timeout  to a bit more than that = 120 becuase we have 
currently 60s as the default "hard" test time for each test. This is in order 
to get such test fail rather than hang the whole build because if we let the 
test hang till the whole test job timeout (80 minutes now I believe) then the 
job will be cancelled and we will not know what test caused it. So both 100 and 
120 are the values that come from experience of unflak-ing other tests and 
experiment to see if my guess is right. Note - it's not final value. It's 
experiment to see if our hypothesis is right. Once we run our CI builds quite a 
few times, and we will see this problem gone (which as usual I am planning it), 
we will be able to see how much those tests take when succed on average - we 
will see if they take 30 or 60 or 90 seconds sometimes to succeed - I usua
 lly  take a random sumple from a 10-20 longest running tests I can find over 
last week or so - and our tests have summaries at the end showing the longest 
running tests and we can see there what is the variation and maximum time we 
can expect for the test.
   
   This is what I do when I fight with such flaky tests usually.
   
   I hope the explanation is detailed enough to describe the reasoning of the 
change and root causes of what's happening here.
   
   Is that good enough of an explanation? Or maybe (knowing the context) do you 
have some ideas how we can improve the process and treat those tests 
differently ? I am all ears @bolkedebruin :). 
   
   Really I would love others to atempt to diagnose and fix those flaky tests 
so if there are other ideas to help with those - I am all for reviewing and 
commenting PRs and attempts to get rid of thsoe flakes :). I would really, 
really, really love that.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to