HA! I FOUND IT! It's not the bacfill_job. It's the `test_kubernetes_executor.py' and not even that - it's the code coverage plugin when test_kubernetes_executor.py is running that takes a lot of memory.
When you run `test_kubernetes_executor.py` it can take a lot of memory - (2-3GB) and does not free it even if the kubernetes tests are completed, and it remains taken for the subsequent tests to run. It seems that the coverage plugin keeps a loooooot of data in memory about the code coverage resulting from those test. When I disable code coverage the memory remaining after test_kubernetes_executor is ~ 700 MB (!) I am going to disable the coverage for PR. It's very rarely looked at and actually only the main one makes sense, because only there we have guarantee of running all tests. J. On Wed, Nov 10, 2021 at 6:37 PM Jarek Potiuk <[email protected]> wrote: > It seems much less frequently but is there. > > And the culprit is - most-likely - `test_backfill_job.py`. > > In this case just before it failed it took 77% of the memory (5.3 GB): > > ########### STATISTICS ################# > CONTAINER ID NAME CPU % > MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS > 520e4674350d airflow-core-mysql_airflow_run_37e48532d683 171.93% > 5.24GiB / 6.789GiB 77.18% 33.5MB / 20.4MB 144MB / 152kB 152 > eedf0f4b9f1e airflow-core-mysql_mysql_1 0.09% > 108.3MiB / 6.789GiB 1.56% 20.4MB / 33.5MB 22.7MB / 530MB 32 > > total used free shared buff/cache > available > Mem: 6951 6754 125 6 72 > 16 > Swap: 0 0 0 > > Filesystem Size Used Avail Use% Mounted on > /dev/root 84G 54G 30G 65% / > /dev/sdb15 105M 5.2M 100M 5% /boot/efi > /dev/sda1 14G 4.1G 9.0G 32% /mnt > ########### STATISTICS ################# > ### The last 2 lines for Core process: > /tmp/tmp.7v6Wm3jIOT/tests/Core/stdout ### > tests/executors/test_sequential_executor.py . > [ 49%] > tests/jobs/test_backfill_job.py .. > > Likely should be enough to investigate or maybe mitigate it somehow and > add some extra cleanups between tests if the memory is not freed between > them. > > J. > > > > On Wed, Nov 10, 2021 at 6:11 PM Khalid Mammadov <[email protected]> > wrote: > >> Just to let you know. >> >> Issue looks like is still there: >> >> https://github.com/apache/airflow/runs/4167464563?check_suite_focus=true >> >> >> >> On 10/11/2021 13:40, Jarek Potiuk wrote: >> >> Merged! Please rebase (Khalid- you can remove the workaround of yours) >> and let me know. >> >> There is one failure that happened in my tests: >> >> https://github.com/apache/airflow/runs/4165358689?check_suite_focus=true >> - but we can observe results of this one and try to find the reason >> separately if it continues to repeat. >> >> J. >> >> On Wed, Nov 10, 2021 at 12:49 PM Jarek Potiuk <[email protected]> wrote: >> >>> Fix being tested in: https://github.com/apache/airflow/pull/19512 >>> (committer PR) and https://github.com/apache/airflow/pull/19514 >>> (regular user PR). >>> >>> >>> On Wed, Nov 10, 2021 at 11:25 AM Jarek Potiuk <[email protected]> wrote: >>> >>>> OK. I took a look . It looks like indeed "core" tests" are briefly (and >>>> sometimes for a longer time) pass over 50% of memory available on Github >>>> Runners. I do not think optimizing them now makes little sense - because >>>> even if we optimize them now, they will likely soon again reach 50-60% of >>>> available memory, which - when ther are other parallel tests running might >>>> easily get OOM. >>>> >>>> It looks like those are only "Core" type of tests so the solution will >>>> be (similarly as with "Integration" tests) to separate them out to a >>>> non-parallel run for github runners. >>>> >>>> On Tue, Nov 9, 2021 at 9:33 PM Jarek Potiuk <[email protected]> wrote: >>>> >>>>> Yep. Apparently one of the recent tests is using too much memory. I >>>>> had some private errands that made me less available last few days - but I >>>>> will have time to catch-up tonight/tomorrow. >>>>> >>>>> Thanks for changing the "parallel" level in your PR - that will give >>>>> me more datapoints. I've just re-run both PRs with "debug-ci-resources" >>>>> label. This is our "debug" label to show resource use during the build and >>>>> i might be able to find and fix the root cause. >>>>> >>>>> For the future - in case any other committer wants to investigate it, >>>>> setting the "debug-ci-resources" labels turns on the debugging mode >>>>> showing >>>>> this information periodically alongside the progress of tests - it can be >>>>> helpful in determining what caused the OOM: >>>>> >>>>> CONTAINER ID NAME CPU % >>>>> MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS >>>>> c46832148ff7 airflow-always-mssql_airflow_run_e59b6039c3d8 99.59% >>>>> 365.1MiB / 6.789GiB 5.25% 1.62MB / 3.33MB 8.97MB / 20.5kB 8 >>>>> f4d2a192d6fc airflow-always-mssql_mssqlsetup_1 0.00% >>>>> 0B / 0B 0.00% 0B / 0B 0B / 0B 0 >>>>> a668cdedc717 airflow-api-mssql_airflow_run_bcc466077ac0 35.07% >>>>> 431.4MiB / 6.789GiB 6.21% 2.26MB / 4.47MB 73.2MB / 20.5kB 8 >>>>> f306f4221ba1 airflow-api-mssql_mssqlsetup_1 0.00% >>>>> 0B / 0B 0.00% 0B / 0B 0B / 0B 0 >>>>> 7f10748e9496 airflow-api-mssql_mssql_1 30.66% >>>>> 735.5MiB / 6.789GiB 10.58% 4.47MB / 2.26MB 36.8MB / 124MB 132 >>>>> 8b5ca767ed0c airflow-always-mssql_mssql_1 12.59% >>>>> 716.5MiB / 6.789GiB 10.31% 3.33MB / 1.63MB 36.7MB / 52.7MB 131 >>>>> >>>>> total used free shared buff/cache >>>>> available >>>>> Mem: 6951 2939 200 6 3811 >>>>> 3702 >>>>> Swap: 0 0 0 >>>>> >>>>> Filesystem Size Used Avail Use% Mounted on >>>>> /dev/root 84G 51G 33G 61% / >>>>> /dev/sda15 105M 5.2M 100M 5% /boot/efi >>>>> /dev/sdb1 14G 4.1G 9.0G 32% /mnt >>>>> >>>>> J. >>>>> >>>>> >>>>> On Tue, Nov 9, 2021 at 9:19 PM Oliveira, Niko >>>>> <[email protected]> <[email protected]> wrote: >>>>> >>>>>> Hey all, >>>>>> >>>>>> >>>>>> Just to throw another data point in the ring, I've had a PR >>>>>> <https://github.com/apache/airflow/pull/19410> stuck in the same way >>>>>> as well. Several retries are all failing with the same OOM. >>>>>> >>>>>> >>>>>> I've also dug through the Github Actions history and found a few >>>>>> others. So it doesn't seem to be just a one-off. >>>>>> >>>>>> >>>>>> Cheers, >>>>>> Niko >>>>>> ------------------------------ >>>>>> *From:* Khalid Mammadov <[email protected]> >>>>>> *Sent:* Tuesday, November 9, 2021 6:24 AM >>>>>> *To:* [email protected] >>>>>> *Subject:* [EXTERNAL] OOM issue in the CI >>>>>> >>>>>> >>>>>> *CAUTION*: This email originated from outside of the organization. >>>>>> Do not click links or open attachments unless you can confirm the sender >>>>>> and know the content is safe. >>>>>> >>>>>> Hi Devs, >>>>>> >>>>>> I have been working on below PR for and run into OOM issue during >>>>>> testing on GitHub actions (you can see in commit history). >>>>>> >>>>>> https://github.com/apache/airflow/pull/19139/files >>>>>> >>>>>> The tests for databases Postgres, MySQL etc. fails due to OOM and >>>>>> docker gets killed. >>>>>> >>>>>> I have reduced parallelism to 1 "in the code" *temporarily* (the >>>>>> only extra change in the PR) and it passes all the checks which confirms >>>>>> the issue. >>>>>> >>>>>> >>>>>> I was hoping if you could advise the best course of action in this >>>>>> situation so I can force parallelism to 1 to get all checks passed or >>>>>> some >>>>>> other way to solve OOM? >>>>>> >>>>>> Any help would be appreciated. >>>>>> >>>>>> >>>>>> Thanks in advance >>>>>> >>>>>> Khalid >>>>>> >>>>>
