Cool, thanks! On Wed, 10 Nov 2021, 18:14 Jarek Potiuk, <[email protected]> wrote:
> HA! I FOUND IT! > > It's not the bacfill_job. It's the `test_kubernetes_executor.py' and not > even that - it's the code coverage plugin when test_kubernetes_executor.py > is running that takes a lot of memory. > > When you run `test_kubernetes_executor.py` it can take a lot of memory - > (2-3GB) and does not free it even if the kubernetes tests are completed, > and it remains taken for the subsequent tests to run. > > It seems that the coverage plugin keeps a loooooot of data in memory about > the code coverage resulting from those test. When I disable code coverage > the memory remaining after test_kubernetes_executor is ~ 700 MB (!) > > I am going to disable the coverage for PR. It's very rarely looked at and > actually only the main one makes sense, because only there we have > guarantee of running all tests. > > J. > > On Wed, Nov 10, 2021 at 6:37 PM Jarek Potiuk <[email protected]> wrote: > >> It seems much less frequently but is there. >> >> And the culprit is - most-likely - `test_backfill_job.py`. >> >> In this case just before it failed it took 77% of the memory (5.3 GB): >> >> ########### STATISTICS ################# >> CONTAINER ID NAME CPU % >> MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS >> 520e4674350d airflow-core-mysql_airflow_run_37e48532d683 171.93% >> 5.24GiB / 6.789GiB 77.18% 33.5MB / 20.4MB 144MB / 152kB 152 >> eedf0f4b9f1e airflow-core-mysql_mysql_1 0.09% >> 108.3MiB / 6.789GiB 1.56% 20.4MB / 33.5MB 22.7MB / 530MB 32 >> >> total used free shared buff/cache >> available >> Mem: 6951 6754 125 6 72 >> 16 >> Swap: 0 0 0 >> >> Filesystem Size Used Avail Use% Mounted on >> /dev/root 84G 54G 30G 65% / >> /dev/sdb15 105M 5.2M 100M 5% /boot/efi >> /dev/sda1 14G 4.1G 9.0G 32% /mnt >> ########### STATISTICS ################# >> ### The last 2 lines for Core process: >> /tmp/tmp.7v6Wm3jIOT/tests/Core/stdout ### >> tests/executors/test_sequential_executor.py . >> [ 49%] >> tests/jobs/test_backfill_job.py .. >> >> Likely should be enough to investigate or maybe mitigate it somehow and >> add some extra cleanups between tests if the memory is not freed between >> them. >> >> J. >> >> >> >> On Wed, Nov 10, 2021 at 6:11 PM Khalid Mammadov < >> [email protected]> wrote: >> >>> Just to let you know. >>> >>> Issue looks like is still there: >>> >>> https://github.com/apache/airflow/runs/4167464563?check_suite_focus=true >>> >>> >>> >>> On 10/11/2021 13:40, Jarek Potiuk wrote: >>> >>> Merged! Please rebase (Khalid- you can remove the workaround of yours) >>> and let me know. >>> >>> There is one failure that happened in my tests: >>> >>> https://github.com/apache/airflow/runs/4165358689?check_suite_focus=true >>> - but we can observe results of this one and try to find the reason >>> separately if it continues to repeat. >>> >>> J. >>> >>> On Wed, Nov 10, 2021 at 12:49 PM Jarek Potiuk <[email protected]> wrote: >>> >>>> Fix being tested in: https://github.com/apache/airflow/pull/19512 >>>> (committer PR) and https://github.com/apache/airflow/pull/19514 >>>> (regular user PR). >>>> >>>> >>>> On Wed, Nov 10, 2021 at 11:25 AM Jarek Potiuk <[email protected]> wrote: >>>> >>>>> OK. I took a look . It looks like indeed "core" tests" are briefly >>>>> (and sometimes for a longer time) pass over 50% of memory available on >>>>> Github Runners. I do not think optimizing them now makes little sense - >>>>> because even if we optimize them now, they will likely soon again reach >>>>> 50-60% of available memory, which - when ther are other parallel tests >>>>> running might easily get OOM. >>>>> >>>>> It looks like those are only "Core" type of tests so the solution will >>>>> be (similarly as with "Integration" tests) to separate them out to a >>>>> non-parallel run for github runners. >>>>> >>>>> On Tue, Nov 9, 2021 at 9:33 PM Jarek Potiuk <[email protected]> wrote: >>>>> >>>>>> Yep. Apparently one of the recent tests is using too much memory. I >>>>>> had some private errands that made me less available last few days - but >>>>>> I >>>>>> will have time to catch-up tonight/tomorrow. >>>>>> >>>>>> Thanks for changing the "parallel" level in your PR - that will give >>>>>> me more datapoints. I've just re-run both PRs with "debug-ci-resources" >>>>>> label. This is our "debug" label to show resource use during the build >>>>>> and >>>>>> i might be able to find and fix the root cause. >>>>>> >>>>>> For the future - in case any other committer wants to investigate it, >>>>>> setting the "debug-ci-resources" labels turns on the debugging mode >>>>>> showing >>>>>> this information periodically alongside the progress of tests - it can >>>>>> be >>>>>> helpful in determining what caused the OOM: >>>>>> >>>>>> CONTAINER ID NAME CPU % >>>>>> MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O >>>>>> PIDS >>>>>> c46832148ff7 airflow-always-mssql_airflow_run_e59b6039c3d8 99.59% >>>>>> 365.1MiB / 6.789GiB 5.25% 1.62MB / 3.33MB 8.97MB / 20.5kB 8 >>>>>> f4d2a192d6fc airflow-always-mssql_mssqlsetup_1 0.00% >>>>>> 0B / 0B 0.00% 0B / 0B 0B / 0B 0 >>>>>> a668cdedc717 airflow-api-mssql_airflow_run_bcc466077ac0 35.07% >>>>>> 431.4MiB / 6.789GiB 6.21% 2.26MB / 4.47MB 73.2MB / 20.5kB 8 >>>>>> f306f4221ba1 airflow-api-mssql_mssqlsetup_1 0.00% >>>>>> 0B / 0B 0.00% 0B / 0B 0B / 0B 0 >>>>>> 7f10748e9496 airflow-api-mssql_mssql_1 30.66% >>>>>> 735.5MiB / 6.789GiB 10.58% 4.47MB / 2.26MB 36.8MB / 124MB >>>>>> 132 >>>>>> 8b5ca767ed0c airflow-always-mssql_mssql_1 12.59% >>>>>> 716.5MiB / 6.789GiB 10.31% 3.33MB / 1.63MB 36.7MB / 52.7MB >>>>>> 131 >>>>>> >>>>>> total used free shared buff/cache >>>>>> available >>>>>> Mem: 6951 2939 200 6 3811 >>>>>> 3702 >>>>>> Swap: 0 0 0 >>>>>> >>>>>> Filesystem Size Used Avail Use% Mounted on >>>>>> /dev/root 84G 51G 33G 61% / >>>>>> /dev/sda15 105M 5.2M 100M 5% /boot/efi >>>>>> /dev/sdb1 14G 4.1G 9.0G 32% /mnt >>>>>> >>>>>> J. >>>>>> >>>>>> >>>>>> On Tue, Nov 9, 2021 at 9:19 PM Oliveira, Niko >>>>>> <[email protected]> <[email protected]> wrote: >>>>>> >>>>>>> Hey all, >>>>>>> >>>>>>> >>>>>>> Just to throw another data point in the ring, I've had a PR >>>>>>> <https://github.com/apache/airflow/pull/19410> stuck in the same >>>>>>> way as well. Several retries are all failing with the same OOM. >>>>>>> >>>>>>> >>>>>>> I've also dug through the Github Actions history and found a few >>>>>>> others. So it doesn't seem to be just a one-off. >>>>>>> >>>>>>> >>>>>>> Cheers, >>>>>>> Niko >>>>>>> ------------------------------ >>>>>>> *From:* Khalid Mammadov <[email protected]> >>>>>>> *Sent:* Tuesday, November 9, 2021 6:24 AM >>>>>>> *To:* [email protected] >>>>>>> *Subject:* [EXTERNAL] OOM issue in the CI >>>>>>> >>>>>>> >>>>>>> *CAUTION*: This email originated from outside of the organization. >>>>>>> Do not click links or open attachments unless you can confirm the sender >>>>>>> and know the content is safe. >>>>>>> >>>>>>> Hi Devs, >>>>>>> >>>>>>> I have been working on below PR for and run into OOM issue during >>>>>>> testing on GitHub actions (you can see in commit history). >>>>>>> >>>>>>> https://github.com/apache/airflow/pull/19139/files >>>>>>> >>>>>>> The tests for databases Postgres, MySQL etc. fails due to OOM and >>>>>>> docker gets killed. >>>>>>> >>>>>>> I have reduced parallelism to 1 "in the code" *temporarily* (the >>>>>>> only extra change in the PR) and it passes all the checks which confirms >>>>>>> the issue. >>>>>>> >>>>>>> >>>>>>> I was hoping if you could advise the best course of action in this >>>>>>> situation so I can force parallelism to 1 to get all checks passed or >>>>>>> some >>>>>>> other way to solve OOM? >>>>>>> >>>>>>> Any help would be appreciated. >>>>>>> >>>>>>> >>>>>>> Thanks in advance >>>>>>> >>>>>>> Khalid >>>>>>> >>>>>>
