Nice sleuthing.

  Dennis Ferruzzi
  SDE | Amazon MWAA
           [signature_1728047709]


From: Jarek Potiuk <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Wednesday, November 10, 2021 at 10:14 AM
To: Khalid Mammadov <[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: RE: [EXTERNAL] OOM issue in the CI


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


HA! I FOUND IT!

It's not the bacfill_job. It's the `test_kubernetes_executor.py' and not even 
that - it's the code coverage plugin when test_kubernetes_executor.py is 
running that takes a lot of memory.

When you run `test_kubernetes_executor.py` it can take a lot of memory - 
(2-3GB) and does not free it even if the kubernetes tests are completed, and it 
remains taken for the subsequent tests to run.

It seems that the coverage plugin keeps a loooooot of data in memory about the 
code coverage resulting from those test. When I disable code coverage the 
memory remaining after test_kubernetes_executor is ~ 700 MB (!)

I am going to disable the coverage for PR. It's very rarely looked at and 
actually only the main one makes sense, because only there we have guarantee of 
running all tests.

J.

On Wed, Nov 10, 2021 at 6:37 PM Jarek Potiuk 
<[email protected]<mailto:[email protected]>> wrote:
It seems much less frequently but is there.

And the culprit is - most-likely -  `test_backfill_job.py`.

In this case just before it failed it took 77% of the memory (5.3 GB):

  ########### STATISTICS #################
  CONTAINER ID   NAME                                          CPU %     MEM 
USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O        PIDS
  520e4674350d   airflow-core-mysql_airflow_run_37e48532d683   171.93%   
5.24GiB / 6.789GiB    77.18%    33.5MB / 20.4MB   144MB / 152kB    152
  eedf0f4b9f1e   airflow-core-mysql_mysql_1                    0.09%     
108.3MiB / 6.789GiB   1.56%     20.4MB / 33.5MB   22.7MB / 530MB   32

                total        used        free      shared  buff/cache   
available
  Mem:           6951        6754         125           6          72          
16
  Swap:             0           0           0

  Filesystem      Size  Used Avail Use% Mounted on
  /dev/root        84G   54G   30G  65% /
  /dev/sdb15      105M  5.2M  100M   5% /boot/efi
  /dev/sda1        14G  4.1G  9.0G  32% /mnt
  ########### STATISTICS #################
  ### The last 2 lines for Core process: /tmp/tmp.7v6Wm3jIOT/tests/Core/stdout 
###
  tests/executors/test_sequential_executor.py .                            [ 
49%]
  tests/jobs/test_backfill_job.py ..

Likely should be enough to investigate or maybe mitigate it somehow and add 
some extra cleanups between tests if the memory is not freed between them.

J.



On Wed, Nov 10, 2021 at 6:11 PM Khalid Mammadov 
<[email protected]<mailto:[email protected]>> wrote:
Just to let you know.

Issue looks like is still there:

https://github.com/apache/airflow/runs/4167464563?check_suite_focus=true




On 10/11/2021 13:40, Jarek Potiuk wrote:
Merged!  Please rebase (Khalid- you can remove the workaround of yours) and let 
me know.

There is one failure that happened in my tests:

https://github.com/apache/airflow/runs/4165358689?check_suite_focus=true  - but 
we can observe results of this one and try to find the reason separately if it 
continues to repeat.

J.

On Wed, Nov 10, 2021 at 12:49 PM Jarek Potiuk 
<[email protected]<mailto:[email protected]>> wrote:
Fix being tested in: https://github.com/apache/airflow/pull/19512 (committer 
PR) and https://github.com/apache/airflow/pull/19514 (regular user PR).


On Wed, Nov 10, 2021 at 11:25 AM Jarek Potiuk 
<[email protected]<mailto:[email protected]>> wrote:
OK. I took a look . It looks like indeed "core" tests" are briefly (and 
sometimes for  a longer time) pass over 50% of memory available on Github 
Runners. I do not think optimizing them now makes little sense - because even 
if we optimize them now, they will likely soon again reach 50-60% of available 
memory, which - when ther are other parallel tests running might easily get OOM.

It looks like those are only "Core" type of tests so the solution will be 
(similarly as with "Integration" tests) to separate them out to a non-parallel 
run for github runners.

On Tue, Nov 9, 2021 at 9:33 PM Jarek Potiuk 
<[email protected]<mailto:[email protected]>> wrote:
Yep. Apparently one of the recent tests is using too much memory. I had some 
private errands that made me less available last few days - but I will have 
time to catch-up tonight/tomorrow.

Thanks for changing the "parallel" level in your PR - that will give me more 
datapoints. I've just re-run both PRs with "debug-ci-resources" label. This is 
our "debug" label to show resource use during the build and i might be able to 
find and fix the root cause.

For the future - in case any other committer wants to investigate it, setting 
the "debug-ci-resources" labels turns on the debugging mode showing this 
information periodically  alongside the progress of tests - it can be helpful 
in determining what caused the OOM:

CONTAINER ID   NAME                                            CPU %     MEM 
USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
c46832148ff7   airflow-always-mssql_airflow_run_e59b6039c3d8   99.59%    
365.1MiB / 6.789GiB   5.25%     1.62MB / 3.33MB   8.97MB / 20.5kB   8
f4d2a192d6fc   airflow-always-mssql_mssqlsetup_1               0.00%     0B / 
0B               0.00%     0B / 0B           0B / 0B           0
a668cdedc717   airflow-api-mssql_airflow_run_bcc466077ac0      35.07%    
431.4MiB / 6.789GiB   6.21%     2.26MB / 4.47MB   73.2MB / 20.5kB   8
f306f4221ba1   airflow-api-mssql_mssqlsetup_1                  0.00%     0B / 
0B               0.00%     0B / 0B           0B / 0B           0
7f10748e9496   airflow-api-mssql_mssql_1                       30.66%    
735.5MiB / 6.789GiB   10.58%    4.47MB / 2.26MB   36.8MB / 124MB    132
8b5ca767ed0c   airflow-always-mssql_mssql_1                    12.59%    
716.5MiB / 6.789GiB   10.31%    3.33MB / 1.63MB   36.7MB / 52.7MB   131

              total        used        free      shared  buff/cache   available
Mem:           6951        2939         200           6        3811        3702
Swap:             0           0           0

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        84G   51G   33G  61% /
/dev/sda15      105M  5.2M  100M   5% /boot/efi
/dev/sdb1        14G  4.1G  9.0G  32% /mnt

J.


On Tue, Nov 9, 2021 at 9:19 PM Oliveira, Niko 
<[email protected]><mailto:[email protected]> wrote:

Hey all,

Just to throw another data point in the ring, I've had a 
PR<https://github.com/apache/airflow/pull/19410> stuck in the same way as well. 
Several retries are all failing with the same OOM.



I've also dug through the Github Actions history and found a few others. So it 
doesn't seem to be just a one-off.



Cheers,
Niko

________________________________
From: Khalid Mammadov 
<[email protected]<mailto:[email protected]>>
Sent: Tuesday, November 9, 2021 6:24 AM
To: [email protected]<mailto:[email protected]>
Subject: [EXTERNAL] OOM issue in the CI


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



Hi Devs,

I have been working on below PR for and run into OOM issue during testing on 
GitHub actions (you can see in commit history).

https://github.com/apache/airflow/pull/19139/files

The tests for databases Postgres, MySQL etc. fails due to OOM and docker gets 
killed.

I have reduced parallelism to 1 "in the code" temporarily (the only extra 
change in the PR) and it passes all the checks which confirms the issue.



I was hoping if you could advise the best course of action in this situation so 
I can force parallelism to 1 to get all checks passed or some other way to 
solve OOM?

Any help would be appreciated.



Thanks in advance

Khalid

Reply via email to