#32769: Fix for sporadically crashing parallel test runner
-----------------------------------+--------------------------------------
     Reporter:  quinox             |                    Owner:  nobody
         Type:  Bug                |                   Status:  new
    Component:  Testing framework  |                  Version:  2.2
     Severity:  Normal             |               Resolution:
     Keywords:                     |             Triage Stage:  Unreviewed
    Has patch:  0                  |      Needs documentation:  0
  Needs tests:  0                  |  Patch needs improvement:  0
Easy pickings:  0                  |                    UI/UX:  0
-----------------------------------+--------------------------------------
Description changed by quinox:

Old description:

> G'day,
>
> I'd like to report a fix for a bug I found with the parallel test runner.
> I'm filing it against 2.2 since that's what I use myself, but looking at
> the sourcecode 3.2 also has this bug.
>
> Tickets that sound related but also not entirely the same: #27734
>
> I have about 25 active Django projects, all in CI with testcases, some
> going back 10 years. About 6 months ago I started using `--parallel` to
> speed up testing. The speedup is great, obviously I had to fix some tests
> that were not compatible but that has been sorted out by now.
>
> Since using `--parallel` I've also encountered crashes that really can't
> be blamed on my own code. They happen quite infrequent, so up to now I've
> been ignoring them. Today I ran into it again and decided to do some
> digging.
>
> == The crash
>
> They always look like this:
>
> {{{
> #!python
> Traceback (most recent call last):
>   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
> /site-packages/django/db/backends/base/base.py", line 218, in
> ensure_connection
>     self.connect()
>   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
> /site-packages/django/db/backends/base/base.py", line 196, in connect
>     self.connection = self.get_new_connection(conn_params)
>   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
> /site-packages/django/db/backends/postgresql/base.py", line 182, in
> get_new_connection
>     connection = Database.connect(**conn_params)
>   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
> /site-packages/psycopg2/__init__.py", line 127, in connect
>     conn = _connect(dsn, connection_factory=connection_factory,
> **kwasync)
> psycopg2.OperationalError: FATAL:  database "test_3" does not exist
>

> The above exception was the direct cause of the following exception:
>
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/unittest/suite.py", line 163, in
> _handleClassSetUp
>     setUpClass()
>   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
> /site-packages/django/test/testcases.py", line 1126, in setUpClass
>     cls.cls_atomics = cls._enter_atomics()
>   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
> /site-packages/django/test/testcases.py", line 1107, in _enter_atomics
>     atomics[db_name].__enter__()
>   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
> /site-packages/django/db/transaction.py", line 175, in __enter__
>     if not connection.get_autocommit():
>   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
> /site-packages/django/db/backends/base/base.py", line 380, in
> get_autocommit
>     self.ensure_connection()
>   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
> /site-packages/django/db/backends/base/base.py", line 218, in
> ensure_connection
>     self.connect()
>   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
> /site-packages/django/db/utils.py", line 89, in __exit__
>     raise dj_exc_value.with_traceback(traceback) from exc_value
>   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
> /site-packages/django/db/backends/base/base.py", line 218, in
> ensure_connection
>     self.connect()
>   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
> /site-packages/django/db/backends/base/base.py", line 196, in connect
>     self.connection = self.get_new_connection(conn_params)
>   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
> /site-packages/django/db/backends/postgresql/base.py", line 182, in
> get_new_connection
>     connection = Database.connect(**conn_params)
>   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
> /site-packages/psycopg2/__init__.py", line 127, in connect
>     conn = _connect(dsn, connection_factory=connection_factory,
> **kwasync)
> django.db.utils.OperationalError: FATAL:  database "test_3" does not
> exist
> }}}
>
> == My observations
>
> It has been happening across projects. Today I zoomed in on 1 specific
> project. The facts summarized:
>
> * My CPU has 4 cores (this is important for the story)
> * When I only test 2 separate classes at once I can trigger it once every
> 30 times
> * The crashes always mention the DBs `test_3` or `test_4`, never `test_1`
> * The precise command line: `python test my_app_with_two_testcase_classes
> --parallel`
> * I haven't set `DJANGO_TEST_PROCESSES`
>
> I hacked the Django code with additional `print()` statements. Things I
> noticed:
> * Django starts out with assuming it can do 4 parallel processess, which
> is correct for my CPU
> * When checking the Postgresql DB log 3 DBs are made:
> {{{
> $ grep 'CREATE DATABASE' postgres.log
> 2021-05-19 14:29:11.473 CEST [23952] LOG:  duration: 216.274 ms
> statement: CREATE DATABASE test ENCODING 'UNICODE';
> 2021-05-19 14:29:12.918 CEST [24024] LOG:  duration: 226.477 ms
> statement: CREATE DATABASE "test_1" WITH TEMPLATE "test"
> 2021-05-19 14:29:13.147 CEST [24025] LOG:  duration: 225.881 ms
> statement: CREATE DATABASE "test_2" WITH TEMPLATE "test"
> }}}
> * `db.backends.base.creation`
> `BaseDatabaseCreation.get_test_db_clone_settings` is called four times,
> with suffix `1`, `2`, `3` and `4`
> * `test.runner` `_init_worker` is called four times, with `_worker_id`
> `1`, `2`, `3` and `4`.
> * `db.backends.postgresql.creation` `DatabaseCreation._clone_test_db` is
> called twice, always with `test_1` and `test_2` as DBs
> * `db.backends.postgresql.creation`
> `DatabaseWrapper.get_connection_params` is called twice:
>   * most of time time with `test_1` and `test_2`. In these cases the
> testcases do not crash
>   * once in a while with fe. `test_1` and `test_3`. In this case Django
> will crash
>
> == My fix
>
> Based on the prints I'm assuming Django creates 4 processes, then only
> initializes 2 because it only needs 2, and as long as the first 2
> processes get used everything works out but if, by randomness, either the
> 3rd or 4th process is activated things break?
>
> Because it happens so infrequently I can't 100% guarantee my fix works,
> but it appears to: after applying my fix `_init_worker` is called only
> twice, `get_test_db_clone_settings` is called only twice, and I haven't
> seen it crash after 200 consecutive runs.
>
> {{{
> --- old_runner.py       2021-05-19 15:01:32.756268287 +0200
> +++ new_runner.py       2021-05-19 15:01:18.844150140 +0200
> @@ -541,6 +541,7 @@
>              # basis, there's no need for more processes than TestCases.
>              parallel_units = len(parallel_suite.subsuites)
>              self.parallel = min(self.parallel, parallel_units)
> +            parallel_suite.processes = self.parallel
>
>              # If there's only one TestCase, parallelization isn't
> needed.
>              if self.parallel > 1:
> }}}
>
> I can make a PR in Github if you want (against the main branch then?)

New description:

 G'day,

 I'd like to report a fix for a bug I found with the parallel test runner.
 I'm filing it against 2.2 since that's what I use myself, but looking at
 the sourcecode 3.2 also has this bug.

 Tickets that sound related but also not entirely the same: #27734

 I have about 25 active Django projects, all in CI with testcases, some
 going back 10 years. About 6 months ago I started using `--parallel` to
 speed up testing. The speedup is great, obviously I had to fix some tests
 that were not compatible but that has been sorted out by now.

 Since using `--parallel` I've also encountered crashes that really can't
 be blamed on my own code. They happen quite infrequent, so up to now I've
 been ignoring them. Today I ran into it again and decided to do some
 digging.

 == The crash

 They always look like this:

 {{{
 #!python
 Traceback (most recent call last):
   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
 /site-packages/django/db/backends/base/base.py", line 218, in
 ensure_connection
     self.connect()
   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
 /site-packages/django/db/backends/base/base.py", line 196, in connect
     self.connection = self.get_new_connection(conn_params)
   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
 /site-packages/django/db/backends/postgresql/base.py", line 182, in
 get_new_connection
     connection = Database.connect(**conn_params)
   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
 /site-packages/psycopg2/__init__.py", line 127, in connect
     conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
 psycopg2.OperationalError: FATAL:  database "test_3" does not exist


 The above exception was the direct cause of the following exception:

 Traceback (most recent call last):
   File "/usr/lib/python3.6/unittest/suite.py", line 163, in
 _handleClassSetUp
     setUpClass()
   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
 /site-packages/django/test/testcases.py", line 1126, in setUpClass
     cls.cls_atomics = cls._enter_atomics()
   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
 /site-packages/django/test/testcases.py", line 1107, in _enter_atomics
     atomics[db_name].__enter__()
   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
 /site-packages/django/db/transaction.py", line 175, in __enter__
     if not connection.get_autocommit():
   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
 /site-packages/django/db/backends/base/base.py", line 380, in
 get_autocommit
     self.ensure_connection()
   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
 /site-packages/django/db/backends/base/base.py", line 218, in
 ensure_connection
     self.connect()
   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
 /site-packages/django/db/utils.py", line 89, in __exit__
     raise dj_exc_value.with_traceback(traceback) from exc_value
   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
 /site-packages/django/db/backends/base/base.py", line 218, in
 ensure_connection
     self.connect()
   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
 /site-packages/django/db/backends/base/base.py", line 196, in connect
     self.connection = self.get_new_connection(conn_params)
   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
 /site-packages/django/db/backends/postgresql/base.py", line 182, in
 get_new_connection
     connection = Database.connect(**conn_params)
   File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6
 /site-packages/psycopg2/__init__.py", line 127, in connect
     conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
 django.db.utils.OperationalError: FATAL:  database "test_3" does not exist
 }}}

 == My observations

 It has been happening across projects. Today I zoomed in on 1 specific
 project. The facts summarized:

 * My CPU has 4 cores (this is important for the story)
 * When I only test 2 separate classes at once I can trigger it once every
 30 times
 * The crashes always mention the DBs `test_3` or `test_4`, never `test_1`
 * The command line used: `python manage.py test
 my_app_with_two_testcase_classes --parallel`
 * I haven't set `DJANGO_TEST_PROCESSES`

 I hacked the Django code with additional `print()` statements. Things I
 noticed:
 * Django starts out with assuming it can do 4 parallel processess, which
 is correct for my CPU
 * When checking the Postgresql DB log 3 DBs are made:
 {{{
 $ grep 'CREATE DATABASE' postgres.log
 2021-05-19 14:29:11.473 CEST [23952] LOG:  duration: 216.274 ms
 statement: CREATE DATABASE test ENCODING 'UNICODE';
 2021-05-19 14:29:12.918 CEST [24024] LOG:  duration: 226.477 ms
 statement: CREATE DATABASE "test_1" WITH TEMPLATE "test"
 2021-05-19 14:29:13.147 CEST [24025] LOG:  duration: 225.881 ms
 statement: CREATE DATABASE "test_2" WITH TEMPLATE "test"
 }}}
 * `db.backends.base.creation`
 `BaseDatabaseCreation.get_test_db_clone_settings` is called four times,
 with suffix `1`, `2`, `3` and `4`
 * `test.runner` `_init_worker` is called four times, with `_worker_id`
 `1`, `2`, `3` and `4`.
 * `db.backends.postgresql.creation` `DatabaseCreation._clone_test_db` is
 called twice, always with `test_1` and `test_2` as DBs
 * `db.backends.postgresql.creation`
 `DatabaseWrapper.get_connection_params` is called twice:
   * most of time time with `test_1` and `test_2`. In these cases the
 testcases do not crash
   * once in a while with fe. `test_1` and `test_3`. In this case Django
 will crash

 == My fix

 Based on the prints I'm assuming Django creates 4 processes, then only
 initializes 2 because it only needs 2, and as long as the first 2
 processes get used everything works out but if, by randomness, either the
 3rd or 4th process is activated things break?

 Because it happens so infrequently I can't 100% guarantee my fix works,
 but it appears to: after applying my fix `_init_worker` is called only
 twice, `get_test_db_clone_settings` is called only twice, and I haven't
 seen it crash after 200 consecutive runs.

 {{{
 --- old_runner.py       2021-05-19 15:01:32.756268287 +0200
 +++ new_runner.py       2021-05-19 15:01:18.844150140 +0200
 @@ -541,6 +541,7 @@
              # basis, there's no need for more processes than TestCases.
              parallel_units = len(parallel_suite.subsuites)
              self.parallel = min(self.parallel, parallel_units)
 +            parallel_suite.processes = self.parallel

              # If there's only one TestCase, parallelization isn't needed.
              if self.parallel > 1:
 }}}

 I can make a PR in Github if you want (against the main branch then?)

--

-- 
Ticket URL: <https://code.djangoproject.com/ticket/32769#comment:1>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-updates+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-updates/064.ba97d1a16777ed19496fdeaa73f06246%40djangoproject.com.

Reply via email to