#32769: Fix for sporadically crashing parallel test runner -----------------------------------+-------------------------------------- Reporter: quinox | Owner: nobody Type: Bug | Status: new Component: Testing framework | Version: 2.2 Severity: Normal | Resolution: Keywords: | Triage Stage: Unreviewed Has patch: 0 | Needs documentation: 0 Needs tests: 0 | Patch needs improvement: 0 Easy pickings: 0 | UI/UX: 0 -----------------------------------+-------------------------------------- Description changed by quinox:
Old description: > G'day, > > I'd like to report a fix for a bug I found with the parallel test runner. > I'm filing it against 2.2 since that's what I use myself, but looking at > the sourcecode 3.2 also has this bug. > > Tickets that sound related but also not entirely the same: #27734 > > I have about 25 active Django projects, all in CI with testcases, some > going back 10 years. About 6 months ago I started using `--parallel` to > speed up testing. The speedup is great, obviously I had to fix some tests > that were not compatible but that has been sorted out by now. > > Since using `--parallel` I've also encountered crashes that really can't > be blamed on my own code. They happen quite infrequent, so up to now I've > been ignoring them. Today I ran into it again and decided to do some > digging. > > == The crash > > They always look like this: > > {{{ > #!python > Traceback (most recent call last): > File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 > /site-packages/django/db/backends/base/base.py", line 218, in > ensure_connection > self.connect() > File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 > /site-packages/django/db/backends/base/base.py", line 196, in connect > self.connection = self.get_new_connection(conn_params) > File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 > /site-packages/django/db/backends/postgresql/base.py", line 182, in > get_new_connection > connection = Database.connect(**conn_params) > File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 > /site-packages/psycopg2/__init__.py", line 127, in connect > conn = _connect(dsn, connection_factory=connection_factory, > **kwasync) > psycopg2.OperationalError: FATAL: database "test_3" does not exist > > The above exception was the direct cause of the following exception: > > Traceback (most recent call last): > File "/usr/lib/python3.6/unittest/suite.py", line 163, in > _handleClassSetUp > setUpClass() > File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 > /site-packages/django/test/testcases.py", line 1126, in setUpClass > cls.cls_atomics = cls._enter_atomics() > File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 > /site-packages/django/test/testcases.py", line 1107, in _enter_atomics > atomics[db_name].__enter__() > File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 > /site-packages/django/db/transaction.py", line 175, in __enter__ > if not connection.get_autocommit(): > File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 > /site-packages/django/db/backends/base/base.py", line 380, in > get_autocommit > self.ensure_connection() > File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 > /site-packages/django/db/backends/base/base.py", line 218, in > ensure_connection > self.connect() > File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 > /site-packages/django/db/utils.py", line 89, in __exit__ > raise dj_exc_value.with_traceback(traceback) from exc_value > File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 > /site-packages/django/db/backends/base/base.py", line 218, in > ensure_connection > self.connect() > File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 > /site-packages/django/db/backends/base/base.py", line 196, in connect > self.connection = self.get_new_connection(conn_params) > File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 > /site-packages/django/db/backends/postgresql/base.py", line 182, in > get_new_connection > connection = Database.connect(**conn_params) > File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 > /site-packages/psycopg2/__init__.py", line 127, in connect > conn = _connect(dsn, connection_factory=connection_factory, > **kwasync) > django.db.utils.OperationalError: FATAL: database "test_3" does not > exist > }}} > > == My observations > > It has been happening across projects. Today I zoomed in on 1 specific > project. The facts summarized: > > * My CPU has 4 cores (this is important for the story) > * When I only test 2 separate classes at once I can trigger it once every > 30 times > * The crashes always mention the DBs `test_3` or `test_4`, never `test_1` > * The precise command line: `python test my_app_with_two_testcase_classes > --parallel` > * I haven't set `DJANGO_TEST_PROCESSES` > > I hacked the Django code with additional `print()` statements. Things I > noticed: > * Django starts out with assuming it can do 4 parallel processess, which > is correct for my CPU > * When checking the Postgresql DB log 3 DBs are made: > {{{ > $ grep 'CREATE DATABASE' postgres.log > 2021-05-19 14:29:11.473 CEST [23952] LOG: duration: 216.274 ms > statement: CREATE DATABASE test ENCODING 'UNICODE'; > 2021-05-19 14:29:12.918 CEST [24024] LOG: duration: 226.477 ms > statement: CREATE DATABASE "test_1" WITH TEMPLATE "test" > 2021-05-19 14:29:13.147 CEST [24025] LOG: duration: 225.881 ms > statement: CREATE DATABASE "test_2" WITH TEMPLATE "test" > }}} > * `db.backends.base.creation` > `BaseDatabaseCreation.get_test_db_clone_settings` is called four times, > with suffix `1`, `2`, `3` and `4` > * `test.runner` `_init_worker` is called four times, with `_worker_id` > `1`, `2`, `3` and `4`. > * `db.backends.postgresql.creation` `DatabaseCreation._clone_test_db` is > called twice, always with `test_1` and `test_2` as DBs > * `db.backends.postgresql.creation` > `DatabaseWrapper.get_connection_params` is called twice: > * most of time time with `test_1` and `test_2`. In these cases the > testcases do not crash > * once in a while with fe. `test_1` and `test_3`. In this case Django > will crash > > == My fix > > Based on the prints I'm assuming Django creates 4 processes, then only > initializes 2 because it only needs 2, and as long as the first 2 > processes get used everything works out but if, by randomness, either the > 3rd or 4th process is activated things break? > > Because it happens so infrequently I can't 100% guarantee my fix works, > but it appears to: after applying my fix `_init_worker` is called only > twice, `get_test_db_clone_settings` is called only twice, and I haven't > seen it crash after 200 consecutive runs. > > {{{ > --- old_runner.py 2021-05-19 15:01:32.756268287 +0200 > +++ new_runner.py 2021-05-19 15:01:18.844150140 +0200 > @@ -541,6 +541,7 @@ > # basis, there's no need for more processes than TestCases. > parallel_units = len(parallel_suite.subsuites) > self.parallel = min(self.parallel, parallel_units) > + parallel_suite.processes = self.parallel > > # If there's only one TestCase, parallelization isn't > needed. > if self.parallel > 1: > }}} > > I can make a PR in Github if you want (against the main branch then?) New description: G'day, I'd like to report a fix for a bug I found with the parallel test runner. I'm filing it against 2.2 since that's what I use myself, but looking at the sourcecode 3.2 also has this bug. Tickets that sound related but also not entirely the same: #27734 I have about 25 active Django projects, all in CI with testcases, some going back 10 years. About 6 months ago I started using `--parallel` to speed up testing. The speedup is great, obviously I had to fix some tests that were not compatible but that has been sorted out by now. Since using `--parallel` I've also encountered crashes that really can't be blamed on my own code. They happen quite infrequent, so up to now I've been ignoring them. Today I ran into it again and decided to do some digging. == The crash They always look like this: {{{ #!python Traceback (most recent call last): File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 /site-packages/django/db/backends/base/base.py", line 218, in ensure_connection self.connect() File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 /site-packages/django/db/backends/base/base.py", line 196, in connect self.connection = self.get_new_connection(conn_params) File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 /site-packages/django/db/backends/postgresql/base.py", line 182, in get_new_connection connection = Database.connect(**conn_params) File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 /site-packages/psycopg2/__init__.py", line 127, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync) psycopg2.OperationalError: FATAL: database "test_3" does not exist The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/lib/python3.6/unittest/suite.py", line 163, in _handleClassSetUp setUpClass() File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 /site-packages/django/test/testcases.py", line 1126, in setUpClass cls.cls_atomics = cls._enter_atomics() File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 /site-packages/django/test/testcases.py", line 1107, in _enter_atomics atomics[db_name].__enter__() File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 /site-packages/django/db/transaction.py", line 175, in __enter__ if not connection.get_autocommit(): File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 /site-packages/django/db/backends/base/base.py", line 380, in get_autocommit self.ensure_connection() File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 /site-packages/django/db/backends/base/base.py", line 218, in ensure_connection self.connect() File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 /site-packages/django/db/utils.py", line 89, in __exit__ raise dj_exc_value.with_traceback(traceback) from exc_value File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 /site-packages/django/db/backends/base/base.py", line 218, in ensure_connection self.connect() File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 /site-packages/django/db/backends/base/base.py", line 196, in connect self.connection = self.get_new_connection(conn_params) File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 /site-packages/django/db/backends/postgresql/base.py", line 182, in get_new_connection connection = Database.connect(**conn_params) File "/mnt/drive/projects/ytec-my_awesome_project/virtual/lib/python3.6 /site-packages/psycopg2/__init__.py", line 127, in connect conn = _connect(dsn, connection_factory=connection_factory, **kwasync) django.db.utils.OperationalError: FATAL: database "test_3" does not exist }}} == My observations It has been happening across projects. Today I zoomed in on 1 specific project. The facts summarized: * My CPU has 4 cores (this is important for the story) * When I only test 2 separate classes at once I can trigger it once every 30 times * The crashes always mention the DBs `test_3` or `test_4`, never `test_1` * The command line used: `python manage.py test my_app_with_two_testcase_classes --parallel` * I haven't set `DJANGO_TEST_PROCESSES` I hacked the Django code with additional `print()` statements. Things I noticed: * Django starts out with assuming it can do 4 parallel processess, which is correct for my CPU * When checking the Postgresql DB log 3 DBs are made: {{{ $ grep 'CREATE DATABASE' postgres.log 2021-05-19 14:29:11.473 CEST [23952] LOG: duration: 216.274 ms statement: CREATE DATABASE test ENCODING 'UNICODE'; 2021-05-19 14:29:12.918 CEST [24024] LOG: duration: 226.477 ms statement: CREATE DATABASE "test_1" WITH TEMPLATE "test" 2021-05-19 14:29:13.147 CEST [24025] LOG: duration: 225.881 ms statement: CREATE DATABASE "test_2" WITH TEMPLATE "test" }}} * `db.backends.base.creation` `BaseDatabaseCreation.get_test_db_clone_settings` is called four times, with suffix `1`, `2`, `3` and `4` * `test.runner` `_init_worker` is called four times, with `_worker_id` `1`, `2`, `3` and `4`. * `db.backends.postgresql.creation` `DatabaseCreation._clone_test_db` is called twice, always with `test_1` and `test_2` as DBs * `db.backends.postgresql.creation` `DatabaseWrapper.get_connection_params` is called twice: * most of time time with `test_1` and `test_2`. In these cases the testcases do not crash * once in a while with fe. `test_1` and `test_3`. In this case Django will crash == My fix Based on the prints I'm assuming Django creates 4 processes, then only initializes 2 because it only needs 2, and as long as the first 2 processes get used everything works out but if, by randomness, either the 3rd or 4th process is activated things break? Because it happens so infrequently I can't 100% guarantee my fix works, but it appears to: after applying my fix `_init_worker` is called only twice, `get_test_db_clone_settings` is called only twice, and I haven't seen it crash after 200 consecutive runs. {{{ --- old_runner.py 2021-05-19 15:01:32.756268287 +0200 +++ new_runner.py 2021-05-19 15:01:18.844150140 +0200 @@ -541,6 +541,7 @@ # basis, there's no need for more processes than TestCases. parallel_units = len(parallel_suite.subsuites) self.parallel = min(self.parallel, parallel_units) + parallel_suite.processes = self.parallel # If there's only one TestCase, parallelization isn't needed. if self.parallel > 1: }}} I can make a PR in Github if you want (against the main branch then?) -- -- Ticket URL: <https://code.djangoproject.com/ticket/32769#comment:1> Django <https://code.djangoproject.com/> The Web framework for perfectionists with deadlines. -- You received this message because you are subscribed to the Google Groups "Django updates" group. To unsubscribe from this group and stop receiving emails from it, send an email to django-updates+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/django-updates/064.ba97d1a16777ed19496fdeaa73f06246%40djangoproject.com.