[GitHub] [airflow] potiuk commented on a diff in pull request #30705: Optimize parallel test execution for unit tests

via GitHub Tue, 18 Apr 2023 14:04:13 -0700


potiuk commented on code in PR #30705:
URL: https://github.com/apache/airflow/pull/30705#discussion_r1170563693



##########
dev/breeze/src/airflow_breeze/utils/selective_checks.py:
##########
@@ -606,7 +606,43 @@ def parallel_test_types(self) -> str:
                     )
                     test_types_to_remove.add(test_type)
             current_test_types = current_test_types - test_types_to_remove
-        return " ".join(sorted(current_test_types))
+        for test_type in tuple(current_test_types):
+            if test_type == "Providers":
+                current_test_types.remove(test_type)
+                current_test_types.update(
+                    ("Providers[amazon]", "Providers[google]", 
"Providers[-amazon,google]")
+                )
+            elif test_type.startswith("Providers[") and "amazon" in test_type 
or "google" in test_type:
+                current_test_types.remove(test_type)
+                if "amazon" in test_type:
+                    current_test_types.add("Providers[amazon]")
+                if "google" in test_type:
+                    current_test_types.add("Providers[google]")

Review Comment:
   We cannot run tests in parallell, because far too many of our tests rely on 
a shared database (for example connections are not mocked. DagRuns are created, 
etc. etc. ) . Simply speaking HUGE percent of our tests are not pure unit tests 
with everything mocked but they rely on a shared database to be there (and they 
prepare data. use it and sometimes delete and sometimes not when running). We 
even run all our tests WITH specific database. New tests that continue using 
the shared database are added / modified /updated in every PR.
   
   if we run them in parallel, the tests will start override each other data in 
the database. 
   
   So we can do either
   
   1) review the 12.500 of tests of ours and separate out "real unit test" from 
the "DB tests" and  add mechanisms to keep the separaiton - then we would be 
able to parallelise the "real unit tests". Possibly even rewrite the tests to 
be "real unit tests" and mock the DB access
   
   2)  or do what we are doing - i.e. split the tests into  more-or-less equal 
chunks (in terms of execution time) and run them sequentially, each of the test 
type with its own database (this is what we do now).
   
   Option 1) seems to require enormous effort - but if you (or anyone) would 
like to take on the task, this is a  good idea.  I would love to have it, but 
it seems not feasible (but I would love to be proven wrong)
   
   Option 2) make a deliberate effort to split the tests and balance-optimize 
them from time to time with few hours effort and custom parallel running 
framework (this is what we have now)
   
   Option 3.) ... I do not see a 3rd option
   
   But maybe there is one? Courious to hear your thoughts :)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [airflow] potiuk commented on a diff in pull request #30705: Optimize parallel test execution for unit tests

Reply via email to