potiuk commented on code in PR #28256:
URL: https://github.com/apache/airflow/pull/28256#discussion_r1112190405


##########
airflow/dag_processing/manager.py:
##########
@@ -782,7 +782,11 @@ def clear_nonexistent_import_errors(file_paths: list[str] 
| None, session=NEW_SE
         """
         query = session.query(errors.ImportError)
         if file_paths:
-            query = query.filter(~errors.ImportError.filename.in_(file_paths))
+            for file_path in file_paths:
+                if file_path.endswith(".zip"):
+                    query = 
query.filter(~(errors.ImportError.filename.startswith(file_path)))
+                else:
+                    query = query.filter(errors.ImportError.filename != 
file_path)

Review Comment:
   Aren't we risking very long queries here @tirkarthi ? Potentially there can 
be many of those files) . I wonder If there are limits in SQLalchemy / 
converting to SQL when there are thousands of AND conditions. Is it possible to 
test it on a fake DAG  folder with say 10.000 DAG files (mysql/postgres)?
   
   For example this one - for SQL Server: 
https://stackoverflow.com/questions/1869753/maximum-size-for-a-sql-server-query-in-clause-is-there-a-better-approach
 suggests that IN (a,b,c) is nothing comparing to chained conditions and that 
the latter might easily fail due to stack size. It also suggests that nowadays 
stack sizes are deep, but I have a hunch that running huge query like that 
might be terribly expensive:
   
   > Other than that, your query is limited by runtime conditions. It will 
usually run out of stack size because x IN (a,b,c) is nothing but x=a OR x=b OR 
x=c which creates an expression tree similar to x=a OR (x=b OR (x=c)), so it 
gets very deep with a large number of OR. SQL 7 would hit a SO [at about 10k 
values in the IN](http://support.microsoft.com/kb/288095), but nowdays stacks 
are much deeper (because of x64), so it can go pretty deep.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to