Arunodoy18 opened a new pull request, #61680: URL: https://github.com/apache/airflow/pull/61680
closes: #61453 ## What this PR does This PR improves scheduler query performance when processing large asset collections by preventing extremely large SQL `IN` clauses from being generated. Previously, when the number of assets was high, the scheduler could generate very long `IN` queries. In some environments this could lead to: - Slow query planning - Database lock contention - Queries appearing stuck or taking extremely long to complete ## How this is fixed This PR introduces safe chunked processing for asset lookups to ensure: - SQL queries remain bounded in size - Scheduler performance remains stable even with large asset counts - Database planning and execution time remains predictable The change keeps database-side filtering (instead of Python in-memory filtering) to maintain optimal performance characteristics. ## Implementation Details Changes include: - Chunking large asset lists before building SQL queries - Maintaining existing behavior for small datasets - Adding debug logging for easier observability - Adding unit tests to validate chunking behavior - Adding newsfragment for performance improvement ## Why not in-memory filtering Filtering in Python was considered but rejected because: - Database engines are optimized for filtering operations - Pulling larger datasets into memory would increase load - The root problem is query size, not filtering capability ## Testing Tests added to cover: - Large asset list chunk handling - Query correctness across chunks - No regression for small asset lists - Scheduler functional behavior remains unchanged All existing tests pass locally and CI is expected to validate across supported DB backends. ## Performance Impact Expected improvements: - Reduced scheduler query planning latency - Avoidance of extremely long SQL queries - Better stability in large asset deployments No functional behavior changes intended. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
