Arunodoy18 opened a new pull request, #61680:
URL: https://github.com/apache/airflow/pull/61680

   closes: #61453
   
   ## What this PR does
   
   This PR improves scheduler query performance when processing large asset 
collections by preventing extremely large SQL `IN` clauses from being generated.
   
   Previously, when the number of assets was high, the scheduler could generate 
very long `IN` queries. In some environments this could lead to:
   - Slow query planning
   - Database lock contention
   - Queries appearing stuck or taking extremely long to complete
   
   ## How this is fixed
   
   This PR introduces safe chunked processing for asset lookups to ensure:
   - SQL queries remain bounded in size
   - Scheduler performance remains stable even with large asset counts
   - Database planning and execution time remains predictable
   
   The change keeps database-side filtering (instead of Python in-memory 
filtering) to maintain optimal performance characteristics.
   
   ## Implementation Details
   
   Changes include:
   - Chunking large asset lists before building SQL queries
   - Maintaining existing behavior for small datasets
   - Adding debug logging for easier observability
   - Adding unit tests to validate chunking behavior
   - Adding newsfragment for performance improvement
   
   ## Why not in-memory filtering
   
   Filtering in Python was considered but rejected because:
   - Database engines are optimized for filtering operations
   - Pulling larger datasets into memory would increase load
   - The root problem is query size, not filtering capability
   
   ## Testing
   
   Tests added to cover:
   - Large asset list chunk handling
   - Query correctness across chunks
   - No regression for small asset lists
   - Scheduler functional behavior remains unchanged
   
   All existing tests pass locally and CI is expected to validate across 
supported DB backends.
   
   ## Performance Impact
   
   Expected improvements:
   - Reduced scheduler query planning latency
   - Avoidance of extremely long SQL queries
   - Better stability in large asset deployments
   
   No functional behavior changes intended.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to