Bishesh-Shahi opened a new pull request, #66932:
URL: https://github.com/apache/airflow/pull/66932

   Closes #66853.
   
   ## Problem
   
   Under high concurrency (80+ simultaneous task completions emitting asset 
events), the API server dies with OOMKill. The root cause is a DB lock 
contention chain:
   
   1.   i_update_state() acquires SELECT task_instance ... WITH FOR UPDATE, 
holding a PostgreSQL row lock.
   2. While holding that lock,  egister_asset_changes_in_db() runs multiple 
slow queries including sset_alias_model.asset_events.append(asset_event). This 
ORM .append() lazy-loads the **entire** sset_events collection for the alias.
   3. Each slow query leaves the connection idle in transaction while Python 
processes results. New workers needing SELECT task_instance FOR UPDATE on the 
same row queue up, each holding a FastAPI threadpool thread.
   4. With 80+ concurrent completions, thread count grows unbounded until 
OOMKill.
   
   ## Fix
   
   Two changes:
   
   **1. \AssetManager.register_asset_change()\ (\ssets/manager.py\)**: Replace 
\sset_alias_model.asset_events.append(asset_event)\ + 
\session.add(asset_alias_model)\ with a direct \INSERT INTO 
asset_alias_asset_event (alias_id, event_id)\. This eliminates the lazy-load of 
the existing events collection (which can be thousands of rows) while the 
task_instance row lock is held.
   
   **2. \       i_update_state()\ (\execution_api/routes/task_instances.py\)**: 
Add \session.commit()\ after the TI state UPDATE and Log writes to release the 
\        ask_instance\ row lock before running asset registration. Asset 
registration then runs in a fresh implicit transaction. Registration failures 
are logged and swallowed -- the task state is already durable at that point.
   
   ## Testing
   
   - New: \     est_register_asset_change_with_alias_no_lazy_load\ -- confirms 
no SELECT on \sset_alias_asset_event\ collection during registration when 
pre-existing rows exist
   - New: \     
est_ti_update_state_to_success_asset_registration_failure_returns_204\ -- 
confirms 204 + TI SUCCESS when asset registration raises after commit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to