dheerajturaga commented on PR #64326:
URL: https://github.com/apache/airflow/pull/64326#issuecomment-4158705889

   > Nice. Seems like a real production bug. A few thoughts:
   > 
   > 1. Default of 512 may be too low. The scheduler processes all active DAGs 
every cycle. With 1000+ DAGs, a 512 cache means constant eviction and 
re-fetching from the DB on every loop. The API server's Execution API also 
serves worker requests for every task state transition, so it can accumulate 
entries fast too. Consider starting higher (2048+) and letting people tune down 
— it's easier to reduce a known number than to discover you need to increase 
one you didn't know existed.
   > 2. A single config for both scheduler and API server may not be ideal. The 
scheduler's working set is bounded (latest version per active DAG) and 
performance-sensitive — it needs a cache big enough to hold all active DAGs. 
There are no metrics for the cache which will also cause problems in debugging
   
   Done! scheduler is now not bound by the cache. Its only the API server that 
can have the cache size configurable. Also added metrics to track.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to