dheerajturaga commented on PR #64326: URL: https://github.com/apache/airflow/pull/64326#issuecomment-4158705889
> Nice. Seems like a real production bug. A few thoughts: > > 1. Default of 512 may be too low. The scheduler processes all active DAGs every cycle. With 1000+ DAGs, a 512 cache means constant eviction and re-fetching from the DB on every loop. The API server's Execution API also serves worker requests for every task state transition, so it can accumulate entries fast too. Consider starting higher (2048+) and letting people tune down — it's easier to reduce a known number than to discover you need to increase one you didn't know existed. > 2. A single config for both scheduler and API server may not be ideal. The scheduler's working set is bounded (latest version per active DAG) and performance-sensitive — it needs a cache big enough to hold all active DAGs. There are no metrics for the cache which will also cause problems in debugging Done! scheduler is now not bound by the cache. Its only the API server that can have the cache size configurable. Also added metrics to track. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
