KevinYang21 commented on issue #5908: Revert "[AIRFLOW-4797] Improve performance and behaviour of zombie de… URL: https://github.com/apache/airflow/pull/5908#issuecomment-528699559 Thank you guys for reviewing! @milton0825 We benchmarked the two approaches during the initial PR 3873 with 4k DAG files and 30k. With aggregated query the DB CPU usage is kept under 50% while with the subprocess query the DB will be killed instantly. In our production cluster at that time, running ~20k tasks concurrently with 2k DAG files, DB CPU went from 80% to ~40%. In our current production DB with >23M rows in task_instance table and >4M rows in job table, average time it takes to run the query takes 0.5 second( we have a powerful DB but the PR being reverted also showed an average of 0.5 second runtime of that query). So it shouldn't slow down the dag processor manager too much. @ashb pg_stat won't get flushed until the DB is restarted so we don't really see the diff in frequency, but that is pretty important in the evaluation here. Even with the provided data, query time of 25 DAG files added would already beat the joined query, not to mention the overhead of starting/stopping the transaction. In general I believe it is better to use the aggregated query, thus leverage the query optimizer, instead of trying to query ourselves. And esp. with a large scaled cluster that has huge number of DAG files to parse, it would a show stopper if we distribute the query to the subprocess.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services