KevinYang21 commented on issue #5908: Revert "[AIRFLOW-4797] Improve 
performance and behaviour of zombie de…
URL: https://github.com/apache/airflow/pull/5908#issuecomment-528699559
 
 
   Thank you guys for reviewing!
   
   @milton0825 We benchmarked the two approaches during the initial PR 3873 
with 4k DAG files and 30k. With aggregated query the DB CPU usage is kept under 
50% while with the subprocess query the DB will be killed instantly. In our 
production cluster at that time, running ~20k tasks concurrently with 2k DAG 
files, DB CPU went from 80% to ~40%. In our current production DB with >23M 
rows in task_instance table and >4M rows in job table, average time it takes to 
run the query takes 0.5 second( we have a powerful DB but the PR being reverted 
also showed an average of 0.5 second runtime of that query). So it shouldn't 
slow down the dag processor manager too much.
   
   @ashb pg_stat won't get flushed until the DB is restarted so we don't really 
see the diff in frequency, but that is pretty important in the evaluation here. 
Even with the provided data, query time of 25 DAG files added would already 
beat the joined query, not to mention the overhead of starting/stopping the 
transaction.
   
   In general I believe it is better to use the aggregated query, thus leverage 
the query optimizer, instead of trying to query ourselves. And esp. with a 
large scaled cluster that has huge number of DAG files to parse, it would a 
show stopper if we distribute the query to the subprocess.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to