This is a really great improvement! Great job by everybody, we are really excited about this contribution! These changes make it easier for Airflow to support much more complex/large scale use cases in the future. Looking forward to more improvements like this one! * Huge thanks to friends from Polidea! *
Evgeny Shulman databand.ai | CTO On Mon, Feb 24, 2020 at 6:44 PM Jarek Potiuk <jarek.pot...@polidea.com> wrote: > Those are all great improvements Kamil! It would be great to have them > reviewed, tested and merged for 2.0 ! > > J. > > > On Mon, Feb 24, 2020 at 5:35 PM Kamil Breguła <kamil.breg...@polidea.com> > wrote: > > > Hello, > > > > Polidea [1] together with Databand [2] has taken steps to optimize > > scheduler performance. > > I made many changes last weekend: > > 1. [AIRFLOW-6856] Bulk fetch paused_dag_ids > > https://github.com/apache/airflow/pull/7476 > > 2. [AIRFLOW-6857] Bulk sync DAGs > > https://github.com/apache/airflow/pull/7477 > > 3. [AIRFLOW-6862] Do not check the freshness of fresh DAG > > https://github.com/apache/airflow/pull/7481 > > 4. [AIRFLOW-6869] Bulk fetch DAGRuns for _process_task_instances > > https://github.com/apache/airflow/pull/7489 > > 5. [AIRFLOW-6881] Bulk fetch DAGRun for create_dag_run > > https://github.com/apache/airflow/pull/7502 > > 6. [AIRFLOW-6887] Do not check the state of fresh DAGRun > > https://github.com/apache/airflow/pull/7510 > > These changes have not yet been merged to allow review by wider > > audiences. Any feedback is very helpful. The result of the performance > > benchmark is available in the description of each change. > > > > When it comes to the overall changes, It looks as follows. > > > > Before: > > Average time: 8080.246 ms > > Queries count: 2692 > > After: > > Average time: 628.801 ms > > Queries count: 5 > > Diff: > > Average time: -7452 ms (-92%) > > Queries count: 2687 (-99%) > > > > My changes focused only on DagFileProcessor, but this generates the > > most database queries and takes a significant amount of scheduler's > > time. > > > > Tomek Urbaszek's change has also been merged in the past to improve > > performance. > > 7. [AIRFLOW-6590] Use batch db operations in jobs > > https://github.com/apache/airflow/pull/7370 > > > > This is not the last improvement of performance. We still keep working > > and other changes will appear in the future. > > > > Many thanks to friends from Databand [https://databand.ai/] for support. > > > > Best regards, > > Kamil Breguła > > > > [1] https://www.polidea.com/services/ > > [2] https://databand.ai/about/ > > > > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/> | Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > [image: Polidea] <https://www.polidea.com/> >