[DISCUSS] Performance aspects of Airflow (2.0 + 1.10 backport)

Jarek Potiuk Wed, 11 Dec 2019 02:06:23 -0800

*TL;DR;* We are gearing up @ Polidea to work on Apache Airflow performance
and I wanted to start a discussion that might lead to creating a new AIP
and implementing it :). Here is a high-level summary of the discussions we
had so far, so this might be a starting point to get some details worked
out and end up with polished AIP.


*Motivation*

Airflow has a number of areas that require performance testing and
currently when releasing new version of Airflow we are not able to reason
about potential performance impacts, discuss performance degradation
problems, because we have only anecdotal evidence of performance
characteristics of Airflow. There is this fantastic post
<https://www.astronomer.io/blog/profiling-the-airflow-scheduler/> from Ash
about profiling scheduler, but this is only one part of Airflow and it
would be great to turn our performance work into regular and managed
approach.

*Areas*

We think about two types of performance streams:

   - Instrumentation of the code
   - Synthetic CI-runnable E2E tests.

Those can be implemented independently, but instrumentation might be of
great help when synthetic tests are run.

*Instrumentation*

Instrumentation is targeted towards Airflow users and DAG developers.

Instrumentation is mostly about gathering more performance characteristics,
including numbers related to database queries and performance, latency of
DAG scheduling, parsing time etc. This all can be exported using current
statsd interface and visualised using one of the standard metric tools
(Grafana, Prometheus etc.) and some of this can be surfaced in the existing
UI where you can see it in the context of actual DAGS (mostly latency
numbers). Most of it should be back-portable to 1.10.*. It can be used to
track down performance numbers with some real in-production DAGs.

Part of the effort should be instructions on how to setup monitoring and
dashboards for Airflow, document which of the metrics mean what and making
sure the data is actionable - it should be rather clear for the developers
how to turn their observations into actions.

An important part of this will be also to provide a way to export such
performance information easily and share with community or service
providers so that they can make more educated reasoning about performance
problem their users experience.

*Synthetic CI-runnable E2E tests*

Synthetic tests are targeted towards Airflow committers/contributors

The synthetic CI-runnable tests that will be able to produce generic
performance numbers for the core of Airflow. We can prepare synthetic data
and run Airflow with NullExecutors, empty tasks, various Executors, using
different deployments to focus only on the performance of the core Airflow
itself. We can run those performance tests on already released Airflow
versions to compare the performance numbers, as well as it has the benefit
that we can visualise trends, compare versions, and have automated
performance testing of new releases. Some of the instrumentation changes
described above (especially the part that will be easily back-portable to
1.10.* series) might also be helpful in getting some performance
characteristics.

*What's my ask?*

I'd love to start community discussion on this. Please feel free to add
your comments and let's start working on it soon. I would love to get some
interested people and organise a SIG-performance group soon - @Kevin - I
think you mentioned you would be interested, but anyone else is welcome to
join as well).

J.


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

[DISCUSS] Performance aspects of Airflow (2.0 + 1.10 backport)

Reply via email to