Sorry for the late reply - Ping. TL;DR; I think the metrics might be useful but I think using triggers is asking for troubles.
While using triggers sounds like a common approach in a number of installations, we do not use triggers so far. Using Triggers moves some logic to the database, and in our case we do not have it at all - all logic is in Airflow, and we keep it there, the database for us is merely "state" storage and "locks". Adding database triggers, extends it to also keep some logic there. And adding triggers has some worrying "implicitness" which goes against the "Explicit is better than Implicit" Zen of Python. One thing that makes me think "coldly" about this is that it might have some undesired side effects - such as synchronizing of changes from multiple schedulers on trying to insert such audit entry (you need to create an index lock when you insert rows to a table which has a primary key/unique indexes). And what's even more worrying is that we are using SQLAlchemy and MySQl/MsSQL/Postgres and we should make sure it works the same in all of them. This is troublesome. Even if we could solve and verify all those problems individually the effect is - Once we open the "gate" of triggers, we will get more "ok we have trigger here so let's also use it for that and this" and this will be hard to say "no" if we already have a precedent, and this might lead to more and more logica and features deferred to a database logic (and my past experience is that it leads to more complexity and implicit behaviours that are difficult to reason about). But this is only about the technical details of this, not the metrics itself. I think the metric you proposed is very useful. I think however (correct me if I am wrong) - that we do not need database triggers for any of those. I have a feeling that this proposal is trying to implement the (useful) metrics with very limited modification to the Airflow code, so I can understand that you might think about it this way when you have your own fork - then it makes sense to piggyback on the existing database and use triggers, because you do not want to modify Airflow code. But here - we are in a completely different situation. We CAN modify Airflow code and add missing features and functionality to capture the necessary metric data in the code, rather than using triggers. We could even define some kind of callbacks for the auditing events that would allow us to gather those metrics in a way that does not even use the database to store the information for the metrics. In fact - this leads me to conclusion that we should implement the metrics you mention as part of our Open-Telemetry effort https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow. This is precisely what it was prepared for, once we have Open-Telemetry integrated we could add more and more such useful metrics more easily, and that could be way more useful, because instead of running external custom-db-reading process for that, we could not only calculate such metrics using the right metrics tooling (each company could use their preferred open-telemetry compliant tool), but that would open up all the features like alerting, connecting it with traces and other metrics etc. etc. Howard - WDYT? J. On Thu, Jun 30, 2022 at 4:52 PM Vikram Koka <[email protected]> wrote: > > HI Ping, > > Apologies for the belated response. > > We have created a set of stress test DAGs where the tasks take almost no time > to execute at all, so that the worker task execution time is small, and the > stress is on the Scheduler and Executor. > > We then calculate "task latency" aka "task lag" as: > ti_lag = ti.start_date - max_upstream_ti_end_date > This is effectively the time between "the downstream task starting" and "the > last dependent upstream task complete" > > We don't use the tasks that don't have any upstream tasks in this metric for > measuring task lag. > And for tasks that have multiple upstream tasks, we use the upstream task for > which the end_date took maximum time as the scheduler waits for completion of > all parent tasks before scheduling any downstream task. > > Vikram > > > On Wed, Jun 8, 2022 at 2:58 PM Ping Zhang <[email protected]> wrote: >> >> Hi Mehta, >> >> Good point. The primary goal of the metric is for stress testing to catch >> airflow scheduler performance regression for 1) our internal scheduler >> improvement work and 2) airflow version upgrade. >> >> One of the key benefits of this metric definition is it is independent from >> the scheduler implementation and it can be computed/backfilled offline. >> >> Currently, we expose it to the datadog and we (the airflow cluster >> maintainers) are the main users for it. >> >> Thanks, >> >> Ping >> >> >> On Wed, Jun 8, 2022 at 2:36 PM Mehta, Shubham <[email protected]> >> wrote: >>> >>> Ping, >>> >>> >>> >>> I’m very interested in this as well. A good metric can help us benchmark >>> and identify potential improvements in the scheduler performance. >>> In order to understand the proposal better, can you please share where and >>> how do you intend to use “Scheduling delay”? Is it meant for benchmarking >>> or stress testing only? Do you plan to expose it to the users in the >>> Airflow UI? >>> >>> >>> >>> Thanks >>> Shubham >>> >>> >>> >>> >>> >>> From: Ping Zhang <[email protected]> >>> Reply-To: "[email protected]" <[email protected]> >>> Date: Wednesday, June 8, 2022 at 11:58 AM >>> To: "[email protected]" <[email protected]>, >>> "[email protected]" <[email protected]> >>> Subject: RE: [EXTERNAL][DISCUSS] Airflow Scheduling Delay Metric Definition >>> >>> >>> >>> CAUTION: This email originated from outside of the organization. Do not >>> click links or open attachments unless you can confirm the sender and know >>> the content is safe. >>> >>> >>> >>> Hi Vikram, >>> >>> >>> >>> Thanks for pointing that out, 'task latency', >>> >>> >>> >>> "we define task latency as the time it takes for a task to begin executing >>> once its dependencies have been met." >>> >>> >>> >>> It will be great if you can elaborate more about "begin executing" and how >>> you calculate "its dependencies have been met.". >>> >>> >>> >>> If the 'begin executing' means the state of ti becomes running, then the >>> 'Scheduling Delay' metric focuses on the overhead introduced by the >>> scheduler. >>> >>> >>> >>> In our prod and stress test, we use the `task_instance_audit` table ( a new >>> row is created whenever there is state change in task_instance table) to >>> compute the time of a ti should be scheduled. >>> >>> >>> >>> Thanks, >>> >>> >>> >>> Ping >>> >>> >>> >>> >>> >>> On Wed, Jun 8, 2022 at 11:25 AM Vikram Koka <[email protected]> >>> wrote: >>> >>> Ping, >>> >>> >>> >>> I am quite interested in this topic and trying to understand the difference >>> between the "scheduling delay" metric articulated as compared to the "task >>> latency" aka "task lag" metric which we have been using before. >>> >>> >>> >>> As you may recall, we have been using two specific metrics to benchmark >>> Scheduler performance, specifically "task latency" and "task throughput" >>> since Airflow 2.0. >>> >>> These were described in the 2.0 Scheduler blog post >>> Specifically, within that we defined task tatency as the time it takes for >>> the task to begin executing once it's dependencies are all met. >>> >>> >>> >>> Thanks, >>> >>> Vikram >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Wed, Jun 8, 2022 at 10:25 AM Ping Zhang <[email protected]> wrote: >>> >>> Hi Airflow Community, >>> >>> >>> >>> Airflow is a scheduling platform for data pipelines, however there is no >>> good metric to measure the scheduling delay in the production and also the >>> stress test environment. This makes it hard to catch regressions in the >>> scheduler during the stress test stage. >>> >>> >>> I would like to propose an airflow scheduling delay metric definition. Here >>> is the detailed design of the metric and its implementation: >>> >>> https://docs.google.com/document/d/1NhO26kgWkIZJEe50M60yh_jgROaU84dRJ5qGFqbkNbU/edit?usp=sharing >>> >>> Please take a look and any feedback is welcome. >>> >>> >>> >>> Thanks, >>> >>> >>> >>> Ping >>> >>>
