Correction <the more workers and schedulers you have, the more "centralized overhead" you have> of course.
On Tue, Jul 12, 2022 at 8:01 PM Jarek Potiuk <ja...@potiuk.com> wrote: > > I think if we limit it to stress tests, this could be an "extra" > addition - not even necessarily part of Airflow codebase and adding > triggers with a script, on a single database, some kind of > test-harness that you always add after you installed airflow in test > environment - for that I have far less reservations to use triggers. > > But if we want to measure the delays in production, that's quite a > different story (and different purpose): > > * The stress tests are synthetic and basically what you will get out > of it is "are worse/better in this version than in the previous one"? > "How much", "Which synthetic scenarios are affected most" . Those will > be done with a few synthetic kinds of traffic/load/shape. > * The production is different - you really want to see if you have > some problems with particular DAGs, times of the day, week, load etc > and you should be able to take some corrective actions ( for example > increase number of schedulers, or queues, split your dags etc.) - so > even the "scheduling delay" metrics might sound familiar you might > want to use completely different dimensions to look at it (how about > this DAG? this time of day, this group of dags, this type of workloads > etc). > > I think those two might even be separated and calculated differently > (though having a single approach would be likely better). I am not > entirely sure but I have a feeling we do not need the scheduler to > calculate the "dependency met" while scheduling. I think for > production purposes, it would be much better (less overhead) to simply > emit "raw" mettrics such as task start/end time of each task plus > possibly simple publishing of - mostly static - task dependency rules > - then "dependency met" time can be calculated offline based on joined > data. That would be roughly equivalent to what you have in the > trigger, but without the overhead of triggers- simply instead of > storing the events in metadata db we would emit them (for example > using otel) and let the external system aggregate them and process it > offline independently. > > The OTEL integration is rather lightweight - most of them use > in-memory buffers and efficiently push the data (and even can > implement scalable forwarding of the data and pre-aggregation). The > nice thing about it is that it can scale much easier. I think that > (apart of my earlier reservation) database-trigger approach has this > not-nice property that the less workers and schedulers you have, the > more "centralized overhead" you have, where the distributed OTEL > solution scales together with the system adding more or less fixed > overhead per component (providing that the remote telemetry service is > also scalable). This makes the trigger approach far less suitable IMHO > as we are getting dangerously close to Heisen-Monitoring where the > more we observe the system the more we impact its performance. > > J. > > On Tue, Jul 12, 2022 at 6:49 PM Ping Zhang <pin...@umich.edu> wrote: > > > > Hi Jarek, > > > > Thanks for the insights and pointing out the potential issues with triggers > > in the prod with scheduler HA setup. > > > > The solution that I proposed is mainly for the stress test scheduler before > > each airflow release. We can make changes in the airflow codebase to emit > > this metric however: > > > > 1. It will incur additional overhead for the scheduler to compute the > > metric as scheduler needs to compute the dependency met time of a task. > > 2. It couples with the implementation of the scheduler. For example, from > > 1.10.4 to airflow 2, the scheduler has changed a lot. If the metric is > > emitted from the scheduler, when making the changes in the scheduler, it > > also needs to update how the metric is computed and emitted. > > > > Thus, I think having it out of the airflow core makes it easier to compare > > the scheduling delay across different airflow versions. > > > > Thanks for pointing out the OpenTelemetry, let me check it out. > > > > Thanks, > > > > Ping > > > > > > On Mon, Jul 11, 2022 at 9:44 AM Jarek Potiuk <pot...@apache.org> wrote: > >> > >> Sorry for the late reply - Ping. > >> > >> TL;DR; I think the metrics might be useful but I think using triggers > >> is asking for troubles. > >> > >> While using triggers sounds like a common approach in a number of > >> installations, we do not use triggers so far. > >> Using Triggers moves some logic to the database, and in our case we do > >> not have it at all - all logic is in Airflow, and we keep it there, > >> the database for us is merely "state" storage and "locks". Adding > >> database triggers, extends it to also keep some logic there. And > >> adding triggers has some worrying "implicitness" which goes against > >> the "Explicit is better than Implicit" Zen of Python. > >> > >> One thing that makes me think "coldly" about this is that it might > >> have some undesired side effects - such as synchronizing of changes > >> from multiple schedulers on trying to insert such audit entry (you > >> need to create an index lock when you insert rows to a table which has > >> a primary key/unique indexes). > >> > >> And what's even more worrying is that we are using SQLAlchemy and > >> MySQl/MsSQL/Postgres and we should make sure it works the same in all > >> of them. This is troublesome. > >> > >> Even if we could solve and verify all those problems individually the > >> effect is - Once we open the "gate" of triggers, we will get more "ok > >> we have trigger here so let's also use it for that and this" and this > >> will be hard to say "no" if we already have a precedent, and this > >> might lead to more and more logica and features deferred to a database > >> logic (and my past experience is that it leads to more complexity and > >> implicit behaviours that are difficult to reason about). > >> > >> But this is only about the technical details of this, not the metrics > >> itself. I think the metric you proposed is very useful. > >> > >> I think however (correct me if I am wrong) - that we do not need > >> database triggers for any of those. I have a feeling that this > >> proposal is trying to implement the (useful) metrics with very limited > >> modification to the Airflow code, so I can understand that you might > >> think about it this way when you have your own fork - then it makes > >> sense to piggyback on the existing database and use triggers, because > >> you do not want to modify Airflow code. > >> > >> But here - we are in a completely different situation. We CAN modify > >> Airflow code and add missing features and functionality to capture the > >> necessary metric data in the code, rather than using triggers. We > >> could even define some kind of callbacks for the auditing events that > >> would allow us to gather those metrics in a way that does not even use > >> the database to store the information for the metrics. > >> > >> In fact - this leads me to conclusion that we should implement the > >> metrics you mention as part of our Open-Telemetry effort > >> > >> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow. > >> This is precisely what it was prepared for, once we have > >> Open-Telemetry integrated we could add more and more such useful > >> metrics more easily, and that could be way more useful, because > >> instead of running external custom-db-reading process for that, we > >> could not only calculate such metrics using the right metrics tooling > >> (each company could use their preferred open-telemetry compliant > >> tool), but that would open up all the features like alerting, > >> connecting it with traces and other metrics etc. etc. > >> > >> Howard - WDYT? > >> > >> J. > >> > >> > >> > >> > >> > >> > >> On Thu, Jun 30, 2022 at 4:52 PM Vikram Koka > >> <vik...@astronomer.io.invalid> wrote: > >> > > >> > HI Ping, > >> > > >> > Apologies for the belated response. > >> > > >> > We have created a set of stress test DAGs where the tasks take almost no > >> > time to execute at all, so that the worker task execution time is small, > >> > and the stress is on the Scheduler and Executor. > >> > > >> > We then calculate "task latency" aka "task lag" as: > >> > ti_lag = ti.start_date - max_upstream_ti_end_date > >> > This is effectively the time between "the downstream task starting" and > >> > "the last dependent upstream task complete" > >> > > >> > We don't use the tasks that don't have any upstream tasks in this metric > >> > for measuring task lag. > >> > And for tasks that have multiple upstream tasks, we use the upstream > >> > task for which the end_date took maximum time as the scheduler waits for > >> > completion of all parent tasks before scheduling any downstream task. > >> > > >> > Vikram > >> > > >> > > >> > On Wed, Jun 8, 2022 at 2:58 PM Ping Zhang <pin...@umich.edu> wrote: > >> >> > >> >> Hi Mehta, > >> >> > >> >> Good point. The primary goal of the metric is for stress testing to > >> >> catch airflow scheduler performance regression for 1) our internal > >> >> scheduler improvement work and 2) airflow version upgrade. > >> >> > >> >> One of the key benefits of this metric definition is it is independent > >> >> from the scheduler implementation and it can be computed/backfilled > >> >> offline. > >> >> > >> >> Currently, we expose it to the datadog and we (the airflow cluster > >> >> maintainers) are the main users for it. > >> >> > >> >> Thanks, > >> >> > >> >> Ping > >> >> > >> >> > >> >> On Wed, Jun 8, 2022 at 2:36 PM Mehta, Shubham > >> >> <shu...@amazon.com.invalid> wrote: > >> >>> > >> >>> Ping, > >> >>> > >> >>> > >> >>> > >> >>> I’m very interested in this as well. A good metric can help us > >> >>> benchmark and identify potential improvements in the scheduler > >> >>> performance. > >> >>> In order to understand the proposal better, can you please share where > >> >>> and how do you intend to use “Scheduling delay”? Is it meant for > >> >>> benchmarking or stress testing only? Do you plan to expose it to the > >> >>> users in the Airflow UI? > >> >>> > >> >>> > >> >>> > >> >>> Thanks > >> >>> Shubham > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> From: Ping Zhang <pin...@umich.edu> > >> >>> Reply-To: "dev@airflow.apache.org" <dev@airflow.apache.org> > >> >>> Date: Wednesday, June 8, 2022 at 11:58 AM > >> >>> To: "dev@airflow.apache.org" <dev@airflow.apache.org>, > >> >>> "vik...@astronomer.io" <vik...@astronomer.io> > >> >>> Subject: RE: [EXTERNAL][DISCUSS] Airflow Scheduling Delay Metric > >> >>> Definition > >> >>> > >> >>> > >> >>> > >> >>> CAUTION: This email originated from outside of the organization. Do > >> >>> not click links or open attachments unless you can confirm the sender > >> >>> and know the content is safe. > >> >>> > >> >>> > >> >>> > >> >>> Hi Vikram, > >> >>> > >> >>> > >> >>> > >> >>> Thanks for pointing that out, 'task latency', > >> >>> > >> >>> > >> >>> > >> >>> "we define task latency as the time it takes for a task to begin > >> >>> executing once its dependencies have been met." > >> >>> > >> >>> > >> >>> > >> >>> It will be great if you can elaborate more about "begin executing" and > >> >>> how you calculate "its dependencies have been met.". > >> >>> > >> >>> > >> >>> > >> >>> If the 'begin executing' means the state of ti becomes running, then > >> >>> the 'Scheduling Delay' metric focuses on the overhead introduced by > >> >>> the scheduler. > >> >>> > >> >>> > >> >>> > >> >>> In our prod and stress test, we use the `task_instance_audit` table ( > >> >>> a new row is created whenever there is state change in task_instance > >> >>> table) to compute the time of a ti should be scheduled. > >> >>> > >> >>> > >> >>> > >> >>> Thanks, > >> >>> > >> >>> > >> >>> > >> >>> Ping > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> On Wed, Jun 8, 2022 at 11:25 AM Vikram Koka > >> >>> <vik...@astronomer.io.invalid> wrote: > >> >>> > >> >>> Ping, > >> >>> > >> >>> > >> >>> > >> >>> I am quite interested in this topic and trying to understand the > >> >>> difference between the "scheduling delay" metric articulated as > >> >>> compared to the "task latency" aka "task lag" metric which we have > >> >>> been using before. > >> >>> > >> >>> > >> >>> > >> >>> As you may recall, we have been using two specific metrics to > >> >>> benchmark Scheduler performance, specifically "task latency" and "task > >> >>> throughput" since Airflow 2.0. > >> >>> > >> >>> These were described in the 2.0 Scheduler blog post > >> >>> Specifically, within that we defined task tatency as the time it takes > >> >>> for the task to begin executing once it's dependencies are all met. > >> >>> > >> >>> > >> >>> > >> >>> Thanks, > >> >>> > >> >>> Vikram > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> On Wed, Jun 8, 2022 at 10:25 AM Ping Zhang <pin...@umich.edu> wrote: > >> >>> > >> >>> Hi Airflow Community, > >> >>> > >> >>> > >> >>> > >> >>> Airflow is a scheduling platform for data pipelines, however there is > >> >>> no good metric to measure the scheduling delay in the production and > >> >>> also the stress test environment. This makes it hard to catch > >> >>> regressions in the scheduler during the stress test stage. > >> >>> > >> >>> > >> >>> I would like to propose an airflow scheduling delay metric definition. > >> >>> Here is the detailed design of the metric and its implementation: > >> >>> > >> >>> https://docs.google.com/document/d/1NhO26kgWkIZJEe50M60yh_jgROaU84dRJ5qGFqbkNbU/edit?usp=sharing > >> >>> > >> >>> Please take a look and any feedback is welcome. > >> >>> > >> >>> > >> >>> > >> >>> Thanks, > >> >>> > >> >>> > >> >>> > >> >>> Ping > >> >>> > >> >>>