Re: [DISCUSS] Airflow Scheduling Delay Metric Definition

Jarek Potiuk Tue, 12 Jul 2022 11:04:33 -0700

Correction <the more workers and schedulers you have, the more
"centralized overhead" you have> of course.



On Tue, Jul 12, 2022 at 8:01 PM Jarek Potiuk <[email protected]> wrote:
>
> I think if we limit it to stress tests, this could be an "extra"
> addition - not even necessarily part of Airflow codebase and adding
> triggers with a script, on a single database, some kind of
> test-harness that you always add after you installed airflow in test
> environment - for that I have far less reservations to use triggers.
>
> But if we want to measure the delays in production, that's quite a
> different story (and different purpose):
>
> * The stress tests are synthetic and basically what you will get out
> of it is "are worse/better in this version than in the previous one"?
> "How much", "Which synthetic scenarios are affected most" . Those will
> be done with a few synthetic kinds of traffic/load/shape.
> * The production is different - you really want to see if you have
> some problems with particular DAGs, times of the day, week, load etc
> and you should be able to take some corrective actions ( for example
> increase number of schedulers, or queues, split your dags etc.) - so
> even the "scheduling delay" metrics might sound familiar you might
> want to use completely different dimensions to look at it (how about
> this DAG? this time of day, this group of dags, this type of workloads
> etc).
>
> I think those two might even be separated and calculated differently
> (though having a single approach would be likely better). I am not
> entirely sure but I have a feeling we do not need the scheduler to
> calculate the "dependency met" while scheduling. I think for
> production purposes, it would be much better (less overhead) to simply
> emit "raw" mettrics such as task start/end time of each task plus
> possibly simple publishing of - mostly static - task dependency rules
> - then "dependency met" time can be calculated offline based on joined
> data. That would be roughly equivalent to what you have in the
> trigger, but without the overhead of triggers- simply instead of
> storing the events in metadata db we would emit them (for example
> using otel) and let the external system aggregate them and process it
> offline independently.
>
> The OTEL integration is rather lightweight - most of them use
> in-memory buffers and efficiently push the data (and even can
> implement scalable forwarding of the data and pre-aggregation). The
> nice thing about it is that it can scale much easier. I think that
> (apart of my earlier reservation) database-trigger approach has this
> not-nice property that the less workers and schedulers you have, the
> more "centralized overhead" you have, where the distributed OTEL
> solution scales together with the system adding more or less fixed
> overhead per component (providing that the remote telemetry service is
> also scalable). This makes the trigger approach far less suitable IMHO
> as we are getting dangerously close to Heisen-Monitoring where the
> more we observe the system the more we impact its performance.
>
> J.
>
> On Tue, Jul 12, 2022 at 6:49 PM Ping Zhang <[email protected]> wrote:
> >
> > Hi Jarek,
> >
> > Thanks for the insights and pointing out the potential issues with triggers 
> > in the prod with scheduler HA setup.
> >
> > The solution that I proposed is mainly for the stress test scheduler before 
> > each airflow release. We can make changes in the airflow codebase to emit 
> > this metric however:
> >
> > 1. It will incur additional overhead for the scheduler to compute the 
> > metric as scheduler needs to compute the dependency met time of a task.
> > 2. It couples with the implementation of the scheduler. For example, from 
> > 1.10.4 to airflow 2, the scheduler has changed a lot. If the metric is 
> > emitted from the scheduler, when making the changes in the scheduler, it 
> > also needs to update how the metric is computed and emitted.
> >
> > Thus, I think having it out of the airflow core makes it easier to compare 
> > the scheduling delay across different airflow versions.
> >
> > Thanks for pointing out the OpenTelemetry, let me check it out.
> >
> > Thanks,
> >
> > Ping
> >
> >
> > On Mon, Jul 11, 2022 at 9:44 AM Jarek Potiuk <[email protected]> wrote:
> >>
> >> Sorry for the late reply - Ping.
> >>
> >> TL;DR; I think the metrics might be useful but I think using triggers
> >> is asking for troubles.
> >>
> >> While using triggers sounds like a common approach in a number of
> >> installations, we do not use triggers so far.
> >> Using Triggers moves some logic to the database, and in our case we do
> >> not have it at all - all logic is in Airflow, and we keep it there,
> >> the database for us is merely "state" storage and "locks". Adding
> >> database triggers, extends it to also keep some logic there. And
> >> adding triggers has some worrying "implicitness" which goes against
> >> the "Explicit is better than Implicit" Zen of Python.
> >>
> >> One thing that makes me think "coldly" about this is that it might
> >> have some undesired side effects - such as synchronizing of changes
> >> from multiple schedulers on trying to insert such audit entry (you
> >> need to create an index lock when you insert rows to a table which has
> >> a primary key/unique indexes).
> >>
> >> And what's even more worrying is that we are using SQLAlchemy and
> >> MySQl/MsSQL/Postgres and we should make sure it works the same in all
> >> of them. This is troublesome.
> >>
> >> Even if we could solve and verify all those problems individually the
> >> effect is - Once we open the "gate" of triggers, we will get more "ok
> >> we have trigger here so let's also use it for that and this" and this
> >> will be hard to say "no" if we already have a precedent, and this
> >> might lead to more and more logica and features deferred to a database
> >> logic (and my past experience is that it leads to more complexity and
> >> implicit behaviours that are difficult to reason about).
> >>
> >> But this is only about the technical details of this, not the metrics
> >> itself. I think the metric you proposed is very useful.
> >>
> >> I think however (correct me if I am wrong) - that we do not need
> >> database triggers for any of those. I have a feeling that this
> >> proposal is trying to implement the (useful) metrics with very limited
> >> modification to the Airflow code, so I can understand that you might
> >> think about it this way when you have your own fork - then it makes
> >> sense to piggyback on the existing database and use triggers, because
> >> you do not want to modify Airflow code.
> >>
> >> But here - we are in a completely different situation. We CAN modify
> >> Airflow code and add missing features and functionality to capture the
> >> necessary metric data in the code,  rather than using triggers. We
> >> could even define some kind of callbacks for the auditing events that
> >> would allow us to gather those metrics in a way that does not even use
> >> the database to store the information for the metrics.
> >>
> >> In fact - this leads me to conclusion that we should implement the
> >> metrics you mention as part of our Open-Telemetry effort
> >>  
> >> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow.
> >> This is precisely what it was prepared for, once we have
> >> Open-Telemetry integrated we could add more and more such useful
> >> metrics more easily, and that could be way more useful, because
> >> instead of running external custom-db-reading process for that, we
> >> could not only calculate such metrics using the right metrics tooling
> >> (each company could use their preferred open-telemetry compliant
> >> tool), but that would open up all the features like alerting,
> >> connecting it with traces and other metrics etc. etc.
> >>
> >> Howard - WDYT?
> >>
> >> J.
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Jun 30, 2022 at 4:52 PM Vikram Koka
> >> <[email protected]> wrote:
> >> >
> >> > HI Ping,
> >> >
> >> > Apologies for the belated response.
> >> >
> >> > We have created a set of stress test DAGs where the tasks take almost no 
> >> > time to execute at all, so that the worker task execution time is small, 
> >> > and the stress is on the Scheduler and Executor.
> >> >
> >> > We then calculate "task latency" aka "task lag" as:
> >> >  ti_lag = ti.start_date - max_upstream_ti_end_date
> >> > This is effectively the time between "the downstream task starting" and 
> >> > "the last dependent upstream task complete"
> >> >
> >> > We don't use the tasks that don't have any upstream tasks in this metric 
> >> > for measuring task lag.
> >> > And for tasks that have multiple upstream tasks, we use the upstream 
> >> > task for which the end_date took maximum time as the scheduler waits for 
> >> > completion of all parent tasks before scheduling any downstream task.
> >> >
> >> > Vikram
> >> >
> >> >
> >> > On Wed, Jun 8, 2022 at 2:58 PM Ping Zhang <[email protected]> wrote:
> >> >>
> >> >> Hi Mehta,
> >> >>
> >> >> Good point. The primary goal of the metric is for stress testing to 
> >> >> catch airflow scheduler performance regression for 1) our internal 
> >> >> scheduler improvement work and 2) airflow version upgrade.
> >> >>
> >> >> One of the key benefits of this metric definition is it is independent 
> >> >> from the scheduler implementation and it can be computed/backfilled 
> >> >> offline.
> >> >>
> >> >> Currently, we expose it to the datadog and we (the airflow cluster 
> >> >> maintainers) are the main users for it.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Ping
> >> >>
> >> >>
> >> >> On Wed, Jun 8, 2022 at 2:36 PM Mehta, Shubham 
> >> >> <[email protected]> wrote:
> >> >>>
> >> >>> Ping,
> >> >>>
> >> >>>
> >> >>>
> >> >>> I’m very interested in this as well. A good metric can help us 
> >> >>> benchmark and identify potential improvements in the scheduler 
> >> >>> performance.
> >> >>> In order to understand the proposal better, can you please share where 
> >> >>> and how do you intend to use “Scheduling delay”? Is it meant for 
> >> >>> benchmarking or stress testing only? Do you plan to expose it to the 
> >> >>> users in the Airflow UI?
> >> >>>
> >> >>>
> >> >>>
> >> >>> Thanks
> >> >>> Shubham
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> From: Ping Zhang <[email protected]>
> >> >>> Reply-To: "[email protected]" <[email protected]>
> >> >>> Date: Wednesday, June 8, 2022 at 11:58 AM
> >> >>> To: "[email protected]" <[email protected]>, 
> >> >>> "[email protected]" <[email protected]>
> >> >>> Subject: RE: [EXTERNAL][DISCUSS] Airflow Scheduling Delay Metric 
> >> >>> Definition
> >> >>>
> >> >>>
> >> >>>
> >> >>> CAUTION: This email originated from outside of the organization. Do 
> >> >>> not click links or open attachments unless you can confirm the sender 
> >> >>> and know the content is safe.
> >> >>>
> >> >>>
> >> >>>
> >> >>> Hi Vikram,
> >> >>>
> >> >>>
> >> >>>
> >> >>> Thanks for pointing that out, 'task latency',
> >> >>>
> >> >>>
> >> >>>
> >> >>> "we define task latency as the time it takes for a task to begin 
> >> >>> executing once its dependencies have been met."
> >> >>>
> >> >>>
> >> >>>
> >> >>> It will be great if you can elaborate more about "begin executing" and 
> >> >>> how you calculate "its dependencies have been met.".
> >> >>>
> >> >>>
> >> >>>
> >> >>> If the 'begin executing' means the state of ti becomes running, then 
> >> >>> the 'Scheduling Delay' metric focuses on the overhead introduced by 
> >> >>> the scheduler.
> >> >>>
> >> >>>
> >> >>>
> >> >>> In our prod and stress test, we use the `task_instance_audit` table ( 
> >> >>> a new row is created whenever there is state change in task_instance 
> >> >>> table) to compute the time of a ti should be scheduled.
> >> >>>
> >> >>>
> >> >>>
> >> >>> Thanks,
> >> >>>
> >> >>>
> >> >>>
> >> >>> Ping
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Wed, Jun 8, 2022 at 11:25 AM Vikram Koka 
> >> >>> <[email protected]> wrote:
> >> >>>
> >> >>> Ping,
> >> >>>
> >> >>>
> >> >>>
> >> >>> I am quite interested in this topic and trying to understand the 
> >> >>> difference between the "scheduling delay" metric articulated as 
> >> >>> compared to the "task latency" aka "task lag" metric which we have 
> >> >>> been using before.
> >> >>>
> >> >>>
> >> >>>
> >> >>> As you may recall, we have been using two specific metrics to 
> >> >>> benchmark Scheduler performance, specifically "task latency" and "task 
> >> >>> throughput" since Airflow 2.0.
> >> >>>
> >> >>> These were described in the 2.0 Scheduler blog post
> >> >>> Specifically, within that we defined task tatency as the time it takes 
> >> >>> for the task to begin executing once it's dependencies are all met.
> >> >>>
> >> >>>
> >> >>>
> >> >>> Thanks,
> >> >>>
> >> >>> Vikram
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Wed, Jun 8, 2022 at 10:25 AM Ping Zhang <[email protected]> wrote:
> >> >>>
> >> >>> Hi Airflow Community,
> >> >>>
> >> >>>
> >> >>>
> >> >>> Airflow is a scheduling platform for data pipelines, however there is 
> >> >>> no good metric to measure the scheduling delay in the production and 
> >> >>> also the stress test environment. This makes it hard to catch 
> >> >>> regressions in the scheduler during the stress test stage.
> >> >>>
> >> >>>
> >> >>> I would like to propose an airflow scheduling delay metric definition. 
> >> >>> Here is the detailed design of the metric and its implementation:
> >> >>>
> >> >>> https://docs.google.com/document/d/1NhO26kgWkIZJEe50M60yh_jgROaU84dRJ5qGFqbkNbU/edit?usp=sharing
> >> >>>
> >> >>> Please take a look and any feedback is welcome.
> >> >>>
> >> >>>
> >> >>>
> >> >>> Thanks,
> >> >>>
> >> >>>
> >> >>>
> >> >>> Ping
> >> >>>
> >> >>>

Re: [DISCUSS] Airflow Scheduling Delay Metric Definition

Reply via email to