Re: [DISCUSS] Airflow Scheduling Delay Metric Definition

Ping Zhang Tue, 26 Jul 2022 10:41:49 -0700

Hi Jarek,

Thanks a lot for the thorough response and they are all legitimate concerns.


How about that I prepare a more thorough doc to address your concerns and
we can continue the discussion from there?

Thanks,

Ping



Thanks,

Ping


On Tue, Jul 26, 2022 at 6:53 AM Jarek Potiuk <[email protected]> wrote:

> If I understand correctly, the Idea is to run an additional set of stress
> tests before releasing a version - without impacting the production version
> of Airflow.
>
> I think if this is something that we want to make part of our release
> process, then submitting a code somewhere is the last thing to do. Just
> submitting the code does not mean that it will be executed.
>
> Note - this is my personal view on it, I am not sure if this is other's
> view as well, but it comes from years of being involved in the release
> process and doing it myself - and volunteering to do part of the process
> (and improve and perfect it).
>
> I think the first thing here is to have several answers:
>
> a) do we want to do it
> b) who will do it
> c) when this will be done in the release process
> d) what infrastructure will be used to run the tests
>
> The actiual "completion" of the release process of ours is described in
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_AIRFLOW.md
> and there is of course release manager running the release process. This is
> usually Jed, Ephraim recently but it is generally whoever from the PMC
> members (or committers if they have a PMC member ready to sign the
> artifacts) who raises their hand and say "Hey I want to be a release
> manager". Similarly we have a release process for providers
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md.
> But for "MINOR releases" the community process starts when about 90% of
> testing effort has already been spent.
>
> The important thing is that the release manager is NOT doing testing (and
> the release process from ASF does not even touch on the subject). Release
> manager has the role of executing the "mechanics" to produce
> release artifacts, start voting and has the power of deciding
> single-handedly if the release should be cancelled (fully or in parts) if
> there are some issues found (more info on the role and release process here
> https://www.apache.org/legal/release-policy.html#management). In case we
> release Providers or Airflow (especially the PATCHLEVEL ones) - we delegate
> a big part of the testing to whoever was involved in preparing fixes - by
> the "Status of testing" issue (quite successfully I think). For MINOR
> "airflow", it's much more complex, usually a lot of testing is done by
> stakeholders - mostly Astronomer who donates HUGE amount of testing time of
> Airflow MINOR releases (and this is one of the reasons why Astronomer is
> able to release new Airflow versions much faster than anyone else because
> they run a lot of tests in their own infrastructure - and this is actually
> great contribution to the community :). This has a huge mutual benefit for
> both - the community and Astronomer.
>
> Now - if we are going to do the stress testing before releasing Airflow -
> the question is who will be doing that. This is quite an effort (I believe)
> and it requires quite an infrastructure. And if we are to donate any kind
> of testing harness to the community - it only makes sense if there is a way
> the community can use it. For example there are (from what I hear) many
> test scenarios and scripts that Astronomer has and follows, but it's not
> donated to the community. It simply makes no sense because we have no
> capacity to run those tests, nor process how we can follow those test
> scenarios.
>
> So now - the question is - who will run such tests and with what
> infrastructure?  I am not sure what kind of infrastructure it might
> require, but I think the only way to make it part of the community process
> is to fully automate it in our CI.
>
> This is for example what happened with Docker Image and especially with
> the ARM version of it. Building and running it requires - generally
> speaking an ARM hardware and it is a  heavy cpu-and-network process - so
> until I automated it, it was my personal "commitment" that I will build the
> image (with the goal that we will be able to fully automate it). Until then
> releasing of the images was not "community" duty, but "Jarek Potiuk"'s duty
> (I did automate it from the very beginning, but it took some time and
> effort to implement - but we finally got this nice and simple CI workflow -
> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_AIRFLOW.md#manually-prepare-production-docker-image
> that the release manager (whoever the release manager is) can trigger and
> release the images.
>
> So I think if we want to add such a "test harness before" release,
> answering the questions above is important. If we are to get it in the
> community, we need to know what kind of infrastructure it requires, whether
> it is fully automatable (eventually) and ready-to-use by whoever can
> trigger it before the release. And when it comes to such testing, there is
> one more important question - what do we do with the results? Is there a
> trigger that should make the release manager say "OK - those results are
> bad enough to not release"? Do we know what the trigger is ? Do we know how
> to interpret the results? Is it documented and have we run it already on a
> few releases to get some baseline?
>
> I think there are two basic paths we can follow:
>
> 1) some stakeholder (Ideally the one it came from - AirBnB in this case)
> commits to the burden on running it and reporting the results before every
> release (similar to Astronomer - if you noticed, the few days/weeks before
> a release there is a flurry of stability issues and fixes coming from
> Astronomer usually as a result of this testing). Similarly for Production
> Image it could be just "Jarek Potiuk" as a stakeholder because that was
> small enough for me to handle
>
> 2) if such a test harness is to be donated to Airflow, then it must be
> preceded with a few releases where point 1) is done by someone who commits
> to it and makes sure all the "wrinkles" are removed. The release process
> should be smooth and tested. It should not introduce any more friction to
> the process and delay it, so running it for a number of releases is a must
> (this is what I did for the Production Image first and then for the ARM
> Image version). That someone needs to volunteer and commit to it (same as I
> did for Prod Image).
>
> This is how I see it. I think commiting a code to a repo is likely
> somewhere around 50% of the project where it is already run and at least
> semi-automated for a few releases (if we are going to go the route 2). Or
> is not needed at all (can be kept in AirBnB for example or whoever wants to
> commit to doing it) if we are going the route 1)/
>
> But I am curious if my understanding of it is also what others understand.
>
> J.
>
>
> On Tue, Jul 26, 2022 at 1:54 AM Ping Zhang <[email protected]> wrote:
>
>> Hi Jarek,
>>
>> Friendly bump this thread. What's your thoughts on having a scheduler
>> perf test before each release and incorporating this metric?
>>
>> Also, is there a devops git repo to put these files/logics?
>>
>> Thanks,
>>
>> Ping
>>
>>
>> On Wed, Jul 13, 2022 at 9:47 AM Ping Zhang <[email protected]> wrote:
>>
>>> Hi Jarek,
>>>
>>> Yep, it is more useful in the stress test stage before releasing a new
>>> version with some extra set up to ensure no scheduler performance
>>> degradation due to a release. This can also help to find the scaling limit
>>> of the scheduler with a certain SLA, like upper limit of the number of
>>> tasks in a dag, total number of dag files in a cluster, concurrent running
>>> dag runs etc.
>>>
>>> Very good point about synthetic dag files in the stress test, our team
>>> is working on a stress test framework that can directly use all production
>>> dag files to ensure the stress test has the same set of prod dags, but it
>>> will skip the task execution. It can also generate different kinds of dag
>>> (including number of tasks, levels etc).
>>>
>>> Monitoring the production issues for particular DAGs, time of the day is
>>> a different issue. I agree that in prod, we should not let the scheduler
>>> calculate the `dependency met` time.
>>>
>>>
>>> Thanks,
>>>
>>> Ping
>>>
>>>
>>> On Tue, Jul 12, 2022 at 11:02 AM Jarek Potiuk <[email protected]> wrote:
>>>
>>>> I think if we limit it to stress tests, this could be an "extra"
>>>> addition - not even necessarily part of Airflow codebase and adding
>>>> triggers with a script, on a single database, some kind of
>>>> test-harness that you always add after you installed airflow in test
>>>> environment - for that I have far less reservations to use triggers.
>>>>
>>>> But if we want to measure the delays in production, that's quite a
>>>> different story (and different purpose):
>>>>
>>>> * The stress tests are synthetic and basically what you will get out
>>>> of it is "are worse/better in this version than in the previous one"?
>>>> "How much", "Which synthetic scenarios are affected most" . Those will
>>>> be done with a few synthetic kinds of traffic/load/shape.
>>>> * The production is different - you really want to see if you have
>>>> some problems with particular DAGs, times of the day, week, load etc
>>>> and you should be able to take some corrective actions ( for example
>>>> increase number of schedulers, or queues, split your dags etc.) - so
>>>> even the "scheduling delay" metrics might sound familiar you might
>>>> want to use completely different dimensions to look at it (how about
>>>> this DAG? this time of day, this group of dags, this type of workloads
>>>> etc).
>>>>
>>>> I think those two might even be separated and calculated differently
>>>> (though having a single approach would be likely better). I am not
>>>> entirely sure but I have a feeling we do not need the scheduler to
>>>> calculate the "dependency met" while scheduling. I think for
>>>> production purposes, it would be much better (less overhead) to simply
>>>> emit "raw" mettrics such as task start/end time of each task plus
>>>> possibly simple publishing of - mostly static - task dependency rules
>>>> - then "dependency met" time can be calculated offline based on joined
>>>> data. That would be roughly equivalent to what you have in the
>>>> trigger, but without the overhead of triggers- simply instead of
>>>> storing the events in metadata db we would emit them (for example
>>>> using otel) and let the external system aggregate them and process it
>>>> offline independently.
>>>>
>>>> The OTEL integration is rather lightweight - most of them use
>>>> in-memory buffers and efficiently push the data (and even can
>>>> implement scalable forwarding of the data and pre-aggregation). The
>>>> nice thing about it is that it can scale much easier. I think that
>>>> (apart of my earlier reservation) database-trigger approach has this
>>>> not-nice property that the less workers and schedulers you have, the
>>>> more "centralized overhead" you have, where the distributed OTEL
>>>> solution scales together with the system adding more or less fixed
>>>> overhead per component (providing that the remote telemetry service is
>>>> also scalable). This makes the trigger approach far less suitable IMHO
>>>> as we are getting dangerously close to Heisen-Monitoring where the
>>>> more we observe the system the more we impact its performance.
>>>>
>>>> J.
>>>>
>>>> On Tue, Jul 12, 2022 at 6:49 PM Ping Zhang <[email protected]> wrote:
>>>> >
>>>> > Hi Jarek,
>>>> >
>>>> > Thanks for the insights and pointing out the potential issues with
>>>> triggers in the prod with scheduler HA setup.
>>>> >
>>>> > The solution that I proposed is mainly for the stress test scheduler
>>>> before each airflow release. We can make changes in the airflow codebase to
>>>> emit this metric however:
>>>> >
>>>> > 1. It will incur additional overhead for the scheduler to compute the
>>>> metric as scheduler needs to compute the dependency met time of a task.
>>>> > 2. It couples with the implementation of the scheduler. For example,
>>>> from 1.10.4 to airflow 2, the scheduler has changed a lot. If the metric is
>>>> emitted from the scheduler, when making the changes in the scheduler, it
>>>> also needs to update how the metric is computed and emitted.
>>>> >
>>>> > Thus, I think having it out of the airflow core makes it easier to
>>>> compare the scheduling delay across different airflow versions.
>>>> >
>>>> > Thanks for pointing out the OpenTelemetry, let me check it out.
>>>> >
>>>> > Thanks,
>>>> >
>>>> > Ping
>>>> >
>>>> >
>>>> > On Mon, Jul 11, 2022 at 9:44 AM Jarek Potiuk <[email protected]>
>>>> wrote:
>>>> >>
>>>> >> Sorry for the late reply - Ping.
>>>> >>
>>>> >> TL;DR; I think the metrics might be useful but I think using triggers
>>>> >> is asking for troubles.
>>>> >>
>>>> >> While using triggers sounds like a common approach in a number of
>>>> >> installations, we do not use triggers so far.
>>>> >> Using Triggers moves some logic to the database, and in our case we
>>>> do
>>>> >> not have it at all - all logic is in Airflow, and we keep it there,
>>>> >> the database for us is merely "state" storage and "locks". Adding
>>>> >> database triggers, extends it to also keep some logic there. And
>>>> >> adding triggers has some worrying "implicitness" which goes against
>>>> >> the "Explicit is better than Implicit" Zen of Python.
>>>> >>
>>>> >> One thing that makes me think "coldly" about this is that it might
>>>> >> have some undesired side effects - such as synchronizing of changes
>>>> >> from multiple schedulers on trying to insert such audit entry (you
>>>> >> need to create an index lock when you insert rows to a table which
>>>> has
>>>> >> a primary key/unique indexes).
>>>> >>
>>>> >> And what's even more worrying is that we are using SQLAlchemy and
>>>> >> MySQl/MsSQL/Postgres and we should make sure it works the same in all
>>>> >> of them. This is troublesome.
>>>> >>
>>>> >> Even if we could solve and verify all those problems individually the
>>>> >> effect is - Once we open the "gate" of triggers, we will get more "ok
>>>> >> we have trigger here so let's also use it for that and this" and this
>>>> >> will be hard to say "no" if we already have a precedent, and this
>>>> >> might lead to more and more logica and features deferred to a
>>>> database
>>>> >> logic (and my past experience is that it leads to more complexity and
>>>> >> implicit behaviours that are difficult to reason about).
>>>> >>
>>>> >> But this is only about the technical details of this, not the metrics
>>>> >> itself. I think the metric you proposed is very useful.
>>>> >>
>>>> >> I think however (correct me if I am wrong) - that we do not need
>>>> >> database triggers for any of those. I have a feeling that this
>>>> >> proposal is trying to implement the (useful) metrics with very
>>>> limited
>>>> >> modification to the Airflow code, so I can understand that you might
>>>> >> think about it this way when you have your own fork - then it makes
>>>> >> sense to piggyback on the existing database and use triggers, because
>>>> >> you do not want to modify Airflow code.
>>>> >>
>>>> >> But here - we are in a completely different situation. We CAN modify
>>>> >> Airflow code and add missing features and functionality to capture
>>>> the
>>>> >> necessary metric data in the code,  rather than using triggers. We
>>>> >> could even define some kind of callbacks for the auditing events that
>>>> >> would allow us to gather those metrics in a way that does not even
>>>> use
>>>> >> the database to store the information for the metrics.
>>>> >>
>>>> >> In fact - this leads me to conclusion that we should implement the
>>>> >> metrics you mention as part of our Open-Telemetry effort
>>>> >>
>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow
>>>> .
>>>> >> This is precisely what it was prepared for, once we have
>>>> >> Open-Telemetry integrated we could add more and more such useful
>>>> >> metrics more easily, and that could be way more useful, because
>>>> >> instead of running external custom-db-reading process for that, we
>>>> >> could not only calculate such metrics using the right metrics tooling
>>>> >> (each company could use their preferred open-telemetry compliant
>>>> >> tool), but that would open up all the features like alerting,
>>>> >> connecting it with traces and other metrics etc. etc.
>>>> >>
>>>> >> Howard - WDYT?
>>>> >>
>>>> >> J.
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Thu, Jun 30, 2022 at 4:52 PM Vikram Koka
>>>> >> <[email protected]> wrote:
>>>> >> >
>>>> >> > HI Ping,
>>>> >> >
>>>> >> > Apologies for the belated response.
>>>> >> >
>>>> >> > We have created a set of stress test DAGs where the tasks take
>>>> almost no time to execute at all, so that the worker task execution time is
>>>> small, and the stress is on the Scheduler and Executor.
>>>> >> >
>>>> >> > We then calculate "task latency" aka "task lag" as:
>>>> >> >  ti_lag = ti.start_date - max_upstream_ti_end_date
>>>> >> > This is effectively the time between "the downstream task
>>>> starting" and "the last dependent upstream task complete"
>>>> >> >
>>>> >> > We don't use the tasks that don't have any upstream tasks in this
>>>> metric for measuring task lag.
>>>> >> > And for tasks that have multiple upstream tasks, we use the
>>>> upstream task for which the end_date took maximum time as the scheduler
>>>> waits for completion of all parent tasks before scheduling any downstream
>>>> task.
>>>> >> >
>>>> >> > Vikram
>>>> >> >
>>>> >> >
>>>> >> > On Wed, Jun 8, 2022 at 2:58 PM Ping Zhang <[email protected]>
>>>> wrote:
>>>> >> >>
>>>> >> >> Hi Mehta,
>>>> >> >>
>>>> >> >> Good point. The primary goal of the metric is for stress testing
>>>> to catch airflow scheduler performance regression for 1) our internal
>>>> scheduler improvement work and 2) airflow version upgrade.
>>>> >> >>
>>>> >> >> One of the key benefits of this metric definition is it is
>>>> independent from the scheduler implementation and it can be
>>>> computed/backfilled offline.
>>>> >> >>
>>>> >> >> Currently, we expose it to the datadog and we (the airflow
>>>> cluster maintainers) are the main users for it.
>>>> >> >>
>>>> >> >> Thanks,
>>>> >> >>
>>>> >> >> Ping
>>>> >> >>
>>>> >> >>
>>>> >> >> On Wed, Jun 8, 2022 at 2:36 PM Mehta, Shubham
>>>> <[email protected]> wrote:
>>>> >> >>>
>>>> >> >>> Ping,
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> I’m very interested in this as well. A good metric can help us
>>>> benchmark and identify potential improvements in the scheduler performance.
>>>> >> >>> In order to understand the proposal better, can you please share
>>>> where and how do you intend to use “Scheduling delay”? Is it meant for
>>>> benchmarking or stress testing only? Do you plan to expose it to the users
>>>> in the Airflow UI?
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> Thanks
>>>> >> >>> Shubham
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> From: Ping Zhang <[email protected]>
>>>> >> >>> Reply-To: "[email protected]" <[email protected]>
>>>> >> >>> Date: Wednesday, June 8, 2022 at 11:58 AM
>>>> >> >>> To: "[email protected]" <[email protected]>, "
>>>> [email protected]" <[email protected]>
>>>> >> >>> Subject: RE: [EXTERNAL][DISCUSS] Airflow Scheduling Delay Metric
>>>> Definition
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> CAUTION: This email originated from outside of the organization.
>>>> Do not click links or open attachments unless you can confirm the sender
>>>> and know the content is safe.
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> Hi Vikram,
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> Thanks for pointing that out, 'task latency',
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> "we define task latency as the time it takes for a task to begin
>>>> executing once its dependencies have been met."
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> It will be great if you can elaborate more about "begin
>>>> executing" and how you calculate "its dependencies have been met.".
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> If the 'begin executing' means the state of ti becomes running,
>>>> then the 'Scheduling Delay' metric focuses on the overhead introduced by
>>>> the scheduler.
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> In our prod and stress test, we use the `task_instance_audit`
>>>> table ( a new row is created whenever there is state change in
>>>> task_instance table) to compute the time of a ti should be scheduled.
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> Thanks,
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> Ping
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> On Wed, Jun 8, 2022 at 11:25 AM Vikram Koka
>>>> <[email protected]> wrote:
>>>> >> >>>
>>>> >> >>> Ping,
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> I am quite interested in this topic and trying to understand the
>>>> difference between the "scheduling delay" metric articulated as compared to
>>>> the "task latency" aka "task lag" metric which we have been using before.
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> As you may recall, we have been using two specific metrics to
>>>> benchmark Scheduler performance, specifically "task latency" and "task
>>>> throughput" since Airflow 2.0.
>>>> >> >>>
>>>> >> >>> These were described in the 2.0 Scheduler blog post
>>>> >> >>> Specifically, within that we defined task tatency as the time it
>>>> takes for the task to begin executing once it's dependencies are all met.
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> Thanks,
>>>> >> >>>
>>>> >> >>> Vikram
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> On Wed, Jun 8, 2022 at 10:25 AM Ping Zhang <[email protected]>
>>>> wrote:
>>>> >> >>>
>>>> >> >>> Hi Airflow Community,
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> Airflow is a scheduling platform for data pipelines, however
>>>> there is no good metric to measure the scheduling delay in the production
>>>> and also the stress test environment. This makes it hard to catch
>>>> regressions in the scheduler during the stress test stage.
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> I would like to propose an airflow scheduling delay metric
>>>> definition. Here is the detailed design of the metric and its
>>>> implementation:
>>>> >> >>>
>>>> >> >>>
>>>> https://docs.google.com/document/d/1NhO26kgWkIZJEe50M60yh_jgROaU84dRJ5qGFqbkNbU/edit?usp=sharing
>>>> >> >>>
>>>> >> >>> Please take a look and any feedback is welcome.
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> Thanks,
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> Ping
>>>> >> >>>
>>>> >> >>>
>>>>
>>>

Re: [DISCUSS] Airflow Scheduling Delay Metric Definition

Reply via email to