Re: [DISCUSS] Airflow Scheduling Delay Metric Definition

Jarek Potiuk Tue, 26 Jul 2022 12:31:11 -0700

Sure

On Tue, Jul 26, 2022 at 7:41 PM Ping Zhang <[email protected]> wrote:


> Hi Jarek,
>
> Thanks a lot for the thorough response and they are all legitimate
> concerns.
>
> How about that I prepare a more thorough doc to address your concerns and
> we can continue the discussion from there?
>
> Thanks,
>
> Ping
>
>
>
> Thanks,
>
> Ping
>
>
> On Tue, Jul 26, 2022 at 6:53 AM Jarek Potiuk <[email protected]> wrote:
>
>> If I understand correctly, the Idea is to run an additional set of stress
>> tests before releasing a version - without impacting the production version
>> of Airflow.
>>
>> I think if this is something that we want to make part of our release
>> process, then submitting a code somewhere is the last thing to do. Just
>> submitting the code does not mean that it will be executed.
>>
>> Note - this is my personal view on it, I am not sure if this is other's
>> view as well, but it comes from years of being involved in the release
>> process and doing it myself - and volunteering to do part of the process
>> (and improve and perfect it).
>>
>> I think the first thing here is to have several answers:
>>
>> a) do we want to do it
>> b) who will do it
>> c) when this will be done in the release process
>> d) what infrastructure will be used to run the tests
>>
>> The actiual "completion" of the release process of ours is described in
>> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_AIRFLOW.md
>> and there is of course release manager running the release process. This is
>> usually Jed, Ephraim recently but it is generally whoever from the PMC
>> members (or committers if they have a PMC member ready to sign the
>> artifacts) who raises their hand and say "Hey I want to be a release
>> manager". Similarly we have a release process for providers
>> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md.
>> But for "MINOR releases" the community process starts when about 90% of
>> testing effort has already been spent.
>>
>> The important thing is that the release manager is NOT doing testing (and
>> the release process from ASF does not even touch on the subject). Release
>> manager has the role of executing the "mechanics" to produce
>> release artifacts, start voting and has the power of deciding
>> single-handedly if the release should be cancelled (fully or in parts) if
>> there are some issues found (more info on the role and release process here
>> https://www.apache.org/legal/release-policy.html#management). In case we
>> release Providers or Airflow (especially the PATCHLEVEL ones) - we delegate
>> a big part of the testing to whoever was involved in preparing fixes - by
>> the "Status of testing" issue (quite successfully I think). For MINOR
>> "airflow", it's much more complex, usually a lot of testing is done by
>> stakeholders - mostly Astronomer who donates HUGE amount of testing time of
>> Airflow MINOR releases (and this is one of the reasons why Astronomer is
>> able to release new Airflow versions much faster than anyone else because
>> they run a lot of tests in their own infrastructure - and this is actually
>> great contribution to the community :). This has a huge mutual benefit for
>> both - the community and Astronomer.
>>
>> Now - if we are going to do the stress testing before releasing Airflow -
>> the question is who will be doing that. This is quite an effort (I believe)
>> and it requires quite an infrastructure. And if we are to donate any kind
>> of testing harness to the community - it only makes sense if there is a way
>> the community can use it. For example there are (from what I hear) many
>> test scenarios and scripts that Astronomer has and follows, but it's not
>> donated to the community. It simply makes no sense because we have no
>> capacity to run those tests, nor process how we can follow those test
>> scenarios.
>>
>> So now - the question is - who will run such tests and with what
>> infrastructure?  I am not sure what kind of infrastructure it might
>> require, but I think the only way to make it part of the community process
>> is to fully automate it in our CI.
>>
>> This is for example what happened with Docker Image and especially with
>> the ARM version of it. Building and running it requires - generally
>> speaking an ARM hardware and it is a  heavy cpu-and-network process - so
>> until I automated it, it was my personal "commitment" that I will build the
>> image (with the goal that we will be able to fully automate it). Until then
>> releasing of the images was not "community" duty, but "Jarek Potiuk"'s duty
>> (I did automate it from the very beginning, but it took some time and
>> effort to implement - but we finally got this nice and simple CI workflow -
>> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_AIRFLOW.md#manually-prepare-production-docker-image
>> that the release manager (whoever the release manager is) can trigger and
>> release the images.
>>
>> So I think if we want to add such a "test harness before" release,
>> answering the questions above is important. If we are to get it in the
>> community, we need to know what kind of infrastructure it requires, whether
>> it is fully automatable (eventually) and ready-to-use by whoever can
>> trigger it before the release. And when it comes to such testing, there is
>> one more important question - what do we do with the results? Is there a
>> trigger that should make the release manager say "OK - those results are
>> bad enough to not release"? Do we know what the trigger is ? Do we know how
>> to interpret the results? Is it documented and have we run it already on a
>> few releases to get some baseline?
>>
>> I think there are two basic paths we can follow:
>>
>> 1) some stakeholder (Ideally the one it came from - AirBnB in this case)
>> commits to the burden on running it and reporting the results before every
>> release (similar to Astronomer - if you noticed, the few days/weeks before
>> a release there is a flurry of stability issues and fixes coming from
>> Astronomer usually as a result of this testing). Similarly for Production
>> Image it could be just "Jarek Potiuk" as a stakeholder because that was
>> small enough for me to handle
>>
>> 2) if such a test harness is to be donated to Airflow, then it must be
>> preceded with a few releases where point 1) is done by someone who commits
>> to it and makes sure all the "wrinkles" are removed. The release process
>> should be smooth and tested. It should not introduce any more friction to
>> the process and delay it, so running it for a number of releases is a must
>> (this is what I did for the Production Image first and then for the ARM
>> Image version). That someone needs to volunteer and commit to it (same as I
>> did for Prod Image).
>>
>> This is how I see it. I think commiting a code to a repo is likely
>> somewhere around 50% of the project where it is already run and at least
>> semi-automated for a few releases (if we are going to go the route 2). Or
>> is not needed at all (can be kept in AirBnB for example or whoever wants to
>> commit to doing it) if we are going the route 1)/
>>
>> But I am curious if my understanding of it is also what others understand.
>>
>> J.
>>
>>
>> On Tue, Jul 26, 2022 at 1:54 AM Ping Zhang <[email protected]> wrote:
>>
>>> Hi Jarek,
>>>
>>> Friendly bump this thread. What's your thoughts on having a scheduler
>>> perf test before each release and incorporating this metric?
>>>
>>> Also, is there a devops git repo to put these files/logics?
>>>
>>> Thanks,
>>>
>>> Ping
>>>
>>>
>>> On Wed, Jul 13, 2022 at 9:47 AM Ping Zhang <[email protected]> wrote:
>>>
>>>> Hi Jarek,
>>>>
>>>> Yep, it is more useful in the stress test stage before releasing a new
>>>> version with some extra set up to ensure no scheduler performance
>>>> degradation due to a release. This can also help to find the scaling limit
>>>> of the scheduler with a certain SLA, like upper limit of the number of
>>>> tasks in a dag, total number of dag files in a cluster, concurrent running
>>>> dag runs etc.
>>>>
>>>> Very good point about synthetic dag files in the stress test, our team
>>>> is working on a stress test framework that can directly use all production
>>>> dag files to ensure the stress test has the same set of prod dags, but it
>>>> will skip the task execution. It can also generate different kinds of dag
>>>> (including number of tasks, levels etc).
>>>>
>>>> Monitoring the production issues for particular DAGs, time of the day
>>>> is a different issue. I agree that in prod, we should not let the scheduler
>>>> calculate the `dependency met` time.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Ping
>>>>
>>>>
>>>> On Tue, Jul 12, 2022 at 11:02 AM Jarek Potiuk <[email protected]> wrote:
>>>>
>>>>> I think if we limit it to stress tests, this could be an "extra"
>>>>> addition - not even necessarily part of Airflow codebase and adding
>>>>> triggers with a script, on a single database, some kind of
>>>>> test-harness that you always add after you installed airflow in test
>>>>> environment - for that I have far less reservations to use triggers.
>>>>>
>>>>> But if we want to measure the delays in production, that's quite a
>>>>> different story (and different purpose):
>>>>>
>>>>> * The stress tests are synthetic and basically what you will get out
>>>>> of it is "are worse/better in this version than in the previous one"?
>>>>> "How much", "Which synthetic scenarios are affected most" . Those will
>>>>> be done with a few synthetic kinds of traffic/load/shape.
>>>>> * The production is different - you really want to see if you have
>>>>> some problems with particular DAGs, times of the day, week, load etc
>>>>> and you should be able to take some corrective actions ( for example
>>>>> increase number of schedulers, or queues, split your dags etc.) - so
>>>>> even the "scheduling delay" metrics might sound familiar you might
>>>>> want to use completely different dimensions to look at it (how about
>>>>> this DAG? this time of day, this group of dags, this type of workloads
>>>>> etc).
>>>>>
>>>>> I think those two might even be separated and calculated differently
>>>>> (though having a single approach would be likely better). I am not
>>>>> entirely sure but I have a feeling we do not need the scheduler to
>>>>> calculate the "dependency met" while scheduling. I think for
>>>>> production purposes, it would be much better (less overhead) to simply
>>>>> emit "raw" mettrics such as task start/end time of each task plus
>>>>> possibly simple publishing of - mostly static - task dependency rules
>>>>> - then "dependency met" time can be calculated offline based on joined
>>>>> data. That would be roughly equivalent to what you have in the
>>>>> trigger, but without the overhead of triggers- simply instead of
>>>>> storing the events in metadata db we would emit them (for example
>>>>> using otel) and let the external system aggregate them and process it
>>>>> offline independently.
>>>>>
>>>>> The OTEL integration is rather lightweight - most of them use
>>>>> in-memory buffers and efficiently push the data (and even can
>>>>> implement scalable forwarding of the data and pre-aggregation). The
>>>>> nice thing about it is that it can scale much easier. I think that
>>>>> (apart of my earlier reservation) database-trigger approach has this
>>>>> not-nice property that the less workers and schedulers you have, the
>>>>> more "centralized overhead" you have, where the distributed OTEL
>>>>> solution scales together with the system adding more or less fixed
>>>>> overhead per component (providing that the remote telemetry service is
>>>>> also scalable). This makes the trigger approach far less suitable IMHO
>>>>> as we are getting dangerously close to Heisen-Monitoring where the
>>>>> more we observe the system the more we impact its performance.
>>>>>
>>>>> J.
>>>>>
>>>>> On Tue, Jul 12, 2022 at 6:49 PM Ping Zhang <[email protected]> wrote:
>>>>> >
>>>>> > Hi Jarek,
>>>>> >
>>>>> > Thanks for the insights and pointing out the potential issues with
>>>>> triggers in the prod with scheduler HA setup.
>>>>> >
>>>>> > The solution that I proposed is mainly for the stress test scheduler
>>>>> before each airflow release. We can make changes in the airflow codebase 
>>>>> to
>>>>> emit this metric however:
>>>>> >
>>>>> > 1. It will incur additional overhead for the scheduler to compute
>>>>> the metric as scheduler needs to compute the dependency met time of a 
>>>>> task.
>>>>> > 2. It couples with the implementation of the scheduler. For example,
>>>>> from 1.10.4 to airflow 2, the scheduler has changed a lot. If the metric 
>>>>> is
>>>>> emitted from the scheduler, when making the changes in the scheduler, it
>>>>> also needs to update how the metric is computed and emitted.
>>>>> >
>>>>> > Thus, I think having it out of the airflow core makes it easier to
>>>>> compare the scheduling delay across different airflow versions.
>>>>> >
>>>>> > Thanks for pointing out the OpenTelemetry, let me check it out.
>>>>> >
>>>>> > Thanks,
>>>>> >
>>>>> > Ping
>>>>> >
>>>>> >
>>>>> > On Mon, Jul 11, 2022 at 9:44 AM Jarek Potiuk <[email protected]>
>>>>> wrote:
>>>>> >>
>>>>> >> Sorry for the late reply - Ping.
>>>>> >>
>>>>> >> TL;DR; I think the metrics might be useful but I think using
>>>>> triggers
>>>>> >> is asking for troubles.
>>>>> >>
>>>>> >> While using triggers sounds like a common approach in a number of
>>>>> >> installations, we do not use triggers so far.
>>>>> >> Using Triggers moves some logic to the database, and in our case we
>>>>> do
>>>>> >> not have it at all - all logic is in Airflow, and we keep it there,
>>>>> >> the database for us is merely "state" storage and "locks". Adding
>>>>> >> database triggers, extends it to also keep some logic there. And
>>>>> >> adding triggers has some worrying "implicitness" which goes against
>>>>> >> the "Explicit is better than Implicit" Zen of Python.
>>>>> >>
>>>>> >> One thing that makes me think "coldly" about this is that it might
>>>>> >> have some undesired side effects - such as synchronizing of changes
>>>>> >> from multiple schedulers on trying to insert such audit entry (you
>>>>> >> need to create an index lock when you insert rows to a table which
>>>>> has
>>>>> >> a primary key/unique indexes).
>>>>> >>
>>>>> >> And what's even more worrying is that we are using SQLAlchemy and
>>>>> >> MySQl/MsSQL/Postgres and we should make sure it works the same in
>>>>> all
>>>>> >> of them. This is troublesome.
>>>>> >>
>>>>> >> Even if we could solve and verify all those problems individually
>>>>> the
>>>>> >> effect is - Once we open the "gate" of triggers, we will get more
>>>>> "ok
>>>>> >> we have trigger here so let's also use it for that and this" and
>>>>> this
>>>>> >> will be hard to say "no" if we already have a precedent, and this
>>>>> >> might lead to more and more logica and features deferred to a
>>>>> database
>>>>> >> logic (and my past experience is that it leads to more complexity
>>>>> and
>>>>> >> implicit behaviours that are difficult to reason about).
>>>>> >>
>>>>> >> But this is only about the technical details of this, not the
>>>>> metrics
>>>>> >> itself. I think the metric you proposed is very useful.
>>>>> >>
>>>>> >> I think however (correct me if I am wrong) - that we do not need
>>>>> >> database triggers for any of those. I have a feeling that this
>>>>> >> proposal is trying to implement the (useful) metrics with very
>>>>> limited
>>>>> >> modification to the Airflow code, so I can understand that you might
>>>>> >> think about it this way when you have your own fork - then it makes
>>>>> >> sense to piggyback on the existing database and use triggers,
>>>>> because
>>>>> >> you do not want to modify Airflow code.
>>>>> >>
>>>>> >> But here - we are in a completely different situation. We CAN modify
>>>>> >> Airflow code and add missing features and functionality to capture
>>>>> the
>>>>> >> necessary metric data in the code,  rather than using triggers. We
>>>>> >> could even define some kind of callbacks for the auditing events
>>>>> that
>>>>> >> would allow us to gather those metrics in a way that does not even
>>>>> use
>>>>> >> the database to store the information for the metrics.
>>>>> >>
>>>>> >> In fact - this leads me to conclusion that we should implement the
>>>>> >> metrics you mention as part of our Open-Telemetry effort
>>>>> >>
>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow
>>>>> .
>>>>> >> This is precisely what it was prepared for, once we have
>>>>> >> Open-Telemetry integrated we could add more and more such useful
>>>>> >> metrics more easily, and that could be way more useful, because
>>>>> >> instead of running external custom-db-reading process for that, we
>>>>> >> could not only calculate such metrics using the right metrics
>>>>> tooling
>>>>> >> (each company could use their preferred open-telemetry compliant
>>>>> >> tool), but that would open up all the features like alerting,
>>>>> >> connecting it with traces and other metrics etc. etc.
>>>>> >>
>>>>> >> Howard - WDYT?
>>>>> >>
>>>>> >> J.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On Thu, Jun 30, 2022 at 4:52 PM Vikram Koka
>>>>> >> <[email protected]> wrote:
>>>>> >> >
>>>>> >> > HI Ping,
>>>>> >> >
>>>>> >> > Apologies for the belated response.
>>>>> >> >
>>>>> >> > We have created a set of stress test DAGs where the tasks take
>>>>> almost no time to execute at all, so that the worker task execution time 
>>>>> is
>>>>> small, and the stress is on the Scheduler and Executor.
>>>>> >> >
>>>>> >> > We then calculate "task latency" aka "task lag" as:
>>>>> >> >  ti_lag = ti.start_date - max_upstream_ti_end_date
>>>>> >> > This is effectively the time between "the downstream task
>>>>> starting" and "the last dependent upstream task complete"
>>>>> >> >
>>>>> >> > We don't use the tasks that don't have any upstream tasks in this
>>>>> metric for measuring task lag.
>>>>> >> > And for tasks that have multiple upstream tasks, we use the
>>>>> upstream task for which the end_date took maximum time as the scheduler
>>>>> waits for completion of all parent tasks before scheduling any downstream
>>>>> task.
>>>>> >> >
>>>>> >> > Vikram
>>>>> >> >
>>>>> >> >
>>>>> >> > On Wed, Jun 8, 2022 at 2:58 PM Ping Zhang <[email protected]>
>>>>> wrote:
>>>>> >> >>
>>>>> >> >> Hi Mehta,
>>>>> >> >>
>>>>> >> >> Good point. The primary goal of the metric is for stress testing
>>>>> to catch airflow scheduler performance regression for 1) our internal
>>>>> scheduler improvement work and 2) airflow version upgrade.
>>>>> >> >>
>>>>> >> >> One of the key benefits of this metric definition is it is
>>>>> independent from the scheduler implementation and it can be
>>>>> computed/backfilled offline.
>>>>> >> >>
>>>>> >> >> Currently, we expose it to the datadog and we (the airflow
>>>>> cluster maintainers) are the main users for it.
>>>>> >> >>
>>>>> >> >> Thanks,
>>>>> >> >>
>>>>> >> >> Ping
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> On Wed, Jun 8, 2022 at 2:36 PM Mehta, Shubham
>>>>> <[email protected]> wrote:
>>>>> >> >>>
>>>>> >> >>> Ping,
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> I’m very interested in this as well. A good metric can help us
>>>>> benchmark and identify potential improvements in the scheduler 
>>>>> performance.
>>>>> >> >>> In order to understand the proposal better, can you please
>>>>> share where and how do you intend to use “Scheduling delay”? Is it meant
>>>>> for benchmarking or stress testing only? Do you plan to expose it to the
>>>>> users in the Airflow UI?
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> Thanks
>>>>> >> >>> Shubham
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> From: Ping Zhang <[email protected]>
>>>>> >> >>> Reply-To: "[email protected]" <[email protected]>
>>>>> >> >>> Date: Wednesday, June 8, 2022 at 11:58 AM
>>>>> >> >>> To: "[email protected]" <[email protected]>, "
>>>>> [email protected]" <[email protected]>
>>>>> >> >>> Subject: RE: [EXTERNAL][DISCUSS] Airflow Scheduling Delay
>>>>> Metric Definition
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> CAUTION: This email originated from outside of the
>>>>> organization. Do not click links or open attachments unless you can 
>>>>> confirm
>>>>> the sender and know the content is safe.
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> Hi Vikram,
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> Thanks for pointing that out, 'task latency',
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> "we define task latency as the time it takes for a task to
>>>>> begin executing once its dependencies have been met."
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> It will be great if you can elaborate more about "begin
>>>>> executing" and how you calculate "its dependencies have been met.".
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> If the 'begin executing' means the state of ti becomes running,
>>>>> then the 'Scheduling Delay' metric focuses on the overhead introduced by
>>>>> the scheduler.
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> In our prod and stress test, we use the `task_instance_audit`
>>>>> table ( a new row is created whenever there is state change in
>>>>> task_instance table) to compute the time of a ti should be scheduled.
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> Thanks,
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> Ping
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> On Wed, Jun 8, 2022 at 11:25 AM Vikram Koka
>>>>> <[email protected]> wrote:
>>>>> >> >>>
>>>>> >> >>> Ping,
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> I am quite interested in this topic and trying to understand
>>>>> the difference between the "scheduling delay" metric articulated as
>>>>> compared to the "task latency" aka "task lag" metric which we have been
>>>>> using before.
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> As you may recall, we have been using two specific metrics to
>>>>> benchmark Scheduler performance, specifically "task latency" and "task
>>>>> throughput" since Airflow 2.0.
>>>>> >> >>>
>>>>> >> >>> These were described in the 2.0 Scheduler blog post
>>>>> >> >>> Specifically, within that we defined task tatency as the time
>>>>> it takes for the task to begin executing once it's dependencies are all 
>>>>> met.
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> Thanks,
>>>>> >> >>>
>>>>> >> >>> Vikram
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> On Wed, Jun 8, 2022 at 10:25 AM Ping Zhang <[email protected]>
>>>>> wrote:
>>>>> >> >>>
>>>>> >> >>> Hi Airflow Community,
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> Airflow is a scheduling platform for data pipelines, however
>>>>> there is no good metric to measure the scheduling delay in the production
>>>>> and also the stress test environment. This makes it hard to catch
>>>>> regressions in the scheduler during the stress test stage.
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> I would like to propose an airflow scheduling delay metric
>>>>> definition. Here is the detailed design of the metric and its
>>>>> implementation:
>>>>> >> >>>
>>>>> >> >>>
>>>>> https://docs.google.com/document/d/1NhO26kgWkIZJEe50M60yh_jgROaU84dRJ5qGFqbkNbU/edit?usp=sharing
>>>>> >> >>>
>>>>> >> >>> Please take a look and any feedback is welcome.
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> Thanks,
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> Ping
>>>>> >> >>>
>>>>> >> >>>
>>>>>
>>>>

Re: [DISCUSS] Airflow Scheduling Delay Metric Definition

Reply via email to