Hi Jarek, Thanks a lot for the thorough response and they are all legitimate concerns.
How about that I prepare a more thorough doc to address your concerns and we can continue the discussion from there? Thanks, Ping Thanks, Ping On Tue, Jul 26, 2022 at 6:53 AM Jarek Potiuk <[email protected]> wrote: > If I understand correctly, the Idea is to run an additional set of stress > tests before releasing a version - without impacting the production version > of Airflow. > > I think if this is something that we want to make part of our release > process, then submitting a code somewhere is the last thing to do. Just > submitting the code does not mean that it will be executed. > > Note - this is my personal view on it, I am not sure if this is other's > view as well, but it comes from years of being involved in the release > process and doing it myself - and volunteering to do part of the process > (and improve and perfect it). > > I think the first thing here is to have several answers: > > a) do we want to do it > b) who will do it > c) when this will be done in the release process > d) what infrastructure will be used to run the tests > > The actiual "completion" of the release process of ours is described in > https://github.com/apache/airflow/blob/main/dev/README_RELEASE_AIRFLOW.md > and there is of course release manager running the release process. This is > usually Jed, Ephraim recently but it is generally whoever from the PMC > members (or committers if they have a PMC member ready to sign the > artifacts) who raises their hand and say "Hey I want to be a release > manager". Similarly we have a release process for providers > https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md. > But for "MINOR releases" the community process starts when about 90% of > testing effort has already been spent. > > The important thing is that the release manager is NOT doing testing (and > the release process from ASF does not even touch on the subject). Release > manager has the role of executing the "mechanics" to produce > release artifacts, start voting and has the power of deciding > single-handedly if the release should be cancelled (fully or in parts) if > there are some issues found (more info on the role and release process here > https://www.apache.org/legal/release-policy.html#management). In case we > release Providers or Airflow (especially the PATCHLEVEL ones) - we delegate > a big part of the testing to whoever was involved in preparing fixes - by > the "Status of testing" issue (quite successfully I think). For MINOR > "airflow", it's much more complex, usually a lot of testing is done by > stakeholders - mostly Astronomer who donates HUGE amount of testing time of > Airflow MINOR releases (and this is one of the reasons why Astronomer is > able to release new Airflow versions much faster than anyone else because > they run a lot of tests in their own infrastructure - and this is actually > great contribution to the community :). This has a huge mutual benefit for > both - the community and Astronomer. > > Now - if we are going to do the stress testing before releasing Airflow - > the question is who will be doing that. This is quite an effort (I believe) > and it requires quite an infrastructure. And if we are to donate any kind > of testing harness to the community - it only makes sense if there is a way > the community can use it. For example there are (from what I hear) many > test scenarios and scripts that Astronomer has and follows, but it's not > donated to the community. It simply makes no sense because we have no > capacity to run those tests, nor process how we can follow those test > scenarios. > > So now - the question is - who will run such tests and with what > infrastructure? I am not sure what kind of infrastructure it might > require, but I think the only way to make it part of the community process > is to fully automate it in our CI. > > This is for example what happened with Docker Image and especially with > the ARM version of it. Building and running it requires - generally > speaking an ARM hardware and it is a heavy cpu-and-network process - so > until I automated it, it was my personal "commitment" that I will build the > image (with the goal that we will be able to fully automate it). Until then > releasing of the images was not "community" duty, but "Jarek Potiuk"'s duty > (I did automate it from the very beginning, but it took some time and > effort to implement - but we finally got this nice and simple CI workflow - > https://github.com/apache/airflow/blob/main/dev/README_RELEASE_AIRFLOW.md#manually-prepare-production-docker-image > that the release manager (whoever the release manager is) can trigger and > release the images. > > So I think if we want to add such a "test harness before" release, > answering the questions above is important. If we are to get it in the > community, we need to know what kind of infrastructure it requires, whether > it is fully automatable (eventually) and ready-to-use by whoever can > trigger it before the release. And when it comes to such testing, there is > one more important question - what do we do with the results? Is there a > trigger that should make the release manager say "OK - those results are > bad enough to not release"? Do we know what the trigger is ? Do we know how > to interpret the results? Is it documented and have we run it already on a > few releases to get some baseline? > > I think there are two basic paths we can follow: > > 1) some stakeholder (Ideally the one it came from - AirBnB in this case) > commits to the burden on running it and reporting the results before every > release (similar to Astronomer - if you noticed, the few days/weeks before > a release there is a flurry of stability issues and fixes coming from > Astronomer usually as a result of this testing). Similarly for Production > Image it could be just "Jarek Potiuk" as a stakeholder because that was > small enough for me to handle > > 2) if such a test harness is to be donated to Airflow, then it must be > preceded with a few releases where point 1) is done by someone who commits > to it and makes sure all the "wrinkles" are removed. The release process > should be smooth and tested. It should not introduce any more friction to > the process and delay it, so running it for a number of releases is a must > (this is what I did for the Production Image first and then for the ARM > Image version). That someone needs to volunteer and commit to it (same as I > did for Prod Image). > > This is how I see it. I think commiting a code to a repo is likely > somewhere around 50% of the project where it is already run and at least > semi-automated for a few releases (if we are going to go the route 2). Or > is not needed at all (can be kept in AirBnB for example or whoever wants to > commit to doing it) if we are going the route 1)/ > > But I am curious if my understanding of it is also what others understand. > > J. > > > On Tue, Jul 26, 2022 at 1:54 AM Ping Zhang <[email protected]> wrote: > >> Hi Jarek, >> >> Friendly bump this thread. What's your thoughts on having a scheduler >> perf test before each release and incorporating this metric? >> >> Also, is there a devops git repo to put these files/logics? >> >> Thanks, >> >> Ping >> >> >> On Wed, Jul 13, 2022 at 9:47 AM Ping Zhang <[email protected]> wrote: >> >>> Hi Jarek, >>> >>> Yep, it is more useful in the stress test stage before releasing a new >>> version with some extra set up to ensure no scheduler performance >>> degradation due to a release. This can also help to find the scaling limit >>> of the scheduler with a certain SLA, like upper limit of the number of >>> tasks in a dag, total number of dag files in a cluster, concurrent running >>> dag runs etc. >>> >>> Very good point about synthetic dag files in the stress test, our team >>> is working on a stress test framework that can directly use all production >>> dag files to ensure the stress test has the same set of prod dags, but it >>> will skip the task execution. It can also generate different kinds of dag >>> (including number of tasks, levels etc). >>> >>> Monitoring the production issues for particular DAGs, time of the day is >>> a different issue. I agree that in prod, we should not let the scheduler >>> calculate the `dependency met` time. >>> >>> >>> Thanks, >>> >>> Ping >>> >>> >>> On Tue, Jul 12, 2022 at 11:02 AM Jarek Potiuk <[email protected]> wrote: >>> >>>> I think if we limit it to stress tests, this could be an "extra" >>>> addition - not even necessarily part of Airflow codebase and adding >>>> triggers with a script, on a single database, some kind of >>>> test-harness that you always add after you installed airflow in test >>>> environment - for that I have far less reservations to use triggers. >>>> >>>> But if we want to measure the delays in production, that's quite a >>>> different story (and different purpose): >>>> >>>> * The stress tests are synthetic and basically what you will get out >>>> of it is "are worse/better in this version than in the previous one"? >>>> "How much", "Which synthetic scenarios are affected most" . Those will >>>> be done with a few synthetic kinds of traffic/load/shape. >>>> * The production is different - you really want to see if you have >>>> some problems with particular DAGs, times of the day, week, load etc >>>> and you should be able to take some corrective actions ( for example >>>> increase number of schedulers, or queues, split your dags etc.) - so >>>> even the "scheduling delay" metrics might sound familiar you might >>>> want to use completely different dimensions to look at it (how about >>>> this DAG? this time of day, this group of dags, this type of workloads >>>> etc). >>>> >>>> I think those two might even be separated and calculated differently >>>> (though having a single approach would be likely better). I am not >>>> entirely sure but I have a feeling we do not need the scheduler to >>>> calculate the "dependency met" while scheduling. I think for >>>> production purposes, it would be much better (less overhead) to simply >>>> emit "raw" mettrics such as task start/end time of each task plus >>>> possibly simple publishing of - mostly static - task dependency rules >>>> - then "dependency met" time can be calculated offline based on joined >>>> data. That would be roughly equivalent to what you have in the >>>> trigger, but without the overhead of triggers- simply instead of >>>> storing the events in metadata db we would emit them (for example >>>> using otel) and let the external system aggregate them and process it >>>> offline independently. >>>> >>>> The OTEL integration is rather lightweight - most of them use >>>> in-memory buffers and efficiently push the data (and even can >>>> implement scalable forwarding of the data and pre-aggregation). The >>>> nice thing about it is that it can scale much easier. I think that >>>> (apart of my earlier reservation) database-trigger approach has this >>>> not-nice property that the less workers and schedulers you have, the >>>> more "centralized overhead" you have, where the distributed OTEL >>>> solution scales together with the system adding more or less fixed >>>> overhead per component (providing that the remote telemetry service is >>>> also scalable). This makes the trigger approach far less suitable IMHO >>>> as we are getting dangerously close to Heisen-Monitoring where the >>>> more we observe the system the more we impact its performance. >>>> >>>> J. >>>> >>>> On Tue, Jul 12, 2022 at 6:49 PM Ping Zhang <[email protected]> wrote: >>>> > >>>> > Hi Jarek, >>>> > >>>> > Thanks for the insights and pointing out the potential issues with >>>> triggers in the prod with scheduler HA setup. >>>> > >>>> > The solution that I proposed is mainly for the stress test scheduler >>>> before each airflow release. We can make changes in the airflow codebase to >>>> emit this metric however: >>>> > >>>> > 1. It will incur additional overhead for the scheduler to compute the >>>> metric as scheduler needs to compute the dependency met time of a task. >>>> > 2. It couples with the implementation of the scheduler. For example, >>>> from 1.10.4 to airflow 2, the scheduler has changed a lot. If the metric is >>>> emitted from the scheduler, when making the changes in the scheduler, it >>>> also needs to update how the metric is computed and emitted. >>>> > >>>> > Thus, I think having it out of the airflow core makes it easier to >>>> compare the scheduling delay across different airflow versions. >>>> > >>>> > Thanks for pointing out the OpenTelemetry, let me check it out. >>>> > >>>> > Thanks, >>>> > >>>> > Ping >>>> > >>>> > >>>> > On Mon, Jul 11, 2022 at 9:44 AM Jarek Potiuk <[email protected]> >>>> wrote: >>>> >> >>>> >> Sorry for the late reply - Ping. >>>> >> >>>> >> TL;DR; I think the metrics might be useful but I think using triggers >>>> >> is asking for troubles. >>>> >> >>>> >> While using triggers sounds like a common approach in a number of >>>> >> installations, we do not use triggers so far. >>>> >> Using Triggers moves some logic to the database, and in our case we >>>> do >>>> >> not have it at all - all logic is in Airflow, and we keep it there, >>>> >> the database for us is merely "state" storage and "locks". Adding >>>> >> database triggers, extends it to also keep some logic there. And >>>> >> adding triggers has some worrying "implicitness" which goes against >>>> >> the "Explicit is better than Implicit" Zen of Python. >>>> >> >>>> >> One thing that makes me think "coldly" about this is that it might >>>> >> have some undesired side effects - such as synchronizing of changes >>>> >> from multiple schedulers on trying to insert such audit entry (you >>>> >> need to create an index lock when you insert rows to a table which >>>> has >>>> >> a primary key/unique indexes). >>>> >> >>>> >> And what's even more worrying is that we are using SQLAlchemy and >>>> >> MySQl/MsSQL/Postgres and we should make sure it works the same in all >>>> >> of them. This is troublesome. >>>> >> >>>> >> Even if we could solve and verify all those problems individually the >>>> >> effect is - Once we open the "gate" of triggers, we will get more "ok >>>> >> we have trigger here so let's also use it for that and this" and this >>>> >> will be hard to say "no" if we already have a precedent, and this >>>> >> might lead to more and more logica and features deferred to a >>>> database >>>> >> logic (and my past experience is that it leads to more complexity and >>>> >> implicit behaviours that are difficult to reason about). >>>> >> >>>> >> But this is only about the technical details of this, not the metrics >>>> >> itself. I think the metric you proposed is very useful. >>>> >> >>>> >> I think however (correct me if I am wrong) - that we do not need >>>> >> database triggers for any of those. I have a feeling that this >>>> >> proposal is trying to implement the (useful) metrics with very >>>> limited >>>> >> modification to the Airflow code, so I can understand that you might >>>> >> think about it this way when you have your own fork - then it makes >>>> >> sense to piggyback on the existing database and use triggers, because >>>> >> you do not want to modify Airflow code. >>>> >> >>>> >> But here - we are in a completely different situation. We CAN modify >>>> >> Airflow code and add missing features and functionality to capture >>>> the >>>> >> necessary metric data in the code, rather than using triggers. We >>>> >> could even define some kind of callbacks for the auditing events that >>>> >> would allow us to gather those metrics in a way that does not even >>>> use >>>> >> the database to store the information for the metrics. >>>> >> >>>> >> In fact - this leads me to conclusion that we should implement the >>>> >> metrics you mention as part of our Open-Telemetry effort >>>> >> >>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow >>>> . >>>> >> This is precisely what it was prepared for, once we have >>>> >> Open-Telemetry integrated we could add more and more such useful >>>> >> metrics more easily, and that could be way more useful, because >>>> >> instead of running external custom-db-reading process for that, we >>>> >> could not only calculate such metrics using the right metrics tooling >>>> >> (each company could use their preferred open-telemetry compliant >>>> >> tool), but that would open up all the features like alerting, >>>> >> connecting it with traces and other metrics etc. etc. >>>> >> >>>> >> Howard - WDYT? >>>> >> >>>> >> J. >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> On Thu, Jun 30, 2022 at 4:52 PM Vikram Koka >>>> >> <[email protected]> wrote: >>>> >> > >>>> >> > HI Ping, >>>> >> > >>>> >> > Apologies for the belated response. >>>> >> > >>>> >> > We have created a set of stress test DAGs where the tasks take >>>> almost no time to execute at all, so that the worker task execution time is >>>> small, and the stress is on the Scheduler and Executor. >>>> >> > >>>> >> > We then calculate "task latency" aka "task lag" as: >>>> >> > ti_lag = ti.start_date - max_upstream_ti_end_date >>>> >> > This is effectively the time between "the downstream task >>>> starting" and "the last dependent upstream task complete" >>>> >> > >>>> >> > We don't use the tasks that don't have any upstream tasks in this >>>> metric for measuring task lag. >>>> >> > And for tasks that have multiple upstream tasks, we use the >>>> upstream task for which the end_date took maximum time as the scheduler >>>> waits for completion of all parent tasks before scheduling any downstream >>>> task. >>>> >> > >>>> >> > Vikram >>>> >> > >>>> >> > >>>> >> > On Wed, Jun 8, 2022 at 2:58 PM Ping Zhang <[email protected]> >>>> wrote: >>>> >> >> >>>> >> >> Hi Mehta, >>>> >> >> >>>> >> >> Good point. The primary goal of the metric is for stress testing >>>> to catch airflow scheduler performance regression for 1) our internal >>>> scheduler improvement work and 2) airflow version upgrade. >>>> >> >> >>>> >> >> One of the key benefits of this metric definition is it is >>>> independent from the scheduler implementation and it can be >>>> computed/backfilled offline. >>>> >> >> >>>> >> >> Currently, we expose it to the datadog and we (the airflow >>>> cluster maintainers) are the main users for it. >>>> >> >> >>>> >> >> Thanks, >>>> >> >> >>>> >> >> Ping >>>> >> >> >>>> >> >> >>>> >> >> On Wed, Jun 8, 2022 at 2:36 PM Mehta, Shubham >>>> <[email protected]> wrote: >>>> >> >>> >>>> >> >>> Ping, >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> I’m very interested in this as well. A good metric can help us >>>> benchmark and identify potential improvements in the scheduler performance. >>>> >> >>> In order to understand the proposal better, can you please share >>>> where and how do you intend to use “Scheduling delay”? Is it meant for >>>> benchmarking or stress testing only? Do you plan to expose it to the users >>>> in the Airflow UI? >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> Thanks >>>> >> >>> Shubham >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> From: Ping Zhang <[email protected]> >>>> >> >>> Reply-To: "[email protected]" <[email protected]> >>>> >> >>> Date: Wednesday, June 8, 2022 at 11:58 AM >>>> >> >>> To: "[email protected]" <[email protected]>, " >>>> [email protected]" <[email protected]> >>>> >> >>> Subject: RE: [EXTERNAL][DISCUSS] Airflow Scheduling Delay Metric >>>> Definition >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> CAUTION: This email originated from outside of the organization. >>>> Do not click links or open attachments unless you can confirm the sender >>>> and know the content is safe. >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> Hi Vikram, >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> Thanks for pointing that out, 'task latency', >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> "we define task latency as the time it takes for a task to begin >>>> executing once its dependencies have been met." >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> It will be great if you can elaborate more about "begin >>>> executing" and how you calculate "its dependencies have been met.". >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> If the 'begin executing' means the state of ti becomes running, >>>> then the 'Scheduling Delay' metric focuses on the overhead introduced by >>>> the scheduler. >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> In our prod and stress test, we use the `task_instance_audit` >>>> table ( a new row is created whenever there is state change in >>>> task_instance table) to compute the time of a ti should be scheduled. >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> Thanks, >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> Ping >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> On Wed, Jun 8, 2022 at 11:25 AM Vikram Koka >>>> <[email protected]> wrote: >>>> >> >>> >>>> >> >>> Ping, >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> I am quite interested in this topic and trying to understand the >>>> difference between the "scheduling delay" metric articulated as compared to >>>> the "task latency" aka "task lag" metric which we have been using before. >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> As you may recall, we have been using two specific metrics to >>>> benchmark Scheduler performance, specifically "task latency" and "task >>>> throughput" since Airflow 2.0. >>>> >> >>> >>>> >> >>> These were described in the 2.0 Scheduler blog post >>>> >> >>> Specifically, within that we defined task tatency as the time it >>>> takes for the task to begin executing once it's dependencies are all met. >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> Thanks, >>>> >> >>> >>>> >> >>> Vikram >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> On Wed, Jun 8, 2022 at 10:25 AM Ping Zhang <[email protected]> >>>> wrote: >>>> >> >>> >>>> >> >>> Hi Airflow Community, >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> Airflow is a scheduling platform for data pipelines, however >>>> there is no good metric to measure the scheduling delay in the production >>>> and also the stress test environment. This makes it hard to catch >>>> regressions in the scheduler during the stress test stage. >>>> >> >>> >>>> >> >>> >>>> >> >>> I would like to propose an airflow scheduling delay metric >>>> definition. Here is the detailed design of the metric and its >>>> implementation: >>>> >> >>> >>>> >> >>> >>>> https://docs.google.com/document/d/1NhO26kgWkIZJEe50M60yh_jgROaU84dRJ5qGFqbkNbU/edit?usp=sharing >>>> >> >>> >>>> >> >>> Please take a look and any feedback is welcome. >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> Thanks, >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> Ping >>>> >> >>> >>>> >> >>> >>>> >>>
