Sure On Tue, Jul 26, 2022 at 7:41 PM Ping Zhang <[email protected]> wrote:
> Hi Jarek, > > Thanks a lot for the thorough response and they are all legitimate > concerns. > > How about that I prepare a more thorough doc to address your concerns and > we can continue the discussion from there? > > Thanks, > > Ping > > > > Thanks, > > Ping > > > On Tue, Jul 26, 2022 at 6:53 AM Jarek Potiuk <[email protected]> wrote: > >> If I understand correctly, the Idea is to run an additional set of stress >> tests before releasing a version - without impacting the production version >> of Airflow. >> >> I think if this is something that we want to make part of our release >> process, then submitting a code somewhere is the last thing to do. Just >> submitting the code does not mean that it will be executed. >> >> Note - this is my personal view on it, I am not sure if this is other's >> view as well, but it comes from years of being involved in the release >> process and doing it myself - and volunteering to do part of the process >> (and improve and perfect it). >> >> I think the first thing here is to have several answers: >> >> a) do we want to do it >> b) who will do it >> c) when this will be done in the release process >> d) what infrastructure will be used to run the tests >> >> The actiual "completion" of the release process of ours is described in >> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_AIRFLOW.md >> and there is of course release manager running the release process. This is >> usually Jed, Ephraim recently but it is generally whoever from the PMC >> members (or committers if they have a PMC member ready to sign the >> artifacts) who raises their hand and say "Hey I want to be a release >> manager". Similarly we have a release process for providers >> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_PROVIDER_PACKAGES.md. >> But for "MINOR releases" the community process starts when about 90% of >> testing effort has already been spent. >> >> The important thing is that the release manager is NOT doing testing (and >> the release process from ASF does not even touch on the subject). Release >> manager has the role of executing the "mechanics" to produce >> release artifacts, start voting and has the power of deciding >> single-handedly if the release should be cancelled (fully or in parts) if >> there are some issues found (more info on the role and release process here >> https://www.apache.org/legal/release-policy.html#management). In case we >> release Providers or Airflow (especially the PATCHLEVEL ones) - we delegate >> a big part of the testing to whoever was involved in preparing fixes - by >> the "Status of testing" issue (quite successfully I think). For MINOR >> "airflow", it's much more complex, usually a lot of testing is done by >> stakeholders - mostly Astronomer who donates HUGE amount of testing time of >> Airflow MINOR releases (and this is one of the reasons why Astronomer is >> able to release new Airflow versions much faster than anyone else because >> they run a lot of tests in their own infrastructure - and this is actually >> great contribution to the community :). This has a huge mutual benefit for >> both - the community and Astronomer. >> >> Now - if we are going to do the stress testing before releasing Airflow - >> the question is who will be doing that. This is quite an effort (I believe) >> and it requires quite an infrastructure. And if we are to donate any kind >> of testing harness to the community - it only makes sense if there is a way >> the community can use it. For example there are (from what I hear) many >> test scenarios and scripts that Astronomer has and follows, but it's not >> donated to the community. It simply makes no sense because we have no >> capacity to run those tests, nor process how we can follow those test >> scenarios. >> >> So now - the question is - who will run such tests and with what >> infrastructure? I am not sure what kind of infrastructure it might >> require, but I think the only way to make it part of the community process >> is to fully automate it in our CI. >> >> This is for example what happened with Docker Image and especially with >> the ARM version of it. Building and running it requires - generally >> speaking an ARM hardware and it is a heavy cpu-and-network process - so >> until I automated it, it was my personal "commitment" that I will build the >> image (with the goal that we will be able to fully automate it). Until then >> releasing of the images was not "community" duty, but "Jarek Potiuk"'s duty >> (I did automate it from the very beginning, but it took some time and >> effort to implement - but we finally got this nice and simple CI workflow - >> https://github.com/apache/airflow/blob/main/dev/README_RELEASE_AIRFLOW.md#manually-prepare-production-docker-image >> that the release manager (whoever the release manager is) can trigger and >> release the images. >> >> So I think if we want to add such a "test harness before" release, >> answering the questions above is important. If we are to get it in the >> community, we need to know what kind of infrastructure it requires, whether >> it is fully automatable (eventually) and ready-to-use by whoever can >> trigger it before the release. And when it comes to such testing, there is >> one more important question - what do we do with the results? Is there a >> trigger that should make the release manager say "OK - those results are >> bad enough to not release"? Do we know what the trigger is ? Do we know how >> to interpret the results? Is it documented and have we run it already on a >> few releases to get some baseline? >> >> I think there are two basic paths we can follow: >> >> 1) some stakeholder (Ideally the one it came from - AirBnB in this case) >> commits to the burden on running it and reporting the results before every >> release (similar to Astronomer - if you noticed, the few days/weeks before >> a release there is a flurry of stability issues and fixes coming from >> Astronomer usually as a result of this testing). Similarly for Production >> Image it could be just "Jarek Potiuk" as a stakeholder because that was >> small enough for me to handle >> >> 2) if such a test harness is to be donated to Airflow, then it must be >> preceded with a few releases where point 1) is done by someone who commits >> to it and makes sure all the "wrinkles" are removed. The release process >> should be smooth and tested. It should not introduce any more friction to >> the process and delay it, so running it for a number of releases is a must >> (this is what I did for the Production Image first and then for the ARM >> Image version). That someone needs to volunteer and commit to it (same as I >> did for Prod Image). >> >> This is how I see it. I think commiting a code to a repo is likely >> somewhere around 50% of the project where it is already run and at least >> semi-automated for a few releases (if we are going to go the route 2). Or >> is not needed at all (can be kept in AirBnB for example or whoever wants to >> commit to doing it) if we are going the route 1)/ >> >> But I am curious if my understanding of it is also what others understand. >> >> J. >> >> >> On Tue, Jul 26, 2022 at 1:54 AM Ping Zhang <[email protected]> wrote: >> >>> Hi Jarek, >>> >>> Friendly bump this thread. What's your thoughts on having a scheduler >>> perf test before each release and incorporating this metric? >>> >>> Also, is there a devops git repo to put these files/logics? >>> >>> Thanks, >>> >>> Ping >>> >>> >>> On Wed, Jul 13, 2022 at 9:47 AM Ping Zhang <[email protected]> wrote: >>> >>>> Hi Jarek, >>>> >>>> Yep, it is more useful in the stress test stage before releasing a new >>>> version with some extra set up to ensure no scheduler performance >>>> degradation due to a release. This can also help to find the scaling limit >>>> of the scheduler with a certain SLA, like upper limit of the number of >>>> tasks in a dag, total number of dag files in a cluster, concurrent running >>>> dag runs etc. >>>> >>>> Very good point about synthetic dag files in the stress test, our team >>>> is working on a stress test framework that can directly use all production >>>> dag files to ensure the stress test has the same set of prod dags, but it >>>> will skip the task execution. It can also generate different kinds of dag >>>> (including number of tasks, levels etc). >>>> >>>> Monitoring the production issues for particular DAGs, time of the day >>>> is a different issue. I agree that in prod, we should not let the scheduler >>>> calculate the `dependency met` time. >>>> >>>> >>>> Thanks, >>>> >>>> Ping >>>> >>>> >>>> On Tue, Jul 12, 2022 at 11:02 AM Jarek Potiuk <[email protected]> wrote: >>>> >>>>> I think if we limit it to stress tests, this could be an "extra" >>>>> addition - not even necessarily part of Airflow codebase and adding >>>>> triggers with a script, on a single database, some kind of >>>>> test-harness that you always add after you installed airflow in test >>>>> environment - for that I have far less reservations to use triggers. >>>>> >>>>> But if we want to measure the delays in production, that's quite a >>>>> different story (and different purpose): >>>>> >>>>> * The stress tests are synthetic and basically what you will get out >>>>> of it is "are worse/better in this version than in the previous one"? >>>>> "How much", "Which synthetic scenarios are affected most" . Those will >>>>> be done with a few synthetic kinds of traffic/load/shape. >>>>> * The production is different - you really want to see if you have >>>>> some problems with particular DAGs, times of the day, week, load etc >>>>> and you should be able to take some corrective actions ( for example >>>>> increase number of schedulers, or queues, split your dags etc.) - so >>>>> even the "scheduling delay" metrics might sound familiar you might >>>>> want to use completely different dimensions to look at it (how about >>>>> this DAG? this time of day, this group of dags, this type of workloads >>>>> etc). >>>>> >>>>> I think those two might even be separated and calculated differently >>>>> (though having a single approach would be likely better). I am not >>>>> entirely sure but I have a feeling we do not need the scheduler to >>>>> calculate the "dependency met" while scheduling. I think for >>>>> production purposes, it would be much better (less overhead) to simply >>>>> emit "raw" mettrics such as task start/end time of each task plus >>>>> possibly simple publishing of - mostly static - task dependency rules >>>>> - then "dependency met" time can be calculated offline based on joined >>>>> data. That would be roughly equivalent to what you have in the >>>>> trigger, but without the overhead of triggers- simply instead of >>>>> storing the events in metadata db we would emit them (for example >>>>> using otel) and let the external system aggregate them and process it >>>>> offline independently. >>>>> >>>>> The OTEL integration is rather lightweight - most of them use >>>>> in-memory buffers and efficiently push the data (and even can >>>>> implement scalable forwarding of the data and pre-aggregation). The >>>>> nice thing about it is that it can scale much easier. I think that >>>>> (apart of my earlier reservation) database-trigger approach has this >>>>> not-nice property that the less workers and schedulers you have, the >>>>> more "centralized overhead" you have, where the distributed OTEL >>>>> solution scales together with the system adding more or less fixed >>>>> overhead per component (providing that the remote telemetry service is >>>>> also scalable). This makes the trigger approach far less suitable IMHO >>>>> as we are getting dangerously close to Heisen-Monitoring where the >>>>> more we observe the system the more we impact its performance. >>>>> >>>>> J. >>>>> >>>>> On Tue, Jul 12, 2022 at 6:49 PM Ping Zhang <[email protected]> wrote: >>>>> > >>>>> > Hi Jarek, >>>>> > >>>>> > Thanks for the insights and pointing out the potential issues with >>>>> triggers in the prod with scheduler HA setup. >>>>> > >>>>> > The solution that I proposed is mainly for the stress test scheduler >>>>> before each airflow release. We can make changes in the airflow codebase >>>>> to >>>>> emit this metric however: >>>>> > >>>>> > 1. It will incur additional overhead for the scheduler to compute >>>>> the metric as scheduler needs to compute the dependency met time of a >>>>> task. >>>>> > 2. It couples with the implementation of the scheduler. For example, >>>>> from 1.10.4 to airflow 2, the scheduler has changed a lot. If the metric >>>>> is >>>>> emitted from the scheduler, when making the changes in the scheduler, it >>>>> also needs to update how the metric is computed and emitted. >>>>> > >>>>> > Thus, I think having it out of the airflow core makes it easier to >>>>> compare the scheduling delay across different airflow versions. >>>>> > >>>>> > Thanks for pointing out the OpenTelemetry, let me check it out. >>>>> > >>>>> > Thanks, >>>>> > >>>>> > Ping >>>>> > >>>>> > >>>>> > On Mon, Jul 11, 2022 at 9:44 AM Jarek Potiuk <[email protected]> >>>>> wrote: >>>>> >> >>>>> >> Sorry for the late reply - Ping. >>>>> >> >>>>> >> TL;DR; I think the metrics might be useful but I think using >>>>> triggers >>>>> >> is asking for troubles. >>>>> >> >>>>> >> While using triggers sounds like a common approach in a number of >>>>> >> installations, we do not use triggers so far. >>>>> >> Using Triggers moves some logic to the database, and in our case we >>>>> do >>>>> >> not have it at all - all logic is in Airflow, and we keep it there, >>>>> >> the database for us is merely "state" storage and "locks". Adding >>>>> >> database triggers, extends it to also keep some logic there. And >>>>> >> adding triggers has some worrying "implicitness" which goes against >>>>> >> the "Explicit is better than Implicit" Zen of Python. >>>>> >> >>>>> >> One thing that makes me think "coldly" about this is that it might >>>>> >> have some undesired side effects - such as synchronizing of changes >>>>> >> from multiple schedulers on trying to insert such audit entry (you >>>>> >> need to create an index lock when you insert rows to a table which >>>>> has >>>>> >> a primary key/unique indexes). >>>>> >> >>>>> >> And what's even more worrying is that we are using SQLAlchemy and >>>>> >> MySQl/MsSQL/Postgres and we should make sure it works the same in >>>>> all >>>>> >> of them. This is troublesome. >>>>> >> >>>>> >> Even if we could solve and verify all those problems individually >>>>> the >>>>> >> effect is - Once we open the "gate" of triggers, we will get more >>>>> "ok >>>>> >> we have trigger here so let's also use it for that and this" and >>>>> this >>>>> >> will be hard to say "no" if we already have a precedent, and this >>>>> >> might lead to more and more logica and features deferred to a >>>>> database >>>>> >> logic (and my past experience is that it leads to more complexity >>>>> and >>>>> >> implicit behaviours that are difficult to reason about). >>>>> >> >>>>> >> But this is only about the technical details of this, not the >>>>> metrics >>>>> >> itself. I think the metric you proposed is very useful. >>>>> >> >>>>> >> I think however (correct me if I am wrong) - that we do not need >>>>> >> database triggers for any of those. I have a feeling that this >>>>> >> proposal is trying to implement the (useful) metrics with very >>>>> limited >>>>> >> modification to the Airflow code, so I can understand that you might >>>>> >> think about it this way when you have your own fork - then it makes >>>>> >> sense to piggyback on the existing database and use triggers, >>>>> because >>>>> >> you do not want to modify Airflow code. >>>>> >> >>>>> >> But here - we are in a completely different situation. We CAN modify >>>>> >> Airflow code and add missing features and functionality to capture >>>>> the >>>>> >> necessary metric data in the code, rather than using triggers. We >>>>> >> could even define some kind of callbacks for the auditing events >>>>> that >>>>> >> would allow us to gather those metrics in a way that does not even >>>>> use >>>>> >> the database to store the information for the metrics. >>>>> >> >>>>> >> In fact - this leads me to conclusion that we should implement the >>>>> >> metrics you mention as part of our Open-Telemetry effort >>>>> >> >>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow >>>>> . >>>>> >> This is precisely what it was prepared for, once we have >>>>> >> Open-Telemetry integrated we could add more and more such useful >>>>> >> metrics more easily, and that could be way more useful, because >>>>> >> instead of running external custom-db-reading process for that, we >>>>> >> could not only calculate such metrics using the right metrics >>>>> tooling >>>>> >> (each company could use their preferred open-telemetry compliant >>>>> >> tool), but that would open up all the features like alerting, >>>>> >> connecting it with traces and other metrics etc. etc. >>>>> >> >>>>> >> Howard - WDYT? >>>>> >> >>>>> >> J. >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> >> On Thu, Jun 30, 2022 at 4:52 PM Vikram Koka >>>>> >> <[email protected]> wrote: >>>>> >> > >>>>> >> > HI Ping, >>>>> >> > >>>>> >> > Apologies for the belated response. >>>>> >> > >>>>> >> > We have created a set of stress test DAGs where the tasks take >>>>> almost no time to execute at all, so that the worker task execution time >>>>> is >>>>> small, and the stress is on the Scheduler and Executor. >>>>> >> > >>>>> >> > We then calculate "task latency" aka "task lag" as: >>>>> >> > ti_lag = ti.start_date - max_upstream_ti_end_date >>>>> >> > This is effectively the time between "the downstream task >>>>> starting" and "the last dependent upstream task complete" >>>>> >> > >>>>> >> > We don't use the tasks that don't have any upstream tasks in this >>>>> metric for measuring task lag. >>>>> >> > And for tasks that have multiple upstream tasks, we use the >>>>> upstream task for which the end_date took maximum time as the scheduler >>>>> waits for completion of all parent tasks before scheduling any downstream >>>>> task. >>>>> >> > >>>>> >> > Vikram >>>>> >> > >>>>> >> > >>>>> >> > On Wed, Jun 8, 2022 at 2:58 PM Ping Zhang <[email protected]> >>>>> wrote: >>>>> >> >> >>>>> >> >> Hi Mehta, >>>>> >> >> >>>>> >> >> Good point. The primary goal of the metric is for stress testing >>>>> to catch airflow scheduler performance regression for 1) our internal >>>>> scheduler improvement work and 2) airflow version upgrade. >>>>> >> >> >>>>> >> >> One of the key benefits of this metric definition is it is >>>>> independent from the scheduler implementation and it can be >>>>> computed/backfilled offline. >>>>> >> >> >>>>> >> >> Currently, we expose it to the datadog and we (the airflow >>>>> cluster maintainers) are the main users for it. >>>>> >> >> >>>>> >> >> Thanks, >>>>> >> >> >>>>> >> >> Ping >>>>> >> >> >>>>> >> >> >>>>> >> >> On Wed, Jun 8, 2022 at 2:36 PM Mehta, Shubham >>>>> <[email protected]> wrote: >>>>> >> >>> >>>>> >> >>> Ping, >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> I’m very interested in this as well. A good metric can help us >>>>> benchmark and identify potential improvements in the scheduler >>>>> performance. >>>>> >> >>> In order to understand the proposal better, can you please >>>>> share where and how do you intend to use “Scheduling delay”? Is it meant >>>>> for benchmarking or stress testing only? Do you plan to expose it to the >>>>> users in the Airflow UI? >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> Thanks >>>>> >> >>> Shubham >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> From: Ping Zhang <[email protected]> >>>>> >> >>> Reply-To: "[email protected]" <[email protected]> >>>>> >> >>> Date: Wednesday, June 8, 2022 at 11:58 AM >>>>> >> >>> To: "[email protected]" <[email protected]>, " >>>>> [email protected]" <[email protected]> >>>>> >> >>> Subject: RE: [EXTERNAL][DISCUSS] Airflow Scheduling Delay >>>>> Metric Definition >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> CAUTION: This email originated from outside of the >>>>> organization. Do not click links or open attachments unless you can >>>>> confirm >>>>> the sender and know the content is safe. >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> Hi Vikram, >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> Thanks for pointing that out, 'task latency', >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> "we define task latency as the time it takes for a task to >>>>> begin executing once its dependencies have been met." >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> It will be great if you can elaborate more about "begin >>>>> executing" and how you calculate "its dependencies have been met.". >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> If the 'begin executing' means the state of ti becomes running, >>>>> then the 'Scheduling Delay' metric focuses on the overhead introduced by >>>>> the scheduler. >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> In our prod and stress test, we use the `task_instance_audit` >>>>> table ( a new row is created whenever there is state change in >>>>> task_instance table) to compute the time of a ti should be scheduled. >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> Thanks, >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> Ping >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> On Wed, Jun 8, 2022 at 11:25 AM Vikram Koka >>>>> <[email protected]> wrote: >>>>> >> >>> >>>>> >> >>> Ping, >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> I am quite interested in this topic and trying to understand >>>>> the difference between the "scheduling delay" metric articulated as >>>>> compared to the "task latency" aka "task lag" metric which we have been >>>>> using before. >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> As you may recall, we have been using two specific metrics to >>>>> benchmark Scheduler performance, specifically "task latency" and "task >>>>> throughput" since Airflow 2.0. >>>>> >> >>> >>>>> >> >>> These were described in the 2.0 Scheduler blog post >>>>> >> >>> Specifically, within that we defined task tatency as the time >>>>> it takes for the task to begin executing once it's dependencies are all >>>>> met. >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> Thanks, >>>>> >> >>> >>>>> >> >>> Vikram >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> On Wed, Jun 8, 2022 at 10:25 AM Ping Zhang <[email protected]> >>>>> wrote: >>>>> >> >>> >>>>> >> >>> Hi Airflow Community, >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> Airflow is a scheduling platform for data pipelines, however >>>>> there is no good metric to measure the scheduling delay in the production >>>>> and also the stress test environment. This makes it hard to catch >>>>> regressions in the scheduler during the stress test stage. >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> I would like to propose an airflow scheduling delay metric >>>>> definition. Here is the detailed design of the metric and its >>>>> implementation: >>>>> >> >>> >>>>> >> >>> >>>>> https://docs.google.com/document/d/1NhO26kgWkIZJEe50M60yh_jgROaU84dRJ5qGFqbkNbU/edit?usp=sharing >>>>> >> >>> >>>>> >> >>> Please take a look and any feedback is welcome. >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> Thanks, >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> Ping >>>>> >> >>> >>>>> >> >>> >>>>> >>>>
