Yes. Long time awaited - and indeed some implementation details would be needed to get it to AIP. And I also think one important decision to consider - should it be targeting Airflow 2?
On Sun, May 26, 2024 at 12:26 PM Elad Kalif <elad...@apache.org> wrote: > > In order for this to become a reality, Backfills need to be handled by > the > Airflow Scheduler as a normal DAG execution > > I think it's a good idea. > It should solve natively problems like > https://github.com/apache/airflow/issues/11302 > > On Fri, May 24, 2024 at 10:58 PM Vikram Koka <vik...@astronomer.io.invalid > > > wrote: > > > Fellow Airflowers, > > > > I am following up on some of the proposed changes in the Airflow 3 > proposal > > < > > > https://docs.google.com/document/d/1MTr53101EISZaYidCUKcR6mRKshXGzW6DZFXGzetG3E/ > > >, > > where more information was requested by the community. > > > > One specific topic was "Running Backfills at scale". This is not yet a > full > > fledged AIP, but a starting point for the discussion leading towards an > AIP > > with fully defined technical details. > > Backfills at scale > > > > Backfills in Airflow 2.x are treated as an exception and executed by an > > incarnation of the BackfillJob, rather than the regular Airflow Scheduler > > itself. This results in unexpected interactions with the other DAGs being > > run by the main Airflow Scheduler at the same time including resource > > contention and possibly unexpected delays because established scalability > > configuration settings such as Concurrency are not consistently applied, > > and also code-level complexity by having two somewhat-similar > > implementations of scheduling logic. > > > > > > However, with ML model training, backfills are a common operation and > need > > to be treated as a regular Airflow DAG / Task execution operation and not > > treated as an exception. It is also not possible to run a backfill unless > > you have direct access to the Airflow database/SSH access to the Airflow > > server , which is not possible for many/most data engineers. > > > > > > In order for this to become a reality, Backfills need to be handled by > the > > Airflow Scheduler as a normal DAG execution, building on the Dynamic Task > > Mapping execution pattern, rather than an exception. Additionally, > Backfill > > tasks will now ONLY be executed by the Airflow Workers, for obvious > reasons > > including scalability. A less obvious, but important reason is Security, > > since it is ideal to have data connections to Enterprise data only happen > > through Airflow Workers, rather than any Airflow system components. > > > > > > As part of making Backfill support cleaner in Airflow, Backfill DAG > > execution will also be supported in the Airflow REST API. > > > > > > This proposal is purposefully light on exact implementation details but > > will include at least: > > > > > > > > - > > > > Making the Airflow Scheduler responsible for scheduling decisions on > all > > DagRuns (instead of the current where it purposefully ignores backfill > > runs) > > - > > > > A new API endpoint to submit a "backfill request". > > > > > > -- > > > > > > Best regards, > > Vikram Koka, Ash Berlin-Taylor, Kaxil Naik, and Constance Martineau > > >