> IMHO - if we do not want to support DB access at all from workers, triggerrers and DAG file processors, we should replace the current "DB" bound interface with a new one specifically designed for this bi-directional direct communication Executor <-> Workers,
That is exactly what I was thinking too (both that no DB should be the only option in v3, and that we need a bidirectional purpose designed interface) and am working up the details. One of the key features of this will be giving each task try a "strong identity" that the API server can use to identify and trust the requests, likely some form of signed JWT. I just need to finish off some other work before I can move over to focus Airflow fully -a On 7 June 2024 18:01:56 BST, Jarek Potiuk <ja...@potiuk.com> wrote: >I added some comments here and I think there is one big thing that should >be clarified when we get to "task isolation" - mainly dependance of it on >AIP-44. > >The Internal gRPC API (AIP-44) was only designed in the way it was designed >to allow using the same codebase to be used with/without DB. It's based on >the assumption that a limited set of changes will be needed (that was >underestimated) in order to support both DB and GRPC ways of communication >between workers/triggerers/DAG file processors at the same time. That was a >basic assumption for AIP-44 - that we will want to keep both ways and >maximum backwards compatibility (including "pull" model of worker getting >connections, variables, and updating task state in the Airflow DB). We are >still using "DB" as a way to communicate between those components and this >does not change with AIP-44. > >But for Airflow 3 the whole context is changed. If we go with the >assumption that Airflow 3 will only have isolated tasks and no DB "option", >I personally think using AIP-44 for that is a mistake. AIP-44 is merely a >wrapper over existing DB calls designed to be kept updated together with >the DB code, and the whole synchronisation of state, heartbeats, variables >and connection access still uses the same "DB communication" model and >there is basically no way we can get it more scalable this way. We will >still have the same limitations on the DB - where a number of DB >connections will be replaced with a number of GRPC connections, Essentially >- more scalability and performance has never been the goal of AIP-44- all >the assumptions are that it only brings isolation but nothing more will >change. So I think it does not address some of the fundamental problems >stated in this "isolation" document. > >Essentially AIP-44 merely exposes a small-ish number of methods (bigger >than initially anticipated) but it only wraps around the existing DB >mechanism. Essentially from the performance and scalability point of view - >we do not get much more than currently when using pgbouncer. This one >essentially turns a big number of connections coming from workers into a >smaller number of pooled connections that pgbounder manages internal and >multiplexes the calls over. With the difference that unlike AIP-44 Internal >API server, pgbouncer does not limit the operations you can do from the >worker/triggerer/dag file processor - that's the main difference between >using pgbouncer and using our own Internal-API server. > >IMHO - if we do not want to support DB access at all from workers, >triggerrers and DAG file processors, we should replace the current "DB" >bound interface with a new one specifically designed for this >bi-directional direct communication Executor <-> Workers, more in line with >what Jens described in AIP-69 (and for example WebSocket and asynchronous >communication comes immediately to my mind if I did not have to use DB for >that communication). This is also why I put the AIP-67 on hold because IF >we go that direction that we have "new" interface between worker, triggerer >, dag file processor - it might be way easier (and safer) to introduce >multi-team in Airflow 3 rather than 2 (or we can implement it differently >in Airflow 2 and differently in Airflow 3). > > > >On Tue, Jun 4, 2024 at 3:58 PM Vikram Koka <vik...@astronomer.io.invalid> >wrote: > >> Fellow Airflowers, >> >> I am following up on some of the proposed changes in the Airflow 3 proposal >> < >> https://docs.google.com/document/d/1MTr53101EISZaYidCUKcR6mRKshXGzW6DZFXGzetG3E/ >> >, >> where more information was requested by the community, specifically around >> the injection of Task Execution Secrets. This topic has been discussed at >> various times with a variety of names, but here is a holistic proposal >> around the whole task context mechanism. >> >> This is not yet a full fledged AIP, but is intended to facilitate a >> structured discussion, which will then be followed up with a formal AIP >> within the next two weeks. I have included most of the text here, but >> please give detailed feedback in the attached document >> < >> https://docs.google.com/document/d/1BG8f4X2YdwNgHTtHoAyxA69SC_X0FFnn17PlzD65ljA/ >> >, >> so that we can have a contextual discussion around specific points which >> may need more detail. >> --- >> Motivation >> >> Historically, Airflow’s task execution context has been oriented around >> local execution within a relatively trusted networking cluster. >> >> This includes: >> >> - >> >> the interaction between the Executor and the process of launching a task >> on Airflow Workers, >> - >> >> the interaction between the Workers and the Airflow meta-database for >> connection and environment information as part of initial task startup, >> - >> >> the interaction between the Airflow Workers and the rest of Airflow for >> heartbeat information, and so on. >> >> This has been accomplished by colocating all of the Airflow task execution >> code with the user task code in the same container and process. >> >> >> >> For Airflow users at scale i.e. supporting multiple data teams, this has >> posed many operational challenges: >> >> - >> >> Dependency conflicts for administrators supporting data teams using >> different versions of providers, libraries, or python packages >> - >> >> Security challenge in the running of customer-defined code (task code >> within the DAGs) for multiple customers within the same operating >> environment and service accounts >> - >> >> Scalability of Airflow since one of the core Airflow scalability >> limitations has been the number of concurrent database connections >> supported by the underlying database instance. To alleviate this >> problem, >> we have consistently, as an Airflow community, recommended the use of >> PgBouncer for connection pooling, as part of an Airflow deployment. >> - >> >> Operational issues caused by unintentional reliance on internal Airflow >> constructs within the DAG/Task code, which only and unexpectedly show >> up as >> part of Airflow production operations, coincidentally with, but not >> limited >> to upgrades and migrations. >> - >> >> Operational management based on the above for Airflow platform teams at >> scale, because different data teams naturally operate at different >> velocities. Attempting to support these different teams with a common >> Airflow environment is unnecessarily challenging. >> >> >> >> The internal API to reduce the need for interaction between the Airflow >> Workers and the metadatabase is a big and necessary step forward. However, >> it doesn’t fully address the above challenges. The proposal below builds on >> the internal API proposal and goes significantly further to not only >> address these challenges above, but also enable the following key use >> cases: >> >> 1. >> >> Ensure that this interface reduces the interaction between the code >> running within the Task and the rest of Airflow. This is to address >> unintended ripple effects from core Airflow changes which has caused >> numerous Airflow upgrade issues, because Task (i.e. DAG) code relied on >> Core Airflow abstractions. This has been a common problem pointed out by >> numerous Airflow users including early adopters. >> 2. >> >> Enable quick, performant execution of tasks on local, trusted networks, >> without requiring the Airflow workers / tasks to connect to the Airflow >> database to obtain all the information required for task startup, >> 3. >> >> Enable remote execution of Airflow tasks across network boundaries, by >> establishing a clean interface for Airflow workers on remote networks >> to be >> able to connect back to a central Airflow service to access all >> information >> needed for task execution. This is foundational work for remote >> execution. >> 4. >> >> Enable a clean language agnostic interface for task execution, with >> support for multiple language bindings, so that Airflow tasks can be >> written in languages beyond Python. >> >> Proposal >> >> The proposal here has multiple parts as detailed below. >> >> 1. >> >> Formally split out the Task Execution Interface as the Airflow Task SDK >> (possibly name it as the Airflow SDK), which would be the only >> interface to >> and from Airflow Task User code to the Airflow system components >> including >> the meta-database, Airflow Executor, etc. >> 2. >> >> Disable all direct database interaction from the Airflow Workers >> including Tasks being run on those Airflow Workers and the Airflow >> meta-database. >> 3. >> >> The Airflow Task SDK will include interfaces for: >> - >> >> Access to needed Airflow Connections, Variables, and XCom values >> - >> >> Report heartbeat >> - >> >> Record logs >> - >> >> Report metrics >> 4. >> >> The Airflow Task SDK will support a Push mechanism for speedy local >> execution in trusted environments. >> 5. >> >> The Airflow Task SDK will also support a Pull mechanism for the remote >> Task execution environments to access information from an Airflow >> instance >> over network boundaries. >> 6. >> >> The Airflow Task SDK will be designed to support multiple language >> bindings, with the first language binding of course being Python. >> >> >> Assumption: The existing AIP for Internal API covers the interaction >> between the Airflow workers and Airflow metadatabase for heartbeat >> information, persisting XComs, and so on. >> -- >> >> Best regards, >> >> Vikram Koka, Ash Berlin-Taylor, Kaxil Naik, and Constance Martineau >>