I have a comment there - originated from Jens' question in the document - related to some basic setup of the API and specifically async vs. sync approach. I have a feeling that the API for tasks would benefit a lot from using websockets and async-first approach. Previously we've been doing heartbeating on our own, while websockets have built in capability of heartbeating opened connecitions, and by the fact that websocket communication is bi-directional, it would allow for things like almost instantaneous killing of running tasks rather than waiting for heartbeats.
I'd say now is a good time to think about it - and maybe some of us have bigger experience with async api / websockets to be able to share their experiences, but since we are moving to "Http" interface for tasks, async way of communication via websockets is out there for quite a while and it has some undeniable advantages, and there are a number of frameworks (including FastAPI) that support it and possibly it's the best time to consider it. J. On Wed, Jul 3, 2024 at 3:39 PM Ash Berlin-Taylor <a...@apache.org> wrote: > > Hi all, > > I’ve made some small changes to this AIP and I’m now happy with the state of > it. > > First, a general point: I’ve tried to not overly-specify too many of the > details on this one — for instance how viewing in-progress log will be > handled is a TBD, but we know the constraints and the final details can shake > our during implementation. > > A summary of changes since the previous link: > > - Added a section on "Extend Executor interface” to tidy up the executor > interface and move us away from the “run this command string” approach. I’ve > named this new thing “Activity”. (In the past we have thrown around the name > “workload”, but that is too close to “workflow” which is analogous to a DAG > so I’ve picked a different name) > - Add an example of how Celery tasks might get context injected ( > - Note that triggerers won’t be allowed direct DB access anymore either, they > run user code so are all just workers > - Add some simple version/feature introspection idea to the API so that it’s > easier to build forward/backwards compatibility in to workers if need. > > https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=311626182&originalVersion=16&revisedVersion=20 > > > I’d like to start a vote on this soon, but given the 4th July holiday in the > US I suspect we will be a bit reduced in presences of people, so I’ll give > people until Tuesday 9th July to comment and will start the vote then if > there are no major items outstanding. > > Thanks, > Ash > > > > On 14 Jun 2024, at 19:55, Jarek Potiuk <ja...@potiuk.com> wrote: > > > > First pass done - especially around security aspects of it, Looks great. > > > > On Fri, Jun 14, 2024 at 2:55 PM Ash Berlin-Taylor <a...@apache.org> wrote: > > > >> I’ve written up a lot more of the implementation details into an AIP > >> https://cwiki.apache.org/confluence/x/xgmTEg > >> > >> It’s still marked as Draft/Work In Progress for now as there are few > >> details we know we need to cover before the doc is complete. > >> > >> (There was also some discussion in the dev call about a different name for > >> this AIP) > >> > >>> On 7 Jun 2024, at 19:25, Ash Berlin-Taylor <a...@apache.org> wrote: > >>> > >>>> IMHO - if we do not want to support DB access at all from workers, > >>> triggerrers and DAG file processors, we should replace the current "DB" > >>> bound interface with a new one specifically designed for this > >>> bi-directional direct communication Executor <-> Workers, > >>> > >>> That is exactly what I was thinking too (both that no DB should be the > >> only option in v3, and that we need a bidirectional purpose designed > >> interface) and am working up the details. > >>> > >>> One of the key features of this will be giving each task try a "strong > >> identity" that the API server can use to identify and trust the requests, > >> likely some form of signed JWT. > >>> > >>> I just need to finish off some other work before I can move over to > >> focus Airflow fully > >>> > >>> -a > >>> > >>> On 7 June 2024 18:01:56 BST, Jarek Potiuk <ja...@potiuk.com> wrote: > >>>> I added some comments here and I think there is one big thing that > >> should > >>>> be clarified when we get to "task isolation" - mainly dependance of it > >> on > >>>> AIP-44. > >>>> > >>>> The Internal gRPC API (AIP-44) was only designed in the way it was > >> designed > >>>> to allow using the same codebase to be used with/without DB. It's based > >> on > >>>> the assumption that a limited set of changes will be needed (that was > >>>> underestimated) in order to support both DB and GRPC ways of > >> communication > >>>> between workers/triggerers/DAG file processors at the same time. That > >> was a > >>>> basic assumption for AIP-44 - that we will want to keep both ways and > >>>> maximum backwards compatibility (including "pull" model of worker > >> getting > >>>> connections, variables, and updating task state in the Airflow DB). We > >> are > >>>> still using "DB" as a way to communicate between those components and > >> this > >>>> does not change with AIP-44. > >>>> > >>>> But for Airflow 3 the whole context is changed. If we go with the > >>>> assumption that Airflow 3 will only have isolated tasks and no DB > >> "option", > >>>> I personally think using AIP-44 for that is a mistake. AIP-44 is merely > >> a > >>>> wrapper over existing DB calls designed to be kept updated together with > >>>> the DB code, and the whole synchronisation of state, heartbeats, > >> variables > >>>> and connection access still uses the same "DB communication" model and > >>>> there is basically no way we can get it more scalable this way. We will > >>>> still have the same limitations on the DB - where a number of DB > >>>> connections will be replaced with a number of GRPC connections, > >> Essentially > >>>> - more scalability and performance has never been the goal of AIP-44- > >> all > >>>> the assumptions are that it only brings isolation but nothing more will > >>>> change. So I think it does not address some of the fundamental problems > >>>> stated in this "isolation" document. > >>>> > >>>> Essentially AIP-44 merely exposes a small-ish number of methods (bigger > >>>> than initially anticipated) but it only wraps around the existing DB > >>>> mechanism. Essentially from the performance and scalability point of > >> view - > >>>> we do not get much more than currently when using pgbouncer. This one > >>>> essentially turns a big number of connections coming from workers into a > >>>> smaller number of pooled connections that pgbounder manages internal and > >>>> multiplexes the calls over. With the difference that unlike AIP-44 > >> Internal > >>>> API server, pgbouncer does not limit the operations you can do from the > >>>> worker/triggerer/dag file processor - that's the main difference between > >>>> using pgbouncer and using our own Internal-API server. > >>>> > >>>> IMHO - if we do not want to support DB access at all from workers, > >>>> triggerrers and DAG file processors, we should replace the current "DB" > >>>> bound interface with a new one specifically designed for this > >>>> bi-directional direct communication Executor <-> Workers, more in line > >> with > >>>> what Jens described in AIP-69 (and for example WebSocket and > >> asynchronous > >>>> communication comes immediately to my mind if I did not have to use DB > >> for > >>>> that communication). This is also why I put the AIP-67 on hold because > >> IF > >>>> we go that direction that we have "new" interface between worker, > >> triggerer > >>>> , dag file processor - it might be way easier (and safer) to introduce > >>>> multi-team in Airflow 3 rather than 2 (or we can implement it > >> differently > >>>> in Airflow 2 and differently in Airflow 3). > >>>> > >>>> > >>>> > >>>> On Tue, Jun 4, 2024 at 3:58 PM Vikram Koka <vik...@astronomer.io.invalid > >>> > >>>> wrote: > >>>> > >>>>> Fellow Airflowers, > >>>>> > >>>>> I am following up on some of the proposed changes in the Airflow 3 > >> proposal > >>>>> < > >>>>> > >> https://docs.google.com/document/d/1MTr53101EISZaYidCUKcR6mRKshXGzW6DZFXGzetG3E/ > >>>>>> , > >>>>> where more information was requested by the community, specifically > >> around > >>>>> the injection of Task Execution Secrets. This topic has been discussed > >> at > >>>>> various times with a variety of names, but here is a holistic proposal > >>>>> around the whole task context mechanism. > >>>>> > >>>>> This is not yet a full fledged AIP, but is intended to facilitate a > >>>>> structured discussion, which will then be followed up with a formal AIP > >>>>> within the next two weeks. I have included most of the text here, but > >>>>> please give detailed feedback in the attached document > >>>>> < > >>>>> > >> https://docs.google.com/document/d/1BG8f4X2YdwNgHTtHoAyxA69SC_X0FFnn17PlzD65ljA/ > >>>>>> , > >>>>> so that we can have a contextual discussion around specific points > >> which > >>>>> may need more detail. > >>>>> --- > >>>>> Motivation > >>>>> > >>>>> Historically, Airflow’s task execution context has been oriented around > >>>>> local execution within a relatively trusted networking cluster. > >>>>> > >>>>> This includes: > >>>>> > >>>>> - > >>>>> > >>>>> the interaction between the Executor and the process of launching a > >> task > >>>>> on Airflow Workers, > >>>>> - > >>>>> > >>>>> the interaction between the Workers and the Airflow meta-database for > >>>>> connection and environment information as part of initial task > >> startup, > >>>>> - > >>>>> > >>>>> the interaction between the Airflow Workers and the rest of Airflow > >> for > >>>>> heartbeat information, and so on. > >>>>> > >>>>> This has been accomplished by colocating all of the Airflow task > >> execution > >>>>> code with the user task code in the same container and process. > >>>>> > >>>>> > >>>>> > >>>>> For Airflow users at scale i.e. supporting multiple data teams, this > >> has > >>>>> posed many operational challenges: > >>>>> > >>>>> - > >>>>> > >>>>> Dependency conflicts for administrators supporting data teams using > >>>>> different versions of providers, libraries, or python packages > >>>>> - > >>>>> > >>>>> Security challenge in the running of customer-defined code (task code > >>>>> within the DAGs) for multiple customers within the same operating > >>>>> environment and service accounts > >>>>> - > >>>>> > >>>>> Scalability of Airflow since one of the core Airflow scalability > >>>>> limitations has been the number of concurrent database connections > >>>>> supported by the underlying database instance. To alleviate this > >>>>> problem, > >>>>> we have consistently, as an Airflow community, recommended the use of > >>>>> PgBouncer for connection pooling, as part of an Airflow deployment. > >>>>> - > >>>>> > >>>>> Operational issues caused by unintentional reliance on internal > >> Airflow > >>>>> constructs within the DAG/Task code, which only and unexpectedly show > >>>>> up as > >>>>> part of Airflow production operations, coincidentally with, but not > >>>>> limited > >>>>> to upgrades and migrations. > >>>>> - > >>>>> > >>>>> Operational management based on the above for Airflow platform teams > >> at > >>>>> scale, because different data teams naturally operate at different > >>>>> velocities. Attempting to support these different teams with a common > >>>>> Airflow environment is unnecessarily challenging. > >>>>> > >>>>> > >>>>> > >>>>> The internal API to reduce the need for interaction between the Airflow > >>>>> Workers and the metadatabase is a big and necessary step forward. > >> However, > >>>>> it doesn’t fully address the above challenges. The proposal below > >> builds on > >>>>> the internal API proposal and goes significantly further to not only > >>>>> address these challenges above, but also enable the following key use > >>>>> cases: > >>>>> > >>>>> 1. > >>>>> > >>>>> Ensure that this interface reduces the interaction between the code > >>>>> running within the Task and the rest of Airflow. This is to address > >>>>> unintended ripple effects from core Airflow changes which has caused > >>>>> numerous Airflow upgrade issues, because Task (i.e. DAG) code relied > >> on > >>>>> Core Airflow abstractions. This has been a common problem pointed > >> out by > >>>>> numerous Airflow users including early adopters. > >>>>> 2. > >>>>> > >>>>> Enable quick, performant execution of tasks on local, trusted > >> networks, > >>>>> without requiring the Airflow workers / tasks to connect to the > >> Airflow > >>>>> database to obtain all the information required for task startup, > >>>>> 3. > >>>>> > >>>>> Enable remote execution of Airflow tasks across network boundaries, > >> by > >>>>> establishing a clean interface for Airflow workers on remote networks > >>>>> to be > >>>>> able to connect back to a central Airflow service to access all > >>>>> information > >>>>> needed for task execution. This is foundational work for remote > >>>>> execution. > >>>>> 4. > >>>>> > >>>>> Enable a clean language agnostic interface for task execution, with > >>>>> support for multiple language bindings, so that Airflow tasks can be > >>>>> written in languages beyond Python. > >>>>> > >>>>> Proposal > >>>>> > >>>>> The proposal here has multiple parts as detailed below. > >>>>> > >>>>> 1. > >>>>> > >>>>> Formally split out the Task Execution Interface as the Airflow Task > >> SDK > >>>>> (possibly name it as the Airflow SDK), which would be the only > >>>>> interface to > >>>>> and from Airflow Task User code to the Airflow system components > >>>>> including > >>>>> the meta-database, Airflow Executor, etc. > >>>>> 2. > >>>>> > >>>>> Disable all direct database interaction from the Airflow Workers > >>>>> including Tasks being run on those Airflow Workers and the Airflow > >>>>> meta-database. > >>>>> 3. > >>>>> > >>>>> The Airflow Task SDK will include interfaces for: > >>>>> - > >>>>> > >>>>> Access to needed Airflow Connections, Variables, and XCom values > >>>>> - > >>>>> > >>>>> Report heartbeat > >>>>> - > >>>>> > >>>>> Record logs > >>>>> - > >>>>> > >>>>> Report metrics > >>>>> 4. > >>>>> > >>>>> The Airflow Task SDK will support a Push mechanism for speedy local > >>>>> execution in trusted environments. > >>>>> 5. > >>>>> > >>>>> The Airflow Task SDK will also support a Pull mechanism for the > >> remote > >>>>> Task execution environments to access information from an Airflow > >>>>> instance > >>>>> over network boundaries. > >>>>> 6. > >>>>> > >>>>> The Airflow Task SDK will be designed to support multiple language > >>>>> bindings, with the first language binding of course being Python. > >>>>> > >>>>> > >>>>> Assumption: The existing AIP for Internal API covers the interaction > >>>>> between the Airflow workers and Airflow metadatabase for heartbeat > >>>>> information, persisting XComs, and so on. > >>>>> -- > >>>>> > >>>>> Best regards, > >>>>> > >>>>> Vikram Koka, Ash Berlin-Taylor, Kaxil Naik, and Constance Martineau > >>>>> > >> > >> > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org For additional commands, e-mail: dev-h...@airflow.apache.org