Re: [DISCUSS] AIP-72 Task Execution Interface (aka Task SDK)

Jarek Potiuk Mon, 08 Jul 2024 09:11:27 -0700

> I’d love to support websockets/async but having a UI action “instantly” (or 
> even near instantly) be sent to the task on the worker is a very difficult 
> thing to achieve (you need some broadcast or message bus type process) so I 
> don’t want to tie this AIP on to needing that.


Fair.

On Mon, Jul 8, 2024 at 6:08 PM Ash Berlin-Taylor <a...@apache.org> wrote:
>
> My view is that Async/websockets or not is (largely) an implementation detail 
> and not something we have to worry about at this stage.
>
> By that I mean that the messages that flow back and forth between server and 
> client are unchanged (and it’s either JSON/msgspec/gRPC etc). In the specific 
> case of a task termination request I’d say that we can just have the next 
> time the task heartbeats it can get cancelled, as all task status changes are 
> asynchronous in action by definition.
>
> I’d love to support websockets/async but having a UI action “instantly” (or 
> even near instantly) be sent to the task on the worker is a very difficult 
> thing to achieve (you need some broadcast or message bus type process) so I 
> don’t want to tie this AIP on to needing that.
>
> Ash
>
> > On 5 Jul 2024, at 17:31, Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> > I have a comment there - originated from Jens' question in the
> > document - related to some basic setup of the API and specifically
> > async vs. sync approach. I have a feeling that the API for tasks would
> > benefit a lot from using websockets and async-first approach.
> > Previously we've been doing heartbeating on our own, while websockets
> > have built in capability of heartbeating opened connecitions, and by
> > the fact that websocket communication is bi-directional, it would
> > allow for things like almost instantaneous killing of running tasks
> > rather than waiting for heartbeats.
> >
> > I'd say now is a good time to think about it - and maybe some of us
> > have bigger experience with async api / websockets to be able to share
> > their experiences, but since we are moving to "Http" interface for
> > tasks, async way of communication via websockets is out there for
> > quite a while and it has some undeniable advantages, and there are a
> > number of frameworks (including FastAPI) that support it and possibly
> > it's the best time to consider it.
> >
> > J.
> >
> >
> > On Wed, Jul 3, 2024 at 3:39 PM Ash Berlin-Taylor <a...@apache.org> wrote:
> >>
> >> Hi all,
> >>
> >> I’ve made some small changes to this AIP and I’m now happy with the state 
> >> of it.
> >>
> >> First, a general point: I’ve tried to not overly-specify too many of the 
> >> details on this one — for instance how viewing in-progress log will be 
> >> handled is a TBD, but we know the constraints and the final details can 
> >> shake our during implementation.
> >>
> >> A summary of changes since the previous link:
> >>
> >> - Added a section on "Extend Executor interface” to tidy up the executor 
> >> interface and move us away from the “run this command string” approach. 
> >> I’ve named this new thing “Activity”. (In the past we have thrown around 
> >> the name “workload”, but that is too close to “workflow” which is 
> >> analogous to a DAG so I’ve picked a different name)
> >> - Add an example of how Celery tasks might get context injected (
> >> - Note that triggerers won’t be allowed direct DB access anymore either, 
> >> they run user code so are all just workers
> >> - Add some simple version/feature introspection idea to the API so that 
> >> it’s easier to build forward/backwards compatibility in to workers if need.
> >>
> >> https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=311626182&originalVersion=16&revisedVersion=20
> >>
> >>
> >> I’d like to start a vote on this soon, but given the 4th July holiday in 
> >> the US I suspect we will be a bit reduced in presences of people, so I’ll 
> >> give people until Tuesday 9th July to comment and will start the vote then 
> >> if there are no major items outstanding.
> >>
> >> Thanks,
> >> Ash
> >>
> >>
> >>> On 14 Jun 2024, at 19:55, Jarek Potiuk <ja...@potiuk.com> wrote:
> >>>
> >>> First pass done - especially around security aspects of it, Looks great.
> >>>
> >>> On Fri, Jun 14, 2024 at 2:55 PM Ash Berlin-Taylor <a...@apache.org> wrote:
> >>>
> >>>> I’ve written up a lot more of the implementation details into an AIP
> >>>> https://cwiki.apache.org/confluence/x/xgmTEg
> >>>>
> >>>> It’s still marked as Draft/Work In Progress for now as there are few
> >>>> details we know we need to cover before the doc is complete.
> >>>>
> >>>> (There was also some discussion in the dev call about a different name 
> >>>> for
> >>>> this AIP)
> >>>>
> >>>>> On 7 Jun 2024, at 19:25, Ash Berlin-Taylor <a...@apache.org> wrote:
> >>>>>
> >>>>>> IMHO - if we do not want to support DB access at all from workers,
> >>>>> triggerrers and DAG file processors, we should replace the current "DB"
> >>>>> bound interface with a new one specifically designed for this
> >>>>> bi-directional direct communication Executor <-> Workers,
> >>>>>
> >>>>> That is exactly what I was thinking too (both that no DB should be the
> >>>> only option in v3, and that we need a bidirectional purpose designed
> >>>> interface) and am working up the details.
> >>>>>
> >>>>> One of the key features of this will be giving each task try a "strong
> >>>> identity" that the API server can use to identify and trust the requests,
> >>>> likely some form of signed JWT.
> >>>>>
> >>>>> I just need to finish off some other work before I can move over to
> >>>> focus Airflow fully
> >>>>>
> >>>>> -a
> >>>>>
> >>>>> On 7 June 2024 18:01:56 BST, Jarek Potiuk <ja...@potiuk.com> wrote:
> >>>>>> I added some comments here and I think there is one big thing  that
> >>>> should
> >>>>>> be clarified when we get to "task isolation" - mainly dependance of it
> >>>> on
> >>>>>> AIP-44.
> >>>>>>
> >>>>>> The Internal gRPC API (AIP-44) was only designed in the way it was
> >>>> designed
> >>>>>> to allow using the same codebase to be used with/without DB. It's based
> >>>> on
> >>>>>> the assumption that a limited set of changes will be needed (that was
> >>>>>> underestimated) in order to support both DB and GRPC ways of
> >>>> communication
> >>>>>> between workers/triggerers/DAG file processors at the same time. That
> >>>> was a
> >>>>>> basic assumption for AIP-44 - that we will want to keep both ways and
> >>>>>> maximum backwards compatibility (including "pull" model of worker
> >>>> getting
> >>>>>> connections, variables, and updating task state in the Airflow DB). We
> >>>> are
> >>>>>> still using "DB" as a way to communicate between those components and
> >>>> this
> >>>>>> does not change with AIP-44.
> >>>>>>
> >>>>>> But for Airflow 3 the whole context is changed. If we go with the
> >>>>>> assumption that Airflow 3 will only have isolated tasks and no DB
> >>>> "option",
> >>>>>> I personally think using AIP-44 for that is a mistake. AIP-44 is merely
> >>>> a
> >>>>>> wrapper over existing DB calls designed to be kept updated together 
> >>>>>> with
> >>>>>> the DB code, and the whole synchronisation of state, heartbeats,
> >>>> variables
> >>>>>> and connection access still uses the same "DB communication" model and
> >>>>>> there is basically no way we can get it more scalable this way. We will
> >>>>>> still have the same limitations on the DB - where a number of DB
> >>>>>> connections will be replaced with a number of GRPC connections,
> >>>> Essentially
> >>>>>> - more scalability and performance has never been the goal of AIP-44-
> >>>> all
> >>>>>> the assumptions are that it only brings isolation but nothing more will
> >>>>>> change. So I think it does not address some of the fundamental problems
> >>>>>> stated in this "isolation" document.
> >>>>>>
> >>>>>> Essentially AIP-44 merely exposes a small-ish number of methods (bigger
> >>>>>> than initially anticipated) but it only wraps around the existing DB
> >>>>>> mechanism. Essentially from the performance and scalability point of
> >>>> view -
> >>>>>> we do not get much more than currently when using pgbouncer. This one
> >>>>>> essentially turns a big number of connections coming from workers into 
> >>>>>> a
> >>>>>> smaller number of pooled connections that pgbounder manages internal 
> >>>>>> and
> >>>>>> multiplexes the calls over. With the difference that unlike AIP-44
> >>>> Internal
> >>>>>> API server, pgbouncer does not limit the operations you can do from the
> >>>>>> worker/triggerer/dag file processor - that's the main difference 
> >>>>>> between
> >>>>>> using pgbouncer and using our own Internal-API server.
> >>>>>>
> >>>>>> IMHO - if we do not want to support DB access at all from workers,
> >>>>>> triggerrers and DAG file processors, we should replace the current "DB"
> >>>>>> bound interface with a new one specifically designed for this
> >>>>>> bi-directional direct communication Executor <-> Workers, more in line
> >>>> with
> >>>>>> what Jens described in AIP-69 (and for example WebSocket and
> >>>> asynchronous
> >>>>>> communication comes immediately to my mind if I did not have to use DB
> >>>> for
> >>>>>> that communication). This is also why I put the AIP-67 on hold because
> >>>> IF
> >>>>>> we go that direction that we have "new" interface between worker,
> >>>> triggerer
> >>>>>> , dag file processor - it might be way easier (and safer) to introduce
> >>>>>> multi-team in Airflow 3 rather than 2 (or we can implement it
> >>>> differently
> >>>>>> in Airflow 2 and differently in Airflow 3).
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Jun 4, 2024 at 3:58 PM Vikram Koka 
> >>>>>> <vik...@astronomer.io.invalid
> >>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Fellow Airflowers,
> >>>>>>>
> >>>>>>> I am following up on some of the proposed changes in the Airflow 3
> >>>> proposal
> >>>>>>> <
> >>>>>>>
> >>>> https://docs.google.com/document/d/1MTr53101EISZaYidCUKcR6mRKshXGzW6DZFXGzetG3E/
> >>>>>>>> ,
> >>>>>>> where more information was requested by the community, specifically
> >>>> around
> >>>>>>> the injection of Task Execution Secrets. This topic has been discussed
> >>>> at
> >>>>>>> various times with a variety of names, but here is a holistic proposal
> >>>>>>> around the whole task context mechanism.
> >>>>>>>
> >>>>>>> This is not yet a full fledged AIP, but is intended to facilitate a
> >>>>>>> structured discussion, which will then be followed up with a formal 
> >>>>>>> AIP
> >>>>>>> within the next two weeks. I have included most of the text here, but
> >>>>>>> please give detailed feedback in the attached document
> >>>>>>> <
> >>>>>>>
> >>>> https://docs.google.com/document/d/1BG8f4X2YdwNgHTtHoAyxA69SC_X0FFnn17PlzD65ljA/
> >>>>>>>> ,
> >>>>>>> so that we can have a contextual discussion around specific points
> >>>> which
> >>>>>>> may need more detail.
> >>>>>>> ---
> >>>>>>> Motivation
> >>>>>>>
> >>>>>>> Historically, Airflow’s task execution context has been oriented 
> >>>>>>> around
> >>>>>>> local execution within a relatively trusted networking cluster.
> >>>>>>>
> >>>>>>> This includes:
> >>>>>>>
> >>>>>>> -
> >>>>>>>
> >>>>>>> the interaction between the Executor and the process of launching a
> >>>> task
> >>>>>>> on Airflow Workers,
> >>>>>>> -
> >>>>>>>
> >>>>>>> the interaction between the Workers and the Airflow meta-database for
> >>>>>>> connection and environment information as part of initial task
> >>>> startup,
> >>>>>>> -
> >>>>>>>
> >>>>>>> the interaction between the Airflow Workers and the rest of Airflow
> >>>> for
> >>>>>>> heartbeat information, and so on.
> >>>>>>>
> >>>>>>> This has been accomplished by colocating all of the Airflow task
> >>>> execution
> >>>>>>> code with the user task code in the same container and process.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> For Airflow users at scale i.e. supporting multiple data teams, this
> >>>> has
> >>>>>>> posed many operational challenges:
> >>>>>>>
> >>>>>>> -
> >>>>>>>
> >>>>>>> Dependency conflicts for administrators supporting data teams using
> >>>>>>> different versions of providers, libraries, or python packages
> >>>>>>> -
> >>>>>>>
> >>>>>>> Security challenge in the running of customer-defined code (task code
> >>>>>>> within the DAGs) for multiple customers within the same operating
> >>>>>>> environment and service accounts
> >>>>>>> -
> >>>>>>>
> >>>>>>> Scalability of Airflow since one of the core Airflow scalability
> >>>>>>> limitations has been the number of concurrent database connections
> >>>>>>> supported by the underlying database instance. To alleviate this
> >>>>>>> problem,
> >>>>>>> we have consistently, as an Airflow community, recommended the use of
> >>>>>>> PgBouncer for connection pooling, as part of an Airflow deployment.
> >>>>>>> -
> >>>>>>>
> >>>>>>> Operational issues caused by unintentional reliance on internal
> >>>> Airflow
> >>>>>>> constructs within the DAG/Task code, which only and unexpectedly show
> >>>>>>> up as
> >>>>>>> part of Airflow production operations, coincidentally with, but not
> >>>>>>> limited
> >>>>>>> to upgrades and migrations.
> >>>>>>> -
> >>>>>>>
> >>>>>>> Operational management based on the above for Airflow platform teams
> >>>> at
> >>>>>>> scale, because different data teams naturally operate at different
> >>>>>>> velocities. Attempting to support these different teams with a common
> >>>>>>> Airflow environment is unnecessarily challenging.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> The internal API to reduce the need for interaction between the 
> >>>>>>> Airflow
> >>>>>>> Workers and the metadatabase is a big and necessary step forward.
> >>>> However,
> >>>>>>> it doesn’t fully address the above challenges. The proposal below
> >>>> builds on
> >>>>>>> the internal API proposal and goes significantly further to not only
> >>>>>>> address these challenges above, but also enable the following key use
> >>>>>>> cases:
> >>>>>>>
> >>>>>>> 1.
> >>>>>>>
> >>>>>>> Ensure that this interface reduces the interaction between the code
> >>>>>>> running within the Task and the rest of Airflow. This is to address
> >>>>>>> unintended ripple effects from core Airflow changes which has caused
> >>>>>>> numerous Airflow upgrade issues, because Task (i.e. DAG) code relied
> >>>> on
> >>>>>>> Core Airflow abstractions. This has been a common problem pointed
> >>>> out by
> >>>>>>> numerous Airflow users including early adopters.
> >>>>>>> 2.
> >>>>>>>
> >>>>>>> Enable quick, performant execution of tasks on local, trusted
> >>>> networks,
> >>>>>>> without requiring the Airflow workers / tasks to connect to the
> >>>> Airflow
> >>>>>>> database to obtain all the information required for task startup,
> >>>>>>> 3.
> >>>>>>>
> >>>>>>> Enable remote execution of Airflow tasks across network boundaries,
> >>>> by
> >>>>>>> establishing a clean interface for Airflow workers on remote networks
> >>>>>>> to be
> >>>>>>> able to connect back to a central Airflow service to access all
> >>>>>>> information
> >>>>>>> needed for task execution. This is foundational work for remote
> >>>>>>> execution.
> >>>>>>> 4.
> >>>>>>>
> >>>>>>> Enable a clean language agnostic interface for task execution, with
> >>>>>>> support for multiple language bindings, so that Airflow tasks can be
> >>>>>>> written in languages beyond Python.
> >>>>>>>
> >>>>>>> Proposal
> >>>>>>>
> >>>>>>> The proposal here has multiple parts as detailed below.
> >>>>>>>
> >>>>>>> 1.
> >>>>>>>
> >>>>>>> Formally split out the Task Execution Interface as the Airflow Task
> >>>> SDK
> >>>>>>> (possibly name it as the Airflow SDK), which would be the only
> >>>>>>> interface to
> >>>>>>> and from Airflow Task User code to the Airflow system components
> >>>>>>> including
> >>>>>>> the meta-database, Airflow Executor, etc.
> >>>>>>> 2.
> >>>>>>>
> >>>>>>> Disable all direct database interaction from the Airflow Workers
> >>>>>>> including Tasks being run on those Airflow Workers and the Airflow
> >>>>>>> meta-database.
> >>>>>>> 3.
> >>>>>>>
> >>>>>>> The Airflow Task SDK will include interfaces for:
> >>>>>>> -
> >>>>>>>
> >>>>>>>    Access to needed Airflow Connections, Variables, and XCom values
> >>>>>>>    -
> >>>>>>>
> >>>>>>>    Report heartbeat
> >>>>>>>    -
> >>>>>>>
> >>>>>>>    Record logs
> >>>>>>>    -
> >>>>>>>
> >>>>>>>    Report metrics
> >>>>>>>    4.
> >>>>>>>
> >>>>>>> The Airflow Task SDK will support a Push mechanism for speedy local
> >>>>>>> execution in trusted environments.
> >>>>>>> 5.
> >>>>>>>
> >>>>>>> The Airflow Task SDK will also support a Pull mechanism for the
> >>>> remote
> >>>>>>> Task execution environments to access information from an Airflow
> >>>>>>> instance
> >>>>>>> over network boundaries.
> >>>>>>> 6.
> >>>>>>>
> >>>>>>> The Airflow Task SDK will be designed to support multiple language
> >>>>>>> bindings, with the first language binding of course being Python.
> >>>>>>>
> >>>>>>>
> >>>>>>> Assumption: The existing AIP for Internal API covers the interaction
> >>>>>>> between the Airflow workers and Airflow metadatabase for heartbeat
> >>>>>>> information, persisting XComs, and so on.
> >>>>>>> --
> >>>>>>>
> >>>>>>> Best regards,
> >>>>>>>
> >>>>>>> Vikram Koka, Ash Berlin-Taylor, Kaxil Naik, and Constance Martineau
> >>>>>>>
> >>>>
> >>>>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> > For additional commands, e-mail: dev-h...@airflow.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> For additional commands, e-mail: dev-h...@airflow.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org

Re: [DISCUSS] AIP-72 Task Execution Interface (aka Task SDK)

Reply via email to