Re: [DISCUSS] AIP-72 Task Execution Interface (aka Task SDK)

Jarek Potiuk Fri, 05 Jul 2024 09:32:42 -0700

I have a comment there - originated from Jens' question in the
document - related to some basic setup of the API and specifically
async vs. sync approach. I have a feeling that the API for tasks would
benefit a lot from using websockets and async-first approach.
Previously we've been doing heartbeating on our own, while websockets
have built in capability of heartbeating opened connecitions, and by
the fact that websocket communication is bi-directional, it would
allow for things like almost instantaneous killing of running tasks
rather than waiting for heartbeats.


I'd say now is a good time to think about it - and maybe some of us
have bigger experience with async api / websockets to be able to share
their experiences, but since we are moving to "Http" interface for
tasks, async way of communication via websockets is out there for
quite a while and it has some undeniable advantages, and there are a
number of frameworks (including FastAPI) that support it and possibly
it's the best time to consider it.

J.


On Wed, Jul 3, 2024 at 3:39 PM Ash Berlin-Taylor <a...@apache.org> wrote:
>
> Hi all,
>
> I’ve made some small changes to this AIP and I’m now happy with the state of 
> it.
>
> First, a general point: I’ve tried to not overly-specify too many of the 
> details on this one — for instance how viewing in-progress log will be 
> handled is a TBD, but we know the constraints and the final details can shake 
> our during implementation.
>
> A summary of changes since the previous link:
>
> - Added a section on "Extend Executor interface” to tidy up the executor 
> interface and move us away from the “run this command string” approach. I’ve 
> named this new thing “Activity”. (In the past we have thrown around the name 
> “workload”, but that is too close to “workflow” which is analogous to a DAG 
> so I’ve picked a different name)
> - Add an example of how Celery tasks might get context injected (
> - Note that triggerers won’t be allowed direct DB access anymore either, they 
> run user code so are all just workers
> - Add some simple version/feature introspection idea to the API so that it’s 
> easier to build forward/backwards compatibility in to workers if need.
>
> https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=311626182&originalVersion=16&revisedVersion=20
>
>
> I’d like to start a vote on this soon, but given the 4th July holiday in the 
> US I suspect we will be a bit reduced in presences of people, so I’ll give 
> people until Tuesday 9th July to comment and will start the vote then if 
> there are no major items outstanding.
>
> Thanks,
> Ash
>
>
> > On 14 Jun 2024, at 19:55, Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> > First pass done - especially around security aspects of it, Looks great.
> >
> > On Fri, Jun 14, 2024 at 2:55 PM Ash Berlin-Taylor <a...@apache.org> wrote:
> >
> >> I’ve written up a lot more of the implementation details into an AIP
> >> https://cwiki.apache.org/confluence/x/xgmTEg
> >>
> >> It’s still marked as Draft/Work In Progress for now as there are few
> >> details we know we need to cover before the doc is complete.
> >>
> >> (There was also some discussion in the dev call about a different name for
> >> this AIP)
> >>
> >>> On 7 Jun 2024, at 19:25, Ash Berlin-Taylor <a...@apache.org> wrote:
> >>>
> >>>> IMHO - if we do not want to support DB access at all from workers,
> >>> triggerrers and DAG file processors, we should replace the current "DB"
> >>> bound interface with a new one specifically designed for this
> >>> bi-directional direct communication Executor <-> Workers,
> >>>
> >>> That is exactly what I was thinking too (both that no DB should be the
> >> only option in v3, and that we need a bidirectional purpose designed
> >> interface) and am working up the details.
> >>>
> >>> One of the key features of this will be giving each task try a "strong
> >> identity" that the API server can use to identify and trust the requests,
> >> likely some form of signed JWT.
> >>>
> >>> I just need to finish off some other work before I can move over to
> >> focus Airflow fully
> >>>
> >>> -a
> >>>
> >>> On 7 June 2024 18:01:56 BST, Jarek Potiuk <ja...@potiuk.com> wrote:
> >>>> I added some comments here and I think there is one big thing  that
> >> should
> >>>> be clarified when we get to "task isolation" - mainly dependance of it
> >> on
> >>>> AIP-44.
> >>>>
> >>>> The Internal gRPC API (AIP-44) was only designed in the way it was
> >> designed
> >>>> to allow using the same codebase to be used with/without DB. It's based
> >> on
> >>>> the assumption that a limited set of changes will be needed (that was
> >>>> underestimated) in order to support both DB and GRPC ways of
> >> communication
> >>>> between workers/triggerers/DAG file processors at the same time. That
> >> was a
> >>>> basic assumption for AIP-44 - that we will want to keep both ways and
> >>>> maximum backwards compatibility (including "pull" model of worker
> >> getting
> >>>> connections, variables, and updating task state in the Airflow DB). We
> >> are
> >>>> still using "DB" as a way to communicate between those components and
> >> this
> >>>> does not change with AIP-44.
> >>>>
> >>>> But for Airflow 3 the whole context is changed. If we go with the
> >>>> assumption that Airflow 3 will only have isolated tasks and no DB
> >> "option",
> >>>> I personally think using AIP-44 for that is a mistake. AIP-44 is merely
> >> a
> >>>> wrapper over existing DB calls designed to be kept updated together with
> >>>> the DB code, and the whole synchronisation of state, heartbeats,
> >> variables
> >>>> and connection access still uses the same "DB communication" model and
> >>>> there is basically no way we can get it more scalable this way. We will
> >>>> still have the same limitations on the DB - where a number of DB
> >>>> connections will be replaced with a number of GRPC connections,
> >> Essentially
> >>>> - more scalability and performance has never been the goal of AIP-44-
> >> all
> >>>> the assumptions are that it only brings isolation but nothing more will
> >>>> change. So I think it does not address some of the fundamental problems
> >>>> stated in this "isolation" document.
> >>>>
> >>>> Essentially AIP-44 merely exposes a small-ish number of methods (bigger
> >>>> than initially anticipated) but it only wraps around the existing DB
> >>>> mechanism. Essentially from the performance and scalability point of
> >> view -
> >>>> we do not get much more than currently when using pgbouncer. This one
> >>>> essentially turns a big number of connections coming from workers into a
> >>>> smaller number of pooled connections that pgbounder manages internal and
> >>>> multiplexes the calls over. With the difference that unlike AIP-44
> >> Internal
> >>>> API server, pgbouncer does not limit the operations you can do from the
> >>>> worker/triggerer/dag file processor - that's the main difference between
> >>>> using pgbouncer and using our own Internal-API server.
> >>>>
> >>>> IMHO - if we do not want to support DB access at all from workers,
> >>>> triggerrers and DAG file processors, we should replace the current "DB"
> >>>> bound interface with a new one specifically designed for this
> >>>> bi-directional direct communication Executor <-> Workers, more in line
> >> with
> >>>> what Jens described in AIP-69 (and for example WebSocket and
> >> asynchronous
> >>>> communication comes immediately to my mind if I did not have to use DB
> >> for
> >>>> that communication). This is also why I put the AIP-67 on hold because
> >> IF
> >>>> we go that direction that we have "new" interface between worker,
> >> triggerer
> >>>> , dag file processor - it might be way easier (and safer) to introduce
> >>>> multi-team in Airflow 3 rather than 2 (or we can implement it
> >> differently
> >>>> in Airflow 2 and differently in Airflow 3).
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Jun 4, 2024 at 3:58 PM Vikram Koka <vik...@astronomer.io.invalid
> >>>
> >>>> wrote:
> >>>>
> >>>>> Fellow Airflowers,
> >>>>>
> >>>>> I am following up on some of the proposed changes in the Airflow 3
> >> proposal
> >>>>> <
> >>>>>
> >> https://docs.google.com/document/d/1MTr53101EISZaYidCUKcR6mRKshXGzW6DZFXGzetG3E/
> >>>>>> ,
> >>>>> where more information was requested by the community, specifically
> >> around
> >>>>> the injection of Task Execution Secrets. This topic has been discussed
> >> at
> >>>>> various times with a variety of names, but here is a holistic proposal
> >>>>> around the whole task context mechanism.
> >>>>>
> >>>>> This is not yet a full fledged AIP, but is intended to facilitate a
> >>>>> structured discussion, which will then be followed up with a formal AIP
> >>>>> within the next two weeks. I have included most of the text here, but
> >>>>> please give detailed feedback in the attached document
> >>>>> <
> >>>>>
> >> https://docs.google.com/document/d/1BG8f4X2YdwNgHTtHoAyxA69SC_X0FFnn17PlzD65ljA/
> >>>>>> ,
> >>>>> so that we can have a contextual discussion around specific points
> >> which
> >>>>> may need more detail.
> >>>>> ---
> >>>>> Motivation
> >>>>>
> >>>>> Historically, Airflow’s task execution context has been oriented around
> >>>>> local execution within a relatively trusted networking cluster.
> >>>>>
> >>>>> This includes:
> >>>>>
> >>>>>  -
> >>>>>
> >>>>>  the interaction between the Executor and the process of launching a
> >> task
> >>>>>  on Airflow Workers,
> >>>>>  -
> >>>>>
> >>>>>  the interaction between the Workers and the Airflow meta-database for
> >>>>>  connection and environment information as part of initial task
> >> startup,
> >>>>>  -
> >>>>>
> >>>>>  the interaction between the Airflow Workers and the rest of Airflow
> >> for
> >>>>>  heartbeat information, and so on.
> >>>>>
> >>>>> This has been accomplished by colocating all of the Airflow task
> >> execution
> >>>>> code with the user task code in the same container and process.
> >>>>>
> >>>>>
> >>>>>
> >>>>> For Airflow users at scale i.e. supporting multiple data teams, this
> >> has
> >>>>> posed many operational challenges:
> >>>>>
> >>>>>  -
> >>>>>
> >>>>>  Dependency conflicts for administrators supporting data teams using
> >>>>>  different versions of providers, libraries, or python packages
> >>>>>  -
> >>>>>
> >>>>>  Security challenge in the running of customer-defined code (task code
> >>>>>  within the DAGs) for multiple customers within the same operating
> >>>>>  environment and service accounts
> >>>>>  -
> >>>>>
> >>>>>  Scalability of Airflow since one of the core Airflow scalability
> >>>>>  limitations has been the number of concurrent database connections
> >>>>>  supported by the underlying database instance. To alleviate this
> >>>>> problem,
> >>>>>  we have consistently, as an Airflow community, recommended the use of
> >>>>>  PgBouncer for connection pooling, as part of an Airflow deployment.
> >>>>>  -
> >>>>>
> >>>>>  Operational issues caused by unintentional reliance on internal
> >> Airflow
> >>>>>  constructs within the DAG/Task code, which only and unexpectedly show
> >>>>> up as
> >>>>>  part of Airflow production operations, coincidentally with, but not
> >>>>> limited
> >>>>>  to upgrades and migrations.
> >>>>>  -
> >>>>>
> >>>>>  Operational management based on the above for Airflow platform teams
> >> at
> >>>>>  scale, because different data teams naturally operate at different
> >>>>>  velocities. Attempting to support these different teams with a common
> >>>>>  Airflow environment is unnecessarily challenging.
> >>>>>
> >>>>>
> >>>>>
> >>>>> The internal API to reduce the need for interaction between the Airflow
> >>>>> Workers and the metadatabase is a big and necessary step forward.
> >> However,
> >>>>> it doesn’t fully address the above challenges. The proposal below
> >> builds on
> >>>>> the internal API proposal and goes significantly further to not only
> >>>>> address these challenges above, but also enable the following key use
> >>>>> cases:
> >>>>>
> >>>>>  1.
> >>>>>
> >>>>>  Ensure that this interface reduces the interaction between the code
> >>>>>  running within the Task and the rest of Airflow. This is to address
> >>>>>  unintended ripple effects from core Airflow changes which has caused
> >>>>>  numerous Airflow upgrade issues, because Task (i.e. DAG) code relied
> >> on
> >>>>>  Core Airflow abstractions. This has been a common problem pointed
> >> out by
> >>>>>  numerous Airflow users including early adopters.
> >>>>>  2.
> >>>>>
> >>>>>  Enable quick, performant execution of tasks on local, trusted
> >> networks,
> >>>>>  without requiring the Airflow workers / tasks to connect to the
> >> Airflow
> >>>>>  database to obtain all the information required for task startup,
> >>>>>  3.
> >>>>>
> >>>>>  Enable remote execution of Airflow tasks across network boundaries,
> >> by
> >>>>>  establishing a clean interface for Airflow workers on remote networks
> >>>>> to be
> >>>>>  able to connect back to a central Airflow service to access all
> >>>>> information
> >>>>>  needed for task execution. This is foundational work for remote
> >>>>> execution.
> >>>>>  4.
> >>>>>
> >>>>>  Enable a clean language agnostic interface for task execution, with
> >>>>>  support for multiple language bindings, so that Airflow tasks can be
> >>>>>  written in languages beyond Python.
> >>>>>
> >>>>> Proposal
> >>>>>
> >>>>> The proposal here has multiple parts as detailed below.
> >>>>>
> >>>>>  1.
> >>>>>
> >>>>>  Formally split out the Task Execution Interface as the Airflow Task
> >> SDK
> >>>>>  (possibly name it as the Airflow SDK), which would be the only
> >>>>> interface to
> >>>>>  and from Airflow Task User code to the Airflow system components
> >>>>> including
> >>>>>  the meta-database, Airflow Executor, etc.
> >>>>>  2.
> >>>>>
> >>>>>  Disable all direct database interaction from the Airflow Workers
> >>>>>  including Tasks being run on those Airflow Workers and the Airflow
> >>>>>  meta-database.
> >>>>>  3.
> >>>>>
> >>>>>  The Airflow Task SDK will include interfaces for:
> >>>>>  -
> >>>>>
> >>>>>     Access to needed Airflow Connections, Variables, and XCom values
> >>>>>     -
> >>>>>
> >>>>>     Report heartbeat
> >>>>>     -
> >>>>>
> >>>>>     Record logs
> >>>>>     -
> >>>>>
> >>>>>     Report metrics
> >>>>>     4.
> >>>>>
> >>>>>  The Airflow Task SDK will support a Push mechanism for speedy local
> >>>>>  execution in trusted environments.
> >>>>>  5.
> >>>>>
> >>>>>  The Airflow Task SDK will also support a Pull mechanism for the
> >> remote
> >>>>>  Task execution environments to access information from an Airflow
> >>>>> instance
> >>>>>  over network boundaries.
> >>>>>  6.
> >>>>>
> >>>>>  The Airflow Task SDK will be designed to support multiple language
> >>>>>  bindings, with the first language binding of course being Python.
> >>>>>
> >>>>>
> >>>>> Assumption: The existing AIP for Internal API covers the interaction
> >>>>> between the Airflow workers and Airflow metadatabase for heartbeat
> >>>>> information, persisting XComs, and so on.
> >>>>> --
> >>>>>
> >>>>> Best regards,
> >>>>>
> >>>>> Vikram Koka, Ash Berlin-Taylor, Kaxil Naik, and Constance Martineau
> >>>>>
> >>
> >>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org

Re: [DISCUSS] AIP-72 Task Execution Interface (aka Task SDK)

Reply via email to