I’ve written up a lot more of the implementation details into an AIP 
https://cwiki.apache.org/confluence/x/xgmTEg

It’s still marked as Draft/Work In Progress for now as there are few details we 
know we need to cover before the doc is complete.

(There was also some discussion in the dev call about a different name for this 
AIP)

> On 7 Jun 2024, at 19:25, Ash Berlin-Taylor <a...@apache.org> wrote:
> 
>> IMHO - if we do not want to support DB access at all from workers,
> triggerrers and DAG file processors, we should replace the current "DB"
> bound interface with a new one specifically designed for this
> bi-directional direct communication Executor <-> Workers, 
> 
> That is exactly what I was thinking too (both that no DB should be the only 
> option in v3, and that we need a bidirectional purpose designed interface) 
> and am working up the details.
> 
> One of the key features of this will be giving each task try a "strong 
> identity" that the API server can use to identify and trust the requests, 
> likely some form of signed JWT.
> 
> I just need to finish off some other work before I can move over to focus 
> Airflow fully
> 
> -a
> 
> On 7 June 2024 18:01:56 BST, Jarek Potiuk <ja...@potiuk.com> wrote:
>> I added some comments here and I think there is one big thing  that should
>> be clarified when we get to "task isolation" - mainly dependance of it on
>> AIP-44.
>> 
>> The Internal gRPC API (AIP-44) was only designed in the way it was designed
>> to allow using the same codebase to be used with/without DB. It's based on
>> the assumption that a limited set of changes will be needed (that was
>> underestimated) in order to support both DB and GRPC ways of communication
>> between workers/triggerers/DAG file processors at the same time. That was a
>> basic assumption for AIP-44 - that we will want to keep both ways and
>> maximum backwards compatibility (including "pull" model of worker getting
>> connections, variables, and updating task state in the Airflow DB). We are
>> still using "DB" as a way to communicate between those components and this
>> does not change with AIP-44.
>> 
>> But for Airflow 3 the whole context is changed. If we go with the
>> assumption that Airflow 3 will only have isolated tasks and no DB "option",
>> I personally think using AIP-44 for that is a mistake. AIP-44 is merely a
>> wrapper over existing DB calls designed to be kept updated together with
>> the DB code, and the whole synchronisation of state, heartbeats, variables
>> and connection access still uses the same "DB communication" model and
>> there is basically no way we can get it more scalable this way. We will
>> still have the same limitations on the DB - where a number of DB
>> connections will be replaced with a number of GRPC connections, Essentially
>> - more scalability and performance has never been the goal of AIP-44- all
>> the assumptions are that it only brings isolation but nothing more will
>> change. So I think it does not address some of the fundamental problems
>> stated in this "isolation" document.
>> 
>> Essentially AIP-44 merely exposes a small-ish number of methods (bigger
>> than initially anticipated) but it only wraps around the existing DB
>> mechanism. Essentially from the performance and scalability point of view -
>> we do not get much more than currently when using pgbouncer. This one
>> essentially turns a big number of connections coming from workers into a
>> smaller number of pooled connections that pgbounder manages internal and
>> multiplexes the calls over. With the difference that unlike AIP-44 Internal
>> API server, pgbouncer does not limit the operations you can do from the
>> worker/triggerer/dag file processor - that's the main difference between
>> using pgbouncer and using our own Internal-API server.
>> 
>> IMHO - if we do not want to support DB access at all from workers,
>> triggerrers and DAG file processors, we should replace the current "DB"
>> bound interface with a new one specifically designed for this
>> bi-directional direct communication Executor <-> Workers, more in line with
>> what Jens described in AIP-69 (and for example WebSocket and asynchronous
>> communication comes immediately to my mind if I did not have to use DB for
>> that communication). This is also why I put the AIP-67 on hold because IF
>> we go that direction that we have "new" interface between worker, triggerer
>> , dag file processor - it might be way easier (and safer) to introduce
>> multi-team in Airflow 3 rather than 2 (or we can implement it differently
>> in Airflow 2 and differently in Airflow 3).
>> 
>> 
>> 
>> On Tue, Jun 4, 2024 at 3:58 PM Vikram Koka <vik...@astronomer.io.invalid>
>> wrote:
>> 
>>> Fellow Airflowers,
>>> 
>>> I am following up on some of the proposed changes in the Airflow 3 proposal
>>> <
>>> https://docs.google.com/document/d/1MTr53101EISZaYidCUKcR6mRKshXGzW6DZFXGzetG3E/
>>>> ,
>>> where more information was requested by the community, specifically around
>>> the injection of Task Execution Secrets. This topic has been discussed at
>>> various times with a variety of names, but here is a holistic proposal
>>> around the whole task context mechanism.
>>> 
>>> This is not yet a full fledged AIP, but is intended to facilitate a
>>> structured discussion, which will then be followed up with a formal AIP
>>> within the next two weeks. I have included most of the text here, but
>>> please give detailed feedback in the attached document
>>> <
>>> https://docs.google.com/document/d/1BG8f4X2YdwNgHTtHoAyxA69SC_X0FFnn17PlzD65ljA/
>>>> ,
>>> so that we can have a contextual discussion around specific points which
>>> may need more detail.
>>> ---
>>> Motivation
>>> 
>>> Historically, Airflow’s task execution context has been oriented around
>>> local execution within a relatively trusted networking cluster.
>>> 
>>> This includes:
>>> 
>>>   -
>>> 
>>>   the interaction between the Executor and the process of launching a task
>>>   on Airflow Workers,
>>>   -
>>> 
>>>   the interaction between the Workers and the Airflow meta-database for
>>>   connection and environment information as part of initial task startup,
>>>   -
>>> 
>>>   the interaction between the Airflow Workers and the rest of Airflow for
>>>   heartbeat information, and so on.
>>> 
>>> This has been accomplished by colocating all of the Airflow task execution
>>> code with the user task code in the same container and process.
>>> 
>>> 
>>> 
>>> For Airflow users at scale i.e. supporting multiple data teams, this has
>>> posed many operational challenges:
>>> 
>>>   -
>>> 
>>>   Dependency conflicts for administrators supporting data teams using
>>>   different versions of providers, libraries, or python packages
>>>   -
>>> 
>>>   Security challenge in the running of customer-defined code (task code
>>>   within the DAGs) for multiple customers within the same operating
>>>   environment and service accounts
>>>   -
>>> 
>>>   Scalability of Airflow since one of the core Airflow scalability
>>>   limitations has been the number of concurrent database connections
>>>   supported by the underlying database instance. To alleviate this
>>> problem,
>>>   we have consistently, as an Airflow community, recommended the use of
>>>   PgBouncer for connection pooling, as part of an Airflow deployment.
>>>   -
>>> 
>>>   Operational issues caused by unintentional reliance on internal Airflow
>>>   constructs within the DAG/Task code, which only and unexpectedly show
>>> up as
>>>   part of Airflow production operations, coincidentally with, but not
>>> limited
>>>   to upgrades and migrations.
>>>   -
>>> 
>>>   Operational management based on the above for Airflow platform teams at
>>>   scale, because different data teams naturally operate at different
>>>   velocities. Attempting to support these different teams with a common
>>>   Airflow environment is unnecessarily challenging.
>>> 
>>> 
>>> 
>>> The internal API to reduce the need for interaction between the Airflow
>>> Workers and the metadatabase is a big and necessary step forward. However,
>>> it doesn’t fully address the above challenges. The proposal below builds on
>>> the internal API proposal and goes significantly further to not only
>>> address these challenges above, but also enable the following key use
>>> cases:
>>> 
>>>   1.
>>> 
>>>   Ensure that this interface reduces the interaction between the code
>>>   running within the Task and the rest of Airflow. This is to address
>>>   unintended ripple effects from core Airflow changes which has caused
>>>   numerous Airflow upgrade issues, because Task (i.e. DAG) code relied on
>>>   Core Airflow abstractions. This has been a common problem pointed out by
>>>   numerous Airflow users including early adopters.
>>>   2.
>>> 
>>>   Enable quick, performant execution of tasks on local, trusted networks,
>>>   without requiring the Airflow workers / tasks to connect to the Airflow
>>>   database to obtain all the information required for task startup,
>>>   3.
>>> 
>>>   Enable remote execution of Airflow tasks across network boundaries, by
>>>   establishing a clean interface for Airflow workers on remote networks
>>> to be
>>>   able to connect back to a central Airflow service to access all
>>> information
>>>   needed for task execution. This is foundational work for remote
>>> execution.
>>>   4.
>>> 
>>>   Enable a clean language agnostic interface for task execution, with
>>>   support for multiple language bindings, so that Airflow tasks can be
>>>   written in languages beyond Python.
>>> 
>>> Proposal
>>> 
>>> The proposal here has multiple parts as detailed below.
>>> 
>>>   1.
>>> 
>>>   Formally split out the Task Execution Interface as the Airflow Task SDK
>>>   (possibly name it as the Airflow SDK), which would be the only
>>> interface to
>>>   and from Airflow Task User code to the Airflow system components
>>> including
>>>   the meta-database, Airflow Executor, etc.
>>>   2.
>>> 
>>>   Disable all direct database interaction from the Airflow Workers
>>>   including Tasks being run on those Airflow Workers and the Airflow
>>>   meta-database.
>>>   3.
>>> 
>>>   The Airflow Task SDK will include interfaces for:
>>>   -
>>> 
>>>      Access to needed Airflow Connections, Variables, and XCom values
>>>      -
>>> 
>>>      Report heartbeat
>>>      -
>>> 
>>>      Record logs
>>>      -
>>> 
>>>      Report metrics
>>>      4.
>>> 
>>>   The Airflow Task SDK will support a Push mechanism for speedy local
>>>   execution in trusted environments.
>>>   5.
>>> 
>>>   The Airflow Task SDK will also support a Pull mechanism for the remote
>>>   Task execution environments to access information from an Airflow
>>> instance
>>>   over network boundaries.
>>>   6.
>>> 
>>>   The Airflow Task SDK will be designed to support multiple language
>>>   bindings, with the first language binding of course being Python.
>>> 
>>> 
>>> Assumption: The existing AIP for Internal API covers the interaction
>>> between the Airflow workers and Airflow metadatabase for heartbeat
>>> information, persisting XComs, and so on.
>>> --
>>> 
>>> Best regards,
>>> 
>>> Vikram Koka, Ash Berlin-Taylor, Kaxil Naik, and Constance Martineau
>>> 

Reply via email to