Re: [PR] docs: update architecture overview to Airflow 3 architecture [airflow]

via GitHub Tue, 09 Jun 2026 14:08:24 -0700


ferruzzi commented on code in PR #67994:
URL: https://github.com/apache/airflow/pull/67994#discussion_r3383807516



##########
airflow-core/docs/core-concepts/overview.rst:
##########
@@ -49,15 +49,19 @@ A minimal Airflow installation consists of the following 
components:
   a configuration property of the *scheduler*, not a separate component and 
runs within the scheduler
   process. There are several executors available out of the box, and you can 
also write your own.
 
-* A *Dag processor*, which parses Dag files and serializes them into the
+* A *Dag processor*, which parses Dag files from a *Dag bundle* and serializes 
them into the
   *metadata database*. More about processing Dag files can be found in
   :doc:`/administration-and-deployment/dagfile-processing`
 
-* A *webserver*, which presents a handy user interface to inspect, trigger and 
debug the behaviour of
-  Dags and tasks.
+* A *Dag bundle*, which is configured for the *Dag processor* to parse Dag 
files from and allow *workers* to access the correct version of the Dag file. 
By default, this is a local folder on disk. More about Dag bundles can be found 
in
+  :doc:`/administration-and-deployment/dag-bundles`
 
-* A folder of *Dag files*, which is read by the *scheduler* to figure out what 
tasks to run and when to
-  run them.
+* An *API Server*, which serves the REST API and presents a handy user 
interface to inspect, trigger and debug the behaviour of
+  Dags and tasks. The API server is also used by *workers* to communicate 
state back to Airflow, without requiring direct access
+  to the *metadata database*.
+
+* The *Task SDK*, which is an isolated runtime environment inside the 
*workers* that executes the user-defined Dag code.
+  This acts as a way to isolate execution of user code by routing all 
execution through the API server. This protects the *metadata database* and 
other Airflow components from direct access, and allows for better security and 
stability of Airflow.

Review Comment:
   Technically correct but might be over-promising a bit.  Rephrase this just 
to be safe:  
   
   ```suggestion
     This acts as a way to isolate execution of user code by routing all 
execution through the API server. This protects the *metadata database* and 
other Airflow components from direct access from user code, and allows for 
better security and stability of Airflow.  The worker process itself may have 
access depending on the deployment details. 
   ```



##########
airflow-core/docs/core-concepts/overview.rst:
##########
@@ -120,25 +124,24 @@ finally with more isolated security perimeters.
 
 The meaning of the different connection types in the diagrams below is as 
follows:
 
-* **brown solid lines** represent *Dag files* submission and synchronization
+* **brown solid lines** represent *Dag bundles* submission and synchronization
 * **blue solid lines** represent deploying and accessing *installed packages* 
and *plugins*
 * **black dashed lines** represent control flow of workers by the *scheduler* 
(via executor)
 * **black solid lines** represent accessing the UI to manage execution of the 
workflows
-* **red dashed lines** represent accessing the *metadata database* by all 
components
+* **red dashed lines** represent accessing the *metadata database*
 
 .. _overview-basic-airflow-architecture:
 
 ..
-  TODO AIP-66 / AIP-72: These example architectures and diagrams need to be 
updated to reflect AF3 changes
-  like bundles, required Dag processor, execution api, etc.
+  TODO AIP-72: These diagrams need to be updated to reflect AF3 changes like 
bundles, required Dag processor, execution api, etc.
 
 Basic Airflow deployment
 ........................
 
 This is the simplest deployment of Airflow, usually operated and managed on a 
single
 machine. Such a deployment usually uses the LocalExecutor, where the 
*scheduler* and the *workers* are in
-the same Python process and the *Dag files* are read directly from the local 
filesystem by the *scheduler*.
-The *webserver* runs on the same machine as the *scheduler*. There is no 
*triggerer* component, which
+the same Python process. The *Dag processor* runs on the same machine, reads 
Dag files from the *Dag bundle* and serializes them into the *metadata database*
+for the *scheduler* to read. The *API server* runs on the same machine as the 
*scheduler*. There is no *triggerer* component, which
 means that task deferral is not possible.

Review Comment:
   I believe this can be updated as well; triggerers now launch by default.  
Pretty sure we can drop this sentence entirely now.
   
   > There is no *triggerer* component, which means that task deferral is not 
possible.



##########
airflow-core/docs/core-concepts/overview.rst:
##########
@@ -159,16 +162,16 @@ and where various roles of users are introduced - 
*Deployment Manager*, **Dag au
 **Operations User**. You can read more about those various roles in the 
:doc:`/security/security_model`.
 
 In the case of a distributed deployment, it is important to consider the 
security aspects of the components.
-The *webserver* does not have access to the *Dag files* directly. The code in 
the ``Code`` tab of the
-UI is read from the *metadata database*. The *webserver* cannot execute any 
code submitted by the
+The *API server* does not have access to the *Dag bundles* directly. The code 
in the ``Code`` tab of the
+UI is read from the *metadata database*. The *API server* cannot execute any 
code submitted by the
 **Dag author**. It can only execute code that is installed as an *installed 
package* or *plugin* by
 the **Deployment Manager**. The **Operations User** only has access to the UI 
and can only trigger
 Dags and tasks, but cannot author Dags.
 
-The *Dag files* need to be synchronized between all the components that use 
them - *scheduler*,
-*triggerer* and *workers*. The *Dag files* can be synchronized by various 
mechanisms - typical
-ways how Dags can be synchronized are described in 
:doc:`helm-chart:manage-dag-files` of our
-Helm Chart documentation. Helm chart is one of the ways how to deploy Airflow 
in K8S cluster.
+The *Dag processor*, *triggerer* and *workers* all need access to the *Dag 
bundles*. The *scheduler* reads the serialized Dag from the *metadata database* 
and does not need direct access to the *Dag bundles*.
+In a distributed deployment, the *workers* get a specific *Dag bundle* version 
defined by the *scheduler* when executing a task.
+Typical ways to
+configure DAG bundle backends are described in 
:doc:`/administration-and-deployment/dag-bundles`.
 

Review Comment:
   I believe Helm is still supported and references to it should stay.



##########
airflow-core/docs/core-concepts/overview.rst:
##########
@@ -159,16 +162,16 @@ and where various roles of users are introduced - 
*Deployment Manager*, **Dag au
 **Operations User**. You can read more about those various roles in the 
:doc:`/security/security_model`.
 
 In the case of a distributed deployment, it is important to consider the 
security aspects of the components.
-The *webserver* does not have access to the *Dag files* directly. The code in 
the ``Code`` tab of the
-UI is read from the *metadata database*. The *webserver* cannot execute any 
code submitted by the
+The *API server* does not have access to the *Dag bundles* directly. The code 
in the ``Code`` tab of the
+UI is read from the *metadata database*. The *API server* cannot execute any 
code submitted by the
 **Dag author**. It can only execute code that is installed as an *installed 
package* or *plugin* by
 the **Deployment Manager**. The **Operations User** only has access to the UI 
and can only trigger
 Dags and tasks, but cannot author Dags.
 
-The *Dag files* need to be synchronized between all the components that use 
them - *scheduler*,
-*triggerer* and *workers*. The *Dag files* can be synchronized by various 
mechanisms - typical
-ways how Dags can be synchronized are described in 
:doc:`helm-chart:manage-dag-files` of our
-Helm Chart documentation. Helm chart is one of the ways how to deploy Airflow 
in K8S cluster.
+The *Dag processor*, *triggerer* and *workers* all need access to the *Dag 
bundles*. The *scheduler* reads the serialized Dag from the *metadata database* 
and does not need direct access to the *Dag bundles*.

Review Comment:
   I don't think the triggerer does, does it?  Pretty sure it reads the bundles 
from the db, not directly?  Very possible I am wrong here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] docs: update architecture overview to Airflow 3 architecture [airflow]

Reply via email to