Re: Declarative DAGs with Pydantic / Hydra

Jens Scheffler Wed, 16 Jul 2025 09:30:28 -0700

Hi Tim,

thanks for dropping the idea and the context. I think it is fine to postideas and solutions to the devlist.

While I agree it might not be a direct candidate to embed this into thecore I think such ideas very good contribute to the wider ecosystem andthere are very probably a couple/many other users who would be able toleverage. Same like Astronomer took the lead in dag-factory this isanother alternative that can be used.

I also very much favor generated Dag code and this is really a feasibleoption and if for a use case this is beneficial it is cool to add it.Generated Dag code can be good, can also reach a lot of complexity. Fullcoverage might be hard but considering a pareto principle to say 80/20is generated might help as well and still some special things could bemanually written. Hybrid is key.

If the proposals that you made are stable that you can release it andthere is a bit of generalization, then it makes sense to add it to thecollection of tools for the ecosystem inhttps://airflow.apache.org/ecosystem/


Jens

Tech note: Drop a PR forhttps://github.com/apache/airflow-site/blob/main/landing-pages/site/content/en/ecosystem/_index.mdto add it there.


On 16.07.25 03:09, Kevin Yang wrote:

Hi,

I think it is a good idea but there are caveats. I’ve built a similar framework 
for my team to define Airflow DAGs fully declarative using YAML. The framework 
parse the YAML and generate static Python DAG code in CI. The Python generated 
code is scanned through linter, formatter, to identify issues. The framework 
offers the following benefits, especially in data engineering use cases:


   1.
It standardizes our DAG to follow a specific structure. We provided 
standardized YAML templates for developers to onboard new pipeline.
   2.
new developers who don’t familiar with Python can easily onboard to Airflow, 
since they only need to define YAML to create new DAGs
   3.
The DAG code is generated at “build time”, the Code section in Airflow UI 
explicitly defines everything in the DAG. Easier for debugging, and operating 
for DevOps team. (Dynamic DAG Generation by default shows the DAG template 
code, which is very challenging for debugging)
   4.
Easier migration, DAGs onboarded through our framework can be easily migrated 
between airflow version. We only need to update the DAG generation process, and 
use the correct API to parse YAML
   5.
Potentially enforce DAG test. The framework can add dag test and which could be 
executed in CI

However, below are the caveats

   1.
Airflow evolves very fast with almost hundreds of providers, it could be 
challenging to provide a “full” support and check
   2.
a small error in the generated DAG can potentially pass the tests but result in 
an import error after deployment
   3.
there are teams build custom wrappers or providers on Airflow operator, which 
could not be natively support.

Therefore, I feel like it could be a library by its own (not part of 
framework), and offer extra support to create DAGs. In my case, we still let 
user defined their own DAGs if any feature is not supported by the framework. I 
am also interested in hearing more from the community.

Thanks,
Kevin Yang


Sent from Outlook for iOS<https://aka.ms/o0ukef>
________________________________
From: Tim Paine <t.paine...@gmail.com>
Sent: Tuesday, July 15, 2025 7:39:37 PM
To: dev@airflow.apache.org <dev@airflow.apache.org>
Subject: Declarative DAGs with Pydantic / Hydra

Hello,

Apologies for spamming the whole listserv, I wanted to share some work I've 
done recently with a wider audience and wasn't sure if there was a better place 
to post.

For background, many scheduling frameworks like Airflow, Dagster, Prefect, etc, 
want you to define your DAGs in Python code. Things have become increasingly 
dynamic over the years, e.g. Airflow implemented Dynamic Task Mapping.

I wanted to go in the opposite direction and eliminate Python from the equation. Astronomer has dag-factory 
<https://github.com/astronomer/dag-factory> and there is also gusty 
<https://github.com/pipeline-tools/gusty>, but I wanted something to leverage the extremely 
configureable and extensible architecture of Hydra <https://hydra.cc/> + Pydantic 
<https://docs.pydantic.dev/latest/> detailed in this blog post 
<https://towardsdatascience.com/configuration-management-for-model-training-experiments-using-pydantic-and-hydra-d14a6ae84c13/>.

So I've written airflow-pydantic <https://github.com/airflow-laminar/airflow-pydantic> and 
airflow-config <https://github.com/airflow-laminar/airflow-config>. The former is a collection of 
Pydantic models either wrapping or validating Airflow structures, with support for instantiation (e.g. 
convert to airflow objects) or rendering (produce python code to create the python objects). The latter 
is a hydra/pydantic based configuration framework which lets you define DAG/task configuration in yaml, 
with support for fully declarative DAGs 
<https://airflow-laminar.github.io/airflow-config/docs/src/examples.html#declarative-dags-dag-factory>.
 With this, I am able to fully define DAGs in yaml 
<https://github.com/airflow-laminar/validation-dags/blob/7d65eb9173602640427231861a8c36cf489140fa/validation_dags/config/config.yaml#L199>.

I've also written a supporting cast of libraries for some things I needed:

- airflow-ha <https://github.com/airflow-laminar/airflow-ha> allows you to write 
"@continuous" style DAGs in a generic way by looping to retrigger that DAG on evaluation of 
a python callable. I needed this for AWS MWAA which sets time limits on DAG runs, but it can be 
useful in other contexts. Here is a funny little example 
<https://github.com/airflow-laminar/validation-dags/blob/main/validation_dags/config/config.yaml#L89-L104>
 that retriggers a DAG repeatedly counting down a context variable from run to run.

- airflow-supervisor <https://github.com/airflow-laminar/airflow-supervisor> integrates Airflow with supervisor 
<https://supervisord.org/> which I use for "always on" DAGs in contexts where I do not necessarily want 
to rely on Airflow to be my process supervisor, or in contexts where I do not want my worker machine and my "always 
on" process to be the same machine (e.g. use the SSH Operator to  go to my "always on" machine, startup a 
process, and have airflow check in periodically with supervisor to see if the process is still running).


I wanted to share these in case anyone else was working on something similar or 
found it interesting, or if anything here might be interesting as a future 
mainline feature of airflow. Apologies for spamming the full list, I wasn't 
sure where else to discuss airflow things. Feel free to ping me privately on 
any of those GitHub repos.


Tim






tim.paine.nyc


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org

Re: Declarative DAGs with Pydantic / Hydra

Reply via email to