Re: Declarative DAGs with Pydantic / Hydra

Jarek Potiuk Wed, 16 Jul 2025 15:55:11 -0700

Very much agree with Jens. Publishing in "ecosystem" is likely the best
approach.


And maybe I will add a few "why" this might not be a good candidate for
"Airflow core" and why it's better that others are publishing (many)
similar solutions.

I think you should really think of "Python" as the `DAG design API" that
Airflow chose (a long time ago and it holds well I think). Simply you
describe your DAGs in Python - which is imperative, Turing complete and
allows you to define your DAGs in an arbitrary complex way.

Any "declarative" way is - almost by definition - non-Turing complete and
you cannot do some things that Turing-complete approaches can do. And while
you can try to address "all" reasonable use cases for that with
"sufficiently complex" declarative language, it's just impossible, unless
your declarative solution becomes "turing complete" itself which will turn
it (effectively) yet-another-programming-language - and often the way it is
implemented makes the declarative ways of designing DAGs even more complex
than writing Python code. Been there, done that.  In my long history of
software engineering, I participated or led about 5 or six projects that
were started with "declarative is simpler than coding" and they ended up
with terribly complex declarative "programming language" that was not
really possible to reason about without (wait for it!) using some "real"
programming language to generate proper "declarative" representation.
Effectively those solutions often turned the declarative format as
intermediate format between two programming languages and not usable by
humans who we think should be able to write those declarative
representations on their own without learning a programming language.

For me (and this is the learning from those projects) - you cannot define a
simple-enough declarative language that will cover even a little complex
set of use cases. That's simply not possible by - literally
turing-completeness  (or not) of such solutions.

And in case of Airflow it also means that:

* Python is quite cool API for designing Dags, Python has IDE support, a
lot of people know it and can use it, it can be very flexible and
Turing-complete
* It's also quite cool API for a big number of "declarative approaches" -
that can generate arbitrary Python code but each of them serves only
limited set of use cases
* this means that there is a space for a number of different declarative
approaches - by different authors, serving different purposes, catering to
specific subset of users
* that also means that it is way, way, way easier for whatever User of
airflow - to build declarative solutions that are specific to their use
cases (and have simple declarative ways for that that will be easy to
reason about) than to have "generic declarative solution". Company X that
uses only a subset of operators and having similar patterns for all their
Dags is likely to have an easy job to build declarative way for their users
to only cover their specific use cases, than using a generic tool to do so
- that's why we have so many talks on Airflow Summit "how we built OUR
declarative way of writing Dags" - by many companies, and they often
describe quite different approaches and solutions
* that also means that if we decide (in Airflow) to support some
"declarative" approach - it will be heavily limited to only a subset of use
cases and not really easy to make it to cover wide range of cases without
making it too complex to be useful
* so I believe - having a lot of "limited use case" approaches as reusable,
3rd-party run projects + encouraging companies to build their own if needed
is a good and long term sustainable solution.

J.




On Wed, Jul 16, 2025 at 6:30 PM Jens Scheffler <j_scheff...@gmx.de.invalid>
wrote:

> Hi Tim,
>
> thanks for dropping the idea and the context. I think it is fine to post
> ideas and solutions to the devlist.
>
> While I agree it might not be a direct candidate to embed this into the
> core I think such ideas very good contribute to the wider ecosystem and
> there are very probably a couple/many other users who would be able to
> leverage. Same like Astronomer took the lead in dag-factory this is
> another alternative that can be used.
>
> I also very much favor generated Dag code and this is really a feasible
> option and if for a use case this is beneficial it is cool to add it.
> Generated Dag code can be good, can also reach a lot of complexity. Full
> coverage might be hard but considering a pareto principle to say 80/20
> is generated might help as well and still some special things could be
> manually written. Hybrid is key.
>
> If the proposals that you made are stable that you can release it and
> there is a bit of generalization, then it makes sense to add it to the
> collection of tools for the ecosystem in
> https://airflow.apache.org/ecosystem/
>
> Jens
>
> Tech note: Drop a PR for
>
> https://github.com/apache/airflow-site/blob/main/landing-pages/site/content/en/ecosystem/_index.md
> to add it there.
>
> On 16.07.25 03:09, Kevin Yang wrote:
> > Hi,
> >
> > I think it is a good idea but there are caveats. I’ve built a similar
> framework for my team to define Airflow DAGs fully declarative using YAML.
> The framework parse the YAML and generate static Python DAG code in CI. The
> Python generated code is scanned through linter, formatter, to identify
> issues. The framework offers the following benefits, especially in data
> engineering use cases:
> >
> >
> >    1.
> > It standardizes our DAG to follow a specific structure. We provided
> standardized YAML templates for developers to onboard new pipeline.
> >    2.
> > new developers who don’t familiar with Python can easily onboard to
> Airflow, since they only need to define YAML to create new DAGs
> >    3.
> > The DAG code is generated at “build time”, the Code section in Airflow
> UI explicitly defines everything in the DAG. Easier for debugging, and
> operating for DevOps team. (Dynamic DAG Generation by default shows the DAG
> template code, which is very challenging for debugging)
> >    4.
> > Easier migration, DAGs onboarded through our framework can be easily
> migrated between airflow version. We only need to update the DAG generation
> process, and use the correct API to parse YAML
> >    5.
> > Potentially enforce DAG test. The framework can add dag test and which
> could be executed in CI
> >
> > However, below are the caveats
> >
> >    1.
> > Airflow evolves very fast with almost hundreds of providers, it could be
> challenging to provide a “full” support and check
> >    2.
> > a small error in the generated DAG can potentially pass the tests but
> result in an import error after deployment
> >    3.
> > there are teams build custom wrappers or providers on Airflow operator,
> which could not be natively support.
> >
> > Therefore, I feel like it could be a library by its own (not part of
> framework), and offer extra support to create DAGs. In my case, we still
> let user defined their own DAGs if any feature is not supported by the
> framework. I am also interested in hearing more from the community.
> >
> > Thanks,
> > Kevin Yang
> >
> >
> > Sent from Outlook for iOS<https://aka.ms/o0ukef>
> > ________________________________
> > From: Tim Paine <t.paine...@gmail.com>
> > Sent: Tuesday, July 15, 2025 7:39:37 PM
> > To: dev@airflow.apache.org <dev@airflow.apache.org>
> > Subject: Declarative DAGs with Pydantic / Hydra
> >
> > Hello,
> >
> > Apologies for spamming the whole listserv, I wanted to share some work
> I've done recently with a wider audience and wasn't sure if there was a
> better place to post.
> >
> > For background, many scheduling frameworks like Airflow, Dagster,
> Prefect, etc, want you to define your DAGs in Python code. Things have
> become increasingly dynamic over the years, e.g. Airflow implemented
> Dynamic Task Mapping.
> >
> > I wanted to go in the opposite direction and eliminate Python from the
> equation. Astronomer has dag-factory <
> https://github.com/astronomer/dag-factory> and there is also gusty <
> https://github.com/pipeline-tools/gusty>, but I wanted something to
> leverage the extremely configureable and extensible architecture of Hydra <
> https://hydra.cc/> + Pydantic <https://docs.pydantic.dev/latest/>
> detailed in this blog post <
> https://towardsdatascience.com/configuration-management-for-model-training-experiments-using-pydantic-and-hydra-d14a6ae84c13/
> >.
> >
> > So I've written airflow-pydantic <
> https://github.com/airflow-laminar/airflow-pydantic> and airflow-config <
> https://github.com/airflow-laminar/airflow-config>. The former is a
> collection of Pydantic models either wrapping or validating Airflow
> structures, with support for instantiation (e.g. convert to airflow
> objects) or rendering (produce python code to create the python objects).
> The latter is a hydra/pydantic based configuration framework which lets you
> define DAG/task configuration in yaml, with support for fully declarative
> DAGs <
> https://airflow-laminar.github.io/airflow-config/docs/src/examples.html#declarative-dags-dag-factory>.
> With this, I am able to fully define DAGs in yaml <
> https://github.com/airflow-laminar/validation-dags/blob/7d65eb9173602640427231861a8c36cf489140fa/validation_dags/config/config.yaml#L199
> >.
> >
> > I've also written a supporting cast of libraries for some things I
> needed:
> >
> > - airflow-ha <https://github.com/airflow-laminar/airflow-ha> allows you
> to write "@continuous" style DAGs in a generic way by looping to retrigger
> that DAG on evaluation of a python callable. I needed this for AWS MWAA
> which sets time limits on DAG runs, but it can be useful in other contexts.
> Here is a funny little example <
> https://github.com/airflow-laminar/validation-dags/blob/main/validation_dags/config/config.yaml#L89-L104>
> that retriggers a DAG repeatedly counting down a context variable from run
> to run.
> >
> > - airflow-supervisor <
> https://github.com/airflow-laminar/airflow-supervisor> integrates Airflow
> with supervisor <https://supervisord.org/> which I use for "always on"
> DAGs in contexts where I do not necessarily want to rely on Airflow to be
> my process supervisor, or in contexts where I do not want my worker machine
> and my "always on" process to be the same machine (e.g. use the SSH
> Operator to  go to my "always on" machine, startup a process, and have
> airflow check in periodically with supervisor to see if the process is
> still running).
> >
> >
> > I wanted to share these in case anyone else was working on something
> similar or found it interesting, or if anything here might be interesting
> as a future mainline feature of airflow. Apologies for spamming the full
> list, I wasn't sure where else to discuss airflow things. Feel free to ping
> me privately on any of those GitHub repos.
> >
> >
> > Tim
> >
> >
> >
> >
> >
> >
> > tim.paine.nyc
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> For additional commands, e-mail: dev-h...@airflow.apache.org
>
>

Re: Declarative DAGs with Pydantic / Hydra

Reply via email to