Re: AI-Native Airflow - LLM-Driven Intelligence for Production Data Workflows

Jarek Potiuk Sun, 28 Dec 2025 06:03:32 -0800

I also looked at it and I love it as well. I think of it as a missing
abstraction between current Airflow users and current LLM app developers, I
also proposed something a little bit bolder there, which I think shows the
true potential of that approach.
I added comment in the doc, but I will copy it here for better visibility


---

After thinking quite a bit about the proposal, I actually love it and I
think that should be the next frontier of making Airflow abstractions more
approachable and usable by those who want to implement various patterns of
interacting with LLMS.

And I have a little different opinion than Jens regarding HITL. I see those
common LLM operators as slightly "higher" level operators that might
implement a set of common LLM-related patterns that are currently either
difficult or impossible to express via putting together things via Dag and
individual tasks. In this sense, the capability of making HITL call-out for
approval or selection from within such an operator - without completing the
operator and even running those "call-outs" more than once, actually even
unbounded number of times during a single operator's execution.

Actually it's a great way for us to implement some "cyclicness" - without
breaking the "acyclic" property of our Dags (for now at least). Making Dag
"cyclic" is quite a dramatic change, and possibly we do not even have to do
it, because the "cyclic" part can be likely encompassed within the
specialized LLM operators. I can imagine an operator that performs LLM
querying and refining it via additional interactions with LLMs "internally"
- during a single operator's execution. And some of those iterations might
result in HITL "call-out" - even multiple times during one execution.

Also one more proposal I have here is to use an API similar to HITL (or
maybe repurpose HITL for that) - to report PROGRESS of such a task. This is
the typical property of good LLM task that it provides some feedback to the
user - it might be HITL when it asks for something but also it might be
HOOTL (Human Outside Of The Loop) - where the task is simply reporting it's
progress and allows the user to perform asynchronous actions based on that
progress → for example abort the execution (to stop the Dag) or mark it as
"skipped" (to trigger - skip processing path), or mark it as "success" to
simulate things being completed when they are not. While the three "async"
operations we already have, we do not currently have "progress" targeted
for the kind of actor who is also HITL "actor" - someone who is not
interested in detailed logs, but rather want to monitor progress and assess
quality of the output - even if it is just a partial output in the
iterative process).

I think that it will be easier and much more "surgical" (and applied in the
right place) to embed this "iterative" feedback / progress than to modify
the "acyclic" property into our Dags.

Also - this kind of Progress interface can also be used to publish the
"async" tasks progress as the next step of [WIP] AIP-98: Add async support
for PythonOperator in Airflow 3:
https://cwiki.apache.org/confluence/display/AIRFLOW/%5BWIP%5D+AIP-98%3A+Add+async+support+for+PythonOperator+in+Airflow+3
that we discussed with David  .

J.



On Sun, Dec 28, 2025 at 2:16 PM Jens Scheffler <[email protected]> wrote:

> I like the AIP very much and in my view can be made completely in a
> Provider package... with some comments (I assume non blocking) and would
> propose to really start in increments and then adjust by learning on the
> path.
>
> On 12/27/25 22:00, Pavankumar Gopidesu wrote:
> > Thanks Giorgio Zoppi, for reviewing the AIP, yes its already planned
> > part of this AIP, see the [1] example , where you can disable hitl
> > step or enable it. So its integrated part of the Operator with the
> > help of HITL operator.
> >
> > ```
> > LLMDataQualityOperator(
> >
> >      task_id="customer_quality_analysis",
> >
> >      data_sources=[customer_s3],
> >
> >      prompt="Generate data quality validation queries",
> >
> >      require_approval=True,  # Built-in HITL
> >
> >      approval_timeout=timedelta(hours=2)
> >
> > )
> > ```
> >
> > [1]:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406618285
> >
> > Regards,
> > Pavan
> >
> > On Sat, Dec 27, 2025 at 9:16 AM Giorgio Zoppi <[email protected]>
> wrote:
> >> Hello,
> >> Just 1c, skimming AIP,
> >> You might  want to explore on how to avoid human approval for generated
> >> query using llm as judge to eval the quality. The nice thing of data
> >> pipelines is automation
> >>
> >>
> >>
> >>
> >> On Wed, Dec 24, 2025, 10:23 Pavankumar Gopidesu <
> [email protected]>
> >> wrote:
> >>
> >>> Hello everyone,
> >>>
> >>> The thread has been quiet for some time, and I would like to restart
> >>> the discussion with the AIP.
> >>>
> >>> First, a sincere thank you to Kaxil for presenting the idea at Airflow
> >>> Summit 2025. The session was very well received, and many attendees
> >>> expressed strong interest in the proposal. Unfortunately, I was unable
> >>> to attend the summit due to visa issues, but I am hopeful I will be
> >>> able to join next year.
> >>>
> >>> The demo included well-structured prototypes. For those who were
> >>> unable to attend the session, please refer to the recorded talk here
> >>> [1].
> >>>
> >>> I have also drafted the complete AIP proposal, which is available here
> >>> [2]. I would greatly appreciate your reviews and look forward to
> >>> feedback and further discussion.
> >>>
> >>> Finally, to those celebrating Christmas, I wish you a very happy
> >>> Christmas and a wonderful holiday season.
> >>>
> >>> Regards
> >>> Pavan
> >>>
> >>> [1] https://www.youtube.com/watch?v=XSAzSDVUi2o
> >>> [2]
> >>>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406618285
> >>>
> >>> On Wed, Oct 15, 2025 at 6:13 AM Amogh Desai <[email protected]>
> wrote:
> >>>> Thanks Pavan and Kaxil, seems like an interesting idea and a pretty
> >>>> reasonable problem to solve.
> >>>>
> >>>> I also like the idea of starting with
> >>> `apache-airflow-providers-common-ai`
> >>>> and expanding as / when needed.
> >>>>
> >>>> Looking forward to when the recording will be out, missed attending
> this
> >>>> session at the Airflow Summit.
> >>>>
> >>>> Thanks & Regards,
> >>>> Amogh Desai
> >>>>
> >>>>
> >>>> On Thu, Oct 9, 2025 at 10:49 AM Kaxil Naik <[email protected]>
> wrote:
> >>>>
> >>>>> Yea I think it should be apache-airflow-providers-common-ai
> >>>>>
> >>>>> On Wed, 8 Oct 2025 at 02:04, Pavankumar Gopidesu <
> >>> [email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> Yes its new provider starting with completely experimental, we dont
> >>>>>> want to break functionalities with existing providers :)
> >>>>>>
> >>>>>> Mostly its sql based operators, so named it as sql-ai but agree we
> >>> can
> >>>>>> make it generic without specifying sql in it :)
> >>>>>>
> >>>>>> Pavan
> >>>>>>
> >>>>>> On Tue, Oct 7, 2025 at 3:48 PM Ryan Hatter via dev
> >>>>>> <[email protected]> wrote:
> >>>>>>> Would this really necessitate a new provider? Should this just be
> >>> baked
> >>>>>>> into the common SQL provider?
> >>>>>>>
> >>>>>>> Alternatively, instead of a narrow `sql-ai` provider, why not have
> >>> a
> >>>>>>> generic common ai provider with a SQL package, which would allow
> >>> for us
> >>>>>> to
> >>>>>>> build AI-based subpackages into the provider other than just SQL?
> >>>>>>>
> >>>>>>> On Mon, Oct 6, 2025 at 4:31 PM Pavankumar Gopidesu <
> >>>>>> [email protected]>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> @Giorgio Yes indeed that's also a good thought to integrate. I
> >>> will
> >>>>>> keep in
> >>>>>>>> mind to think about when I draft AIP and message about this a bit
> >>>>> more
> >>>>>> :)
> >>>>>>>> Yes please join. We have great demos packed on this topic :)
> >>>>>>>>
> >>>>>>>> @kaxil , Yes that's a great blog post from the wren AI and
> >>> leveraging
> >>>>>> the
> >>>>>>>> Apache DataFusion as a query engine to connect to different data
> >>>>>> sources.
> >>>>>>>> Pavan
> >>>>>>>>
> >>>>>>>> On Tue, Sep 30, 2025 at 7:37 PM Giorgio Zoppi <
> >>>>> [email protected]
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hey Pavan,
> >>>>>>>>> Some notes:
> >>>>>>>>> 1. LLM can be also very useful in detecting root causes of your
> >>>>> error
> >>>>>>>> while
> >>>>>>>>> developing and design a pipeline. I explain me better, we'd in
> >>> the
> >>>>>> past
> >>>>>>>>> several
> >>>>>>>>> Spark processes, when it is all green is ok, but when on
> >>> fails, it
> >>>>>> will
> >>>>>>>> be
> >>>>>>>>> nice to have a tool integrated to ask why.
> >>>>>>>>> 2. Ideally such operator could be a
> >>> ModelContextProtocolOperator
> >>>>> and
> >>>>>> you
> >>>>>>>>> would not need nothing else that put an LLM as parameter with
> >>> that
> >>>>>>>>> operator,
> >>>>>>>>> and just call for tools, execute query, and so on. This would
> >>> be
> >>>>> more
> >>>>>>>>> powerful, because you create an abstraction between devices,
> >>>>>> databases,
> >>>>>>>>> server and so on, so each source of data can be injected on the
> >>>>>> pipeline.
> >>>>>>>>> 3.  Good job! Looking forward to see the presentation.
> >>>>>>>>> Best Regards,
> >>>>>>>>> Giorgio
> >>>>>>>>>
> >>>>>>>>> Il giorno mar 30 set 2025 alle ore 14:51 Pavankumar Gopidesu <
> >>>>>>>>> [email protected]> ha scritto:
> >>>>>>>>>
> >>>>>>>>>> Hi everyone,
> >>>>>>>>>>
> >>>>>>>>>> We're exploring adding LLM-powered SQL operators to Airflow
> >>> and
> >>>>>> would
> >>>>>>>>> love
> >>>>>>>>>> community input before writing an AIP.
> >>>>>>>>>>
> >>>>>>>>>> The idea: Let users write natural language prompts like "find
> >>>>>> customers
> >>>>>>>>>> with missing emails" and have Airflow generate safe SQL
> >>> queries
> >>>>>> with
> >>>>>>>> full
> >>>>>>>>>> context about your database schema, connections, and data
> >>>>>> sensitivity.
> >>>>>>>>>> Why this matters:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Most of us spend too much time on schema drift detection and
> >>>>> manual
> >>>>>>>> data
> >>>>>>>>>> quality checks. Meanwhile, AI agents are getting powerful but
> >>>>> lack
> >>>>>>>>>> production-ready data integrations. Airflow could bridge this
> >>>>> gap.
> >>>>>>>>>> Here's what we're dealing with at Tavant:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Our team works with multiple data domain teams producing
> >>> data in
> >>>>>>>>> different
> >>>>>>>>>> formats and storage across S3, PostgreSQL, Iceberg, and
> >>> Aurora.
> >>>>>> When
> >>>>>>>> data
> >>>>>>>>>> assets become available for consumption, we need:
> >>>>>>>>>>
> >>>>>>>>>> - Detection of breaking schema changes between systems
> >>>>>>>>>>
> >>>>>>>>>> - Data quality assessments between snapshots
> >>>>>>>>>>
> >>>>>>>>>> - Validation that assets meet mandatory metadata requirements
> >>>>>>>>>>
> >>>>>>>>>> - Lookup validation against existing data (comparing file
> >>> feeds
> >>>>>> with
> >>>>>>>>>> different formats to existing data in Iceberg/Aurora)
> >>>>>>>>>>
> >>>>>>>>>> This is exactly the type of work that LLMs  could automate
> >>> while
> >>>>>>>>>> maintaining governance.
> >>>>>>>>>>
> >>>>>>>>>> What we're thinking:
> >>>>>>>>>>
> >>>>>>>>>> ```python
> >>>>>>>>>>
> >>>>>>>>>> # Instead of writing complex SQL by hand...
> >>>>>>>>>>
> >>>>>>>>>> quality_check = LLMSQLQueryOperator(
> >>>>>>>>>>
> >>>>>>>>>>      task_id="find_data_issues",
> >>>>>>>>>>
> >>>>>>>>>>      prompt="Find customers with invalid email formats and
> >>> missing
> >>>>>> phone
> >>>>>>>>>> numbers",
> >>>>>>>>>>
> >>>>>>>>>>      data_sources=[customer_asset],  # Airflow knows the
> >>> schema
> >>>>>>>>>> automatically
> >>>>>>>>>>
> >>>>>>>>>>      # Built-in safety: won't generate DROP/DELETE statements
> >>>>>>>>>>
> >>>>>>>>>> )
> >>>>>>>>>>
> >>>>>>>>>> ```
> >>>>>>>>>>
> >>>>>>>>>> The operator would:
> >>>>>>>>>>
> >>>>>>>>>> - Auto-inject database schema, sample data, and connection
> >>>>> details
> >>>>>>>>>> - Generate safe SQL (blocks dangerous operations)
> >>>>>>>>>>
> >>>>>>>>>> - Work across PostgreSQL, Snowflake, BigQuery with dialect
> >>>>>> awareness
> >>>>>>>>>> - Support schema drift detection between systems
> >>>>>>>>>>
> >>>>>>>>>> - Handle multi-cloud data via Apache DataFusion[1] (Did some
> >>>>>>>> experiments
> >>>>>>>>>> with 50M+          records and results are in 10-15 seconds
> >>> for
> >>>>>> common
> >>>>>>>>>> aggregations)
> >>>>>>>>>>
> >>>>>>>>>> for more info on benchmarks [2]
> >>>>>>>>>>
> >>>>>>>>>> Key benefit: Assets become smarter with structured metadata
> >>>>>> (schema,
> >>>>>>>>>> sensitivity, format) instead of just throwing everything in
> >>>>>> `extra`.
> >>>>>>>>>> Implementation plan:
> >>>>>>>>>>
> >>>>>>>>>> Start with a separate provider
> >>>>> (`apache-airflow-providers-sql-ai`)
> >>>>>> so
> >>>>>>>> we
> >>>>>>>>>> can iterate without touching the Airflow core. No breaking
> >>>>> changes,
> >>>>>>>> works
> >>>>>>>>>> with existing connections and hooks.
> >>>>>>>>>>
> >>>>>>>>>> I am presenting this at Airflow Summit 2025 in Seattle with
> >>>>> Kaxil -
> >>>>>>>> come
> >>>>>>>>>> see the live demo!
> >>>>>>>>>>
> >>>>>>>>>> Next steps:
> >>>>>>>>>>
> >>>>>>>>>> If this resonates after the Summit, we'll write a proper AIP
> >>> with
> >>>>>>>>> technical
> >>>>>>>>>> details and further build a working prototype.
> >>>>>>>>>>
> >>>>>>>>>> Thoughts? Concerns? Better ideas?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> [1]: https://datafusion.apache.org/
> >>>>>>>>>>
> >>>>>>>>>> [2]:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>
> https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/
> >>>>>>>>>> Thanks,
> >>>>>>>>>>
> >>>>>>>>>> Pavan
> >>>>>>>>>>
> >>>>>>>>>> P.S. - Happy to share more technical details with anyone
> >>>>>> interested.
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Life is a chess game - Anonymous.
> >>>>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: [email protected]
> >>>>>> For additional commands, e-mail: [email protected]
> >>>>>>
> >>>>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [email protected]
> >>> For additional commands, e-mail: [email protected]
> >>>
> >>>
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: AI-Native Airflow - LLM-Driven Intelligence for Production Data Workflows

Reply via email to