I like the AIP very much and in my view can be made completely in a Provider package... with some comments (I assume non blocking) and would propose to really start in increments and then adjust by learning on the path.

On 12/27/25 22:00, Pavankumar Gopidesu wrote:
Thanks Giorgio Zoppi, for reviewing the AIP, yes its already planned
part of this AIP, see the [1] example , where you can disable hitl
step or enable it. So its integrated part of the Operator with the
help of HITL operator.

```
LLMDataQualityOperator(

     task_id="customer_quality_analysis",

     data_sources=[customer_s3],

     prompt="Generate data quality validation queries",

     require_approval=True,  # Built-in HITL

     approval_timeout=timedelta(hours=2)

)
```

[1]: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406618285

Regards,
Pavan

On Sat, Dec 27, 2025 at 9:16 AM Giorgio Zoppi <[email protected]> wrote:
Hello,
Just 1c, skimming AIP,
You might  want to explore on how to avoid human approval for generated
query using llm as judge to eval the quality. The nice thing of data
pipelines is automation




On Wed, Dec 24, 2025, 10:23 Pavankumar Gopidesu <[email protected]>
wrote:

Hello everyone,

The thread has been quiet for some time, and I would like to restart
the discussion with the AIP.

First, a sincere thank you to Kaxil for presenting the idea at Airflow
Summit 2025. The session was very well received, and many attendees
expressed strong interest in the proposal. Unfortunately, I was unable
to attend the summit due to visa issues, but I am hopeful I will be
able to join next year.

The demo included well-structured prototypes. For those who were
unable to attend the session, please refer to the recorded talk here
[1].

I have also drafted the complete AIP proposal, which is available here
[2]. I would greatly appreciate your reviews and look forward to
feedback and further discussion.

Finally, to those celebrating Christmas, I wish you a very happy
Christmas and a wonderful holiday season.

Regards
Pavan

[1] https://www.youtube.com/watch?v=XSAzSDVUi2o
[2]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406618285

On Wed, Oct 15, 2025 at 6:13 AM Amogh Desai <[email protected]> wrote:
Thanks Pavan and Kaxil, seems like an interesting idea and a pretty
reasonable problem to solve.

I also like the idea of starting with
`apache-airflow-providers-common-ai`
and expanding as / when needed.

Looking forward to when the recording will be out, missed attending this
session at the Airflow Summit.

Thanks & Regards,
Amogh Desai


On Thu, Oct 9, 2025 at 10:49 AM Kaxil Naik <[email protected]> wrote:

Yea I think it should be apache-airflow-providers-common-ai

On Wed, 8 Oct 2025 at 02:04, Pavankumar Gopidesu <
[email protected]>
wrote:

Yes its new provider starting with completely experimental, we dont
want to break functionalities with existing providers :)

Mostly its sql based operators, so named it as sql-ai but agree we
can
make it generic without specifying sql in it :)

Pavan

On Tue, Oct 7, 2025 at 3:48 PM Ryan Hatter via dev
<[email protected]> wrote:
Would this really necessitate a new provider? Should this just be
baked
into the common SQL provider?

Alternatively, instead of a narrow `sql-ai` provider, why not have
a
generic common ai provider with a SQL package, which would allow
for us
to
build AI-based subpackages into the provider other than just SQL?

On Mon, Oct 6, 2025 at 4:31 PM Pavankumar Gopidesu <
[email protected]>
wrote:

@Giorgio Yes indeed that's also a good thought to integrate. I
will
keep in
mind to think about when I draft AIP and message about this a bit
more
:)
Yes please join. We have great demos packed on this topic :)

@kaxil , Yes that's a great blog post from the wren AI and
leveraging
the
Apache DataFusion as a query engine to connect to different data
sources.
Pavan

On Tue, Sep 30, 2025 at 7:37 PM Giorgio Zoppi <
[email protected]
wrote:

Hey Pavan,
Some notes:
1. LLM can be also very useful in detecting root causes of your
error
while
developing and design a pipeline. I explain me better, we'd in
the
past
several
Spark processes, when it is all green is ok, but when on
fails, it
will
be
nice to have a tool integrated to ask why.
2. Ideally such operator could be a
ModelContextProtocolOperator
and
you
would not need nothing else that put an LLM as parameter with
that
operator,
and just call for tools, execute query, and so on. This would
be
more
powerful, because you create an abstraction between devices,
databases,
server and so on, so each source of data can be injected on the
pipeline.
3.  Good job! Looking forward to see the presentation.
Best Regards,
Giorgio

Il giorno mar 30 set 2025 alle ore 14:51 Pavankumar Gopidesu <
[email protected]> ha scritto:

Hi everyone,

We're exploring adding LLM-powered SQL operators to Airflow
and
would
love
community input before writing an AIP.

The idea: Let users write natural language prompts like "find
customers
with missing emails" and have Airflow generate safe SQL
queries
with
full
context about your database schema, connections, and data
sensitivity.
Why this matters:


Most of us spend too much time on schema drift detection and
manual
data
quality checks. Meanwhile, AI agents are getting powerful but
lack
production-ready data integrations. Airflow could bridge this
gap.
Here's what we're dealing with at Tavant:


Our team works with multiple data domain teams producing
data in
different
formats and storage across S3, PostgreSQL, Iceberg, and
Aurora.
When
data
assets become available for consumption, we need:

- Detection of breaking schema changes between systems

- Data quality assessments between snapshots

- Validation that assets meet mandatory metadata requirements

- Lookup validation against existing data (comparing file
feeds
with
different formats to existing data in Iceberg/Aurora)

This is exactly the type of work that LLMs  could automate
while
maintaining governance.

What we're thinking:

```python

# Instead of writing complex SQL by hand...

quality_check = LLMSQLQueryOperator(

     task_id="find_data_issues",

     prompt="Find customers with invalid email formats and
missing
phone
numbers",

     data_sources=[customer_asset],  # Airflow knows the
schema
automatically

     # Built-in safety: won't generate DROP/DELETE statements

)

```

The operator would:

- Auto-inject database schema, sample data, and connection
details
- Generate safe SQL (blocks dangerous operations)

- Work across PostgreSQL, Snowflake, BigQuery with dialect
awareness
- Support schema drift detection between systems

- Handle multi-cloud data via Apache DataFusion[1] (Did some
experiments
with 50M+          records and results are in 10-15 seconds
for
common
aggregations)

for more info on benchmarks [2]

Key benefit: Assets become smarter with structured metadata
(schema,
sensitivity, format) instead of just throwing everything in
`extra`.
Implementation plan:

Start with a separate provider
(`apache-airflow-providers-sql-ai`)
so
we
can iterate without touching the Airflow core. No breaking
changes,
works
with existing connections and hooks.

I am presenting this at Airflow Summit 2025 in Seattle with
Kaxil -
come
see the live demo!

Next steps:

If this resonates after the Summit, we'll write a proper AIP
with
technical
details and further build a working prototype.

Thoughts? Concerns? Better ideas?


[1]: https://datafusion.apache.org/

[2]:


https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/
Thanks,

Pavan

P.S. - Happy to share more technical details with anyone
interested.

--
Life is a chess game - Anonymous.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to