@Giorgio Yes indeed that's also a good thought to integrate. I will keep in
mind to think about when I draft AIP and message about this a bit more :)
Yes please join. We have great demos packed on this topic :)

@kaxil , Yes that's a great blog post from the wren AI and leveraging the
Apache DataFusion as a query engine to connect to different data sources.

Pavan

On Tue, Sep 30, 2025 at 7:37 PM Giorgio Zoppi <[email protected]>
wrote:

> Hey Pavan,
> Some notes:
> 1. LLM can be also very useful in detecting root causes of your error while
> developing and design a pipeline. I explain me better, we'd in the past
> several
> Spark processes, when it is all green is ok, but when on fails, it will be
> nice to have a tool integrated to ask why.
> 2. Ideally such operator could be a ModelContextProtocolOperator and you
> would not need nothing else that put an LLM as parameter with that
> operator,
> and just call for tools, execute query, and so on. This would be more
> powerful, because you create an abstraction between devices, databases,
> server and so on, so each source of data can be injected on the pipeline.
> 3.  Good job! Looking forward to see the presentation.
> Best Regards,
> Giorgio
>
> Il giorno mar 30 set 2025 alle ore 14:51 Pavankumar Gopidesu <
> [email protected]> ha scritto:
>
> > Hi everyone,
> >
> > We're exploring adding LLM-powered SQL operators to Airflow and would
> love
> > community input before writing an AIP.
> >
> > The idea: Let users write natural language prompts like "find customers
> > with missing emails" and have Airflow generate safe SQL queries with full
> > context about your database schema, connections, and data sensitivity.
> >
> > Why this matters:
> >
> >
> > Most of us spend too much time on schema drift detection and manual data
> > quality checks. Meanwhile, AI agents are getting powerful but lack
> > production-ready data integrations. Airflow could bridge this gap.
> >
> > Here's what we're dealing with at Tavant:
> >
> >
> > Our team works with multiple data domain teams producing data in
> different
> > formats and storage across S3, PostgreSQL, Iceberg, and Aurora. When data
> > assets become available for consumption, we need:
> >
> > - Detection of breaking schema changes between systems
> >
> > - Data quality assessments between snapshots
> >
> > - Validation that assets meet mandatory metadata requirements
> >
> > - Lookup validation against existing data (comparing file feeds with
> > different formats to existing data in Iceberg/Aurora)
> >
> > This is exactly the type of work that LLMs  could automate while
> > maintaining governance.
> >
> > What we're thinking:
> >
> > ```python
> >
> > # Instead of writing complex SQL by hand...
> >
> > quality_check = LLMSQLQueryOperator(
> >
> >     task_id="find_data_issues",
> >
> >     prompt="Find customers with invalid email formats and missing phone
> > numbers",
> >
> >     data_sources=[customer_asset],  # Airflow knows the schema
> > automatically
> >
> >     # Built-in safety: won't generate DROP/DELETE statements
> >
> > )
> >
> > ```
> >
> > The operator would:
> >
> > - Auto-inject database schema, sample data, and connection details
> >
> > - Generate safe SQL (blocks dangerous operations)
> >
> > - Work across PostgreSQL, Snowflake, BigQuery with dialect awareness
> >
> > - Support schema drift detection between systems
> >
> > - Handle multi-cloud data via Apache DataFusion[1] (Did some experiments
> > with 50M+          records and results are in 10-15 seconds for common
> > aggregations)
> >
> > for more info on benchmarks [2]
> >
> > Key benefit: Assets become smarter with structured metadata (schema,
> > sensitivity, format) instead of just throwing everything in `extra`.
> >
> > Implementation plan:
> >
> > Start with a separate provider (`apache-airflow-providers-sql-ai`) so we
> > can iterate without touching the Airflow core. No breaking changes, works
> > with existing connections and hooks.
> >
> > I am presenting this at Airflow Summit 2025 in Seattle with Kaxil - come
> > see the live demo!
> >
> > Next steps:
> >
> > If this resonates after the Summit, we'll write a proper AIP with
> technical
> > details and further build a working prototype.
> >
> > Thoughts? Concerns? Better ideas?
> >
> >
> > [1]: https://datafusion.apache.org/
> >
> > [2]:
> >
> >
> https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/
> >
> > Thanks,
> >
> > Pavan
> >
> > P.S. - Happy to share more technical details with anyone interested.
> >
>
>
> --
> Life is a chess game - Anonymous.
>

Reply via email to