Looks like even Wren did something similar with Datafusion:
https://getwren.ai/post/powering-semantic-sql-for-ai-agents-with-apache-datafusion
-
so there is definitely an interest

On Tue, 30 Sept 2025 at 16:19, Kaxil Naik <[email protected]> wrote:

> Love the idea Pavan! Looking forward to hearing more from you at the
> Summit and speaking "a little" while at it.
>
> On Tue, 30 Sept 2025 at 14:51, Pavankumar Gopidesu <
> [email protected]> wrote:
>
>> Hi everyone,
>>
>> We're exploring adding LLM-powered SQL operators to Airflow and would love
>> community input before writing an AIP.
>>
>> The idea: Let users write natural language prompts like "find customers
>> with missing emails" and have Airflow generate safe SQL queries with full
>> context about your database schema, connections, and data sensitivity.
>>
>> Why this matters:
>>
>>
>> Most of us spend too much time on schema drift detection and manual data
>> quality checks. Meanwhile, AI agents are getting powerful but lack
>> production-ready data integrations. Airflow could bridge this gap.
>>
>> Here's what we're dealing with at Tavant:
>>
>>
>> Our team works with multiple data domain teams producing data in different
>> formats and storage across S3, PostgreSQL, Iceberg, and Aurora. When data
>> assets become available for consumption, we need:
>>
>> - Detection of breaking schema changes between systems
>>
>> - Data quality assessments between snapshots
>>
>> - Validation that assets meet mandatory metadata requirements
>>
>> - Lookup validation against existing data (comparing file feeds with
>> different formats to existing data in Iceberg/Aurora)
>>
>> This is exactly the type of work that LLMs  could automate while
>> maintaining governance.
>>
>> What we're thinking:
>>
>> ```python
>>
>> # Instead of writing complex SQL by hand...
>>
>> quality_check = LLMSQLQueryOperator(
>>
>>     task_id="find_data_issues",
>>
>>     prompt="Find customers with invalid email formats and missing phone
>> numbers",
>>
>>     data_sources=[customer_asset],  # Airflow knows the schema
>> automatically
>>
>>     # Built-in safety: won't generate DROP/DELETE statements
>>
>> )
>>
>> ```
>>
>> The operator would:
>>
>> - Auto-inject database schema, sample data, and connection details
>>
>> - Generate safe SQL (blocks dangerous operations)
>>
>> - Work across PostgreSQL, Snowflake, BigQuery with dialect awareness
>>
>> - Support schema drift detection between systems
>>
>> - Handle multi-cloud data via Apache DataFusion[1] (Did some experiments
>> with 50M+          records and results are in 10-15 seconds for common
>> aggregations)
>>
>> for more info on benchmarks [2]
>>
>> Key benefit: Assets become smarter with structured metadata (schema,
>> sensitivity, format) instead of just throwing everything in `extra`.
>>
>> Implementation plan:
>>
>> Start with a separate provider (`apache-airflow-providers-sql-ai`) so we
>> can iterate without touching the Airflow core. No breaking changes, works
>> with existing connections and hooks.
>>
>> I am presenting this at Airflow Summit 2025 in Seattle with Kaxil - come
>> see the live demo!
>>
>> Next steps:
>>
>> If this resonates after the Summit, we'll write a proper AIP with
>> technical
>> details and further build a working prototype.
>>
>> Thoughts? Concerns? Better ideas?
>>
>>
>> [1]: https://datafusion.apache.org/
>>
>> [2]:
>>
>> https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/
>>
>> Thanks,
>>
>> Pavan
>>
>> P.S. - Happy to share more technical details with anyone interested.
>>
>

Reply via email to