Hi everyone,
We're exploring adding LLM-powered SQL operators to Airflow and would love
community input before writing an AIP.
The idea: Let users write natural language prompts like "find customers
with missing emails" and have Airflow generate safe SQL queries with full
context about your database schema, connections, and data sensitivity.
Why this matters:
Most of us spend too much time on schema drift detection and manual data
quality checks. Meanwhile, AI agents are getting powerful but lack
production-ready data integrations. Airflow could bridge this gap.
Here's what we're dealing with at Tavant:
Our team works with multiple data domain teams producing data in different
formats and storage across S3, PostgreSQL, Iceberg, and Aurora. When data
assets become available for consumption, we need:
- Detection of breaking schema changes between systems
- Data quality assessments between snapshots
- Validation that assets meet mandatory metadata requirements
- Lookup validation against existing data (comparing file feeds with
different formats to existing data in Iceberg/Aurora)
This is exactly the type of work that LLMs could automate while
maintaining governance.
What we're thinking:
```python
# Instead of writing complex SQL by hand...
quality_check = LLMSQLQueryOperator(
task_id="find_data_issues",
prompt="Find customers with invalid email formats and missing phone
numbers",
data_sources=[customer_asset], # Airflow knows the schema automatically
# Built-in safety: won't generate DROP/DELETE statements
)
```
The operator would:
- Auto-inject database schema, sample data, and connection details
- Generate safe SQL (blocks dangerous operations)
- Work across PostgreSQL, Snowflake, BigQuery with dialect awareness
- Support schema drift detection between systems
- Handle multi-cloud data via Apache DataFusion[1] (Did some experiments
with 50M+ records and results are in 10-15 seconds for common
aggregations)
for more info on benchmarks [2]
Key benefit: Assets become smarter with structured metadata (schema,
sensitivity, format) instead of just throwing everything in `extra`.
Implementation plan:
Start with a separate provider (`apache-airflow-providers-sql-ai`) so we
can iterate without touching the Airflow core. No breaking changes, works
with existing connections and hooks.
I am presenting this at Airflow Summit 2025 in Seattle with Kaxil - come
see the live demo!
Next steps:
If this resonates after the Summit, we'll write a proper AIP with technical
details and further build a working prototype.
Thoughts? Concerns? Better ideas?
[1]: https://datafusion.apache.org/
[2]:
https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/
Thanks,
Pavan
P.S. - Happy to share more technical details with anyone interested.