Love the idea Pavan! Looking forward to hearing more from you at the Summit
and speaking "a little" while at it.

On Tue, 30 Sept 2025 at 14:51, Pavankumar Gopidesu <[email protected]>
wrote:

> Hi everyone,
>
> We're exploring adding LLM-powered SQL operators to Airflow and would love
> community input before writing an AIP.
>
> The idea: Let users write natural language prompts like "find customers
> with missing emails" and have Airflow generate safe SQL queries with full
> context about your database schema, connections, and data sensitivity.
>
> Why this matters:
>
>
> Most of us spend too much time on schema drift detection and manual data
> quality checks. Meanwhile, AI agents are getting powerful but lack
> production-ready data integrations. Airflow could bridge this gap.
>
> Here's what we're dealing with at Tavant:
>
>
> Our team works with multiple data domain teams producing data in different
> formats and storage across S3, PostgreSQL, Iceberg, and Aurora. When data
> assets become available for consumption, we need:
>
> - Detection of breaking schema changes between systems
>
> - Data quality assessments between snapshots
>
> - Validation that assets meet mandatory metadata requirements
>
> - Lookup validation against existing data (comparing file feeds with
> different formats to existing data in Iceberg/Aurora)
>
> This is exactly the type of work that LLMs  could automate while
> maintaining governance.
>
> What we're thinking:
>
> ```python
>
> # Instead of writing complex SQL by hand...
>
> quality_check = LLMSQLQueryOperator(
>
>     task_id="find_data_issues",
>
>     prompt="Find customers with invalid email formats and missing phone
> numbers",
>
>     data_sources=[customer_asset],  # Airflow knows the schema
> automatically
>
>     # Built-in safety: won't generate DROP/DELETE statements
>
> )
>
> ```
>
> The operator would:
>
> - Auto-inject database schema, sample data, and connection details
>
> - Generate safe SQL (blocks dangerous operations)
>
> - Work across PostgreSQL, Snowflake, BigQuery with dialect awareness
>
> - Support schema drift detection between systems
>
> - Handle multi-cloud data via Apache DataFusion[1] (Did some experiments
> with 50M+          records and results are in 10-15 seconds for common
> aggregations)
>
> for more info on benchmarks [2]
>
> Key benefit: Assets become smarter with structured metadata (schema,
> sensitivity, format) instead of just throwing everything in `extra`.
>
> Implementation plan:
>
> Start with a separate provider (`apache-airflow-providers-sql-ai`) so we
> can iterate without touching the Airflow core. No breaking changes, works
> with existing connections and hooks.
>
> I am presenting this at Airflow Summit 2025 in Seattle with Kaxil - come
> see the live demo!
>
> Next steps:
>
> If this resonates after the Summit, we'll write a proper AIP with technical
> details and further build a working prototype.
>
> Thoughts? Concerns? Better ideas?
>
>
> [1]: https://datafusion.apache.org/
>
> [2]:
>
> https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/
>
> Thanks,
>
> Pavan
>
> P.S. - Happy to share more technical details with anyone interested.
>

Reply via email to