Great to hear, Jarek and Niko!
If so, then let's roll. Maybe it's a good time for me to re-instantiate the
MCP AIP as well :)


Shahar

On Tue, Jan 13, 2026 at 9:00 PM Jarek Potiuk <[email protected]> wrote:

> FYI: Legally it is fine. Financially - I think we could get a donation for
> that if we asked :)
>
> On Tue, Jan 13, 2026 at 6:36 PM Shahar Epstein <[email protected]> wrote:
>
> > Great idea Pavan, and I would really love to see it happening!
> >
> > One thing that I'm quite concerned about - how are we going to test,
> > evaluate, and assure the quality of the system prompts of these
> operators?
> > Given that currently we cannot officially use AI in our CI to do all of
> > that (legally & financially, AFAIK), I do not feel comfortable delivering
> > the system prompts out of the box, but rather let the user to define them
> > explictly instead. We could recommend in the docs on prompts based on the
> > community's experience - but in any case I think that it should be
> required
> > field with a clear disclaimer that the user is fully responsible for the
> > system prompt.
> >
> >
> > Shahar
> >
> >
> > On Tue, Sep 30, 2025, 16:51 Pavankumar Gopidesu <[email protected]
> >
> > wrote:
> >
> > > Hi everyone,
> > >
> > > We're exploring adding LLM-powered SQL operators to Airflow and would
> > love
> > > community input before writing an AIP.
> > >
> > > The idea: Let users write natural language prompts like "find customers
> > > with missing emails" and have Airflow generate safe SQL queries with
> full
> > > context about your database schema, connections, and data sensitivity.
> > >
> > > Why this matters:
> > >
> > >
> > > Most of us spend too much time on schema drift detection and manual
> data
> > > quality checks. Meanwhile, AI agents are getting powerful but lack
> > > production-ready data integrations. Airflow could bridge this gap.
> > >
> > > Here's what we're dealing with at Tavant:
> > >
> > >
> > > Our team works with multiple data domain teams producing data in
> > different
> > > formats and storage across S3, PostgreSQL, Iceberg, and Aurora. When
> data
> > > assets become available for consumption, we need:
> > >
> > > - Detection of breaking schema changes between systems
> > >
> > > - Data quality assessments between snapshots
> > >
> > > - Validation that assets meet mandatory metadata requirements
> > >
> > > - Lookup validation against existing data (comparing file feeds with
> > > different formats to existing data in Iceberg/Aurora)
> > >
> > > This is exactly the type of work that LLMs  could automate while
> > > maintaining governance.
> > >
> > > What we're thinking:
> > >
> > > ```python
> > >
> > > # Instead of writing complex SQL by hand...
> > >
> > > quality_check = LLMSQLQueryOperator(
> > >
> > >     task_id="find_data_issues",
> > >
> > >     prompt="Find customers with invalid email formats and missing phone
> > > numbers",
> > >
> > >     data_sources=[customer_asset],  # Airflow knows the schema
> > > automatically
> > >
> > >     # Built-in safety: won't generate DROP/DELETE statements
> > >
> > > )
> > >
> > > ```
> > >
> > > The operator would:
> > >
> > > - Auto-inject database schema, sample data, and connection details
> > >
> > > - Generate safe SQL (blocks dangerous operations)
> > >
> > > - Work across PostgreSQL, Snowflake, BigQuery with dialect awareness
> > >
> > > - Support schema drift detection between systems
> > >
> > > - Handle multi-cloud data via Apache DataFusion[1] (Did some
> experiments
> > > with 50M+          records and results are in 10-15 seconds for common
> > > aggregations)
> > >
> > > for more info on benchmarks [2]
> > >
> > > Key benefit: Assets become smarter with structured metadata (schema,
> > > sensitivity, format) instead of just throwing everything in `extra`.
> > >
> > > Implementation plan:
> > >
> > > Start with a separate provider (`apache-airflow-providers-sql-ai`) so
> we
> > > can iterate without touching the Airflow core. No breaking changes,
> works
> > > with existing connections and hooks.
> > >
> > > I am presenting this at Airflow Summit 2025 in Seattle with Kaxil -
> come
> > > see the live demo!
> > >
> > > Next steps:
> > >
> > > If this resonates after the Summit, we'll write a proper AIP with
> > technical
> > > details and further build a working prototype.
> > >
> > > Thoughts? Concerns? Better ideas?
> > >
> > >
> > > [1]: https://datafusion.apache.org/
> > >
> > > [2]:
> > >
> > >
> >
> https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/
> > >
> > > Thanks,
> > >
> > > Pavan
> > >
> > > P.S. - Happy to share more technical details with anyone interested.
> > >
> >
>

Reply via email to