Re: AI-Native Airflow - LLM-Driven Intelligence for Production Data Workflows

Pavankumar Gopidesu Fri, 17 Oct 2025 17:05:53 -0700

Yes its new provider starting with completely experimental, we dont
want to break functionalities with existing providers :)


Mostly its sql based operators, so named it as sql-ai but agree we can
make it generic without specifying sql in it :)

Pavan

On Tue, Oct 7, 2025 at 3:48 PM Ryan Hatter via dev
<[email protected]> wrote:
>
> Would this really necessitate a new provider? Should this just be baked
> into the common SQL provider?
>
> Alternatively, instead of a narrow `sql-ai` provider, why not have a
> generic common ai provider with a SQL package, which would allow for us to
> build AI-based subpackages into the provider other than just SQL?
>
> On Mon, Oct 6, 2025 at 4:31 PM Pavankumar Gopidesu <[email protected]>
> wrote:
>
> > @Giorgio Yes indeed that's also a good thought to integrate. I will keep in
> > mind to think about when I draft AIP and message about this a bit more :)
> > Yes please join. We have great demos packed on this topic :)
> >
> > @kaxil , Yes that's a great blog post from the wren AI and leveraging the
> > Apache DataFusion as a query engine to connect to different data sources.
> >
> > Pavan
> >
> > On Tue, Sep 30, 2025 at 7:37 PM Giorgio Zoppi <[email protected]>
> > wrote:
> >
> > > Hey Pavan,
> > > Some notes:
> > > 1. LLM can be also very useful in detecting root causes of your error
> > while
> > > developing and design a pipeline. I explain me better, we'd in the past
> > > several
> > > Spark processes, when it is all green is ok, but when on fails, it will
> > be
> > > nice to have a tool integrated to ask why.
> > > 2. Ideally such operator could be a ModelContextProtocolOperator and you
> > > would not need nothing else that put an LLM as parameter with that
> > > operator,
> > > and just call for tools, execute query, and so on. This would be more
> > > powerful, because you create an abstraction between devices, databases,
> > > server and so on, so each source of data can be injected on the pipeline.
> > > 3.  Good job! Looking forward to see the presentation.
> > > Best Regards,
> > > Giorgio
> > >
> > > Il giorno mar 30 set 2025 alle ore 14:51 Pavankumar Gopidesu <
> > > [email protected]> ha scritto:
> > >
> > > > Hi everyone,
> > > >
> > > > We're exploring adding LLM-powered SQL operators to Airflow and would
> > > love
> > > > community input before writing an AIP.
> > > >
> > > > The idea: Let users write natural language prompts like "find customers
> > > > with missing emails" and have Airflow generate safe SQL queries with
> > full
> > > > context about your database schema, connections, and data sensitivity.
> > > >
> > > > Why this matters:
> > > >
> > > >
> > > > Most of us spend too much time on schema drift detection and manual
> > data
> > > > quality checks. Meanwhile, AI agents are getting powerful but lack
> > > > production-ready data integrations. Airflow could bridge this gap.
> > > >
> > > > Here's what we're dealing with at Tavant:
> > > >
> > > >
> > > > Our team works with multiple data domain teams producing data in
> > > different
> > > > formats and storage across S3, PostgreSQL, Iceberg, and Aurora. When
> > data
> > > > assets become available for consumption, we need:
> > > >
> > > > - Detection of breaking schema changes between systems
> > > >
> > > > - Data quality assessments between snapshots
> > > >
> > > > - Validation that assets meet mandatory metadata requirements
> > > >
> > > > - Lookup validation against existing data (comparing file feeds with
> > > > different formats to existing data in Iceberg/Aurora)
> > > >
> > > > This is exactly the type of work that LLMs  could automate while
> > > > maintaining governance.
> > > >
> > > > What we're thinking:
> > > >
> > > > ```python
> > > >
> > > > # Instead of writing complex SQL by hand...
> > > >
> > > > quality_check = LLMSQLQueryOperator(
> > > >
> > > >     task_id="find_data_issues",
> > > >
> > > >     prompt="Find customers with invalid email formats and missing phone
> > > > numbers",
> > > >
> > > >     data_sources=[customer_asset],  # Airflow knows the schema
> > > > automatically
> > > >
> > > >     # Built-in safety: won't generate DROP/DELETE statements
> > > >
> > > > )
> > > >
> > > > ```
> > > >
> > > > The operator would:
> > > >
> > > > - Auto-inject database schema, sample data, and connection details
> > > >
> > > > - Generate safe SQL (blocks dangerous operations)
> > > >
> > > > - Work across PostgreSQL, Snowflake, BigQuery with dialect awareness
> > > >
> > > > - Support schema drift detection between systems
> > > >
> > > > - Handle multi-cloud data via Apache DataFusion[1] (Did some
> > experiments
> > > > with 50M+          records and results are in 10-15 seconds for common
> > > > aggregations)
> > > >
> > > > for more info on benchmarks [2]
> > > >
> > > > Key benefit: Assets become smarter with structured metadata (schema,
> > > > sensitivity, format) instead of just throwing everything in `extra`.
> > > >
> > > > Implementation plan:
> > > >
> > > > Start with a separate provider (`apache-airflow-providers-sql-ai`) so
> > we
> > > > can iterate without touching the Airflow core. No breaking changes,
> > works
> > > > with existing connections and hooks.
> > > >
> > > > I am presenting this at Airflow Summit 2025 in Seattle with Kaxil -
> > come
> > > > see the live demo!
> > > >
> > > > Next steps:
> > > >
> > > > If this resonates after the Summit, we'll write a proper AIP with
> > > technical
> > > > details and further build a working prototype.
> > > >
> > > > Thoughts? Concerns? Better ideas?
> > > >
> > > >
> > > > [1]: https://datafusion.apache.org/
> > > >
> > > > [2]:
> > > >
> > > >
> > >
> > https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/
> > > >
> > > > Thanks,
> > > >
> > > > Pavan
> > > >
> > > > P.S. - Happy to share more technical details with anyone interested.
> > > >
> > >
> > >
> > > --
> > > Life is a chess game - Anonymous.
> > >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: AI-Native Airflow - LLM-Driven Intelligence for Production Data Workflows

Reply via email to