Yea I think it should be apache-airflow-providers-common-ai On Wed, 8 Oct 2025 at 02:04, Pavankumar Gopidesu <[email protected]> wrote:
> Yes its new provider starting with completely experimental, we dont > want to break functionalities with existing providers :) > > Mostly its sql based operators, so named it as sql-ai but agree we can > make it generic without specifying sql in it :) > > Pavan > > On Tue, Oct 7, 2025 at 3:48 PM Ryan Hatter via dev > <[email protected]> wrote: > > > > Would this really necessitate a new provider? Should this just be baked > > into the common SQL provider? > > > > Alternatively, instead of a narrow `sql-ai` provider, why not have a > > generic common ai provider with a SQL package, which would allow for us > to > > build AI-based subpackages into the provider other than just SQL? > > > > On Mon, Oct 6, 2025 at 4:31 PM Pavankumar Gopidesu < > [email protected]> > > wrote: > > > > > @Giorgio Yes indeed that's also a good thought to integrate. I will > keep in > > > mind to think about when I draft AIP and message about this a bit more > :) > > > Yes please join. We have great demos packed on this topic :) > > > > > > @kaxil , Yes that's a great blog post from the wren AI and leveraging > the > > > Apache DataFusion as a query engine to connect to different data > sources. > > > > > > Pavan > > > > > > On Tue, Sep 30, 2025 at 7:37 PM Giorgio Zoppi <[email protected] > > > > > wrote: > > > > > > > Hey Pavan, > > > > Some notes: > > > > 1. LLM can be also very useful in detecting root causes of your error > > > while > > > > developing and design a pipeline. I explain me better, we'd in the > past > > > > several > > > > Spark processes, when it is all green is ok, but when on fails, it > will > > > be > > > > nice to have a tool integrated to ask why. > > > > 2. Ideally such operator could be a ModelContextProtocolOperator and > you > > > > would not need nothing else that put an LLM as parameter with that > > > > operator, > > > > and just call for tools, execute query, and so on. This would be more > > > > powerful, because you create an abstraction between devices, > databases, > > > > server and so on, so each source of data can be injected on the > pipeline. > > > > 3. Good job! Looking forward to see the presentation. > > > > Best Regards, > > > > Giorgio > > > > > > > > Il giorno mar 30 set 2025 alle ore 14:51 Pavankumar Gopidesu < > > > > [email protected]> ha scritto: > > > > > > > > > Hi everyone, > > > > > > > > > > We're exploring adding LLM-powered SQL operators to Airflow and > would > > > > love > > > > > community input before writing an AIP. > > > > > > > > > > The idea: Let users write natural language prompts like "find > customers > > > > > with missing emails" and have Airflow generate safe SQL queries > with > > > full > > > > > context about your database schema, connections, and data > sensitivity. > > > > > > > > > > Why this matters: > > > > > > > > > > > > > > > Most of us spend too much time on schema drift detection and manual > > > data > > > > > quality checks. Meanwhile, AI agents are getting powerful but lack > > > > > production-ready data integrations. Airflow could bridge this gap. > > > > > > > > > > Here's what we're dealing with at Tavant: > > > > > > > > > > > > > > > Our team works with multiple data domain teams producing data in > > > > different > > > > > formats and storage across S3, PostgreSQL, Iceberg, and Aurora. > When > > > data > > > > > assets become available for consumption, we need: > > > > > > > > > > - Detection of breaking schema changes between systems > > > > > > > > > > - Data quality assessments between snapshots > > > > > > > > > > - Validation that assets meet mandatory metadata requirements > > > > > > > > > > - Lookup validation against existing data (comparing file feeds > with > > > > > different formats to existing data in Iceberg/Aurora) > > > > > > > > > > This is exactly the type of work that LLMs could automate while > > > > > maintaining governance. > > > > > > > > > > What we're thinking: > > > > > > > > > > ```python > > > > > > > > > > # Instead of writing complex SQL by hand... > > > > > > > > > > quality_check = LLMSQLQueryOperator( > > > > > > > > > > task_id="find_data_issues", > > > > > > > > > > prompt="Find customers with invalid email formats and missing > phone > > > > > numbers", > > > > > > > > > > data_sources=[customer_asset], # Airflow knows the schema > > > > > automatically > > > > > > > > > > # Built-in safety: won't generate DROP/DELETE statements > > > > > > > > > > ) > > > > > > > > > > ``` > > > > > > > > > > The operator would: > > > > > > > > > > - Auto-inject database schema, sample data, and connection details > > > > > > > > > > - Generate safe SQL (blocks dangerous operations) > > > > > > > > > > - Work across PostgreSQL, Snowflake, BigQuery with dialect > awareness > > > > > > > > > > - Support schema drift detection between systems > > > > > > > > > > - Handle multi-cloud data via Apache DataFusion[1] (Did some > > > experiments > > > > > with 50M+ records and results are in 10-15 seconds for > common > > > > > aggregations) > > > > > > > > > > for more info on benchmarks [2] > > > > > > > > > > Key benefit: Assets become smarter with structured metadata > (schema, > > > > > sensitivity, format) instead of just throwing everything in > `extra`. > > > > > > > > > > Implementation plan: > > > > > > > > > > Start with a separate provider (`apache-airflow-providers-sql-ai`) > so > > > we > > > > > can iterate without touching the Airflow core. No breaking changes, > > > works > > > > > with existing connections and hooks. > > > > > > > > > > I am presenting this at Airflow Summit 2025 in Seattle with Kaxil - > > > come > > > > > see the live demo! > > > > > > > > > > Next steps: > > > > > > > > > > If this resonates after the Summit, we'll write a proper AIP with > > > > technical > > > > > details and further build a working prototype. > > > > > > > > > > Thoughts? Concerns? Better ideas? > > > > > > > > > > > > > > > [1]: https://datafusion.apache.org/ > > > > > > > > > > [2]: > > > > > > > > > > > > > > > > > > https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/ > > > > > > > > > > Thanks, > > > > > > > > > > Pavan > > > > > > > > > > P.S. - Happy to share more technical details with anyone > interested. > > > > > > > > > > > > > > > > > -- > > > > Life is a chess game - Anonymous. > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
