Thanks Giorgio Zoppi, for reviewing the AIP, yes its already planned
part of this AIP, see the [1] example , where you can disable hitl
step or enable it. So its integrated part of the Operator with the
help of HITL operator.

```
LLMDataQualityOperator(

    task_id="customer_quality_analysis",

    data_sources=[customer_s3],

    prompt="Generate data quality validation queries",

    require_approval=True,  # Built-in HITL

    approval_timeout=timedelta(hours=2)

)
```

[1]: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406618285

Regards,
Pavan

On Sat, Dec 27, 2025 at 9:16 AM Giorgio Zoppi <[email protected]> wrote:
>
> Hello,
> Just 1c, skimming AIP,
> You might  want to explore on how to avoid human approval for generated
> query using llm as judge to eval the quality. The nice thing of data
> pipelines is automation
>
>
>
>
> On Wed, Dec 24, 2025, 10:23 Pavankumar Gopidesu <[email protected]>
> wrote:
>
> > Hello everyone,
> >
> > The thread has been quiet for some time, and I would like to restart
> > the discussion with the AIP.
> >
> > First, a sincere thank you to Kaxil for presenting the idea at Airflow
> > Summit 2025. The session was very well received, and many attendees
> > expressed strong interest in the proposal. Unfortunately, I was unable
> > to attend the summit due to visa issues, but I am hopeful I will be
> > able to join next year.
> >
> > The demo included well-structured prototypes. For those who were
> > unable to attend the session, please refer to the recorded talk here
> > [1].
> >
> > I have also drafted the complete AIP proposal, which is available here
> > [2]. I would greatly appreciate your reviews and look forward to
> > feedback and further discussion.
> >
> > Finally, to those celebrating Christmas, I wish you a very happy
> > Christmas and a wonderful holiday season.
> >
> > Regards
> > Pavan
> >
> > [1] https://www.youtube.com/watch?v=XSAzSDVUi2o
> > [2]
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406618285
> >
> > On Wed, Oct 15, 2025 at 6:13 AM Amogh Desai <[email protected]> wrote:
> > >
> > > Thanks Pavan and Kaxil, seems like an interesting idea and a pretty
> > > reasonable problem to solve.
> > >
> > > I also like the idea of starting with
> > `apache-airflow-providers-common-ai`
> > > and expanding as / when needed.
> > >
> > > Looking forward to when the recording will be out, missed attending this
> > > session at the Airflow Summit.
> > >
> > > Thanks & Regards,
> > > Amogh Desai
> > >
> > >
> > > On Thu, Oct 9, 2025 at 10:49 AM Kaxil Naik <[email protected]> wrote:
> > >
> > > > Yea I think it should be apache-airflow-providers-common-ai
> > > >
> > > > On Wed, 8 Oct 2025 at 02:04, Pavankumar Gopidesu <
> > [email protected]>
> > > > wrote:
> > > >
> > > > > Yes its new provider starting with completely experimental, we dont
> > > > > want to break functionalities with existing providers :)
> > > > >
> > > > > Mostly its sql based operators, so named it as sql-ai but agree we
> > can
> > > > > make it generic without specifying sql in it :)
> > > > >
> > > > > Pavan
> > > > >
> > > > > On Tue, Oct 7, 2025 at 3:48 PM Ryan Hatter via dev
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > Would this really necessitate a new provider? Should this just be
> > baked
> > > > > > into the common SQL provider?
> > > > > >
> > > > > > Alternatively, instead of a narrow `sql-ai` provider, why not have
> > a
> > > > > > generic common ai provider with a SQL package, which would allow
> > for us
> > > > > to
> > > > > > build AI-based subpackages into the provider other than just SQL?
> > > > > >
> > > > > > On Mon, Oct 6, 2025 at 4:31 PM Pavankumar Gopidesu <
> > > > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > @Giorgio Yes indeed that's also a good thought to integrate. I
> > will
> > > > > keep in
> > > > > > > mind to think about when I draft AIP and message about this a bit
> > > > more
> > > > > :)
> > > > > > > Yes please join. We have great demos packed on this topic :)
> > > > > > >
> > > > > > > @kaxil , Yes that's a great blog post from the wren AI and
> > leveraging
> > > > > the
> > > > > > > Apache DataFusion as a query engine to connect to different data
> > > > > sources.
> > > > > > >
> > > > > > > Pavan
> > > > > > >
> > > > > > > On Tue, Sep 30, 2025 at 7:37 PM Giorgio Zoppi <
> > > > [email protected]
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hey Pavan,
> > > > > > > > Some notes:
> > > > > > > > 1. LLM can be also very useful in detecting root causes of your
> > > > error
> > > > > > > while
> > > > > > > > developing and design a pipeline. I explain me better, we'd in
> > the
> > > > > past
> > > > > > > > several
> > > > > > > > Spark processes, when it is all green is ok, but when on
> > fails, it
> > > > > will
> > > > > > > be
> > > > > > > > nice to have a tool integrated to ask why.
> > > > > > > > 2. Ideally such operator could be a
> > ModelContextProtocolOperator
> > > > and
> > > > > you
> > > > > > > > would not need nothing else that put an LLM as parameter with
> > that
> > > > > > > > operator,
> > > > > > > > and just call for tools, execute query, and so on. This would
> > be
> > > > more
> > > > > > > > powerful, because you create an abstraction between devices,
> > > > > databases,
> > > > > > > > server and so on, so each source of data can be injected on the
> > > > > pipeline.
> > > > > > > > 3.  Good job! Looking forward to see the presentation.
> > > > > > > > Best Regards,
> > > > > > > > Giorgio
> > > > > > > >
> > > > > > > > Il giorno mar 30 set 2025 alle ore 14:51 Pavankumar Gopidesu <
> > > > > > > > [email protected]> ha scritto:
> > > > > > > >
> > > > > > > > > Hi everyone,
> > > > > > > > >
> > > > > > > > > We're exploring adding LLM-powered SQL operators to Airflow
> > and
> > > > > would
> > > > > > > > love
> > > > > > > > > community input before writing an AIP.
> > > > > > > > >
> > > > > > > > > The idea: Let users write natural language prompts like "find
> > > > > customers
> > > > > > > > > with missing emails" and have Airflow generate safe SQL
> > queries
> > > > > with
> > > > > > > full
> > > > > > > > > context about your database schema, connections, and data
> > > > > sensitivity.
> > > > > > > > >
> > > > > > > > > Why this matters:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Most of us spend too much time on schema drift detection and
> > > > manual
> > > > > > > data
> > > > > > > > > quality checks. Meanwhile, AI agents are getting powerful but
> > > > lack
> > > > > > > > > production-ready data integrations. Airflow could bridge this
> > > > gap.
> > > > > > > > >
> > > > > > > > > Here's what we're dealing with at Tavant:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Our team works with multiple data domain teams producing
> > data in
> > > > > > > > different
> > > > > > > > > formats and storage across S3, PostgreSQL, Iceberg, and
> > Aurora.
> > > > > When
> > > > > > > data
> > > > > > > > > assets become available for consumption, we need:
> > > > > > > > >
> > > > > > > > > - Detection of breaking schema changes between systems
> > > > > > > > >
> > > > > > > > > - Data quality assessments between snapshots
> > > > > > > > >
> > > > > > > > > - Validation that assets meet mandatory metadata requirements
> > > > > > > > >
> > > > > > > > > - Lookup validation against existing data (comparing file
> > feeds
> > > > > with
> > > > > > > > > different formats to existing data in Iceberg/Aurora)
> > > > > > > > >
> > > > > > > > > This is exactly the type of work that LLMs  could automate
> > while
> > > > > > > > > maintaining governance.
> > > > > > > > >
> > > > > > > > > What we're thinking:
> > > > > > > > >
> > > > > > > > > ```python
> > > > > > > > >
> > > > > > > > > # Instead of writing complex SQL by hand...
> > > > > > > > >
> > > > > > > > > quality_check = LLMSQLQueryOperator(
> > > > > > > > >
> > > > > > > > >     task_id="find_data_issues",
> > > > > > > > >
> > > > > > > > >     prompt="Find customers with invalid email formats and
> > missing
> > > > > phone
> > > > > > > > > numbers",
> > > > > > > > >
> > > > > > > > >     data_sources=[customer_asset],  # Airflow knows the
> > schema
> > > > > > > > > automatically
> > > > > > > > >
> > > > > > > > >     # Built-in safety: won't generate DROP/DELETE statements
> > > > > > > > >
> > > > > > > > > )
> > > > > > > > >
> > > > > > > > > ```
> > > > > > > > >
> > > > > > > > > The operator would:
> > > > > > > > >
> > > > > > > > > - Auto-inject database schema, sample data, and connection
> > > > details
> > > > > > > > >
> > > > > > > > > - Generate safe SQL (blocks dangerous operations)
> > > > > > > > >
> > > > > > > > > - Work across PostgreSQL, Snowflake, BigQuery with dialect
> > > > > awareness
> > > > > > > > >
> > > > > > > > > - Support schema drift detection between systems
> > > > > > > > >
> > > > > > > > > - Handle multi-cloud data via Apache DataFusion[1] (Did some
> > > > > > > experiments
> > > > > > > > > with 50M+          records and results are in 10-15 seconds
> > for
> > > > > common
> > > > > > > > > aggregations)
> > > > > > > > >
> > > > > > > > > for more info on benchmarks [2]
> > > > > > > > >
> > > > > > > > > Key benefit: Assets become smarter with structured metadata
> > > > > (schema,
> > > > > > > > > sensitivity, format) instead of just throwing everything in
> > > > > `extra`.
> > > > > > > > >
> > > > > > > > > Implementation plan:
> > > > > > > > >
> > > > > > > > > Start with a separate provider
> > > > (`apache-airflow-providers-sql-ai`)
> > > > > so
> > > > > > > we
> > > > > > > > > can iterate without touching the Airflow core. No breaking
> > > > changes,
> > > > > > > works
> > > > > > > > > with existing connections and hooks.
> > > > > > > > >
> > > > > > > > > I am presenting this at Airflow Summit 2025 in Seattle with
> > > > Kaxil -
> > > > > > > come
> > > > > > > > > see the live demo!
> > > > > > > > >
> > > > > > > > > Next steps:
> > > > > > > > >
> > > > > > > > > If this resonates after the Summit, we'll write a proper AIP
> > with
> > > > > > > > technical
> > > > > > > > > details and further build a working prototype.
> > > > > > > > >
> > > > > > > > > Thoughts? Concerns? Better ideas?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > [1]: https://datafusion.apache.org/
> > > > > > > > >
> > > > > > > > > [2]:
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Pavan
> > > > > > > > >
> > > > > > > > > P.S. - Happy to share more technical details with anyone
> > > > > interested.
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Life is a chess game - Anonymous.
> > > > > > > >
> > > > > > >
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: [email protected]
> > > > > For additional commands, e-mail: [email protected]
> > > > >
> > > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to