Re: AI-Native Airflow - LLM-Driven Intelligence for Production Data Workflows

Pavankumar Gopidesu Wed, 14 Jan 2026 07:37:48 -0800

Thank you everyone for thoughtful discussions on this AIP, we have sent out
for voting.


Regards,
Pavan

On Tue, Jan 13, 2026 at 10:52 PM Pavankumar Gopidesu <
[email protected]> wrote:

> Thanks Alex,
>
> I agree that evals will be a core part of the operator implementations.
> haven’t yet fully thought through the structure or how best to expose and
> serve evals across operators, so your perspective is very timely. The idea
> of a BaseEvals operator is interesting as well.
>
> Thank you for offering your support, we’ll definitely take you up on that.
> I’ll reach out when we move into implementation so we can definitely
> collaborate on this.
>
> Regards.
> Pavan
>
>
>
>
> On Tue, Jan 13, 2026 at 11:43 AM Alex <[email protected]> wrote:
>
>> Thanks Pavan, this thread and the AIP are awesome!
>>
>> I've been starting to use and advocate an eval-first approach (including a
>> lightning talk in the Airflow Summit [1]), for not just traditional
>> software developers but new builders from other domains (So I can't just
>> say "it's like TDD with integration tests for AI apps") and I'd be happy
>> to
>> help build the evals for, test, design or brainstorm components in this
>> space.
>>
>> I firmly believe evals are a key area and I'm starting to contact the MCP
>> server pioneers I met at last summit so we can experiment building a
>> testbed [2] to evaluate operators/agents/mcps/skills.
>>
>> Including a BaseEvals operator (Which I believe differs from the goal of
>> LLMDataQualityOperator) in the proposal might be worth it (unless the
>> evals
>> scope deserves its own place).
>>
>> Any specific area where you'd like support?
>>
>> Thanks,
>> Alex
>>
>> - [1]
>>
>> https://alexhans.github.io/posts/talk.toward-a-shared-vision-of-llm-evals-in-airflow-ecosystem.html
>> - [2] https://github.com/Alexhans/evals-playground
>>
>> On Thu, Jan 8, 2026 at 9:43 PM Pavankumar Gopidesu <
>> [email protected]>
>> wrote:
>>
>> > Thanks Niko, for reviewing .
>> >
>> > For now I am moving the cycliness implementation to future scope,
>> > maybe a new AIP to bring this in and rethink on this.
>> >
>> > Regards,
>> > Pavan
>> >
>> > On Wed, Jan 7, 2026 at 9:59 PM Oliveira, Niko <[email protected]>
>> wrote:
>> > >
>> > > I read through the AIP and I like the idea a lot! I see both sides of
>> > where to put the HITL portion. But I think that's something we can
>> adjust
>> > one way or another (in an additive way), so if we fine out that it's not
>> > the right fit later, we can pivot.
>> > >
>> > > ________________________________
>> > > From: Pavankumar Gopidesu <[email protected]>
>> > > Sent: Monday, January 5, 2026 9:08:00 AM
>> > > To: [email protected]
>> > > Subject: RE: [EXT] AI-Native Airflow - LLM-Driven Intelligence for
>> > Production Data Workflows
>> > >
>> > > CAUTION: This email originated from outside of the organization. Do
>> not
>> > click links or open attachments unless you can confirm the sender and
>> know
>> > the content is safe.
>> > >
>> > >
>> > >
>> > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
>> > externe. Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si
>> vous
>> > ne pouvez pas confirmer l’identité de l’expéditeur et si vous n’êtes pas
>> > certain que le contenu ne présente aucun risque.
>> > >
>> > >
>> > >
>> > > Yes Zoppi, as mentioned kaxil, we will be using PydanticAI and it
>> > > provides nice interfaces to integrate validations etc;
>> > >
>> > > Pavan
>> > >
>> > > On Wed, Dec 31, 2025 at 2:28 AM Kaxil Naik <[email protected]>
>> wrote:
>> > > >
>> > > > Evals will be part of it as this will be built on top of PydanticAI
>> > that
>> > > > supports it.
>> > > >
>> > > > On Mon, 29 Dec 2025 at 19:03, Giorgio Zoppi <
>> [email protected]>
>> > wrote:
>> > > >
>> > > > > Hey Pavan.
>> > > > > If you are going to introduce this have you thought at the
>> evaluation
>> > > > > framework?
>> > > > > How  do you evaluate the LLm operator?
>> > > > >
>> > > > > On Mon, Dec 29, 2025, 09:40 Pavankumar Gopidesu <
>> > [email protected]>
>> > > > > wrote:
>> > > > >
>> > > > > > Thanks Jens and Jarek, agree on both points raised in comments.
>> > > > > >
>> > > > > > I am happy to defer the embedding of the HITL to separate AIP.
>> > > > > >
>> > > > > > To Jens:
>> > > > > >  Yes it's planned for phases wise, our plan starts with only
>> > provider
>> > > > > > changes.
>> > > > > >
>> > > > > > Regards
>> > > > > > Pavan
>> > > > > >
>> > > > > > On Sun, Dec 28, 2025 at 2:03 PM Jarek Potiuk <[email protected]>
>> > wrote:
>> > > > > > >
>> > > > > > > I also looked at it and I love it as well. I think of it as a
>> > missing
>> > > > > > > abstraction between current Airflow users and current LLM app
>> > > > > > developers, I
>> > > > > > > also proposed something a little bit bolder there, which I
>> think
>> > shows
>> > > > > > the
>> > > > > > > true potential of that approach.
>> > > > > > > I added comment in the doc, but I will copy it here for better
>> > > > > visibility
>> > > > > > >
>> > > > > > > ---
>> > > > > > >
>> > > > > > > After thinking quite a bit about the proposal, I actually love
>> > it and I
>> > > > > > > think that should be the next frontier of making Airflow
>> > abstractions
>> > > > > > more
>> > > > > > > approachable and usable by those who want to implement various
>> > patterns
>> > > > > > of
>> > > > > > > interacting with LLMS.
>> > > > > > >
>> > > > > > > And I have a little different opinion than Jens regarding
>> HITL.
>> > I see
>> > > > > > those
>> > > > > > > common LLM operators as slightly "higher" level operators that
>> > might
>> > > > > > > implement a set of common LLM-related patterns that are
>> currently
>> > > > > either
>> > > > > > > difficult or impossible to express via putting together things
>> > via Dag
>> > > > > > and
>> > > > > > > individual tasks. In this sense, the capability of making HITL
>> > call-out
>> > > > > > for
>> > > > > > > approval or selection from within such an operator - without
>> > completing
>> > > > > > the
>> > > > > > > operator and even running those "call-outs" more than once,
>> > actually
>> > > > > even
>> > > > > > > unbounded number of times during a single operator's
>> execution.
>> > > > > > >
>> > > > > > > Actually it's a great way for us to implement some
>> "cyclicness" -
>> > > > > without
>> > > > > > > breaking the "acyclic" property of our Dags (for now at
>> least).
>> > Making
>> > > > > > Dag
>> > > > > > > "cyclic" is quite a dramatic change, and possibly we do not
>> even
>> > have
>> > > > > to
>> > > > > > do
>> > > > > > > it, because the "cyclic" part can be likely encompassed within
>> > the
>> > > > > > > specialized LLM operators. I can imagine an operator that
>> > performs LLM
>> > > > > > > querying and refining it via additional interactions with LLMs
>> > > > > > "internally"
>> > > > > > > - during a single operator's execution. And some of those
>> > iterations
>> > > > > > might
>> > > > > > > result in HITL "call-out" - even multiple times during one
>> > execution.
>> > > > > > >
>> > > > > > > Also one more proposal I have here is to use an API similar to
>> > HITL (or
>> > > > > > > maybe repurpose HITL for that) - to report PROGRESS of such a
>> > task.
>> > > > > This
>> > > > > > is
>> > > > > > > the typical property of good LLM task that it provides some
>> > feedback to
>> > > > > > the
>> > > > > > > user - it might be HITL when it asks for something but also it
>> > might be
>> > > > > > > HOOTL (Human Outside Of The Loop) - where the task is simply
>> > reporting
>> > > > > > it's
>> > > > > > > progress and allows the user to perform asynchronous actions
>> > based on
>> > > > > > that
>> > > > > > > progress → for example abort the execution (to stop the Dag)
>> or
>> > mark it
>> > > > > > as
>> > > > > > > "skipped" (to trigger - skip processing path), or mark it as
>> > "success"
>> > > > > to
>> > > > > > > simulate things being completed when they are not. While the
>> > three
>> > > > > > "async"
>> > > > > > > operations we already have, we do not currently have
>> "progress"
>> > > > > targeted
>> > > > > > > for the kind of actor who is also HITL "actor" - someone who
>> is
>> > not
>> > > > > > > interested in detailed logs, but rather want to monitor
>> progress
>> > and
>> > > > > > assess
>> > > > > > > quality of the output - even if it is just a partial output in
>> > the
>> > > > > > > iterative process).
>> > > > > > >
>> > > > > > > I think that it will be easier and much more "surgical" (and
>> > applied in
>> > > > > > the
>> > > > > > > right place) to embed this "iterative" feedback / progress
>> than
>> > to
>> > > > > modify
>> > > > > > > the "acyclic" property into our Dags.
>> > > > > > >
>> > > > > > > Also - this kind of Progress interface can also be used to
>> > publish the
>> > > > > > > "async" tasks progress as the next step of [WIP] AIP-98: Add
>> > async
>> > > > > > support
>> > > > > > > for PythonOperator in Airflow 3:
>> > > > > > >
>> > > > > >
>> > > > >
>> >
>> https://cwiki.apache.org/confluence/display/AIRFLOW/%5BWIP%5D+AIP-98%3A+Add+async+support+for+PythonOperator+in+Airflow+3
>> > > > > > > that we discussed with David  .
>> > > > > > >
>> > > > > > > J.
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > On Sun, Dec 28, 2025 at 2:16 PM Jens Scheffler <
>> > [email protected]>
>> > > > > > wrote:
>> > > > > > >
>> > > > > > > > I like the AIP very much and in my view can be made
>> completely
>> > in a
>> > > > > > > > Provider package... with some comments (I assume non
>> blocking)
>> > and
>> > > > > > would
>> > > > > > > > propose to really start in increments and then adjust by
>> > learning on
>> > > > > > the
>> > > > > > > > path.
>> > > > > > > >
>> > > > > > > > On 12/27/25 22:00, Pavankumar Gopidesu wrote:
>> > > > > > > > > Thanks Giorgio Zoppi, for reviewing the AIP, yes its
>> already
>> > > > > planned
>> > > > > > > > > part of this AIP, see the [1] example , where you can
>> > disable hitl
>> > > > > > > > > step or enable it. So its integrated part of the Operator
>> > with the
>> > > > > > > > > help of HITL operator.
>> > > > > > > > >
>> > > > > > > > > ```
>> > > > > > > > > LLMDataQualityOperator(
>> > > > > > > > >
>> > > > > > > > >      task_id="customer_quality_analysis",
>> > > > > > > > >
>> > > > > > > > >      data_sources=[customer_s3],
>> > > > > > > > >
>> > > > > > > > >      prompt="Generate data quality validation queries",
>> > > > > > > > >
>> > > > > > > > >      require_approval=True,  # Built-in HITL
>> > > > > > > > >
>> > > > > > > > >      approval_timeout=timedelta(hours=2)
>> > > > > > > > >
>> > > > > > > > > )
>> > > > > > > > > ```
>> > > > > > > > >
>> > > > > > > > > [1]:
>> > > > > > > >
>> > > > > >
>> > > > >
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406618285
>> > > > > > > > >
>> > > > > > > > > Regards,
>> > > > > > > > > Pavan
>> > > > > > > > >
>> > > > > > > > > On Sat, Dec 27, 2025 at 9:16 AM Giorgio Zoppi <
>> > > > > > [email protected]>
>> > > > > > > > wrote:
>> > > > > > > > >> Hello,
>> > > > > > > > >> Just 1c, skimming AIP,
>> > > > > > > > >> You might  want to explore on how to avoid human approval
>> > for
>> > > > > > generated
>> > > > > > > > >> query using llm as judge to eval the quality. The nice
>> > thing of
>> > > > > data
>> > > > > > > > >> pipelines is automation
>> > > > > > > > >>
>> > > > > > > > >>
>> > > > > > > > >>
>> > > > > > > > >>
>> > > > > > > > >> On Wed, Dec 24, 2025, 10:23 Pavankumar Gopidesu <
>> > > > > > > > [email protected]>
>> > > > > > > > >> wrote:
>> > > > > > > > >>
>> > > > > > > > >>> Hello everyone,
>> > > > > > > > >>>
>> > > > > > > > >>> The thread has been quiet for some time, and I would
>> like
>> > to
>> > > > > > restart
>> > > > > > > > >>> the discussion with the AIP.
>> > > > > > > > >>>
>> > > > > > > > >>> First, a sincere thank you to Kaxil for presenting the
>> > idea at
>> > > > > > Airflow
>> > > > > > > > >>> Summit 2025. The session was very well received, and
>> many
>> > > > > attendees
>> > > > > > > > >>> expressed strong interest in the proposal.
>> Unfortunately,
>> > I was
>> > > > > > unable
>> > > > > > > > >>> to attend the summit due to visa issues, but I am
>> hopeful
>> > I will
>> > > > > be
>> > > > > > > > >>> able to join next year.
>> > > > > > > > >>>
>> > > > > > > > >>> The demo included well-structured prototypes. For those
>> > who were
>> > > > > > > > >>> unable to attend the session, please refer to the
>> recorded
>> > talk
>> > > > > > here
>> > > > > > > > >>> [1].
>> > > > > > > > >>>
>> > > > > > > > >>> I have also drafted the complete AIP proposal, which is
>> > available
>> > > > > > here
>> > > > > > > > >>> [2]. I would greatly appreciate your reviews and look
>> > forward to
>> > > > > > > > >>> feedback and further discussion.
>> > > > > > > > >>>
>> > > > > > > > >>> Finally, to those celebrating Christmas, I wish you a
>> very
>> > happy
>> > > > > > > > >>> Christmas and a wonderful holiday season.
>> > > > > > > > >>>
>> > > > > > > > >>> Regards
>> > > > > > > > >>> Pavan
>> > > > > > > > >>>
>> > > > > > > > >>> [1] https://www.youtube.com/watch?v=XSAzSDVUi2o
>> > > > > > > > >>> [2]
>> > > > > > > > >>>
>> > > > > > > >
>> > > > > >
>> > > > >
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406618285
>> > > > > > > > >>>
>> > > > > > > > >>> On Wed, Oct 15, 2025 at 6:13 AM Amogh Desai <
>> > > > > [email protected]
>> > > > > > >
>> > > > > > > > wrote:
>> > > > > > > > >>>> Thanks Pavan and Kaxil, seems like an interesting idea
>> > and a
>> > > > > > pretty
>> > > > > > > > >>>> reasonable problem to solve.
>> > > > > > > > >>>>
>> > > > > > > > >>>> I also like the idea of starting with
>> > > > > > > > >>> `apache-airflow-providers-common-ai`
>> > > > > > > > >>>> and expanding as / when needed.
>> > > > > > > > >>>>
>> > > > > > > > >>>> Looking forward to when the recording will be out,
>> missed
>> > > > > > attending
>> > > > > > > > this
>> > > > > > > > >>>> session at the Airflow Summit.
>> > > > > > > > >>>>
>> > > > > > > > >>>> Thanks & Regards,
>> > > > > > > > >>>> Amogh Desai
>> > > > > > > > >>>>
>> > > > > > > > >>>>
>> > > > > > > > >>>> On Thu, Oct 9, 2025 at 10:49 AM Kaxil Naik <
>> > [email protected]
>> > > > > >
>> > > > > > > > wrote:
>> > > > > > > > >>>>
>> > > > > > > > >>>>> Yea I think it should be
>> > apache-airflow-providers-common-ai
>> > > > > > > > >>>>>
>> > > > > > > > >>>>> On Wed, 8 Oct 2025 at 02:04, Pavankumar Gopidesu <
>> > > > > > > > >>> [email protected]>
>> > > > > > > > >>>>> wrote:
>> > > > > > > > >>>>>
>> > > > > > > > >>>>>> Yes its new provider starting with completely
>> > experimental, we
>> > > > > > dont
>> > > > > > > > >>>>>> want to break functionalities with existing
>> providers :)
>> > > > > > > > >>>>>>
>> > > > > > > > >>>>>> Mostly its sql based operators, so named it as sql-ai
>> > but
>> > > > > agree
>> > > > > > we
>> > > > > > > > >>> can
>> > > > > > > > >>>>>> make it generic without specifying sql in it :)
>> > > > > > > > >>>>>>
>> > > > > > > > >>>>>> Pavan
>> > > > > > > > >>>>>>
>> > > > > > > > >>>>>> On Tue, Oct 7, 2025 at 3:48 PM Ryan Hatter via dev
>> > > > > > > > >>>>>> <[email protected]> wrote:
>> > > > > > > > >>>>>>> Would this really necessitate a new provider? Should
>> > this
>> > > > > just
>> > > > > > be
>> > > > > > > > >>> baked
>> > > > > > > > >>>>>>> into the common SQL provider?
>> > > > > > > > >>>>>>>
>> > > > > > > > >>>>>>> Alternatively, instead of a narrow `sql-ai`
>> provider,
>> > why not
>> > > > > > have
>> > > > > > > > >>> a
>> > > > > > > > >>>>>>> generic common ai provider with a SQL package, which
>> > would
>> > > > > > allow
>> > > > > > > > >>> for us
>> > > > > > > > >>>>>> to
>> > > > > > > > >>>>>>> build AI-based subpackages into the provider other
>> > than just
>> > > > > > SQL?
>> > > > > > > > >>>>>>>
>> > > > > > > > >>>>>>> On Mon, Oct 6, 2025 at 4:31 PM Pavankumar Gopidesu <
>> > > > > > > > >>>>>> [email protected]>
>> > > > > > > > >>>>>>> wrote:
>> > > > > > > > >>>>>>>
>> > > > > > > > >>>>>>>> @Giorgio Yes indeed that's also a good thought to
>> > > > > integrate. I
>> > > > > > > > >>> will
>> > > > > > > > >>>>>> keep in
>> > > > > > > > >>>>>>>> mind to think about when I draft AIP and message
>> > about this
>> > > > > a
>> > > > > > bit
>> > > > > > > > >>>>> more
>> > > > > > > > >>>>>> :)
>> > > > > > > > >>>>>>>> Yes please join. We have great demos packed on this
>> > topic :)
>> > > > > > > > >>>>>>>>
>> > > > > > > > >>>>>>>> @kaxil , Yes that's a great blog post from the wren
>> > AI and
>> > > > > > > > >>> leveraging
>> > > > > > > > >>>>>> the
>> > > > > > > > >>>>>>>> Apache DataFusion as a query engine to connect to
>> > different
>> > > > > > data
>> > > > > > > > >>>>>> sources.
>> > > > > > > > >>>>>>>> Pavan
>> > > > > > > > >>>>>>>>
>> > > > > > > > >>>>>>>> On Tue, Sep 30, 2025 at 7:37 PM Giorgio Zoppi <
>> > > > > > > > >>>>> [email protected]
>> > > > > > > > >>>>>>>> wrote:
>> > > > > > > > >>>>>>>>
>> > > > > > > > >>>>>>>>> Hey Pavan,
>> > > > > > > > >>>>>>>>> Some notes:
>> > > > > > > > >>>>>>>>> 1. LLM can be also very useful in detecting root
>> > causes of
>> > > > > > your
>> > > > > > > > >>>>> error
>> > > > > > > > >>>>>>>> while
>> > > > > > > > >>>>>>>>> developing and design a pipeline. I explain me
>> > better, we'd
>> > > > > > in
>> > > > > > > > >>> the
>> > > > > > > > >>>>>> past
>> > > > > > > > >>>>>>>>> several
>> > > > > > > > >>>>>>>>> Spark processes, when it is all green is ok, but
>> > when on
>> > > > > > > > >>> fails, it
>> > > > > > > > >>>>>> will
>> > > > > > > > >>>>>>>> be
>> > > > > > > > >>>>>>>>> nice to have a tool integrated to ask why.
>> > > > > > > > >>>>>>>>> 2. Ideally such operator could be a
>> > > > > > > > >>> ModelContextProtocolOperator
>> > > > > > > > >>>>> and
>> > > > > > > > >>>>>> you
>> > > > > > > > >>>>>>>>> would not need nothing else that put an LLM as
>> > parameter
>> > > > > with
>> > > > > > > > >>> that
>> > > > > > > > >>>>>>>>> operator,
>> > > > > > > > >>>>>>>>> and just call for tools, execute query, and so on.
>> > This
>> > > > > would
>> > > > > > > > >>> be
>> > > > > > > > >>>>> more
>> > > > > > > > >>>>>>>>> powerful, because you create an abstraction
>> between
>> > > > > devices,
>> > > > > > > > >>>>>> databases,
>> > > > > > > > >>>>>>>>> server and so on, so each source of data can be
>> > injected on
>> > > > > > the
>> > > > > > > > >>>>>> pipeline.
>> > > > > > > > >>>>>>>>> 3.  Good job! Looking forward to see the
>> > presentation.
>> > > > > > > > >>>>>>>>> Best Regards,
>> > > > > > > > >>>>>>>>> Giorgio
>> > > > > > > > >>>>>>>>>
>> > > > > > > > >>>>>>>>> Il giorno mar 30 set 2025 alle ore 14:51
>> Pavankumar
>> > > > > Gopidesu
>> > > > > > <
>> > > > > > > > >>>>>>>>> [email protected]> ha scritto:
>> > > > > > > > >>>>>>>>>
>> > > > > > > > >>>>>>>>>> Hi everyone,
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> We're exploring adding LLM-powered SQL operators
>> to
>> > > > > Airflow
>> > > > > > > > >>> and
>> > > > > > > > >>>>>> would
>> > > > > > > > >>>>>>>>> love
>> > > > > > > > >>>>>>>>>> community input before writing an AIP.
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> The idea: Let users write natural language
>> prompts
>> > like
>> > > > > > "find
>> > > > > > > > >>>>>> customers
>> > > > > > > > >>>>>>>>>> with missing emails" and have Airflow generate
>> safe
>> > SQL
>> > > > > > > > >>> queries
>> > > > > > > > >>>>>> with
>> > > > > > > > >>>>>>>> full
>> > > > > > > > >>>>>>>>>> context about your database schema, connections,
>> > and data
>> > > > > > > > >>>>>> sensitivity.
>> > > > > > > > >>>>>>>>>> Why this matters:
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> Most of us spend too much time on schema drift
>> > detection
>> > > > > and
>> > > > > > > > >>>>> manual
>> > > > > > > > >>>>>>>> data
>> > > > > > > > >>>>>>>>>> quality checks. Meanwhile, AI agents are getting
>> > powerful
>> > > > > > but
>> > > > > > > > >>>>> lack
>> > > > > > > > >>>>>>>>>> production-ready data integrations. Airflow could
>> > bridge
>> > > > > > this
>> > > > > > > > >>>>> gap.
>> > > > > > > > >>>>>>>>>> Here's what we're dealing with at Tavant:
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> Our team works with multiple data domain teams
>> > producing
>> > > > > > > > >>> data in
>> > > > > > > > >>>>>>>>> different
>> > > > > > > > >>>>>>>>>> formats and storage across S3, PostgreSQL,
>> Iceberg,
>> > and
>> > > > > > > > >>> Aurora.
>> > > > > > > > >>>>>> When
>> > > > > > > > >>>>>>>> data
>> > > > > > > > >>>>>>>>>> assets become available for consumption, we need:
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> - Detection of breaking schema changes between
>> > systems
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> - Data quality assessments between snapshots
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> - Validation that assets meet mandatory metadata
>> > > > > > requirements
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> - Lookup validation against existing data
>> > (comparing file
>> > > > > > > > >>> feeds
>> > > > > > > > >>>>>> with
>> > > > > > > > >>>>>>>>>> different formats to existing data in
>> > Iceberg/Aurora)
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> This is exactly the type of work that LLMs  could
>> > automate
>> > > > > > > > >>> while
>> > > > > > > > >>>>>>>>>> maintaining governance.
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> What we're thinking:
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> ```python
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> # Instead of writing complex SQL by hand...
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> quality_check = LLMSQLQueryOperator(
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>>      task_id="find_data_issues",
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>>      prompt="Find customers with invalid email
>> > formats and
>> > > > > > > > >>> missing
>> > > > > > > > >>>>>> phone
>> > > > > > > > >>>>>>>>>> numbers",
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>>      data_sources=[customer_asset],  # Airflow
>> > knows the
>> > > > > > > > >>> schema
>> > > > > > > > >>>>>>>>>> automatically
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>>      # Built-in safety: won't generate
>> DROP/DELETE
>> > > > > > statements
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> )
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> ```
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> The operator would:
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> - Auto-inject database schema, sample data, and
>> > connection
>> > > > > > > > >>>>> details
>> > > > > > > > >>>>>>>>>> - Generate safe SQL (blocks dangerous operations)
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> - Work across PostgreSQL, Snowflake, BigQuery
>> with
>> > dialect
>> > > > > > > > >>>>>> awareness
>> > > > > > > > >>>>>>>>>> - Support schema drift detection between systems
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> - Handle multi-cloud data via Apache
>> DataFusion[1]
>> > (Did
>> > > > > some
>> > > > > > > > >>>>>>>> experiments
>> > > > > > > > >>>>>>>>>> with 50M+          records and results are in
>> 10-15
>> > > > > seconds
>> > > > > > > > >>> for
>> > > > > > > > >>>>>> common
>> > > > > > > > >>>>>>>>>> aggregations)
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> for more info on benchmarks [2]
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> Key benefit: Assets become smarter with
>> structured
>> > > > > metadata
>> > > > > > > > >>>>>> (schema,
>> > > > > > > > >>>>>>>>>> sensitivity, format) instead of just throwing
>> > everything
>> > > > > in
>> > > > > > > > >>>>>> `extra`.
>> > > > > > > > >>>>>>>>>> Implementation plan:
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> Start with a separate provider
>> > > > > > > > >>>>> (`apache-airflow-providers-sql-ai`)
>> > > > > > > > >>>>>> so
>> > > > > > > > >>>>>>>> we
>> > > > > > > > >>>>>>>>>> can iterate without touching the Airflow core. No
>> > breaking
>> > > > > > > > >>>>> changes,
>> > > > > > > > >>>>>>>> works
>> > > > > > > > >>>>>>>>>> with existing connections and hooks.
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> I am presenting this at Airflow Summit 2025 in
>> > Seattle
>> > > > > with
>> > > > > > > > >>>>> Kaxil -
>> > > > > > > > >>>>>>>> come
>> > > > > > > > >>>>>>>>>> see the live demo!
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> Next steps:
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> If this resonates after the Summit, we'll write a
>> > proper
>> > > > > AIP
>> > > > > > > > >>> with
>> > > > > > > > >>>>>>>>> technical
>> > > > > > > > >>>>>>>>>> details and further build a working prototype.
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> Thoughts? Concerns? Better ideas?
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> [1]: https://datafusion.apache.org/
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> [2]:
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>
>> > > > > > > >
>> > > > > >
>> > > > >
>> >
>> https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/
>> > > > > > > > >>>>>>>>>> Thanks,
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> Pavan
>> > > > > > > > >>>>>>>>>>
>> > > > > > > > >>>>>>>>>> P.S. - Happy to share more technical details with
>> > anyone
>> > > > > > > > >>>>>> interested.
>> > > > > > > > >>>>>>>>>
>> > > > > > > > >>>>>>>>> --
>> > > > > > > > >>>>>>>>> Life is a chess game - Anonymous.
>> > > > > > > > >>>>>>>>>
>> > > > > > > > >>>>>>
>> > > > > > > >
>> > ---------------------------------------------------------------------
>> > > > > > > > >>>>>> To unsubscribe, e-mail:
>> > [email protected]
>> > > > > > > > >>>>>> For additional commands, e-mail:
>> > [email protected]
>> > > > > > > > >>>>>>
>> > > > > > > > >>>>>>
>> > > > > > > > >>>
>> > > > > >
>> > ---------------------------------------------------------------------
>> > > > > > > > >>> To unsubscribe, e-mail:
>> [email protected]
>> > > > > > > > >>> For additional commands, e-mail:
>> > [email protected]
>> > > > > > > > >>>
>> > > > > > > > >>>
>> > > > > > > > >
>> > > > >
>> ---------------------------------------------------------------------
>> > > > > > > > > To unsubscribe, e-mail:
>> [email protected]
>> > > > > > > > > For additional commands, e-mail:
>> [email protected]
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > ---------------------------------------------------------------------
>> > > > > > > > To unsubscribe, e-mail: [email protected]
>> > > > > > > > For additional commands, e-mail:
>> [email protected]
>> > > > > > > >
>> > > > > > > >
>> > > > > >
>> > > > > >
>> > ---------------------------------------------------------------------
>> > > > > > To unsubscribe, e-mail: [email protected]
>> > > > > > For additional commands, e-mail: [email protected]
>> > > > > >
>> > > > > >
>> > > > >
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: [email protected]
>> > > For additional commands, e-mail: [email protected]
>> > >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>> >
>> >
>>
>

Re: AI-Native Airflow - LLM-Driven Intelligence for Production Data Workflows

Reply via email to