Re: [DISCUSS] How to document DB access options in Airflow 3 upgrade docs

Amogh Desai Thu, 06 Nov 2025 05:32:51 -0800

Looking for some more eyes on this one.

Thanks & Regards,
Amogh Desai



On Thu, Nov 6, 2025 at 12:55 PM Amogh Desai <[email protected]> wrote:

> > Yes, API could do this with 5-times more code including the limits per
> response where you need to loop over all pages until you have a full
> list (e.g. API limited to 100 results). Not impossible but a lot of
> re-implementation.
>
> Just wondering, why not vanilla task mapping?
>
> > Might be something that could be a potential contributionto "airflow db
> clean"
>
> Maybe, yes.
>
> Thanks & Regards,
> Amogh Desai
>
>
> On Thu, Nov 6, 2025 at 12:53 PM Amogh Desai <[email protected]> wrote:
>
>> > I think our efforts should be way more focused on adding some missing
>> API
>> calls in Task SDK that our users miss, rather than in allowing them to use
>> "old ways". Every time someone says "I cannot migrate because i did this",
>> our first thought should be:
>>
>> * is it a valid way?
>> * is it acceptable to have an API call for it in SDK?
>> * should we do it ?
>>
>>
>> That is currently a grey zone we need to define better I think. Certain
>> use cases might be general
>> enough that we need an execution API endpoint for that, and we can
>> certainly do that. But there will
>> also be cases when the use case is niche and we will NOT want to have
>> execution API endpoints
>> for that for various reasons. The harder problem to solve is the latter.
>>
>> But you make a fair point here.
>>
>>
>>
>> Thanks & Regards,
>> Amogh Desai
>>
>>
>> On Thu, Nov 6, 2025 at 2:33 AM Jens Scheffler <[email protected]>
>> wrote:
>>
>>> > Thanks for your comments too, Jens.
>>> >
>>> >>    * Aggregate status of tasks in the upstream of same Dag (pass,
>>> fail,
>>> >>      listing)
>>> >>
>>> >> Does the DAG run page not show that?
>>> Partly yes, but in our environment it is a bit more complex than
>>> "pass/fail". Bit more complex story, we want to know more details of the
>>> failed and aggregate details. So high-level saying get the XCom from
>>> failed and then aggregate details. Imagine all tasks ahve an owner and
>>> we want to send a notification to each owner but if 10 tasks from one
>>> owner fail we want to send 1 notification with 10 failed in the text.
>>> And, yes, can be done via API.
>>> >>    * Custom mass-triggering of other dags and collection of results
>>> from
>>> >>     triggered dags as scale-out option for dynamic task mapping
>>> >>
>>> >> Can't an API do that?
>>> Yes, API could do this with 5-times more code including the limits per
>>> response where you need to loop over all pages until you have a full
>>> list (e.g. API limited to 100 results). Not impossible but a lot of
>>> re-implementation.
>>> >>    * And the famous: Partial database clean on a per Dag level with
>>> >>      different retention
>>> >>
>>> >> Can you elaborate this one a bit :D
>>>
>>> Yes. We have some Dag that is called 50k-100k times per day and others
>>> that are called 12 times a day. And a lot of others in-between like 25k
>>> runs per month. The Dag with 100k runs per day we want to archive ASAP
>>> probably after 3 days for all not failed calls to reduce DB overhead.
>>> The failed ones we keep for 14 days for potential re-processing if there
>>> was an outage.
>>>
>>> Most other Dag Runs we keep for a month. And some we cap that we archive
>>> if more than 25k runs
>>>
>>> Might be something that could be a potential contributionto "airflow db
>>> clean"
>>>
>>> >>
>>> >> Thanks & Regards,
>>> >> Amogh Desai
>>> >>
>>> >>
>>> >> On Wed, Nov 5, 2025 at 3:12 AM Jens Scheffler <[email protected]>
>>> wrote:
>>> >>
>>> >> Thanks Amough for adding docs for migration hints.
>>> >>
>>> >> We actually suffer a lot of integrations that had been built in the
>>> past
>>> >> which now makes it hard and serious effort to migrate to version 3. So
>>> >> most probably we ourself need to take option 2 but knowing (like in
>>> the
>>> >> past) that you can not ask for support. But at least this un-blocks us
>>> >> from staying with 2.x
>>> >>
>>> >> I'd love to take route 1 as well but then a lot of code needs to be re
>>> >> written. This will take time, And in mid term we will migrate to (1).
>>> >>
>>> >> As in the dev call I'd love if in Airflow 3.2 we could have option 1
>>> >> supported out-of-the-box - knowing that some security discussion is
>>> >> implied, so maybe need to be turned on and not be enabled by default.
>>> >>
>>> >> The use cases we have and which requires some kind of DB access where
>>> >> TaskSDK is not helping with support
>>> >>
>>> >>    * Adding task and dag run notes to tasks as better readable status
>>> >>      while and after execution
>>> >>    * Aggregate status of tasks in the upstream of same Dag (pass,
>>> fail,
>>> >>      listing)
>>> >>    * Custom mass-triggering of other dags and collection of results
>>> from
>>> >>      triggered dags as scale-out option for dynamic task mapping
>>> >>    * Adjusting Pools based on available workers
>>> >>    * Checking results of pass/fail per edge worker and depending on
>>> >>      stability adjusting Queues on Edge workers based on status and
>>> >>      errors of workers
>>> >>    * Adjust Pools based on time of day
>>> >>    * And the famous: Partial database clean on a per Dag level with
>>> >>      different retention
>>> >>
>>> >> I would be okay removing option 3 and a clear warning to option 2 is
>>> >> also okay.
>>> >>
>>> >> Jens
>>> >>
>>> >> On 11/4/25 13:06, Jarek Potiuk wrote:
>>> >>> My take (and details can be found in the discussion):
>>> >>>
>>> >>> 2. Don't make the impression it is something that we will support -
>>> and
>>> >>> explain to the users that it **WILL** break in the future and it's on
>>> >>> **THEM** to fix when it breaks.
>>> >>>
>>> >>> The 2 is **kinda** possible but we should strongly discourage this
>>> and
>>> >> say
>>> >>> "this will break any time and it's you who have to adapt to any
>>> future
>>> >>> changes in schema" - we had a lot of similar cases in the past where
>>> our
>>> >>> users felt entitled to get **something** they felt as "valid way of
>>> using
>>> >>> things" broken by our changes. If we say "recommended" they will
>>> take it
>>> >> as
>>> >>> "and all the usage there is expected to work when Airlfow gets a new
>>> >>> version so I should be fully entitled to open a valid issue when
>>> things
>>> >>> change".  I think "recommended" in this case is far too strong from
>>> our
>>> >>> side.
>>> >>>
>>> >>> 3. Absolutely remove.
>>> >>>
>>> >>> Sounds like we are going back to Airflow 2 behaviour. And we've made
>>> all
>>> >>> the effort to break out of that. Various things will start breaking
>>> in
>>> >>> Airflow 3.2 and beyond. Once we complete the task isolation work,
>>> Airflow
>>> >>> workers will NOT have sqlalchemy package installed by default - it
>>> simply
>>> >>> will not be task-sdk dependency. The fact that you **can** use
>>> sqlalchemy
>>> >>> now is mostly a by-product of the fact that we have not completed the
>>> >> split
>>> >>> yet - but it was not even **SUPPOSED** to work.
>>> >>>
>>> >>> J.
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Tue, Nov 4, 2025 at 10:03 AM Amogh Desai<[email protected]>
>>> >> wrote:
>>> >>>> Hi All,
>>> >>>>
>>> >>>> I'm working on expanding the Airflow 3 upgrade documentation to
>>> address
>>> >> a
>>> >>>> frequently asked question from users
>>> >>>> migrating from Airflow 2.x: "How do I access the metadata database
>>> from
>>> >> my
>>> >>>> tasks now that direct database
>>> >>>> access is blocked?"
>>> >>>>
>>> >>>> Currently, Step 5 of the upgrade guide[1] only mentions that direct
>>> DB
>>> >>>> access is blocked and points to a GitHub issue.
>>> >>>> However, users need concrete guidance on migration options.
>>> >>>>
>>> >>>> I've drafted documentation via [2] describing three approaches, but
>>> >> before
>>> >>>> proceeding to finalising this, I'd like to get community
>>> >>>> consensus on how we should present these options, especially given
>>> the
>>> >>>> architectural principles we've established with
>>> >>>> Airflow 3.
>>> >>>>
>>> >>>> ## Proposed Approaches
>>> >>>>
>>> >>>> Approach 1: Airflow Python Client (REST API)
>>> >>>> - Uses `apache-airflow-client` [3] to interact via REST API
>>> >>>> - Pros: No DB drivers needed, aligned with Airflow 3 architecture,
>>> >>>> API-first
>>> >>>> - Cons: Requires package installation, API server dependency, auth
>>> token
>>> >>>> management, limited operations possible
>>> >>>>
>>> >>>> Approach 2: Database Hooks (PostgresHook/MySqlHook)
>>> >>>> - Create a connection to metadata DB and use DB hooks to execute SQL
>>> >>>> directly
>>> >>>> - Pros: Uses Airflow connection management, simple SQL interface
>>> >>>> - Cons: Requires DB drivers, direct network access, bypasses
>>> Airflow API
>>> >>>> server and connects to DB directly
>>> >>>>
>>> >>>> Approach 3: Direct SQLAlchemy Access (last resort)
>>> >>>> - Use environment variable with DB connection string and create
>>> >> SQLAlchemy
>>> >>>> session directly
>>> >>>> - Pros: Maximum flexibility
>>> >>>> - Cons: Bypasses all Airflow protections, schema coupling, manual
>>> >>>> connection management, worst possible option.
>>> >>>>
>>> >>>> I was expecting some pushback regarding these approaches and there
>>> were
>>> >>>> (rightly) some important concerns raised
>>> >>>> by Jarek about Approaches 2 and 3:
>>> >>>>
>>> >>>> 1. Breaks Task Isolation - Contradicts Airflow 3's core promise
>>> >>>> 2. DB as Public Interface - Schema changes would require release
>>> notes
>>> >> and
>>> >>>> break user code
>>> >>>> 3. Performance Impact - Using Approach 2 creates direct DB access
>>> and
>>> >> can
>>> >>>> bring back Airflow 2's
>>> >>>> connection-per-task overhead
>>> >>>> 4. Security Model Violation - Contradicts documented isolation
>>> >> principles
>>> >>>> Considering these comments, this is what I want to document now:
>>> >>>>
>>> >>>> 1. Approach 1 - Keep as primary/recommended solution (aligns with
>>> >> Airflow 3
>>> >>>> architecture)
>>> >>>> 2. Approach 2 - Present as "known workaround" (not recommendation)
>>> with
>>> >>>> explicit warnings
>>> >>>> about breaking isolation, schema not being public API, performance
>>> >>>> implications, and no support guarantees
>>> >>>> 3. Approach 3 - Remove entirely, or keep with strongest possible
>>> >> warnings
>>> >>>> (would love to hear what others think for
>>> >>>> this one particularly)
>>> >>>>
>>> >>>> Once we arrive at some discussion points on this one, I would like
>>> to
>>> >> call
>>> >>>> for a lazy consensus for posterity and visibility
>>> >>>> of the community.
>>> >>>>
>>> >>>> Looking forward to your feedback!
>>> >>>>
>>> >>>> [1]
>>> >>>>
>>> >>>>
>>> >>
>>> https://github.com/apache/airflow/blob/main/airflow-core/docs/installation/upgrading_to_airflow3.rst#step-5-review-custom-operators-for-direct-db-access
>>> >>>> [2]https://github.com/apache/airflow/pull/57479
>>> >>>> [3]https://github.com/apache/airflow-client-python
>>> >>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>

Re: [DISCUSS] How to document DB access options in Airflow 3 upgrade docs

Reply via email to