Hi All,

I'm working on expanding the Airflow 3 upgrade documentation to address a
frequently asked question from users
migrating from Airflow 2.x: "How do I access the metadata database from my
tasks now that direct database
access is blocked?"

Currently, Step 5 of the upgrade guide[1] only mentions that direct DB
access is blocked and points to a GitHub issue.
However, users need concrete guidance on migration options.

I've drafted documentation via [2] describing three approaches, but before
proceeding to finalising this, I'd like to get community
consensus on how we should present these options, especially given the
architectural principles we've established with
Airflow 3.

## Proposed Approaches

Approach 1: Airflow Python Client (REST API)
- Uses `apache-airflow-client` [3] to interact via REST API
- Pros: No DB drivers needed, aligned with Airflow 3 architecture, API-first
- Cons: Requires package installation, API server dependency, auth token
management, limited operations possible

Approach 2: Database Hooks (PostgresHook/MySqlHook)
- Create a connection to metadata DB and use DB hooks to execute SQL
directly
- Pros: Uses Airflow connection management, simple SQL interface
- Cons: Requires DB drivers, direct network access, bypasses Airflow API
server and connects to DB directly

Approach 3: Direct SQLAlchemy Access (last resort)
- Use environment variable with DB connection string and create SQLAlchemy
session directly
- Pros: Maximum flexibility
- Cons: Bypasses all Airflow protections, schema coupling, manual
connection management, worst possible option.

I was expecting some pushback regarding these approaches and there were
(rightly) some important concerns raised
by Jarek about Approaches 2 and 3:

1. Breaks Task Isolation - Contradicts Airflow 3's core promise
2. DB as Public Interface - Schema changes would require release notes and
break user code
3. Performance Impact - Using Approach 2 creates direct DB access and can
bring back Airflow 2's
connection-per-task overhead
4. Security Model Violation - Contradicts documented isolation principles

Considering these comments, this is what I want to document now:

1. Approach 1 - Keep as primary/recommended solution (aligns with Airflow 3
architecture)
2. Approach 2 - Present as "known workaround" (not recommendation) with
explicit warnings
about breaking isolation, schema not being public API, performance
implications, and no support guarantees
3. Approach 3 - Remove entirely, or keep with strongest possible warnings
(would love to hear what others think for
this one particularly)

Once we arrive at some discussion points on this one, I would like to call
for a lazy consensus for posterity and visibility
of the community.

Looking forward to your feedback!

[1]
https://github.com/apache/airflow/blob/main/airflow-core/docs/installation/upgrading_to_airflow3.rst#step-5-review-custom-operators-for-direct-db-access
[2] https://github.com/apache/airflow/pull/57479
[3] https://github.com/apache/airflow-client-python

Reply via email to