antonlin1 opened a new issue, #66493:
URL: https://github.com/apache/airflow/issues/66493

   ### Apache Airflow version
   
   3.2.0 (introduced in commit b3306f15cd, "AIP-84: Add JWT token revokation 
for logout invalidation", PR #61339 / #47952)
   
   ### What happened?
   
   Every authenticated API request now performs a synchronous DB query inside 
the FastAPI auth dependency:
   
   ```python
   # airflow-core/src/airflow/api_fastapi/auth/managers/base_auth_manager.py:153
   if (jti := payload.get("jti")) and RevokedToken.is_revoked(jti):
       raise InvalidTokenError("Token has been revoked")
   ```
   
   `RevokedToken.is_revoked` 
([revoked_token.py:58-61](https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/models/revoked_token.py#L58-L61))
 runs `session.scalar(...)` via `@provide_session`, holding a SQLAlchemy 
connection per in-flight request. With the default pool of `5+10=15` shared 
across api-server, scheduler, dag-processor, and triggerer, modest concurrent 
load (UI multi-endpoint polling, fan-out DAGs) exhausts the pool and request 
handlers time out in `QueuePool._do_get` after 30s.
   
   The UI freezes once a few task instances start running because every poll 
request blocks on connection checkout. Stacktrace from a stock 3.2.0 standalone:
   
   ```
   File "airflow/api_fastapi/auth/managers/base_auth_manager.py", line 153, in 
get_user_from_token
       if (jti := payload.get("jti")) and RevokedToken.is_revoked(jti):
   File "airflow/utils/session.py", line 100, in wrapper
       return func(*args, session=session, **kwargs)
   File "airflow/models/revoked_token.py", line 61, in is_revoked
       return bool(session.scalar(select(exists().where(cls.jti == jti))))
   ...
   File "sqlalchemy/pool/impl.py", line 166, in _do_get
       raise exc.TimeoutError(
   sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 10 reached, 
connection timed out, timeout 30.00
   ```
   
   `is_revoked` was the bottom of the failing stack on every endpoint we hit: 
`/ui/config`, `/ui/backfills`, `/api/v2/dags/.../details`, 
`/api/v2/dags/.../dagRuns/...`. Multi-second `duration_us` values (60s, 90s, 
120s, 150s) come from FastAPI resolving the auth dependency multiple times in 
the same handler — each checkout times out at 30s independently.
   
   ### What you think should happen instead?
   
   Cache hit rate on `is_revoked` is ≈100% in practice — revocation only 
happens on explicit logout. The check should not require a DB roundtrip on 
every request. An in-process TTL cache (with bounded staleness across uvicorn 
workers) collapses the per-request DB roundtrip into a near-free in-memory 
lookup.
   
   ### How to reproduce
   
   Stock 3.2.0 with the default config:
   
   ```bash
   pip install apache-airflow==3.2.0
   AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_SIZE=3 \
   AIRFLOW__DATABASE__SQL_ALCHEMY_MAX_OVERFLOW=2 \
   airflow standalone &
   # Get a JWT (admin password is in 
~/airflow/simple_auth_manager_passwords.json.generated):
   JWT=$(curl -sf -X POST http://localhost:8080/auth/token \
     -H 'Content-Type: application/json' \
     -d '{"username":"admin","password":"<PWD>"}' \
     | python3 -c 'import json,sys;print(json.load(sys.stdin)["access_token"])')
   # 60 concurrent requests against an authenticated DB-backed endpoint:
   seq 60 | xargs -P 30 -I {} curl -sS -o /dev/null \
     -w "%{http_code} %{time_total}s\n" \
     -H "Authorization: Bearer $JWT" \
     http://localhost:8080/api/v2/dags
   ```
   
   Observed: ~12% of requests return 500 with `QueuePool TimeoutError`, p50 
latency ~15s, p99 ~30s.
   
   ### Operating System
   
   macOS 24.4.0 (also reproduces on Linux per the standard SQLAlchemy/uvicorn 
pool dynamics).
   
   ### Versions of Apache Airflow Providers
   
   N/A — affects core auth path.
   
   ### Deployment
   
   Standalone (issue is independent of deployment topology — affects any 
uvicorn-driven api-server with a shared SQLAlchemy pool).
   
   ### Deployment details
   
   - Default `[database]` pool config (5+10) reproduces with sufficient 
concurrent request volume; deliberately small pool (3+2) makes it deterministic.
   - Backed by SQLite (default standalone) but the failure is at the 
pool-checkout level, not SQLite-specific.
   
   ### Anything else?
   
   - Bisected to commit b3306f15cd (PR #47952 / #61339, "AIP-84: Add JWT token 
revokation for logout invalidation"). Pre-3.2 the auth path did no DB work.
   - Pull request with proposed fix coming next (in-process 
`cachetools.TTLCache` mirroring the existing `DBDagBag` cache pattern in 
`airflow/models/dagbag.py`). Cache hit rate is ≈100% in practice (revocation is 
rare). Local before/after with the fix at the same `pool_size=3, 
max_overflow=2`: 60 requests / 30 concurrent dropped from 31s wall + 12% 
timeouts + 30s p99 → 1.04s wall + 0% timeouts + 0.6s p99.
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to