antonlin1 opened a new pull request, #66494:
URL: https://github.com/apache/airflow/pull/66494

   Closes #66493
   
   ## What does this change?
   
   Adds a process-local `cachetools.TTLCache` to `RevokedToken.is_revoked`. 
Every authenticated API request currently runs a synchronous 
`session.scalar(...)` to check whether the JWT's `jti` is in the 
`revoked_token` table — once the SQLAlchemy pool saturates, request handlers 
stall in `QueuePool._do_get` for the full `pool_timeout` (30 s by default) 
before failing 500.
   
   Cache hit rate on `is_revoked` is ≈100% in practice (revocation only happens 
on explicit logout), so the cache collapses the per-request DB roundtrip into a 
near-free in-memory lookup.
   
   The implementation mirrors the existing `DBDagBag` cache pattern at 
`airflow/models/dagbag.py` line-for-line: `TTLCache` + `RLock`, double-checked 
locking on miss, `Stats.cache_hit/cache_miss/cache_size` metrics, and a public 
`clear_cache()` for operators. `revoke()` populates the local cache on success 
so the worker that processes a logout is immediately consistent.
   
   Two new `[api_auth]` config keys make the cache tunable:
   - `revoked_token_cache_size` (default 10000) — bounded LRU+TTL cache size
   - `revoked_token_cache_ttl_seconds` (default 60) — entry lifetime
   
   Setting either to 0 disables caching and reverts to the per-request DB query 
behavior.
   
   ## Why?
   
   3.2 introduced AIP-84 (commit b3306f15cd, PR #47952 / #61339) which added a 
per-request DB query to the auth path. Under modest concurrent load — UI 
multi-endpoint polling, fan-out DAG runs, or just a few users — the default 
pool of `5+10=15` (shared across api-server, scheduler, dag-processor, and 
triggerer) exhausts and the API server stops serving requests. See #66493 for 
the full repro and stacktrace.
   
   ## Trade-offs
   
   - **Worker-local cache**. uvicorn workers don't share memory, so a logout on 
worker A is not immediately reflected on worker B — worker B serves the cached 
pre-logout result for up to `revoked_token_cache_ttl_seconds` seconds. 
Operators needing strict cross-worker logout consistency can lower or zero out 
the TTL.
   - **Cached `True` cannot leak past JWT expiry**. 
`JWTValidator.avalidated_claims` (`auth/tokens.py:318-339`) enforces `exp` via 
PyJWT before `is_revoked` runs, so an expired token is rejected upstream 
regardless of cache state.
   - **`_maybe_cleanup_expired` is called BEFORE the cache lookup** so the 
periodic TTL sweep keeps running even when most calls are cache hits — a future 
refactor that short-circuits earlier would silently break this and a regression 
test (`test_cleanup_runs_even_on_cache_hit`) locks the ordering.
   - **Cold-cache thundering herd** is bounded by worker concurrency and 
self-corrects after the first DB populate; the implementation does not add 
`singleflight` coalescing (same pragmatic call as `DBDagBag._get_dag`).
   
   ## Local before/after
   
   Stock 3.2.0 vs this PR, identical `pool_size=3, max_overflow=2`, 60 requests 
at concurrency 30 against `GET /api/v2/dags`:
   
   | Metric | STOCK 3.2.0 | This PR | improvement |
   |---|---|---|---|
   | Wall time | 31.0 s | 1.04 s | ~30× |
   | Success rate | 53/60 (88%) | 60/60 (100%) | +12pp |
   | Pool timeouts (≥25 s + non-200) | 7/60 | 0 | gone |
   | Latency p50 | 15.3 s | 0.51 s | ~30× |
   | Latency p95 | 30.5 s | 0.57 s | ~53× |
   | Latency p99 | 30.7 s | 0.60 s | ~51× |
   
   ## Tests
   
   `airflow-core/tests/unit/models/test_revoked_token.py` adds 8 new tests:
   
   - `test_caches_negative_result` / `test_caches_positive_result` — both 
polarities cached
   - `test_revoke_populates_cache` — `revoke()` makes the local worker 
immediately consistent
   - `test_cache_disabled_when_size_is_zero` / 
`test_cache_disabled_when_ttl_is_zero` — opt-out paths
   - `test_cleanup_runs_even_on_cache_hit` — locks in the cleanup-ordering 
invariant
   - `test_clear_cache_returns_count_and_empties` / 
`test_clear_cache_is_noop_when_caching_disabled`
   - `test_revoke_does_not_cache_on_db_failure` — cache stays clean if 
`session.merge` raises
   - `test_cache_lazy_init_is_idempotent`
   
   Plus an autouse fixture that resets `_cache`, `_cache_initialized`, and 
`_last_cleanup_time` so existing tests don't leak class-level state through.
   
   The 4 pre-existing tests in `TestRevokedTokenModel` and 
`TestRevokedTokenCleanup` are preserved with their assertions intact (just 
patched to point conf.getint at cache-disabled values for hermetic behavior).
   
   ---
   
   ^ Add a description of your Pull Request here. Try to keep it as short as 
possible while still describing all relevant changes.
   ^ Provide enough information about your change. Don't make reviewers work 
too much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to