antonlin1 opened a new pull request, #66494: URL: https://github.com/apache/airflow/pull/66494
Closes #66493 ## What does this change? Adds a process-local `cachetools.TTLCache` to `RevokedToken.is_revoked`. Every authenticated API request currently runs a synchronous `session.scalar(...)` to check whether the JWT's `jti` is in the `revoked_token` table — once the SQLAlchemy pool saturates, request handlers stall in `QueuePool._do_get` for the full `pool_timeout` (30 s by default) before failing 500. Cache hit rate on `is_revoked` is ≈100% in practice (revocation only happens on explicit logout), so the cache collapses the per-request DB roundtrip into a near-free in-memory lookup. The implementation mirrors the existing `DBDagBag` cache pattern at `airflow/models/dagbag.py` line-for-line: `TTLCache` + `RLock`, double-checked locking on miss, `Stats.cache_hit/cache_miss/cache_size` metrics, and a public `clear_cache()` for operators. `revoke()` populates the local cache on success so the worker that processes a logout is immediately consistent. Two new `[api_auth]` config keys make the cache tunable: - `revoked_token_cache_size` (default 10000) — bounded LRU+TTL cache size - `revoked_token_cache_ttl_seconds` (default 60) — entry lifetime Setting either to 0 disables caching and reverts to the per-request DB query behavior. ## Why? 3.2 introduced AIP-84 (commit b3306f15cd, PR #47952 / #61339) which added a per-request DB query to the auth path. Under modest concurrent load — UI multi-endpoint polling, fan-out DAG runs, or just a few users — the default pool of `5+10=15` (shared across api-server, scheduler, dag-processor, and triggerer) exhausts and the API server stops serving requests. See #66493 for the full repro and stacktrace. ## Trade-offs - **Worker-local cache**. uvicorn workers don't share memory, so a logout on worker A is not immediately reflected on worker B — worker B serves the cached pre-logout result for up to `revoked_token_cache_ttl_seconds` seconds. Operators needing strict cross-worker logout consistency can lower or zero out the TTL. - **Cached `True` cannot leak past JWT expiry**. `JWTValidator.avalidated_claims` (`auth/tokens.py:318-339`) enforces `exp` via PyJWT before `is_revoked` runs, so an expired token is rejected upstream regardless of cache state. - **`_maybe_cleanup_expired` is called BEFORE the cache lookup** so the periodic TTL sweep keeps running even when most calls are cache hits — a future refactor that short-circuits earlier would silently break this and a regression test (`test_cleanup_runs_even_on_cache_hit`) locks the ordering. - **Cold-cache thundering herd** is bounded by worker concurrency and self-corrects after the first DB populate; the implementation does not add `singleflight` coalescing (same pragmatic call as `DBDagBag._get_dag`). ## Local before/after Stock 3.2.0 vs this PR, identical `pool_size=3, max_overflow=2`, 60 requests at concurrency 30 against `GET /api/v2/dags`: | Metric | STOCK 3.2.0 | This PR | improvement | |---|---|---|---| | Wall time | 31.0 s | 1.04 s | ~30× | | Success rate | 53/60 (88%) | 60/60 (100%) | +12pp | | Pool timeouts (≥25 s + non-200) | 7/60 | 0 | gone | | Latency p50 | 15.3 s | 0.51 s | ~30× | | Latency p95 | 30.5 s | 0.57 s | ~53× | | Latency p99 | 30.7 s | 0.60 s | ~51× | ## Tests `airflow-core/tests/unit/models/test_revoked_token.py` adds 8 new tests: - `test_caches_negative_result` / `test_caches_positive_result` — both polarities cached - `test_revoke_populates_cache` — `revoke()` makes the local worker immediately consistent - `test_cache_disabled_when_size_is_zero` / `test_cache_disabled_when_ttl_is_zero` — opt-out paths - `test_cleanup_runs_even_on_cache_hit` — locks in the cleanup-ordering invariant - `test_clear_cache_returns_count_and_empties` / `test_clear_cache_is_noop_when_caching_disabled` - `test_revoke_does_not_cache_on_db_failure` — cache stays clean if `session.merge` raises - `test_cache_lazy_init_is_idempotent` Plus an autouse fixture that resets `_cache`, `_cache_initialized`, and `_last_cleanup_time` so existing tests don't leak class-level state through. The 4 pre-existing tests in `TestRevokedTokenModel` and `TestRevokedTokenCleanup` are preserved with their assertions intact (just patched to point conf.getint at cache-disabled values for hermetic behavior). --- ^ Add a description of your Pull Request here. Try to keep it as short as possible while still describing all relevant changes. ^ Provide enough information about your change. Don't make reviewers work too much. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
