The GitHub Actions job "Tests" on airflow.git/cache-revoked-token-is-revoked 
has failed.
Run started by GitHub user antonlin1 (triggered by antonlin1).

Head commit for run:
c4a8c1dde8ed0c3614a76df3dd19334fa832e53a / Anton Lin <[email protected]>
Cache RevokedToken.is_revoked to avoid per-request DB roundtrip

Since 3.2 (b3306f15cd, "AIP-84: Add JWT token revokation for logout
invalidation"), every authenticated API request runs a synchronous
``RevokedToken.is_revoked(jti)`` DB query inside the FastAPI auth
dependency. The query is dispatched via ``@provide_session`` which checks
out a SQLAlchemy connection per in-flight request. With the default pool
of ``5+10=15`` shared across api-server, scheduler, dag-processor, and
triggerer, modest concurrent load (UI multi-endpoint polling, fan-out
DAGs) exhausts the pool and request handlers time out in
``QueuePool._do_get`` after 30 s. Observed locally as::

    sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 10 reached,
    connection timed out, timeout 30.00

Cache hit rate on ``is_revoked`` is ≈100% in practice — revocation only
happens on explicit logout. Wrap ``is_revoked`` in a process-local
``cachetools.TTLCache`` (existing dep) guarded by an ``RLock``, mirroring
the ``DBDagBag`` pattern at ``airflow/models/dagbag.py``: cache lookup
first, double-checked locking on miss, ``Stats.cache_hit/cache_miss``
metrics, and a public ``clear_cache()`` for operators. ``revoke()``
populates the local cache on success so the worker that processes a
logout is immediately consistent.

Two new ``[api_auth]`` config keys (``revoked_token_cache_size`` default
10000, ``revoked_token_cache_ttl_seconds`` default 60) make the cache
tunable; setting either to 0 disables caching and reverts to the
per-request DB query behavior.

Trade-offs:

* uvicorn workers don't share memory, so a logout on worker A is not
  immediately reflected on worker B — worker B serves the cached
  pre-logout result for up to ``revoked_token_cache_ttl_seconds`` seconds.
  Operators needing strict cross-worker logout consistency can reduce or
  zero out the TTL.
* Expired JWTs are rejected by ``avalidated_claims`` (PyJWT ``exp`` check)
  before ``is_revoked`` runs, so cached entries cannot leak past the
  token's natural lifetime.
* ``_maybe_cleanup_expired`` is called BEFORE the cache lookup so the
  periodic TTL sweep keeps running even when most calls are cache hits.

Local before/after with default 3.2.0, ``pool_size=3``,
``max_overflow=2``, 60 requests at concurrency 30 against
``GET /api/v2/dags``:

================  ========  ==========  ============
Metric            STOCK     CACHE_FIX   improvement
================  ========  ==========  ============
Wall time         31.0 s    1.04 s      ~30x
Success rate      88%       100%        +12pp
Pool timeouts     7/60      0           gone
Latency p50       15.3 s    0.51 s      ~30x
Latency p95       30.5 s    0.57 s      ~53x
Latency p99       30.7 s    0.60 s      ~51x
================  ========  ==========  ============

Eight new unit tests in ``tests/unit/models/test_revoked_token.py`` cover
both polarities of caching, TTL/size opt-out, ``revoke()`` cache
population, the cleanup-still-runs-on-cache-hit invariant, ``clear_cache``,
and the ``revoke()`` no-op-on-merge-failure path.

Report URL: https://github.com/apache/airflow/actions/runs/25461235521

With regards,
GitHub Actions via GitBox


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to