potiuk opened a new pull request, #67882:
URL: https://github.com/apache/airflow/pull/67882

   ## Symptom
   
   [run 26764917219, job 
78891341528](https://github.com/apache/airflow/actions/runs/26764917219/job/78891341528)
 (Compat 3.2.2 / P3.10, Tests ARM) — **33 errors**, all 
`pymongo.errors.ServerSelectionTimeoutError: ... Connection refused` at the 
**setup of every `TestMongoHook` test**; 10907 passed.
   
   ## Root cause
   
   The mongo hook tests use a session-scoped `MongoDbContainer` 
(testcontainers). The container started fine early in the run (`Container 
started` at 16:26:51, and the `_wait_for_mongo_ready` ping-gate passed — there 
is no "did not answer ping" in the log), but the mongo module runs much later 
in this ~17-minute compat suite. By then **testcontainers' `ryuk` reaper had 
removed the container** — ryuk reaps spawned containers a short time after the 
controlling connection drops.
   
   Corroboration from the job log:
   
   - ryuk was enabled (`TESTCONTAINERS_RYUK_DISABLED` unset; `Pulling image 
testcontainers/ryuk:0.8.1` / `Container started`).
   - breeze's failure handler tried to dump the mongo container's logs, but 
they were **empty** — the container was already gone.
   
   The existing 3× start-retry + ping-gate cannot help once ryuk removes the 
container mid-suite (and a per-test connection retry wouldn't either — the 
container is gone for the rest of the run).
   
   ## Fix
   
   Set `TESTCONTAINERS_RYUK_DISABLED=true` in 
`providers/mongo/tests/conftest.py` before any `MongoDbContainer` is created — 
**only in CI** (`CI` / `GITHUB_ACTIONS`). The fixture already stops the 
container explicitly in its `finally` block and CI runners are ephemeral, so 
ryuk's auto-reaping is unnecessary there. Local runs keep ryuk enabled so a 
container left by an interrupted test run is still cleaned up.
   
   Test-infra only (`providers/mongo/tests/conftest.py`). No newsfragment 
(providers don't consume them).
   
   > Note: this flake only reproduces in the long compat / docker-in-docker CI 
run, so it can't be reproduced locally; the fix is grounded in the job-log 
evidence (ryuk enabled + empty dumped container logs + 
container-gone-mid-suite), and disabling ryuk is the documented testcontainers 
remedy when container lifecycle is managed explicitly.
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   - [X] Yes — Claude Code (Opus 4.8)
   
   Generated-by: Claude Code (Opus 4.8) following [the 
guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to