acelyc111 commented on PR #2403:
URL: 
https://github.com/apache/incubator-pegasus/pull/2403#issuecomment-4614325338

   ## Update on the Test ASAN/Release failures
   
   After looking deeper, the `defaults.run.shell: bash` fix did **not** resolve 
the ZK startup failure. Logs confirm the step now runs under `bash --noprofile 
--norc -e -o pipefail {0}` (matching what works elsewhere), but ZK still exits 
~1 second after launch with `Starting zookeeper ... FAILED TO START`. So 
`shell` is not the root cause.
   
   ### Key finding: this is a pre-existing v2.5 problem, not a regression from 
this PR
   
   I'd been mentally comparing this PR against the most recent successful `Cpp 
CI` run on what I thought was a v2.5 PR. Re-checking: **PR #2387 
(`limowang/fix/disk_abnormal`, run 
[`24069709672`](https://github.com/apache/incubator-pegasus/actions/runs/24069709672))
 is actually master-base, not v2.5-base.** It pulled 
`apache/pegasus:thirdparties-bin-test-asan-ubuntu2204-master`, not `…-v2.5`.
   
   Querying the `thirdparty-regular-push` workflow's run history directly:
   
   | Branch | Successful image rebuilds | Latest success |
   |---|---|---|
   | `master` | 33 | 2025-11-25 |
   | `build-env-ubuntu-cmake-3` | 1 | 2025-05-06 |
   | `build-env-ubuntu-sasl2-modules` | 1 | 2025-09-16 |
   | **`v2.5`** | **0** | **(never)** |
   
   v2.5's only run of that workflow was a `push`-triggered run on 2026-04-09 
that **`startup_failure`'d before any job** (almost certainly the same ASF 
allow-list block this PR exists to fix).
   
   So `apache/pegasus:thirdparties-bin-test-ubuntu2204-v2.5` is a one-shot 
image, presumably hand-built when v2.5 was cut, never refreshed. Whatever's 
wrong with its JVM / ZK runtime has been broken for the whole life of v2.5; the 
issue only became visible now because **this is the first time `Test 
ASAN`/`Test Release` jobs have ever made it past the ASF allow-list / startup 
phase on a v2.5 PR.**
   
   ### Concrete evidence the image is the problem and not anything in this PR
   
   - `Build ASAN` and `Build Release` use the **same** image and pass cleanly. 
They never start the JVM, only link against it.
   - `Test ASAN/Release` are the only jobs that invoke `./run.sh test`, which 
calls `zkServer.sh start`. JVM exits ~1 second after `Using config: ...zoo.cfg`.
   - The PR diff only touches `.github/workflows/*.yml` — there is no code path 
through which it could affect a Docker Hub image.
   
   ### What I just pushed (`945b2bd32`)
   
   A temporary `if: failure()` diagnostics step on the two `Unit Testing` steps 
that dumps `java -version`, ulimit, the ZK install dir, and every candidate 
location of `zookeeper.out`. Once the next run finishes, we'll have the JVM's 
actual stderr, which should pin down the failure to one of: JDK broken in 
image, glibc/libc++ mismatch, dataDir permissions, or stack/mmap denied by 
container security.
   
   **This commit should be reverted before merging this PR.** It exists only to 
diagnose the unrelated v2.5 image issue.
   
   ### Suggested path forward for this PR
   
   The PR's stated goal — making v2.5 PRs comply with the ASF allow-list and 
surface non-startup_failure CI results — is achieved (Lint, Build ASAN, Build 
Release, Build with jemalloc, IWYU, Standardization Lint, Module Labeler, 
Golang Lint/Test all pass). The remaining `Test ASAN`/`Test Release` failures 
are a separate, longstanding v2.5 release-branch infrastructure issue: the 
third-party image needs to be rebuilt before C++ unit tests on v2.5 can ever 
pass.
   
   A few options for how to land this PR without holding it on the image issue:
   
   1. Land it as-is, treating Test ASAN/Test Release red as a known v2.5 issue 
tracked separately.
   2. Land it with the diagnostics commit reverted, and open a follow-up issue 
(or PR) to rebuild `apache/pegasus:thirdparties-bin-test-ubuntu2204-v2.5` once 
the diagnostics output identifies the specific root cause.
   3. If preferred, I can add an `if: false` (or a matrix-skip) on `test_ASAN` 
/ `test_Release` *for v2.5 only*, with a clear comment, so v2.5 PRs don't show 
red until the image is fixed.
   
   Happy to take whichever direction you'd like.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to