akshayjadiyanv commented on PR #38701:
URL: https://github.com/apache/beam/pull/38701#issuecomment-4628241399

   Thanks for pushing the image! The PostCommit ran, but didn't give a clean 
read on the Dynamo IT — and I don't think it's our test.
   
   The 3.12 job was **cancelled at the 4-hour cap, not failed**, and our Dynamo 
test never actually ran: `vllmTests` runs completion → chat → Dynamo 
sequentially, and only the first job (native `opt-125m`, job 
`2026-06-04_15_32_20-13766057237933605706`) was submitted — it hit `RUNNING` at 
22:32 and never returned before the cap, so the Dynamo exec never started. 
3.11/3.13 passed; 3.10 and 3.14 failed (3.14 was a `libpython3.14` segfault). 
Is PostCommit Python red on master too right now? Looks like it from the recent 
runs.
   
   I did validate the Dynamo path independently first: built an image from the 
updated `vllm.dockerfile.old` in my own GCP project and ran the example on 
Dataflow (T4, `Qwen3-0.6B`, `--use_dynamo`). It finished `JOB_STATE_DONE` with 
every `Completion` carrying an `nvext.timing` field — which only the Dynamo 
frontend emits, so the Dynamo path was definitely exercised. T4 confirmed in 
worker logs.
   
   A couple of **suggestions**, your call:
   - Run/observe just `vllmTests` (or the native `opt-125m` IT) in isolation — 
the exec-1 hang looks pre-existing.
   - Or reorder/split so the Dynamo run isn't starved behind it when the suite 
runs long.
   
   The job id above should let you pull the `apache-beam-testing` worker logs 
for why the native job didn't return.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to