markap14 commented on PR #11325:
URL: https://github.com/apache/nifi/pull/11325#issuecomment-4674752084

   [claude-opus-4.7] Rerun attempt 2 of the system-tests run also failed on 
`ubuntu-24.04 Java 21` and `ubuntu-24.04 Java 25` (`macos-15 Java 21` is still 
in progress). Different tests fail on each attempt, which is the classic flake 
signature, but `LoadBalanceIT.testPartitionByAttribute` has now failed in both 
attempts on Java 25.
   
   **`ubuntu-24.04 Java 21` attempt 2** 
([job](https://github.com/apache/nifi/actions/runs/27299783298/job/80657376495))
   - `ClusteredStatelessFlowIT.testUpdateParameterReferencedByStatelessFlow` — 
`listQueue` HTTP 409 / 500 (same `RST_STREAM` family)
   - 
`OffloadContentClaimTruncationIT.testOffloadedFlowFileContentNotPrematurelyTruncated`
 — same symptom
   
   **`ubuntu-24.04 Java 25` attempt 2** 
([job](https://github.com/apache/nifi/actions/runs/27299783298/job/80657376375))
   - `LoadBalanceIT.testPartitionByAttribute` — this time the 409 is on `DELETE 
/listing-requests/{id}`, with message `Node localhost:5671 is currently 
connecting`. The test runs immediately after 
`LoadBalanceIT.testRoundRobinWithRestartAndPortChange`, which restarts a node. 
Because `NiFiInstanceCache` reuses the cluster between tests in the same class, 
when this test runs after the restart test the cluster may still be 
reconnecting. That is a separate issue from the `RST_STREAM` flake.
   - `LoadBalanceIT.testRoundRobinWithRestartAndPortChange` — `emptyQueue` 
teardown timed out after 132s, downstream of the same node restart.
   - `ClusteredReplayProvenanceIT.testReplayLastEvent[1] PRIMARY` — 
`emptyQueue` 409 / 500.
   
   The non-system-test jobs (Windows FR, Scan, Ubuntu integration-tests, 
CodeQL, Corretto EN, macOS JP, macos-15 Java 25 attempt 2) are all green.
   
   The reduction from 100 → 20 FlowFiles in `testPartitionByAttribute` 
materially lowers the rate of the `RST_STREAM` failure but is clearly not 
sufficient on the GitHub Actions `ubuntu-24.04` runners. Recommending we hold 
off on additional reruns and decide on a direction. Three options I see:
   
   1. Accept that this PR is a partial mitigation and merge anyway (still 
strictly better than `main`, where 8/10 recent system-tests runs have failed 
with the same family of errors).
   2. Push a follow-up change that further reduces load in 
`testPartitionByAttribute` (smaller batch, fewer distinct attribute values) 
and/or addresses the `testRoundRobinWithRestartAndPortChange` → 
`testPartitionByAttribute` ordering by waiting for the cluster to be fully 
reconnected before `testPartitionByAttribute` proceeds.
   3. Pursue a real fix at the framework layer for the `RST_STREAM` on cluster 
replication (the original goal earlier in this investigation), separate from 
this PR.
   
   @markap14 please advise — I will pause aggressive polling and switch to 
once-per-hour until you weigh in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to