markap14 commented on PR #11325: URL: https://github.com/apache/nifi/pull/11325#issuecomment-4674752084
[claude-opus-4.7] Rerun attempt 2 of the system-tests run also failed on `ubuntu-24.04 Java 21` and `ubuntu-24.04 Java 25` (`macos-15 Java 21` is still in progress). Different tests fail on each attempt, which is the classic flake signature, but `LoadBalanceIT.testPartitionByAttribute` has now failed in both attempts on Java 25. **`ubuntu-24.04 Java 21` attempt 2** ([job](https://github.com/apache/nifi/actions/runs/27299783298/job/80657376495)) - `ClusteredStatelessFlowIT.testUpdateParameterReferencedByStatelessFlow` — `listQueue` HTTP 409 / 500 (same `RST_STREAM` family) - `OffloadContentClaimTruncationIT.testOffloadedFlowFileContentNotPrematurelyTruncated` — same symptom **`ubuntu-24.04 Java 25` attempt 2** ([job](https://github.com/apache/nifi/actions/runs/27299783298/job/80657376375)) - `LoadBalanceIT.testPartitionByAttribute` — this time the 409 is on `DELETE /listing-requests/{id}`, with message `Node localhost:5671 is currently connecting`. The test runs immediately after `LoadBalanceIT.testRoundRobinWithRestartAndPortChange`, which restarts a node. Because `NiFiInstanceCache` reuses the cluster between tests in the same class, when this test runs after the restart test the cluster may still be reconnecting. That is a separate issue from the `RST_STREAM` flake. - `LoadBalanceIT.testRoundRobinWithRestartAndPortChange` — `emptyQueue` teardown timed out after 132s, downstream of the same node restart. - `ClusteredReplayProvenanceIT.testReplayLastEvent[1] PRIMARY` — `emptyQueue` 409 / 500. The non-system-test jobs (Windows FR, Scan, Ubuntu integration-tests, CodeQL, Corretto EN, macOS JP, macos-15 Java 25 attempt 2) are all green. The reduction from 100 → 20 FlowFiles in `testPartitionByAttribute` materially lowers the rate of the `RST_STREAM` failure but is clearly not sufficient on the GitHub Actions `ubuntu-24.04` runners. Recommending we hold off on additional reruns and decide on a direction. Three options I see: 1. Accept that this PR is a partial mitigation and merge anyway (still strictly better than `main`, where 8/10 recent system-tests runs have failed with the same family of errors). 2. Push a follow-up change that further reduces load in `testPartitionByAttribute` (smaller batch, fewer distinct attribute values) and/or addresses the `testRoundRobinWithRestartAndPortChange` → `testPartitionByAttribute` ordering by waiting for the cluster to be fully reconnected before `testPartitionByAttribute` proceeds. 3. Pursue a real fix at the framework layer for the `RST_STREAM` on cluster replication (the original goal earlier in this investigation), separate from this PR. @markap14 please advise — I will pause aggressive polling and switch to once-per-hour until you weigh in. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
