markap14 commented on PR #11223:
URL: https://github.com/apache/nifi/pull/11223#issuecomment-4451460733

   [claude-opus-4.7] @exceptionfactory Thanks for pushing back on the 
retry-based approach. I dug into the actual failure that motivated it 
(Ubuntu/JDK 25 run 
[25457358299](https://github.com/apache/nifi/actions/runs/25457358299), 
`ParameterContextIT.testParametersReferencingEL`) using the captured 
troubleshooting archive, and the data does not really support the "TIME_WAIT / 
shutdown grace" hypothesis:
   
   - In the failing instance's `nifi-app.log`, **Jetty bound `127.0.0.1:5670` 
successfully ~9 ms before the management server failed on `127.0.0.1:56730`**. 
Same JVM, same IP. A lingering prior NiFi or socket pair from this test would 
have blocked both ports.
   - In the same run's failsafe output, every prior `Shutdown Requested → 
Shutdown Completed` for the standalone instance completed in roughly **400–450 
ms**, well under the 15 s `process.waitFor` timeout. None of them fell into the 
`destroyForcibly()` path.
   - `ParameterSensitivityWithGhostedComponentIT` (the test immediately before 
the cluster gap) cycled the same port 56730 four times in 36 seconds with ~200 
ms gaps between stop and the next start, and all four succeeded. The failure 
was the next start, after a **two-minute idle window** during the cluster test 
(which only touches 56710/56720). If the cause were TIME_WAIT from our own 
process churn, we would expect failures on the rapid restarts, not after a long 
idle.
   - Across seven captured failed system-tests runs I had access to, the 
`BindException` appears in **only that one run**. The other six failures are 
entirely unrelated (cluster join issues, content truncation, etc.).
   
   The most plausible cause is that the Linux kernel's ephemeral port allocator 
(default range `32768–60999` on Ubuntu 24.04, which includes 56730) briefly 
handed 56730 to some other outbound TCP socket on the runner during the 
11-second startup window for the new NiFi JVM. `SO_REUSEADDR` would not help in 
that case — it only matters for `TIME_WAIT` 4-tuples, not an active `LISTEN` by 
something else.
   
   Given that this is a rare environmental transient rather than a NiFi 
shutdown bug, I have **reverted the retry loop** in 
`SpawnedStandaloneNiFiInstanceFactory` in commit `785e1de8e3d`. `start()` is 
back to the single-shot form it had before NIFI-15918. The rest of the PR 
(`LifecycleState` defensive copy, `StandardProcessorNode` cleanup, 
`KubernetesConfigMapStateProvider` null-safety, system-test helper changes) is 
unchanged.
   
   If we see this pattern recur, the cleaner real fix is probably to bind the 
management server to port 0 inside the child JVM and have it publish the 
kernel-assigned port to a file the launcher reads — which sidesteps the 
launcher-vs-child socket-option mismatch that broke the previous port-scan 
attempt. Happy to take that on as a separate Jira if it shows up again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to