This is an automated email from the ASF dual-hosted git repository.
yhu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git
The following commit(s) were added to refs/heads/master by this push:
new 7431dea675b [#37930][Runners] Reap child processes in ProcessManager
to prevent zombie accumulation (#37932)
7431dea675b is described below
commit 7431dea675b5c3da2ae27be207954cc9abbe3eb2
Author: Andres Tiko <[email protected]>
AuthorDate: Wed Mar 25 21:23:39 2026 +0200
[#37930][Runners] Reap child processes in ProcessManager to prevent zombie
accumulation (#37932)
* [#37930][Runners] Reap child processes in ProcessManager to prevent
zombie accumulation
ProcessManager.stopProcess() calls destroy()/destroyForcibly() to terminate
child processes, but never calls Process.waitFor() to collect the exit
status.
On POSIX systems, this means the terminated child process entry remains in
the
kernel process table as a zombie (state Z/defunct) until the parent process
itself exits.
In long-running environments like Flink TaskManagers using
--environment_type=PROCESS, the expansion service processes are repeatedly
spawned and stopped but never reaped. Over time this leads to significant
zombie accumulation (176+ observed on production hosts).
The fix adds process.waitFor() calls after process termination in:
- stopProcess(): after the destroy/destroyForcibly sequence
- killAllProcesses(): after destroyForcibly in the shutdown hook path
Fixes #37930
* Update CHANGES.md with ProcessManager zombie fix
---
CHANGES.md | 1 +
.../fnexecution/environment/ProcessManager.java | 20 ++++++++++++++++++--
2 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/CHANGES.md b/CHANGES.md
index 064b1485449..99b05ab9aaa 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -84,6 +84,7 @@
## Bugfixes
* Fixed X (Java/Python) ([#X](https://github.com/apache/beam/issues/X)).
+* Fixed ProcessManager not reaping child processes, causing zombie process
accumulation on long-running Flink deployments (Java)
([#37930](https://github.com/apache/beam/issues/37930)).
## Security Fixes
diff --git
a/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/ProcessManager.java
b/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/ProcessManager.java
index 3570fef00df..86299762783 100644
---
a/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/ProcessManager.java
+++
b/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/ProcessManager.java
@@ -203,6 +203,14 @@ public class ProcessManager {
}
}
}
+ // Reap the child process to prevent zombie accumulation.
destroy()/destroyForcibly() send
+ // signals but do not call waitpid(), so the terminated process remains in
the kernel process
+ // table as a zombie until waitFor() collects its exit status.
+ try {
+ process.waitFor();
+ } catch (InterruptedException e) {
+ Thread.currentThread().interrupt();
+ }
}
/** Returns true if the process exists within maxWaitTimeMillis. */
@@ -249,8 +257,16 @@ public class ProcessManager {
processes.forEach((id, process) -> process.destroy());
}
- /** Kill all remaining processes forcibly, i.e. upon JVM shutdown */
+ /** Kill all remaining processes forcibly and reap them, i.e. upon JVM
shutdown. */
private void killAllProcesses() {
- processes.forEach((id, process) -> process.destroyForcibly());
+ processes.forEach(
+ (id, process) -> {
+ process.destroyForcibly();
+ try {
+ process.waitFor();
+ } catch (InterruptedException e) {
+ Thread.currentThread().interrupt();
+ }
+ });
}
}