m1a2st opened a new pull request, #20899:
URL: https://github.com/apache/kafka/pull/20899

   ## Problem
   
   While running Kafka e2e tests, various tests were failing with 
`TimeoutError('Kafka node failed to stop in 60 seconds')`. In Kafka e2e tests, 
we check the PID to ensure the Kafka server has shut down. After investigating 
this issue, I found that the Kafka process was a zombie process in the 
container:
   ```bash
   ducker@ducker05:/$ jcmd
   285 kafka.Kafka /mnt/kafka/kafka.properties
   18207 jdk.jcmd/sun.tools.jcmd.JCmd
   
   ducker@ducker05:/$ cat /proc/285/status | grep -i state
   State:       Z (zombie)
   ```
   
   ## Root Cause
   
   This issue is related to [this 
change](https://github.com/apache/kafka/pull/17554/files#r1845737954). When 
using `CMD ["sudo", "service", "ssh", "start", "-D"]`, PID 1 becomes the SSH 
service, which does not handle `SIGCHLD` signals and therefore won't reap 
zombie processes:
   ```bash
   ducker@ducker05:/$ cat /proc/1/cmdline | tr '\0' ' '
   sudo service ssh start -D
   ```
   
   However, with the old syntax `CMD sudo service ssh start && tail -f 
/dev/null`, PID 1 is `/bin/sh`, which is a shell that properly reaps zombie 
processes:
   ```bash
   ducker@ducker05:/$ cat /proc/1/cmdline | tr '\0' ' '
   /bin/sh -c sudo service ssh start && tail -f /dev/null
   ```
   
   ## Solution
   
   Use `tini` as PID 1 to properly manage processes and avoid zombie processes 
from remaining in the system.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to