Hello, Adding a call to dcc_reap_kids() at the end of the main loop seems to fix the problem.
$ git --no-pager diff --no-prefix src/prefork.c diff --git src/prefork.c src/prefork.c index d4d70d5..39f3c8a 100644 --- src/prefork.c +++ src/prefork.c @@ -196,6 +196,9 @@ static int dcc_preforked_child(int listen_fd) dcc_close(acc_fd); now = time(NULL); + + /* wait for any children to exit */ + dcc_reap_kids(FALSE); } rs_log_info("worn out"); On Wed, Mar 29, 2023 at 12:53 AM George Cox <george....@gmail.com> wrote: > > Hello, > > I am using distcc 3.4 (compiled by me from source) on CentOS (CentOS > Linux release 7.9.2009 (Core)). Successful compilations work OK, but > interrupted compilations (where one presses ctrl-C on the client > machine, interrupting the make or whatever process), lead to errors in > the server-side distccd log, and zombie compiler processes remaining > on the servers. This is concerning because they appear to be > permanently using up worker slots, eventually leading to a situation > where none are left and no remote compilation is possible. I am *not* > using "distcc-pump" mode. > > I am configuring distcc like this: > export DISTCC_HOSTS="build01.example.com/40,lzo > build03.example.com/40,lzo build05.example.com/40,lzo > build06.example.com/32,lzo build07.example.com/32,lzo" > export DISTCC_DIR="/var/tmp/distcc.${LOGNAME}" > > I am running distcc like this: > /opt/distcc/3.4/bin/distcc /opt/gcc/7.3.0/bin/g++ [...compiler > arguments elided...] > > I am starting distccd like this: > /opt/distcc/3.4/bin/distccd --no-detach --enable-tcp-insecure > --allow 10.101.201.0/24 --daemon --log-file > /var/tmp/distccd.log--log-level debug > > I am running distccd in Docker, but I see the same behaviour when I > run it under systemd. > > What I'm seeing in the distccd.log is > distccd[17] compile from RuntimeInfo.cpp to RuntimeInfo.cpp.o > distccd[17] (dcc_run_job) output file > CMakeFiles/lib_all_objects.dir/project/foobar/RuntimeInfo.cpp.o > distccd[17] (dcc_input_tmpnam) input file > /ssd_r0/user/gjvc/project/foobar/RuntimeInfo.cpp > distccd[17] (dcc_r_token_int) got DOTI001175cd > distccd[17] (dcc_r_bulk_lzo1x) decompressed 1144269 bytes to > 4869619 bytes: 23% > distccd[17] (dcc_r_file) received 1144269 bytes to file > /tmp/distccd_fcf8c291.ii > distccd[17] (dcc_r_file_timed) 1144269 bytes received in > 0.015365s, rate 72727kB/s > distccd[17] (dcc_set_input) changed input from > "/ssd_r0/user/gjvc/project/foobar/RuntimeInfo.cpp" to > "/tmp/distccd_fcf8c291.ii" > distccd[17] (dcc_set_input) command after: /opt/gcc/7.3.0/bin/g++ > -g -O0 -pipe -fconcepts -fpermissive -Wno-narrowing -std=c++1z -o > CMakeFiles/lib_all_objects.dir/project/foobar/RuntimeInfo.cpp.o -c > /tmp/distccd_fcf8c291.ii > distccd[17] (dcc_set_output) changed output from > "CMakeFiles/lib_all_objects.dir/project/foobar/RuntimeInfo.cpp.o" to > "/tmp/distccd_fcbcc291.o" > distccd[17] (dcc_set_output) command after: /opt/gcc/7.3.0/bin/g++ > -g -O0 -pipe -fconcepts -fpermissive -Wno-narrowing -std=c++1z -o > /tmp/distccd_fcbcc291.o -c /tmp/distccd_fcf8c291.ii > distccd[17] (dcc_spawn_child) forking to execute: > /opt/gcc/7.3.0/bin/g++ -g -O0 -pipe -fconcepts -fpermissive > -Wno-narrowing -std=c++1z -o /tmp/distccd_fcbcc291.o -c > /tmp/distccd_fcf8c291.ii > distccd[17] (dcc_spawn_child) child started as pid72 > distccd[17] (dcc_collect_child) ERROR: Client fd disconnected, killing job > distccd[17] (dcc_x_token_int) send DONE00000002 > distccd[17] (dcc_x_token_int) send STAT00006b00 > distccd[17] (dcc_writex) ERROR: failed to write: Broken pipe > distccd[17] /opt/gcc/7.3.0/bin/g++ > /ssd_r0/user/gjvc/project/foobar/RuntimeInfo.cpp on localhost failed > with exit code 107 > distccd[17] job complete > distccd[17] (dcc_cleanup_tempfiles_inner) deleted 5 temporary files > distccd[17] (dcc_job_summary) client: 10.101.201.171:51212 > CLI_DISCONN exit:107 sig:0 core:0 ret:107 time:6545ms > distccd[17] (dcc_cleanup_tempfiles_inner) deleted 0 temporary files > > What I see on the remote hosts is: > root 15995 0.0 0.0 712432 6440 ? Sl 18:49 0:00 > /usr/bin/containerd-shim-runc-v2 -namespace moby -id > ab40c598131e195767b36c9795c964e9ae477a1a86bda39c43aba8376a674519 > -address /run/containerd/containerd.sock > nobody 16016 0.0 0.0 1120 4 ? Ss 18:49 0:00 > \_ /sbin/docker-init -- /opt/distcc/3.4/bin/distccd --no-detach > --enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24 > --daemon --log-file /var/tmp/distccd.log --log-level debug > nobody 16110 0.0 0.0 7052 772 ? SN 18:49 0:00 > \_ /opt/distcc/3.4/bin/distccd --no-detach --enable-tcp-insecure > --allow 10.101.201.0/24 --allow 10.101.100.0/24 --daemon --log-file > /var/tmp/distccd.log --log-level debug > nobody 16111 0.0 0.0 20440 8604 ? SN 18:49 0:00 > \_ /opt/distcc/3.4/bin/distccd --no-detach > --enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24 > --daemon --log-file /var/tmp/distccd.log --log-level debug > nobody 16195 0.0 0.0 0 0 ? ZN 18:49 0:00 > | \_ [g++] <defunct> > nobody 17479 0.0 0.0 0 0 ? ZN 18:55 0:00 > | \_ [g++] <defunct> > nobody 20346 0.0 0.0 0 0 ? ZN 19:12 0:00 > | \_ [g++] <defunct> > nobody 16112 0.0 0.0 20436 8604 ? SN 18:49 0:00 > \_ /opt/distcc/3.4/bin/distccd --no-detach > --enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24 > --daemon --log-file /var/tmp/distccd.log --log-level debug > nobody 17486 0.0 0.0 0 0 ? ZN 18:55 0:00 > | \_ [g++] <defunct> > nobody 20335 0.0 0.0 0 0 ? ZN 19:12 0:00 > | \_ [g++] <defunct> > nobody 16113 0.0 0.0 22096 10608 ? SN 18:49 0:00 > \_ /opt/distcc/3.4/bin/distccd --no-detach > --enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24 > --daemon --log-file /var/tmp/distccd.log --log-level debug > nobody 16204 0.0 0.0 0 0 ? ZN 18:49 0:00 > | \_ [g++] <defunct> > nobody 16114 0.0 0.0 22920 11380 ? SN 18:49 0:00 > \_ /opt/distcc/3.4/bin/distccd --no-detach > --enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24 > --daemon --log-file /var/tmp/distccd.log --log-level debug > nobody 17539 0.0 0.0 0 0 ? ZN 18:56 0:00 > | \_ [g++] <defunct> > nobody 20369 0.0 0.0 0 0 ? ZN 19:12 0:00 > | \_ [g++] <defunct> > > Note the STIME field on the zombie processes -- this shows they have > been lingering for a while. > > From "man distcc" and the code, I can see that exit code 107 is "I/O > Error", which is fair enough -- the client process went away > unexpectedly, but whatever happens, the child process should be > reaped. > > After doing this a few times, one can see the number of zombie > compiler processes increasing (as seen in the above excerpt from the > output of "ps faux"). The fact that there are multiple zombies under > a single distccd process suggests that I should not be concerned about > running out of slots as mentioned above, but it is clear that these > compiler processes are not being reaped as they should be. At the > very least, it looks messy in the output of "ps faux" :-) > > Any and all suggestions welcome. Thank you very much! > > > > gjvc __ distcc mailing list http://distcc.samba.org/ To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/distcc