Hello, I am using distcc 3.4 (compiled by me from source) on CentOS (CentOS Linux release 7.9.2009 (Core)). Successful compilations work OK, but interrupted compilations (where one presses ctrl-C on the client machine, interrupting the make or whatever process), lead to errors in the server-side distccd log, and zombie compiler processes remaining on the servers. This is concerning because they appear to be permanently using up worker slots, eventually leading to a situation where none are left and no remote compilation is possible. I am *not* using "distcc-pump" mode.
I am configuring distcc like this: export DISTCC_HOSTS="build01.example.com/40,lzo build03.example.com/40,lzo build05.example.com/40,lzo build06.example.com/32,lzo build07.example.com/32,lzo" export DISTCC_DIR="/var/tmp/distcc.${LOGNAME}" I am running distcc like this: /opt/distcc/3.4/bin/distcc /opt/gcc/7.3.0/bin/g++ [...compiler arguments elided...] I am starting distccd like this: /opt/distcc/3.4/bin/distccd --no-detach --enable-tcp-insecure --allow 10.101.201.0/24 --daemon --log-file /var/tmp/distccd.log--log-level debug I am running distccd in Docker, but I see the same behaviour when I run it under systemd. What I'm seeing in the distccd.log is distccd[17] compile from RuntimeInfo.cpp to RuntimeInfo.cpp.o distccd[17] (dcc_run_job) output file CMakeFiles/lib_all_objects.dir/project/foobar/RuntimeInfo.cpp.o distccd[17] (dcc_input_tmpnam) input file /ssd_r0/user/gjvc/project/foobar/RuntimeInfo.cpp distccd[17] (dcc_r_token_int) got DOTI001175cd distccd[17] (dcc_r_bulk_lzo1x) decompressed 1144269 bytes to 4869619 bytes: 23% distccd[17] (dcc_r_file) received 1144269 bytes to file /tmp/distccd_fcf8c291.ii distccd[17] (dcc_r_file_timed) 1144269 bytes received in 0.015365s, rate 72727kB/s distccd[17] (dcc_set_input) changed input from "/ssd_r0/user/gjvc/project/foobar/RuntimeInfo.cpp" to "/tmp/distccd_fcf8c291.ii" distccd[17] (dcc_set_input) command after: /opt/gcc/7.3.0/bin/g++ -g -O0 -pipe -fconcepts -fpermissive -Wno-narrowing -std=c++1z -o CMakeFiles/lib_all_objects.dir/project/foobar/RuntimeInfo.cpp.o -c /tmp/distccd_fcf8c291.ii distccd[17] (dcc_set_output) changed output from "CMakeFiles/lib_all_objects.dir/project/foobar/RuntimeInfo.cpp.o" to "/tmp/distccd_fcbcc291.o" distccd[17] (dcc_set_output) command after: /opt/gcc/7.3.0/bin/g++ -g -O0 -pipe -fconcepts -fpermissive -Wno-narrowing -std=c++1z -o /tmp/distccd_fcbcc291.o -c /tmp/distccd_fcf8c291.ii distccd[17] (dcc_spawn_child) forking to execute: /opt/gcc/7.3.0/bin/g++ -g -O0 -pipe -fconcepts -fpermissive -Wno-narrowing -std=c++1z -o /tmp/distccd_fcbcc291.o -c /tmp/distccd_fcf8c291.ii distccd[17] (dcc_spawn_child) child started as pid72 distccd[17] (dcc_collect_child) ERROR: Client fd disconnected, killing job distccd[17] (dcc_x_token_int) send DONE00000002 distccd[17] (dcc_x_token_int) send STAT00006b00 distccd[17] (dcc_writex) ERROR: failed to write: Broken pipe distccd[17] /opt/gcc/7.3.0/bin/g++ /ssd_r0/user/gjvc/project/foobar/RuntimeInfo.cpp on localhost failed with exit code 107 distccd[17] job complete distccd[17] (dcc_cleanup_tempfiles_inner) deleted 5 temporary files distccd[17] (dcc_job_summary) client: 10.101.201.171:51212 CLI_DISCONN exit:107 sig:0 core:0 ret:107 time:6545ms distccd[17] (dcc_cleanup_tempfiles_inner) deleted 0 temporary files What I see on the remote hosts is: root 15995 0.0 0.0 712432 6440 ? Sl 18:49 0:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id ab40c598131e195767b36c9795c964e9ae477a1a86bda39c43aba8376a674519 -address /run/containerd/containerd.sock nobody 16016 0.0 0.0 1120 4 ? Ss 18:49 0:00 \_ /sbin/docker-init -- /opt/distcc/3.4/bin/distccd --no-detach --enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24 --daemon --log-file /var/tmp/distccd.log --log-level debug nobody 16110 0.0 0.0 7052 772 ? SN 18:49 0:00 \_ /opt/distcc/3.4/bin/distccd --no-detach --enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24 --daemon --log-file /var/tmp/distccd.log --log-level debug nobody 16111 0.0 0.0 20440 8604 ? SN 18:49 0:00 \_ /opt/distcc/3.4/bin/distccd --no-detach --enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24 --daemon --log-file /var/tmp/distccd.log --log-level debug nobody 16195 0.0 0.0 0 0 ? ZN 18:49 0:00 | \_ [g++] <defunct> nobody 17479 0.0 0.0 0 0 ? ZN 18:55 0:00 | \_ [g++] <defunct> nobody 20346 0.0 0.0 0 0 ? ZN 19:12 0:00 | \_ [g++] <defunct> nobody 16112 0.0 0.0 20436 8604 ? SN 18:49 0:00 \_ /opt/distcc/3.4/bin/distccd --no-detach --enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24 --daemon --log-file /var/tmp/distccd.log --log-level debug nobody 17486 0.0 0.0 0 0 ? ZN 18:55 0:00 | \_ [g++] <defunct> nobody 20335 0.0 0.0 0 0 ? ZN 19:12 0:00 | \_ [g++] <defunct> nobody 16113 0.0 0.0 22096 10608 ? SN 18:49 0:00 \_ /opt/distcc/3.4/bin/distccd --no-detach --enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24 --daemon --log-file /var/tmp/distccd.log --log-level debug nobody 16204 0.0 0.0 0 0 ? ZN 18:49 0:00 | \_ [g++] <defunct> nobody 16114 0.0 0.0 22920 11380 ? SN 18:49 0:00 \_ /opt/distcc/3.4/bin/distccd --no-detach --enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24 --daemon --log-file /var/tmp/distccd.log --log-level debug nobody 17539 0.0 0.0 0 0 ? ZN 18:56 0:00 | \_ [g++] <defunct> nobody 20369 0.0 0.0 0 0 ? ZN 19:12 0:00 | \_ [g++] <defunct> Note the STIME field on the zombie processes -- this shows they have been lingering for a while. >From "man distcc" and the code, I can see that exit code 107 is "I/O Error", which is fair enough -- the client process went away unexpectedly, but whatever happens, the child process should be reaped. After doing this a few times, one can see the number of zombie compiler processes increasing (as seen in the above excerpt from the output of "ps faux"). The fact that there are multiple zombies under a single distccd process suggests that I should not be concerned about running out of slots as mentioned above, but it is clear that these compiler processes are not being reaped as they should be. At the very least, it looks messy in the output of "ps faux" :-) Any and all suggestions welcome. Thank you very much! gjvc __ distcc mailing list http://distcc.samba.org/ To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/distcc