Hi all,

It was more than 9 months ago I discovered a problem with the graceful
restarts on a default Virtualmin installation with the default execution
mode (mod_fcgid), but recently I had the time to dig deeper and experiment.
Since Virtualmin uses Apache + mod_fcgid by default, the experiments will
probably lead to the same results on any Apache 2.2 + mod_fcgid 2.3.7
installation. This is not the widely known problem with leftover processes
that never get killed on a graceful restart, this is something else - the
processes get forcefully killed way to soon and you don't get the output to
the browser. Please, test it on your setup and report back the result.

What is the setup:

CentOS 6.4 x86_64 minimal installation

Virtualmin 4.02.gpl GPL installed by the automatic .sh script, all default
settings (you can skip this, the problem is probably not virtualmin related)

mod_fcgid.x86_64 2.3.7-1.el6 from the virtualmin repo (other should work
too)
httpd.x86_64 1:2.2.15-29.el6.vm.1 from the virtualmin repo (other should
work too)

php 5.3.3 from the official repo

Single virtual domain, running under the default FCGId execution mode, with
90 sec php execution time and fcgid IO wait.

Single test.php file containing

<?php
for($i = 1; $i <= 30; $i++) {
    echo $i."\n";
    sleep(1);
}
?>

What is the error:

Run the script via browser, then go and do a graceful restart on apache
(service httpd graceful). After around 12 seconds you are going to see "No
data received" error in you browser (Chrome) and the following in the
apache error log:

(22)Invalid argument: mod_fcgid: can't lock process table in pid 25570

(the pid number will be different of course)

Further experiments show that this script gets forcefully killed before
ending.

If you reduce the time the script executes to 5 seconds ($i <= 4), you'll
get the same result, this time after 5 seconds.

Further experiments show this process completes, but you still get the
errors both in the browser and the error log.

Try it and post your result.

Dig:

It is probably a problem of mod_fcgid

I tweaked the experiment adding a file write at the end of the script which
shows which script completes and which gets killed before that. I got the
result above.

Add this inside the loop:
file_put_contents("test.txt", "test run for: ".$i." seconds");

So why 12 seconds and where is this set. After some time I discovered that
increasing FcgidErrorScanInterval to 60 will let the second process to
complete (but still you get the errors).

If you check the code of mod_fcgid In fcgid_pm_main.c, the graceful restart
should be performed by the function kill_all_subprocess() but obviously the
scan_errorlist() is also executed even if there is a check for
procmgr_must_exit().

The error in the log "can't lock process table in pid 25570" probably means
that some information about the process is destroyed immediately upon the
graceful restart (the mutex), so we will never get the result back.

Even if we get around the early termination of the processes increasing
FcgidErrorScanInterval the second problem is actually bigger - all your
users are going to see this error.

Do you get the same errors and do you have idea how to fix mod_fcgid?

Thanks for your time, testing and commenting!

Georgi Petrov

Reply via email to