Tom Lane wrote:

So my opinion is that the real issue here is why is the kill()
implementation failing when it should not.  We need to fix that,
not put band-aids in async.c.
I think Tom makes a valid point. To that end, I think the Window's kill implementation needs to be changed to have these properties:
1. kill should generally return quickly and should never hang
2. kill should never fail for a transient reason
3. if kill fails, it should be a good indication that the process is dead or we are permanently unable to communicate with it.

The current implementation has property #1, but not #2 and #3. However, I think I've figured out a simple way to modify the Window's implementation of pgkill to achieve all three properties. I will do some long term testing of the changes over the weekend, just to check my solution works properly over longer time scales.

Here is what I've found so far:
* Contrary to my previous reports, the notification error is always the result of pgkill failing with error code 2 (ERROR_FILE_NOT_FOUND). I had previously thought it had issued error 31, but this was just an error in my debug message (the signal was 31, i.e. SIGUSR2). * Also contrary to what I wrote previously, long or infinite timeouts do not fix the problem. With an infinite timeout, I've avoided the problem for as long as ten minutes, but it eventually happens. In some cases, the problem even occurred quickly with an infinite timeout.

The solution that seems to work is to call CallNamedPipe repeatedly in a loop if it fails. Currently, I call the function up to a maximum of 5 times, although in all test cases so far, the code has never needed more than 1 retry to succeed. Based on my testing, I may reduce the maximum number of retries. The code sleeps for 20 ms between retries. I reduced the timeout for CallNamedPipe from 1000 ms to 250 ms after the first call, to reduce the total time for the signal if we hit a case that needs to timeout.

I also notice that signals that should fail, do fail. For example, signal 30 seems to be regularly sent to pid 1, and this fails in both the orignal code and my modified version.

Theoretically, I'm not entirely sure why CallNamedPipe fails occasionally, but will succeed when called with the same arguments a very short while later. It's hard to know without being able to see the source code. However, from the Windows documentation, it seems like a single named pipe in Windows can have several "instances", which seem to be access interfaces to the pipe. I suspect there is some race condition where the code erroneously decides it needs to create an "instance" of the pipe, rather than waiting for an instance to become available. When the instance creation fails, it generates the FILE_NOT_FOUND error. I'll post back on Monday with more complete test results, and, if all goes well, a patch. If anyone has ideas on what else should be tested, please let me know.

Steve

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Reply via email to