I wrote: > I've been trying to figure out why my recent attempt to latch-ify the > stats collector didn't work well on the Windows buildfarm machines. > ... > What I intend to try doing about this is making > pgwin32_waitforsinglesocket detach its event object from the socket before > returning, ie WSAEventSelect(socket, NULL, 0). This will hopefully > decouple it a bit better from WaitLatchOrSocket.
Well, I tried that, and it made things better on some of the Windows buildfarm critters, but not all. I then added some debug printouts to pgstat.c, and after a few go-rounds with that I have learned the following things: * In the original state of the latch-ified stats collector patch, control never actually reaches the WaitLatchOrSocket call at all, until the end of the regression test run when the postmaster sends a shutdown signal. pgstat.c only iterates the inner loop containing the recv() call, because the forced blocking behavior in pgwin32_recv causes it to wait at the recv() call until data is available. However, on the machines where the regression tests fails, somewhere during the proceedings recv() just stops returning at all. This can be seen for example at http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=narwhal&dt=2012-05-14%2004%3A00%3A06 Now the really interesting point here is that given that WaitLatchOrSocket isn't being reached, there is hardly any difference between the pre-patched code and the patched code: before, we called pgwin32_waitforsinglesocket from pgstat.c and then did a recv() when it said OK, and in the patched code a similar call happens internally to pgwin32_recv. As far as I can see, the only material difference there is that pgstat.c specified a finite timeout when calling pgwin32_waitforsinglesocket while pgwin32_recv uses an infinite timeout. I now believe that the only reason the pre-patched code appeared to work is that sometimes it would time out and retry. With the infinite timeout in place, after whatever goes wrong goes wrong, it just hangs up until a signal event arrives. * I then modified pgstat.c to force pgwin32_noblock on while calling pgwin32_recv(). This caused pgstat's new loop logic to start behaving as intended, that is it would loop calling recv() as long as data was available and then fall out to the WaitLatchOrSocket call. However, it was still capable of randomly freezing at the WaitLatchOrSocket call, as in http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=narwhal&dt=2012-05-14%2016%3A00%3A05 So whatever is wrong affects both the code in pgwin32_waitforsinglesocket and that in WaitLatchOrSocket. * Getting desperate, I then changed the WaitLatchOrSocket call to specify a two-second timeout, which is the same as we had been using in the pgwin32_waitforsinglesocket call before all this started. And behold, narwhal passes the regression tests again: http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=narwhal&dt=2012-05-14%2022%3A00%3A05&stg=check This trace shows positively that the stats collector occasionally fails to receive incoming data, causing it to stop dead until a timeout occurs (look for "pgstat: wait result 0x8" log entries). After that it's again able to receive data. My conclusion is that the WSAEventSelect coding pattern we're using tickles something seriously broken in Windows Server 2003, causing it to sometimes lose FD_READ events. The other flavors of Windows we have in the buildfarm seem to perform more nearly according to Microsoft's documentation. Having already spent way more time on this than I cared to, my intention is to use the hack solution of making pgstat.c call WaitLatchOrSocket with a two-second timeout on Windows only, while using an infinite timeout elsewhere. I don't feel too awful about this because anybody who's expecting minimal power consumption needs to be using something other than Windows anyway. However, I'm now more convinced than ever that our Windows socket code is rickety to the point that it will fall over anytime somebody looks at it cross-eyed. I think that we need to get rid of the technique of detaching and re-attaching event objects to sockets; at least some versions of Windows just don't work with that. And we definitely need to get rid of that global variable pgwin32_noblock. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers