I've been trying to figure out why my recent attempt to latch-ify the stats collector didn't work well on the Windows buildfarm machines. After a good deal of staring at our code and Microsoft's documentation I have a theory, which I intend to try out shortly. However, it appears to me that this code is riddled with problems that are going to take a lot of work to solve properly. I have neither the interest nor the Windows development environment to do more than getting the pgstats code to work, so somebody else is going to have to step up here.
The first issue is that the patched stats collector code expected recv() on a non-blocking socket to fail with EWOULDBLOCK or a related error code. pgwin32_recv() gets this exactly backwards, and instead waits for the socket to come ready using pgwin32_waitforsinglesocket(). I initially thought this was broken by commit 215cbc90f8db3fc8a70af5572b156f49216c2f70 which introduced the "pgwin32_noblock" hack, but on checking the git history I think it was wrong before then too. This wouldn't actually be fatal in itself, as long as the recv call exits with EINTR on receipt of a signal, which it does; it doesn't much matter whether the stats collector waits at the recv() or the WaitLatchOrSocket() call when it has nothing to do. However, pgwin32_waitforsinglesocket creates an event object and attaches it to the socket, and leaves things that way on exit. Assuming that we do sometimes get to the WaitLatchOrSocket() call, that code creates a different event object and attaches it to the socket. The WSAEventSelect man page http://msdn.microsoft.com/en-us/library/windows/desktop/ms741576(v=vs.85).aspx warns against this, saying Issuing a WSAEventSelect for a socket cancels any previous WSAAsyncSelect or WSAEventSelect for the same socket and clears the internal network event record. For example, to associate an event object with both reading and writing network events, the application must call WSAEventSelect with both FD_READ and FD_WRITE, as follows: rc = WSAEventSelect(s, hEventObject, FD_READ|FD_WRITE); It is not possible to specify different event objects for different network events. The following code will not work; the second call will cancel the effects of the first, and only the FD_WRITE network event will be associated with hEventObject2: rc = WSAEventSelect(s, hEventObject1, FD_READ); rc = WSAEventSelect(s, hEventObject2, FD_WRITE); //bad So the socket's attachment is getting moved back and forth between two different event objects. What I suspect is happening is that an incoming FD_READ event is getting "lost" when it happens concurrently with such a switch; probably the wrong one of the two event objects gets marked signaled, and then we just sit and wait. The phraseology about "clearing the internal network event record" is pretty scary too, even though elsewhere on the page it's claimed that FD_READ is level-triggered and so should be re-asserted against a newly selected event object. Another point is that googling suggests that it's unwise to ResetEvent an event that's attached to a socket, which is what will happen during repeated calls to pgwin32_waitforsinglesocket. What I intend to try doing about this is making pgwin32_waitforsinglesocket detach its event object from the socket before returning, ie WSAEventSelect(socket, NULL, 0). This will hopefully decouple it a bit better from WaitLatchOrSocket. Other interesting things I learned from the WSAEventSelect man page: The WSAEventSelect function automatically sets socket s to nonblocking mode, regardless of the value of lNetworkEvents. To set socket s back to blocking mode, it is first necessary to clear the event record associated with socket s via a call to WSAEventSelect with lNetworkEvents set to zero and the hEventObject parameter set to NULL. You can then call ioctlsocket or WSAIoctl to set the socket back to blocking mode. I imagine this is the reason for the otherwise seemingly brain-dead decision that pgwin32_recv() should act like non-blocking sockets are blocking. However, it's still extremely distressing that code that *wants* to operate a socket in non-blocking mode is not able to do so. This ought to be fixed, and not with a global flag like pgwin32_noblock --- it needs to be per-socket. Also, although it's stated that WSAEventSelect will generally set the event to signaled state when any requested network event has already happened, we read this: The FD_WRITE network event is handled slightly differently. An FD_WRITE network event is recorded when a socket is first connected with a call to the connect, ConnectEx, WSAConnect, WSAConnectByList, or WSAConnectByName function or when a socket is accepted with accept, AcceptEx, or WSAAccept function and then after a send fails with WSAEWOULDBLOCK and buffer space becomes available. Therefore, an application can assume that sends are possible starting from the first FD_WRITE network event setting and lasting until a send returns WSAEWOULDBLOCK. After such a failure the application will find out that sends are again possible when an FD_WRITE network event is recorded and the associated event object is set. That is, if you request FD_WRITE events for a pre-existing socket with WSAEventSelect, you will not get one until the outbound network buffer has been filled and then has partially emptied. (This is incredibly broken, but Microsoft evidently has no intention of fixing it.) This behavior probably explains the bogus-looking code in pgwin32_waitforsinglesocket and pgwin32_select where we actually issue a zero-length send() to see if the socket is write-ready. However, it doesn't mean that that code is right or properly documented. AFAICS the appropriate behavior is to try *one* send, without any preceding wait, and if that fails with WSAEWOULDBLOCK then wait normally for FD_WRITE to be signaled. The claimed connection to UDP-ness of the connection looks like it's probably a figment of our imagination, as well. More generally, it seems clear to me that Microsoft's code is designed around the assumption that an event object remains attached to a socket for the lifetime of the socket. This business of transiently associating event objects with sockets looks quite inefficient and is evidently triggering a lot of unpleasant corner-case behaviors. I wonder whether we should not try to make "pgsocket" encapsulate a socket and an associated event object as a single entity on Windows. (Such a struct would be a good place to keep a per-socket noblock flag, too.) I'm not going to tackle that myself though. One other bug I noted is that if pgwin32_waitforsinglesocket gets a timeout condition, it returns 0 without bothering to set errno. This seems only latent at the moment since most callers pass INFINITE timeout, but it probably ought to get cleaned up. Or maybe we should just remove the timeout option, since once I get pgstat.c fixed there will be no callers not passing INFINITE timeout. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers