On 15/03/17 17:55, Tom Lane wrote: > Andrew Dunstan <andrew.duns...@2ndquadrant.com> writes: >> On 03/03/2017 11:11 PM, Tom Lane wrote: >>> Yeah, I was wondering if this is just exposing a pre-existing bug. >>> However, the "normal" path operates by repeatedly invoking PQconnectPoll >>> (cf. connectDBComplete) so it's not immediately obvious how such a bug >>> would've escaped detection. > >> (After a long period of fruitless empirical testing I turned to the code) >> Maybe I'm missing something, but connectDBComplete() handles a return of >> PGRESS_POLLING_OK as a success while connectDBStart() seems not to. I >> don't find anywhere in our code other than libpqwalreceiver that >> actually uses that interface, so it's not surprising if it's now >> failing. So my bet is it is indeed a long-standing bug. > > Meh ... that argument doesn't hold water, because the old code here called > PQconnectdbParams which is just PQconnectStartParams then > connectDBComplete. So the problem cannot be in connectDBStart; that's > common to both paths. It has to be some discrepancy between what > connectDBComplete does and what the new loop in libpqwalreceiver is doing. > > The original loop coding in 1e8a85009 was not very close to the documented > spec for PQconnectPoll at all, and while e434ad39a made it closer, it's > still not really the same: connectDBComplete doesn't call PQconnectPoll > until the socket is known read-ready or write-ready. The walreceiver loop > does not guarantee that, but would make an additional call after any > random other wakeup. It's not very clear why bowerbird, and only > bowerbird, would be seeing such wakeups --- but I'm having a really hard > time seeing any other explanation for the change in behavior. (I wonder > whether bowerbird is telling us that WaitLatchOrSocket can sometimes > return prematurely on Windows.) > > I'm also pretty sure that the ResetLatch call is in the wrong place which > could lead to missed wakeups, though that's the opposite of the immediate > problem. > > I'll try correcting these things and we'll see if it gets any better. >
Looks like that didn't help either. I setup my own Windows machine and can reproduce the issue. I played around a bit and could not really find a fix other than adding WL_TIMEOUT and short timeout to WaitLatchOrSocket (it does wait a very long time on the WaitLatchOrSocket otherwise before failing). So I wonder if this is the same issue that caused us using different coding for WaitLatchOrSocket in pgstat.c (lines ~3918-3940). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers