Hello,

25.06.2024 11:24, Jelte Fennema-Nio wrote:
My expectation is that that should remove all failure cases. If it
doesn't, I think our best bet is removing the test again.

It looks like that test eventually showed what could be called a virtue.
Please take a look at a recent BF failure [1]:
timed out after 10800 secs
...
# +++ regress install-check in contrib/postgres_fdw +++
# using postmaster on /home/andrew/bf/root/tmp/buildfarm-e2ahpQ, port 5878

So the postgres_fdw test hanged for several hours while running on the
Cygwin animal lorikeet.

I've managed to reproduce this issue in my Cygwin environment by running
the postgres_fdw test in a loop (10 iterations are enough to get the
described effect). And what I'm seeing is that a query-cancelling backend
is stuck inside pgfdw_xact_callback() -> pgfdw_abort_cleanup() ->
pgfdw_cancel_query() -> pgfdw_cancel_query_begin() -> libpqsrv_cancel() ->
WaitLatchOrSocket() -> WaitEventSetWait() -> WaitEventSetWaitBlock() ->
poll().

The timeout value (approximately 30 seconds), which is passed to poll(),
is effectively ignored by this call — the waiting lasts for unlimited time.

This definitely is caused by 2466d6654. (I applied the test change from that
commit to 2466d6654~1 and saw no issue when running the same test in a
loop.)

With gdb attached to a hanging backend, I see the following stack trace:
#0  0x00007ffb7f70d5e4 in ntdll!ZwWaitForSingleObject () from 
/cygdrive/c/Windows/SYSTEM32/ntdll.dll
#1  0x00007ffb7d2e920e in WaitForSingleObjectEx () from 
/cygdrive/c/Windows/System32/KERNELBASE.dll
#2  0x00007ffb5ce78862 in fhandler_socket_wsock::evaluate_events (this=0x800126968, event_mask=50, events=@0x7ffffb208: 0, erase=erase@entry=false)
    at /usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/fhandler/socket_inet.cc:268
#3  0x00007ffb5cdef0f5 in peek_socket (me=0xa001a43c0) at 
/usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:1771
#4  0x00007ffb5cdf211e in select_stuff::poll (this=this@entry=0x7ffffb300, readfds=0x7ffffb570, readfds@entry=0x800000000, writefds=0x7ffffb560, writefds@entry=0x7ffffb5c0, exceptfds=0x7ffffb550,     exceptfds@entry=0x7ffb5cdf2c97 <cygwin_select(int, fd_set*, fd_set*, fd_set*, timeval*)+71>) at /usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:554 #5  0x00007ffb5cdf257e in select (maxfds=maxfds@entry=45, readfds=0x800000000, writefds=0x7ffffb5c0, exceptfds=0x7ffb5cdf2c97 <cygwin_select(int, fd_set*, fd_set*, fd_set*, timeval*)+71>, us=4308570016,
    us@entry=29973000) at 
/usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:204
#6  0x00007ffb5cdf2927 in pselect (maxfds=45, readfds=0x7ffffb570, writefds=0x7ffffb560, exceptfds=0x7ffffb550, to=<optimized out>, to@entry=0x7ffffb500, set=<optimized out>, set@entry=0x0)
    at /usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:120
#7  0x00007ffb5cdf2c97 in cygwin_select (maxfds=<optimized out>, readfds=<optimized out>, writefds=<optimized out>, exceptfds=<optimized out>, to=0x7ffffb5b0)
    at /usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:147
#8  0x00007ffb5cddc112 in poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at /usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/poll.cc:83
...
and socket_inet.c:268 ([2]) indeed contains an infinite wait call
(LOCK_EVENTS; / WaitForSingleObject (wsock_mtx, INFINITE)).

So it looks like a Cygwin bug, but maybe something should be done on our side
too, at least to prevent such lorikeet failures.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2024-07-12%2010%3A05%3A27
[2] 
https://www.cygwin.com/cgit/newlib-cygwin/tree/winsup/cygwin/fhandler/socket_inet.cc?h=cygwin-3.5.3

Best regards,
Alexander


Reply via email to