Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

Alexander Lakhin Tue, 16 Jul 2024 04:00:18 -0700

Hello,

25.06.2024 11:24, Jelte Fennema-Nio wrote:

My expectation is that that should remove all failure cases. If it
doesn't, I think our best bet is removing the test again.


It looks like that test eventually showed what could be called a virtue.
Please take a look at a recent BF failure [1]:
timed out after 10800 secs
...
# +++ regress install-check in contrib/postgres_fdw +++
# using postmaster on /home/andrew/bf/root/tmp/buildfarm-e2ahpQ, port 5878

So the postgres_fdw test hanged for several hours while running on the
Cygwin animal lorikeet.

I've managed to reproduce this issue in my Cygwin environment by running
the postgres_fdw test in a loop (10 iterations are enough to get the
described effect). And what I'm seeing is that a query-cancelling backend
is stuck inside pgfdw_xact_callback() -> pgfdw_abort_cleanup() ->
pgfdw_cancel_query() -> pgfdw_cancel_query_begin() -> libpqsrv_cancel() ->
WaitLatchOrSocket() -> WaitEventSetWait() -> WaitEventSetWaitBlock() ->
poll().

The timeout value (approximately 30 seconds), which is passed to poll(),
is effectively ignored by this call — the waiting lasts for unlimited time.

This definitely is caused by 2466d6654. (I applied the test change from that
commit to 2466d6654~1 and saw no issue when running the same test in a
loop.)

With gdb attached to a hanging backend, I see the following stack trace:
#0  0x00007ffb7f70d5e4 in ntdll!ZwWaitForSingleObject () from 
/cygdrive/c/Windows/SYSTEM32/ntdll.dll
#1  0x00007ffb7d2e920e in WaitForSingleObjectEx () from 
/cygdrive/c/Windows/System32/KERNELBASE.dll

#2 0x00007ffb5ce78862 in fhandler_socket_wsock::evaluate_events (this=0x800126968, event_mask=50, events=@0x7ffffb208:0, erase=erase@entry=false)

    at /usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/fhandler/socket_inet.cc:268
#3  0x00007ffb5cdef0f5 in peek_socket (me=0xa001a43c0) at 
/usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:1771

#4 0x00007ffb5cdf211e in select_stuff::poll (this=this@entry=0x7ffffb300, readfds=0x7ffffb570,readfds@entry=0x800000000, writefds=0x7ffffb560, writefds@entry=0x7ffffb5c0, exceptfds=0x7ffffb550, exceptfds@entry=0x7ffb5cdf2c97 <cygwin_select(int, fd_set*, fd_set*, fd_set*, timeval*)+71>) at/usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:554#5 0x00007ffb5cdf257e in select (maxfds=maxfds@entry=45, readfds=0x800000000, writefds=0x7ffffb5c0,exceptfds=0x7ffb5cdf2c97 <cygwin_select(int, fd_set*, fd_set*, fd_set*, timeval*)+71>, us=4308570016,

    us@entry=29973000) at 
/usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:204

#6 0x00007ffb5cdf2927 in pselect (maxfds=45, readfds=0x7ffffb570, writefds=0x7ffffb560, exceptfds=0x7ffffb550,to=<optimized out>, to@entry=0x7ffffb500, set=<optimized out>, set@entry=0x0)

    at /usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:120

#7 0x00007ffb5cdf2c97 in cygwin_select (maxfds=<optimized out>, readfds=<optimized out>, writefds=<optimized out>,exceptfds=<optimized out>, to=0x7ffffb5b0)

    at /usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:147

#8 0x00007ffb5cddc112 in poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at/usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/poll.cc:83

...
and socket_inet.c:268 ([2]) indeed contains an infinite wait call
(LOCK_EVENTS; / WaitForSingleObject (wsock_mtx, INFINITE)).

So it looks like a Cygwin bug, but maybe something should be done on our side
too, at least to prevent such lorikeet failures.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2024-07-12%2010%3A05%3A27
[2] 
https://www.cygwin.com/cgit/newlib-cygwin/tree/winsup/cygwin/fhandler/socket_inet.cc?h=cygwin-3.5.3

Best regards,
Alexander

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

Reply via email to