13.05.2024 23:17, Tom Lane wrote:
3. We don't know exactly why hamerkop suddenly started seeing these
failures, but a plausible theory emerges after noting that its
reported time for the successful "make check" step dropped pretty
substantially right when this started.  In the v13 branch, "make
check" was taking 2:18 or more in the several runs right before the
first isolationcheck failure, but 1:40 or less just after.  So it
looks like the animal was moved onto faster hardware.  That feeds
into this problem because, with a successful isolationcheck run
taking more than a minute, there was enough time for some of the
earlier stuck sessions to time out and exit before their
postmaster-child slots were needed.

Yes, and one thing I can't explain yet, is why REL_14_STABLE+ timings
substantially differ from REL_13_STABLE-, say, for the check stage:
REL_14_STABLE: the oldest available test log (from 2021-10-30) shows
check (00:03:47) and the newest one (from 2024-05-12): check (00:03:18).
Whilst on REL_13_STABLE the oldest available log (from 2021-08-06) shows
check duration 00:03:00, then it decreased to 00:02:24 (2021-09-22)/
00:02:14 (2021-11-07), and now it's 1:40, as you said.

Locally I see more or less the same timings on REL_13_STABLE (34, 28, 27
secs) and on REL_14_STABLE (33, 29, 29 secs).

14.05.2024 03:38, Thomas Munro wrote:
I was beginning to suspect that lingering odour myself.  I haven't
look at the GSS code but I was imagining that what we have here is
perhaps not unsent data dropped on the floor due to linger policy
(unclean socket close on process exist), but rather that the server
didn't see the socket as ready to read because it lost track of the
FD_CLOSE somewhere because the client closed it gracefully, and our
server-side FD_CLOSE handling has always been a bit suspect.  I wonder
if the GSS code is somehow more prone to brokenness.  One thing we
learned in earlier problems was that abortive/error disconnections
generate FD_CLOSE repeatedly, while graceful ones give you only one.
In other words, if the other end politely calls closesocket(), the
server had better not miss the FD_CLOSE event, because it won't come
again.   That's what

https://commitfest.postgresql.org/46/3523/

is intended to fix.  Does it help here?  Unfortunately that's
unpleasantly complicated and unbackpatchable (keeping a side-table of
socket FDs and event handles, so we don't lose events between the
cracks).

Yes, that cure helps here too. I've tested it on b282fa88d~1 (the last
state when that patch set can be applied).

An excerpt (all lines related to process 12500) from a failed run log
without the patch set:
2024-05-14 05:57:29.526 UTC [8228:128] DEBUG:  forked new backend, pid=12500 
socket=5524
2024-05-14 05:57:29.534 UTC [12500:1] [unknown] LOG:  connection received: 
host=::1 port=51394
2024-05-14 05:57:29.534 UTC [12500:2] [unknown] LOG: !!!BackendInitialize| 
before ProcessStartupPacket
2024-05-14 05:57:29.534 UTC [12500:3] [unknown] LOG: !!!ProcessStartupPacket| 
before secure_open_gssapi(), GSSok: G
2024-05-14 05:57:29.534 UTC [12500:4] [unknown] LOG: !!!secure_open_gssapi| 
before read_or_wait
2024-05-14 05:57:29.534 UTC [12500:5] [unknown] LOG: !!!read_or_wait| before 
secure_raw_read(); PqGSSRecvLength: 0, len: 4
2024-05-14 05:57:29.534 UTC [12500:6] [unknown] LOG: !!!read_or_wait| after 
secure_raw_read: -1, errno: 10035
2024-05-14 05:57:29.534 UTC [12500:7] [unknown] LOG: !!!read_or_wait| before 
WaitLatchOrSocket()
2024-05-14 05:57:29.549 UTC [12500:8] [unknown] LOG: !!!read_or_wait| after 
WaitLatchOrSocket
2024-05-14 05:57:29.549 UTC [12500:9] [unknown] LOG: !!!read_or_wait| before 
secure_raw_read(); PqGSSRecvLength: 0, len: 4
2024-05-14 05:57:29.549 UTC [12500:10] [unknown] LOG: !!!read_or_wait| after 
secure_raw_read: 0, errno: 10035
2024-05-14 05:57:29.549 UTC [12500:11] [unknown] LOG: !!!read_or_wait| before 
WaitLatchOrSocket()
...
2024-05-14 05:57:52.024 UTC [8228:3678] DEBUG:  server process (PID 12500) 
exited with exit code 1
# at the end of the test run

And an excerpt (all lines related to process 11736) from a successful run
log with the patch set applied:
2024-05-14 06:03:57.216 UTC [4524:130] DEBUG:  forked new backend, pid=11736 
socket=5540
2024-05-14 06:03:57.226 UTC [11736:1] [unknown] LOG:  connection received: 
host=::1 port=51914
2024-05-14 06:03:57.226 UTC [11736:2] [unknown] LOG: !!!BackendInitialize| 
before ProcessStartupPacket
2024-05-14 06:03:57.226 UTC [11736:3] [unknown] LOG: !!!ProcessStartupPacket| 
before secure_open_gssapi(), GSSok: G
2024-05-14 06:03:57.226 UTC [11736:4] [unknown] LOG: !!!secure_open_gssapi| 
before read_or_wait
2024-05-14 06:03:57.226 UTC [11736:5] [unknown] LOG: !!!read_or_wait| before 
secure_raw_read(); PqGSSRecvLength: 0, len: 4
2024-05-14 06:03:57.226 UTC [11736:6] [unknown] LOG: !!!read_or_wait| after 
secure_raw_read: -1, errno: 10035
2024-05-14 06:03:57.226 UTC [11736:7] [unknown] LOG: !!!read_or_wait| before 
WaitLatchOrSocket()
2024-05-14 06:03:57.240 UTC [11736:8] [unknown] LOG: !!!read_or_wait| after 
WaitLatchOrSocket
2024-05-14 06:03:57.240 UTC [11736:9] [unknown] LOG: !!!read_or_wait| before 
secure_raw_read(); PqGSSRecvLength: 0, len: 4
2024-05-14 06:03:57.240 UTC [11736:10] [unknown] LOG: !!!read_or_wait| after 
secure_raw_read: 0, errno: 10035
2024-05-14 06:03:57.240 UTC [11736:11] [unknown] LOG: !!!read_or_wait| before 
WaitLatchOrSocket()
2024-05-14 06:03:57.240 UTC [11736:12] [unknown] LOG: !!!read_or_wait| after 
WaitLatchOrSocket
2024-05-14 06:03:57.240 UTC [11736:13] [unknown] LOG: !!!secure_open_gssapi| 
read_or_wait returned -1
2024-05-14 06:03:57.240 UTC [11736:14] [unknown] LOG: !!!ProcessStartupPacket| 
secure_open_gssapi() returned error
2024-05-14 06:03:57.240 UTC [11736:15] [unknown] LOG: !!!BackendInitialize| 
after ProcessStartupPacket
2024-05-14 06:03:57.240 UTC [11736:16] [unknown] LOG: !!!BackendInitialize| 
status: -1
2024-05-14 06:03:57.240 UTC [11736:17] [unknown] DEBUG: shmem_exit(0): 0 
before_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:18] [unknown] DEBUG: shmem_exit(0): 0 
on_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:19] [unknown] DEBUG: proc_exit(0): 1 
callbacks to make
2024-05-14 06:03:57.240 UTC [11736:20] [unknown] DEBUG:  exit(0)
2024-05-14 06:03:57.240 UTC [11736:21] [unknown] DEBUG: shmem_exit(-1): 0 
before_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:22] [unknown] DEBUG: shmem_exit(-1): 0 
on_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:23] [unknown] DEBUG: proc_exit(-1): 0 
callbacks to make
2024-05-14 06:03:57.243 UTC [4524:132] DEBUG:  forked new backend, pid=10536 
socket=5540
2024-05-14 06:03:57.243 UTC [4524:133] DEBUG:  server process (PID 11736) 
exited with exit code 0

Best regards,
Alexander


Reply via email to