13.05.2024 23:17, Tom Lane wrote:
3. We don't know exactly why hamerkop suddenly started seeing these failures, but a plausible theory emerges after noting that its reported time for the successful "make check" step dropped pretty substantially right when this started. In the v13 branch, "make check" was taking 2:18 or more in the several runs right before the first isolationcheck failure, but 1:40 or less just after. So it looks like the animal was moved onto faster hardware. That feeds into this problem because, with a successful isolationcheck run taking more than a minute, there was enough time for some of the earlier stuck sessions to time out and exit before their postmaster-child slots were needed.
Yes, and one thing I can't explain yet, is why REL_14_STABLE+ timings substantially differ from REL_13_STABLE-, say, for the check stage: REL_14_STABLE: the oldest available test log (from 2021-10-30) shows check (00:03:47) and the newest one (from 2024-05-12): check (00:03:18). Whilst on REL_13_STABLE the oldest available log (from 2021-08-06) shows check duration 00:03:00, then it decreased to 00:02:24 (2021-09-22)/ 00:02:14 (2021-11-07), and now it's 1:40, as you said. Locally I see more or less the same timings on REL_13_STABLE (34, 28, 27 secs) and on REL_14_STABLE (33, 29, 29 secs). 14.05.2024 03:38, Thomas Munro wrote:
I was beginning to suspect that lingering odour myself. I haven't look at the GSS code but I was imagining that what we have here is perhaps not unsent data dropped on the floor due to linger policy (unclean socket close on process exist), but rather that the server didn't see the socket as ready to read because it lost track of the FD_CLOSE somewhere because the client closed it gracefully, and our server-side FD_CLOSE handling has always been a bit suspect. I wonder if the GSS code is somehow more prone to brokenness. One thing we learned in earlier problems was that abortive/error disconnections generate FD_CLOSE repeatedly, while graceful ones give you only one. In other words, if the other end politely calls closesocket(), the server had better not miss the FD_CLOSE event, because it won't come again. That's what https://commitfest.postgresql.org/46/3523/ is intended to fix. Does it help here? Unfortunately that's unpleasantly complicated and unbackpatchable (keeping a side-table of socket FDs and event handles, so we don't lose events between the cracks).
Yes, that cure helps here too. I've tested it on b282fa88d~1 (the last state when that patch set can be applied). An excerpt (all lines related to process 12500) from a failed run log without the patch set: 2024-05-14 05:57:29.526 UTC [8228:128] DEBUG: forked new backend, pid=12500 socket=5524 2024-05-14 05:57:29.534 UTC [12500:1] [unknown] LOG: connection received: host=::1 port=51394 2024-05-14 05:57:29.534 UTC [12500:2] [unknown] LOG: !!!BackendInitialize| before ProcessStartupPacket 2024-05-14 05:57:29.534 UTC [12500:3] [unknown] LOG: !!!ProcessStartupPacket| before secure_open_gssapi(), GSSok: G 2024-05-14 05:57:29.534 UTC [12500:4] [unknown] LOG: !!!secure_open_gssapi| before read_or_wait 2024-05-14 05:57:29.534 UTC [12500:5] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len: 4 2024-05-14 05:57:29.534 UTC [12500:6] [unknown] LOG: !!!read_or_wait| after secure_raw_read: -1, errno: 10035 2024-05-14 05:57:29.534 UTC [12500:7] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket() 2024-05-14 05:57:29.549 UTC [12500:8] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket 2024-05-14 05:57:29.549 UTC [12500:9] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len: 4 2024-05-14 05:57:29.549 UTC [12500:10] [unknown] LOG: !!!read_or_wait| after secure_raw_read: 0, errno: 10035 2024-05-14 05:57:29.549 UTC [12500:11] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket() ... 2024-05-14 05:57:52.024 UTC [8228:3678] DEBUG: server process (PID 12500) exited with exit code 1 # at the end of the test run And an excerpt (all lines related to process 11736) from a successful run log with the patch set applied: 2024-05-14 06:03:57.216 UTC [4524:130] DEBUG: forked new backend, pid=11736 socket=5540 2024-05-14 06:03:57.226 UTC [11736:1] [unknown] LOG: connection received: host=::1 port=51914 2024-05-14 06:03:57.226 UTC [11736:2] [unknown] LOG: !!!BackendInitialize| before ProcessStartupPacket 2024-05-14 06:03:57.226 UTC [11736:3] [unknown] LOG: !!!ProcessStartupPacket| before secure_open_gssapi(), GSSok: G 2024-05-14 06:03:57.226 UTC [11736:4] [unknown] LOG: !!!secure_open_gssapi| before read_or_wait 2024-05-14 06:03:57.226 UTC [11736:5] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len: 4 2024-05-14 06:03:57.226 UTC [11736:6] [unknown] LOG: !!!read_or_wait| after secure_raw_read: -1, errno: 10035 2024-05-14 06:03:57.226 UTC [11736:7] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket() 2024-05-14 06:03:57.240 UTC [11736:8] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket 2024-05-14 06:03:57.240 UTC [11736:9] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len: 4 2024-05-14 06:03:57.240 UTC [11736:10] [unknown] LOG: !!!read_or_wait| after secure_raw_read: 0, errno: 10035 2024-05-14 06:03:57.240 UTC [11736:11] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket() 2024-05-14 06:03:57.240 UTC [11736:12] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket 2024-05-14 06:03:57.240 UTC [11736:13] [unknown] LOG: !!!secure_open_gssapi| read_or_wait returned -1 2024-05-14 06:03:57.240 UTC [11736:14] [unknown] LOG: !!!ProcessStartupPacket| secure_open_gssapi() returned error 2024-05-14 06:03:57.240 UTC [11736:15] [unknown] LOG: !!!BackendInitialize| after ProcessStartupPacket 2024-05-14 06:03:57.240 UTC [11736:16] [unknown] LOG: !!!BackendInitialize| status: -1 2024-05-14 06:03:57.240 UTC [11736:17] [unknown] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make 2024-05-14 06:03:57.240 UTC [11736:18] [unknown] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make 2024-05-14 06:03:57.240 UTC [11736:19] [unknown] DEBUG: proc_exit(0): 1 callbacks to make 2024-05-14 06:03:57.240 UTC [11736:20] [unknown] DEBUG: exit(0) 2024-05-14 06:03:57.240 UTC [11736:21] [unknown] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2024-05-14 06:03:57.240 UTC [11736:22] [unknown] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2024-05-14 06:03:57.240 UTC [11736:23] [unknown] DEBUG: proc_exit(-1): 0 callbacks to make 2024-05-14 06:03:57.243 UTC [4524:132] DEBUG: forked new backend, pid=10536 socket=5540 2024-05-14 06:03:57.243 UTC [4524:133] DEBUG: server process (PID 11736) exited with exit code 0 Best regards, Alexander