Hi, On 2024-02-09 18:00:01 +0300, Alexander Lakhin wrote: > I've managed to reproduce this issue (which still persists: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-02-04%2001%3A53%3A44 > ) and saw that it's not checkpointer, but walsender is hanging:
How did you reproduce this? > And I see the walsender process still running (I've increased the timeout > to keep the test running and to connect to the process in question), with > the following stack trace: > #0 0x00007fe4feac3d16 in epoll_wait (epfd=5, events=0x55b279b70f38, > maxevents=1, timeout=timeout@entry=-1) at > ../sysdeps/unix/sysv/linux/epoll_wait.c:30 > #1 0x000055b278b9ab32 in WaitEventSetWaitBlock > (set=set@entry=0x55b279b70eb8, cur_timeout=cur_timeout@entry=-1, > occurred_events=occurred_events@entry=0x7ffda5ffac90, > nevents=nevents@entry=1) at latch.c:1571 > #2 0x000055b278b9b6b6 in WaitEventSetWait (set=0x55b279b70eb8, > timeout=timeout@entry=-1, > occurred_events=occurred_events@entry=0x7ffda5ffac90, > nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=100663297) at > latch.c:1517 > #3 0x000055b278a3f11f in secure_write (port=0x55b279b65aa0, > ptr=ptr@entry=0x55b279bfbd08, len=len@entry=21470) at be-secure.c:296 > #4 0x000055b278a460dc in internal_flush () at pqcomm.c:1356 > #5 0x000055b278a461d4 in internal_putbytes (s=s@entry=0x7ffda5ffad3c > "E\177", len=len@entry=1) at pqcomm.c:1302 So it's the issue that we wait effectively forever to to send a FATAL. I've previously proposed that we should not block sending out fatal errors, given that allows clients to do prevent graceful restarts and a lot of other things. Greetings, Andres Freund