On Mon, May 06, 2024 at 10:04:13AM -0700, Andres Freund wrote: > Hi, > > On 2024-05-06 09:05:38 -0500, Justin Pryzby wrote: > > In March, I noticed that a backend got stuck overnight doing: > > > > backend_start | 2024-03-27 22:34:12.774195-07 > > xact_start | 2024-03-27 22:34:39.741933-07 > > query_start | 2024-03-27 22:34:41.997276-07 > > state_change | 2024-03-27 22:34:41.997307-07 > > wait_event_type | IO > > wait_event | DataFileExtend > > state | active > > backend_xid | 330896991 > > backend_xmin | 330896991 > > query_id | -3255287420265634540 > > query | PREPARE mml_1 AS INSERT INTO child.un... > > > > The backend was spinning at 100% CPU: > > > > [pryzbyj@telsa2021 ~]$ ps -O wchan,pcpu 7881 > > PID WCHAN %CPU S TTY TIME COMMAND > > 7881 ? 99.4 R ? 08:14:55 postgres: telsasoft ts [local] INSERT > > > > This was postgres 16 STABLE compiled at 14e991db8. > > > > It's possible that this is a rare issue that we haven't hit before. > > It's also possible this this is a recent regression. We previously > > compiled at b2c9936a7 without hitting any issue (possibly by chance). > > > > I could neither strace the process nor attach a debugger. They got > > stuck attaching. Maybe it's possible there's a kernel issue. This is a > > VM running centos7 / 3.10.0-1160.49.1.el7.x86_64. > > > $ awk '{print $14, $15}' /proc/7881/stat # usr / sys > > 229 3088448 > > > > When I tried to shut down postgres (hoping to finally be able to attach > > a debugger), instead it got stuck: > > That strongly indicates either a kernel bug or storage having an issue. It > can't be postgres' fault if an IO never completes.
Is that for sure even though wchan=? (which I take to mean "not in a system call"), and the process is stuck in user mode ? > What do /proc/$pid/stack say? [pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/18468/stat 18468 (postgres) R 2274 18468 18468 0 -1 4857920 91836 0 3985 0 364 3794271 0 0 20 0 1 0 6092292660 941846528 10 18446744073709551615 4194304 12848820 140732995870240 140732995857304 139726958536394 0 4194304 19929088 536896135 0 0 0 17 3 0 0 1682 0 0 14949632 15052146 34668544 140732995874457 140732995874511 140732995874511 140732995874781 0 > > In both cases, the backend got stuck after 10pm, which is when a backup > > job kicks off, followed by other DB maintenance. Our backup job uses > > pg_export_snapshot() + pg_dump --snapshot. In today's case, the pg_dump > > would've finished and snapshot closed at 2023-05-05 22:15. The backup > > job did some more pg_dumps involving historic tables without snapshots > > and finished at 01:11:40, at which point a reindex job started, but it > > looks like the DB was already stuck for the purpose of reindex, and so > > the script ended after a handful of commands were "[canceled] due to > > statement timeout". > > Is it possible that you're "just" waiting for very slow IO? Is there a lot of > dirty memory? Particularly on these old kernels that can lead to very extreme > delays. > > grep -Ei 'dirty|writeback' /proc/meminfo [pryzbyj@telsasoft-centos7 ~]$ grep -Ei 'dirty|writeback' /proc/meminfo Dirty: 28 kB Writeback: 0 kB WritebackTmp: 0 kB > > Full disclosure: the VM that hit this issue today has had storage-level > > errors (reported here at ZZqr_GTaHyuW7fLp@pryzbyj2023), as recently as 3 > > days ago. > > So indeed, my suspicion from above is confirmed. I'd be fine with that conclusion (as in the earlier thread), except this has now happened on 2 different VMs, and the first one has no I/O issues. If this were another symptom of a storage failure, and hadn't previously happened on another VM, I wouldn't be re-reporting it. -- Justin