On Mon, May 06, 2024 at 10:04:13AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2024-05-06 09:05:38 -0500, Justin Pryzby wrote:
> > In March, I noticed that a backend got stuck overnight doing:
> > 
> > backend_start    | 2024-03-27 22:34:12.774195-07
> > xact_start       | 2024-03-27 22:34:39.741933-07
> > query_start      | 2024-03-27 22:34:41.997276-07
> > state_change     | 2024-03-27 22:34:41.997307-07
> > wait_event_type  | IO
> > wait_event       | DataFileExtend
> > state            | active
> > backend_xid      | 330896991
> > backend_xmin     | 330896991
> > query_id         | -3255287420265634540
> > query            | PREPARE mml_1 AS INSERT INTO child.un...
> > 
> > The backend was spinning at 100% CPU:
> > 
> > [pryzbyj@telsa2021 ~]$ ps -O wchan,pcpu 7881
> >    PID WCHAN  %CPU S TTY          TIME COMMAND
> >   7881 ?      99.4 R ?        08:14:55 postgres: telsasoft ts [local] INSERT
> > 
> > This was postgres 16 STABLE compiled at 14e991db8.
> > 
> > It's possible that this is a rare issue that we haven't hit before.
> > It's also possible this this is a recent regression.  We previously
> > compiled at b2c9936a7 without hitting any issue (possibly by chance).
> > 
> > I could neither strace the process nor attach a debugger.  They got
> > stuck attaching.  Maybe it's possible there's a kernel issue.  This is a
> > VM running centos7 / 3.10.0-1160.49.1.el7.x86_64.
> 
> > $ awk '{print $14, $15}' /proc/7881/stat # usr / sys
> > 229 3088448
> > 
> > When I tried to shut down postgres (hoping to finally be able to attach
> > a debugger), instead it got stuck:
> 
> That strongly indicates either a kernel bug or storage having an issue. It
> can't be postgres' fault if an IO never completes.

Is that for sure even though wchan=? (which I take to mean "not in a system
call"), and the process is stuck in user mode ?

> What do /proc/$pid/stack say?

[pryzbyj@telsasoft-centos7 ~]$ sudo cat /proc/18468/stat
18468 (postgres) R 2274 18468 18468 0 -1 4857920 91836 0 3985 0 364 3794271 0 0 
20 0 1 0 6092292660 941846528 10 18446744073709551615 4194304 12848820 
140732995870240 140732995857304 139726958536394 0 4194304 19929088 536896135 0 
0 0 17 3 0 0 1682 0 0 14949632 15052146 34668544 140732995874457 
140732995874511 140732995874511 140732995874781 0

> > In both cases, the backend got stuck after 10pm, which is when a backup
> > job kicks off, followed by other DB maintenance.  Our backup job uses
> > pg_export_snapshot() + pg_dump --snapshot.  In today's case, the pg_dump
> > would've finished and snapshot closed at 2023-05-05 22:15.  The backup
> > job did some more pg_dumps involving historic tables without snapshots
> > and finished at 01:11:40, at which point a reindex job started, but it
> > looks like the DB was already stuck for the purpose of reindex, and so
> > the script ended after a handful of commands were "[canceled] due to
> > statement timeout".
> 
> Is it possible that you're "just" waiting for very slow IO? Is there a lot of
> dirty memory? Particularly on these old kernels that can lead to very extreme
> delays.
> 
> grep -Ei 'dirty|writeback' /proc/meminfo

[pryzbyj@telsasoft-centos7 ~]$ grep -Ei 'dirty|writeback' /proc/meminfo
Dirty:                28 kB
Writeback:             0 kB
WritebackTmp:          0 kB

> > Full disclosure: the VM that hit this issue today has had storage-level
> > errors (reported here at ZZqr_GTaHyuW7fLp@pryzbyj2023), as recently as 3
> > days ago.
> 
> So indeed, my suspicion from above is confirmed.

I'd be fine with that conclusion (as in the earlier thread), except this
has now happened on 2 different VMs, and the first one has no I/O
issues.  If this were another symptom of a storage failure, and hadn't
previously happened on another VM, I wouldn't be re-reporting it.

-- 
Justin


Reply via email to