Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively?

2025-09-14 Thread Pawel Kudzia
On Sun, Sep 14, 2025 at 12:35 PM Laurenz Albe  wrote:
>
> On Sun, 2025-09-14 at 10:30 +0200, Pawel Kudzia wrote:
> > I've hit a silent data corruption for TOAST data - leading to some
> > infinite loop when accessing bytea column for very particular row. I
> > did not suffer data loss - data from streaming replica was fine, I've
> > used it to rebuild the main server.
> >
> > I'm wondering if there's any proactive way of detecting that type of
> > an issue rather than discovering pile-up of SELECT queries leading to
> > CPU starvation or finding hanged backup jobs.
> >
> > Thanks in advance for your suggestions!
> >
> > I was originally running PostgreSQL 17.2 installed from project's deb
> > packages, under Debian 12 on amd64. Environment is - without any
> > recent crashes, with ECC memory and server-grade hardware. I run
> > pg_dumpall every 24h, on a certain day it hanged. On re-run it hanged
> > again, each time leaving single PosgreSQL process using 100% of single
> > CPU core.
> >
> > [...]
> >
> > PostgreSQL's log does not have any error messages indicating data 
> > corruption.
> >
> > This server does not have checksums enabled, but - as I understand -
> > such checksums cannot be checked online anyway. So - how can I detect
> > similar corruption?
>
> A proactive way of detecting TOAST corruption... how about
>
>   pg_dump -f /dev/null yourdatabase
>
> If there is TOAST corruption, that should give you an error.
>
> If your standby does not have the same problem, that increases the
> likelihood that the cause is a hardware problem.  Of course, it could
> still be a software bug.
>
> Yours,
> Laurenz Albe

Thanks for your answer! pg_dump mydatabase ends with ~the same as the
SELECT statement.
It hangs leaving 'postgres: 17/main: postgres application_cache
[local] COPY' using 100% of single CPU core and no messages in PG's
logs.

gdb stack trace for that process:

#0  0x55cb571ef444 in hash_search_with_hash_value ()
#1  0x55cb5706217a in BufTableLookup ()
#2  0x55cb5706715f in StartReadBuffer ()
#3  0x55cb57068671 in ReadBufferExtended ()
#4  0x55cb56d91573 in _bt_relandgetbuf ()
#5  0x55cb56d96b9b in _bt_moveright ()
#6  0x55cb56d96d69 in _bt_search ()
#7  0x55cb56d97b83 in _bt_first ()
#8  0x55cb56d93bbf in btgettuple ()
#9  0x55cb56d8b289 in index_getnext_tid ()
#10 0x55cb56d8b3ed in index_getnext_slot ()
#11 0x55cb56d8a957 in systable_getnext_ordered ()
#12 0x55cb56d812bd in heap_fetch_toast_slice ()
#13 0x55cb56d3abae in ?? ()
#14 0x55cb56d3afc5 in detoast_attr ()
#15 0x55cb571b7bd6 in byteaout ()
#16 0x55cb571eb9a1 in OutputFunctionCall ()
#17 0x55cb56e711c7 in ?? ()
#18 0x55cb56e71b2f in DoCopyTo ()
#19 0x55cb56e6c06f in DoCopy ()
#20 0x55cb570a2726 in standard_ProcessUtility ()
#21 0x7fd02f890a81 in ?? () from
/usr/lib/postgresql/17/lib/pg_stat_statements.so

I was curious if there's any other tool that can spot that type of
corruption and state it explicitly rather than guess it by monitoring
execution time.
#22 0x55cb570a0c01 in ?? ()
#23 0x55cb570a0d18 in ?? ()
#24 0x55cb570a1296 in PortalRun ()
#25 0x55cb5709d4a5 in ?? ()
#26 0x55cb5709e82d in PostgresMain ()
#27 0x55cb57099a3f in BackendMain ()
#28 0x55cb570072aa in postmaster_child_launch ()
#29 0x55cb5700adb9 in ?? ()
#30 0x55cb5700ccc8 in PostmasterMain ()
#31 0x55cb56d2ea11 in main ()





--
regards,
Pawel Kudzia




Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively?

2025-09-14 Thread Laurenz Albe
On Sun, 2025-09-14 at 14:47 +0200, Pawel Kudzia wrote:
> On Sun, Sep 14, 2025 at 12:35 PM Laurenz Albe  
> wrote:
> > On Sun, 2025-09-14 at 10:30 +0200, Pawel Kudzia wrote:
> > > I've hit a silent data corruption for TOAST data - leading to some
> > > infinite loop when accessing bytea column for very particular row. I
> > > did not suffer data loss - data from streaming replica was fine, I've
> > > used it to rebuild the main server.
> > > 
> > > I'm wondering if there's any proactive way of detecting that type of
> > > an issue rather than discovering pile-up of SELECT queries leading to
> > > CPU starvation or finding hanged backup jobs.
> > > 
> > > PostgreSQL's log does not have any error messages indicating data 
> > > corruption.
> > > 
> > > This server does not have checksums enabled, but - as I understand -
> > > such checksums cannot be checked online anyway. So - how can I detect
> > > similar corruption?
> > 
> > A proactive way of detecting TOAST corruption... how about
> > 
> >   pg_dump -f /dev/null yourdatabase
> > 
> > If there is TOAST corruption, that should give you an error.
> 
> Thanks for your answer! pg_dump mydatabase ends with ~the same as the
> SELECT statement.
> It hangs leaving 'postgres: 17/main: postgres application_cache
> [local] COPY' using 100% of single CPU core and no messages in PG's
> logs.
> 
> I was curious if there's any other tool that can spot that type of
> corruption and state it explicitly rather than guess it by monitoring
> execution time.

Data checksums would certainly be a step in the right direction.
They are checked whenever the data are read from storage.

There is no tool that will reliably detect all kinds of data corruption.

Another tool ar your disposal are the functions from the amcheck
contrib module (or the pg_amcheck executable that calls them).

Yours,
Laurenz Albe




Re: Silent data corruption in PostgreSQL 17 - how to detect it proactively?

2025-09-14 Thread Laurenz Albe
On Sun, 2025-09-14 at 10:30 +0200, Pawel Kudzia wrote:
> I've hit a silent data corruption for TOAST data - leading to some
> infinite loop when accessing bytea column for very particular row. I
> did not suffer data loss - data from streaming replica was fine, I've
> used it to rebuild the main server.
> 
> I'm wondering if there's any proactive way of detecting that type of
> an issue rather than discovering pile-up of SELECT queries leading to
> CPU starvation or finding hanged backup jobs.
> 
> Thanks in advance for your suggestions!
> 
> I was originally running PostgreSQL 17.2 installed from project's deb
> packages, under Debian 12 on amd64. Environment is - without any
> recent crashes, with ECC memory and server-grade hardware. I run
> pg_dumpall every 24h, on a certain day it hanged. On re-run it hanged
> again, each time leaving single PosgreSQL process using 100% of single
> CPU core.
> 
> [...]
> 
> PostgreSQL's log does not have any error messages indicating data corruption.
> 
> This server does not have checksums enabled, but - as I understand -
> such checksums cannot be checked online anyway. So - how can I detect
> similar corruption?

A proactive way of detecting TOAST corruption... how about

  pg_dump -f /dev/null yourdatabase

If there is TOAST corruption, that should give you an error.

If your standby does not have the same problem, that increases the
likelihood that the cause is a hardware problem.  Of course, it could
still be a software bug.

Yours,
Laurenz Albe




Silent data corruption in PostgreSQL 17 - how to detect it proactively?

2025-09-14 Thread Pawel Kudzia
I've hit a silent data corruption for TOAST data - leading to some
infinite loop when accessing bytea column for very particular row. I
did not suffer data loss - data from streaming replica was fine, I've
used it to rebuild the main server.

I'm wondering if there's any proactive way of detecting that type of
an issue rather than discovering pile-up of SELECT queries leading to
CPU starvation or finding hanged backup jobs.

Thanks in advance for your suggestions!

I was originally running PostgreSQL 17.2 installed from project's deb
packages, under Debian 12 on amd64. Environment is - without any
recent crashes, with ECC memory and server-grade hardware. I run
pg_dumpall every 24h, on a certain day it hanged. On re-run it hanged
again, each time leaving single PosgreSQL process using 100% of single
CPU core.

I've narrowed this problem to specific rows in a specific table.

Table definition:

CREATE TABLE public.generic_storage (
id integer NOT NULL,
content_type smallint NOT NULL,
model_type smallint NOT NULL,
content_updated_date timestamp without time zone NOT NULL,
stored_date timestamp without time zone NOT NULL,
content_data bytea NOT NULL,
data_source smallint DEFAULT '-1'::integer NOT NULL
);

ALTER TABLE ONLY public.generic_storage
ADD CONSTRAINT generic_storage_unique UNIQUE (id, content_type,
data_source);

CREATE INDEX idx_generic_storage_content_type_id ON
public.generic_storage USING btree (content_type, id);

CREATE INDEX idx_generic_storage_content_updated_date ON
public.generic_storage USING btree (content_updated_date);

Query that works fine:

SELECT id, content_type, model_type, content_updated_date,
stored_date, data_source FROM generic_storage WHERE id=79923514 AND
content_type=18;

EXPLAIN ANALYZE SELECT id, content_type, model_type,
content_updated_date, stored_date, data_source FROM generic_storage
WHERE id=79923514 AND content_type=18;

 Index Scan using idx_generic_storage_content_type_id on
generic_storage  (cost=0.57..88.04 rows=30 width=26) (actual
time=0.055..0.059 rows=1 loops=1)
   Index Cond: ((content_type = 18) AND (id = 79923514))
 Planning Time: 0.191 ms
 Execution Time: 0.119 ms
(4 rows)

Query that hangs:

SELECT content_data FROM generic_storage WHERE id=79923514 AND content_type=18;

I've upgrade PostgreSQL to the latest 17.6-1.pgdg12+1 - and still get
this issue. gdb stack trace dumped for process "postgres: 17/main:
postgres application_cache [local] SELECT" taking 100% of CPU time:

#0  0x561b4a829fd1 in ?? ()
#1  0x561b4a82eb49 in ReleaseAndReadBuffer ()
#2  0x561b4a557573 in _bt_relandgetbuf ()
#3  0x561b4a55cb9b in _bt_moveright ()
#4  0x561b4a55cd69 in _bt_search ()
#5  0x561b4a55db83 in _bt_first ()
#6  0x561b4a559bbf in btgettuple ()
#7  0x561b4a551289 in index_getnext_tid ()
#8  0x561b4a5513ed in index_getnext_slot ()
#9  0x561b4a550957 in systable_getnext_ordered ()
#10 0x561b4a5472bd in heap_fetch_toast_slice ()
#11 0x561b4a500bae in ?? ()
#12 0x561b4a500fc5 in detoast_attr ()
#13 0x561b4a97dbd6 in byteaout ()
#14 0x561b4a9b19a1 in OutputFunctionCall ()
#15 0x561b4a504f85 in ?? ()
#16 0x561b4a6b01ef in standard_ExecutorRun ()
#17 0x7f0aa982a47d in ?? () from
/usr/lib/postgresql/17/lib/pg_stat_statements.so
#18 0x561b4a865ce9 in ?? ()
#19 0x561b4a867172 in PortalRun ()
#20 0x561b4a8634a5 in ?? ()
#21 0x561b4a86482d in PostgresMain ()
#22 0x561b4a85fa3f in BackendMain ()
#23 0x561b4a7cd2aa in postmaster_child_launch ()
#24 0x561b4a7d0db9 in ?? ()
#25 0x561b4a7d2cc8 in PostmasterMain ()
#26 0x561b4a4f4a11 in main ()

PostgreSQL's log does not have any error messages indicating data corruption.

This server does not have checksums enabled, but - as I understand -
such checksums cannot be checked online anyway. So - how can I detect
similar corruption?

Thank you!


-- 
regards,
Pawel Kudzia