Re: Changing the state of data checksums in a running cluster

2025-09-01 Thread Tomas Vondra
On 8/29/25 16:38, Tomas Vondra wrote: > On 8/29/25 16:26, Tomas Vondra wrote: >> ... >> >> I've seen these failures after changing checksums in both directions, >> both after enabling and disabling. But I've only ever saw this after >> immediate shutdown, never after fast shutdown. (It's interestin

Re: Changing the state of data checksums in a running cluster

2025-08-29 Thread Tomas Vondra
On 8/29/25 16:26, Tomas Vondra wrote: > ... > > I've seen these failures after changing checksums in both directions, > both after enabling and disabling. But I've only ever saw this after > immediate shutdown, never after fast shutdown. (It's interesting the > pg_checksums failed only after fast

Re: Changing the state of data checksums in a running cluster

2025-08-29 Thread Tomas Vondra
On 8/27/25 14:42, Tomas Vondra wrote: > On 8/27/25 14:39, Tomas Vondra wrote: >> ... >> >> And this happened on Friday: >> >> commit c13070a27b63d9ce4850d88a63bf889a6fde26f0 >> Author: Alexander Korotkov >> Date: Fri Aug 22 18:44:39 2025 +0300 >> >> Revert "Get rid of WALBufMappingLock" >> >

Re: Changing the state of data checksums in a running cluster

2025-08-28 Thread Tomas Vondra
Hi, I spent a bit more time fixing the TAP test. The attached patch makes it "work" for me (or I think it should, in principle). I'm not saying it's the best way to do stuff. With the patch applied, I tried running it, and I got a failure when running pg_checksums. There's a log snippet describin

Re: Changing the state of data checksums in a running cluster

2025-08-27 Thread Tomas Vondra
On 8/27/25 14:39, Tomas Vondra wrote: > ... > > And this happened on Friday: > > commit c13070a27b63d9ce4850d88a63bf889a6fde26f0 > Author: Alexander Korotkov > Date: Fri Aug 22 18:44:39 2025 +0300 > > Revert "Get rid of WALBufMappingLock" > > This reverts commit bc22dc0e0ddc2dcb6043a7

Re: Changing the state of data checksums in a running cluster

2025-08-27 Thread Tomas Vondra
On 8/27/25 13:00, Daniel Gustafsson wrote: >> On 27 Aug 2025, at 11:39, Tomas Vondra wrote: > >> Just to be clear - I don't see any pg_checksums failures either. I only >> see failures in the standby log, and I don't think the script checks >> that (it probably should). > > Right, that's what I'

Re: Changing the state of data checksums in a running cluster

2025-08-27 Thread Tomas Vondra
On 8/27/25 10:30, Daniel Gustafsson wrote: >> On 26 Aug 2025, at 01:06, Tomas Vondra wrote: > >> I think this TAP looks very nice, but there's a couple issues with it. >> See the attached patch fixing those. > > Thanks, I have incorporated (most of) your patch in the attached. I did keep > t

Re: Changing the state of data checksums in a running cluster

2025-08-25 Thread Tomas Vondra
On 8/25/25 20:32, Daniel Gustafsson wrote: >> On 20 Aug 2025, at 16:37, Tomas Vondra wrote: > >> This happens quite regularly, it's not hard to hit. But I've only seen >> it to happen on a FSM, and only right after immediate shutdown. I don't >> think that's quite expected. >> >> I believe the bu

Re: Changing the state of data checksums in a running cluster

2025-08-20 Thread Tomas Vondra
Hi, I think there's a minor issue in how pg_checksums validates state before checking the data. The current patch simply does: if (ControlFile->data_checksum_version == 0 && mode == PG_MODE_CHECK) pg_fatal("data checksums are not enabled in cluster"); and that worked when the vers

Re: Changing the state of data checksums in a running cluster

2025-08-20 Thread Tomas Vondra
On 8/16/25 21:34, Daniel Gustafsson wrote: > Attached is a rebase on top of the func.sgml changes which caused this to no > longer apply. > > This version is also substantially updated with a new injection point based > test suite, fixed a few bugs (found by said test suite), added checkpoint to >

Re: Changing the state of data checksums in a running cluster

2025-08-19 Thread Bruce Momjian
On Sat, Aug 16, 2025 at 09:34:03PM +0200, Daniel Gustafsson wrote: > Attached is a rebase on top of the func.sgml changes which caused this to no > longer apply. > > This version is also substantially updated with a new injection point based > test suite, fixed a few bugs (found by said test suite

Re: Changing the state of data checksums in a running cluster

2025-07-12 Thread Daniel Gustafsson
> On 11 Jul 2025, at 17:53, Bernd Helmle wrote: > Since i wanted to dig a little deeper in this patch i took the > opportunity and rebased it to current master, hopefully not having > broken something seriously. Thanks, much appreciated! -- Daniel Gustafsson

Re: Changing the state of data checksums in a running cluster

2025-04-04 Thread Alexander Korotkov
Hi! On Sat, Mar 15, 2025 at 7:33 PM Tomas Vondra wrote: > On 3/15/25 17:26, Andres Freund wrote: > > Jo. > > > > On 2025-03-15 16:50:02 +0100, Tomas Vondra wrote: > >> Thanks, here's an updated patch version > > > > FWIW, this fails in CI; > > > > https://cirrus-ci.com/build/4678473324691456 > >

Re: Changing the state of data checksums in a running cluster

2025-03-15 Thread Tomas Vondra
On 3/15/25 17:26, Andres Freund wrote: > Jo. > > On 2025-03-15 16:50:02 +0100, Tomas Vondra wrote: >> Thanks, here's an updated patch version > > FWIW, this fails in CI; > > https://cirrus-ci.com/build/4678473324691456 > On all OSs: > [16:08:36.331] # Failed test 'options --locale-provider=icu

Re: Changing the state of data checksums in a running cluster

2025-03-15 Thread Andres Freund
Jo. On 2025-03-15 16:50:02 +0100, Tomas Vondra wrote: > Thanks, here's an updated patch version FWIW, this fails in CI; https://cirrus-ci.com/build/4678473324691456 On all OSs: [16:08:36.331] # Failed test 'options --locale-provider=icu --locale=und --lc-*=C: no stderr' [16:08:36.331] # at

Re: Changing the state of data checksums in a running cluster

2025-03-15 Thread Daniel Gustafsson
> On 14 Mar 2025, at 13:20, Tomas Vondra wrote: > I experimented with this a little bit, and unfortunately I ran into not > one, but two race conditions in this :-( I don't have reproducers, all > of this was done by manually adding sleep() calls / gdb breakpoints to > pause the processes for a w

Re: Changing the state of data checksums in a running cluster

2025-03-14 Thread Tomas Vondra
On 3/14/25 00:11, Tomas Vondra wrote: > ... >> One issue I ran into is the postmaster does not seem to be processing >> the barriers, and thus not getting info about the data_checksum_version >> changes. > > Makes sense, that seems like a pretty reasonable constraint for the >>

Re: Changing the state of data checksums in a running cluster

2025-03-13 Thread Tomas Vondra
On 3/13/25 17:26, Tomas Vondra wrote: > On 3/13/25 13:32, Daniel Gustafsson wrote: >>> On 13 Mar 2025, at 12:03, Tomas Vondra wrote: >>> >>> ... >>> >>> This also reminds me I had a question about the barrier - can't it >>> happen a process gets to process multiple barriers at the same time? I >>>

Re: Changing the state of data checksums in a running cluster

2025-03-13 Thread Tomas Vondra
On 3/13/25 13:32, Daniel Gustafsson wrote: >> On 13 Mar 2025, at 12:03, Tomas Vondra wrote: >> On 3/13/25 10:54, Daniel Gustafsson wrote: On 12 Mar 2025, at 14:16, Tomas Vondra wrote: > I believe the approach is correct, but the number of possible states (e.g. after a crash/restar

Re: Changing the state of data checksums in a running cluster

2025-03-13 Thread Daniel Gustafsson
> On 13 Mar 2025, at 12:03, Tomas Vondra wrote: > On 3/13/25 10:54, Daniel Gustafsson wrote: >>> On 12 Mar 2025, at 14:16, Tomas Vondra wrote: >>> I believe the approach is correct, but the number of possible states >>> (e.g. after a crash/restart) seems a bit complex. I wonder if there's a >>>

Re: Changing the state of data checksums in a running cluster

2025-03-13 Thread Tomas Vondra
On 3/13/25 10:54, Daniel Gustafsson wrote: >> On 12 Mar 2025, at 14:16, Tomas Vondra wrote: > >> I continued investigating this and experimenting with alternative >> approaches, and I think the way the patch relies on ControlFile is not >> quite right. That is, it always sets data_checksum_versio

Re: Changing the state of data checksums in a running cluster

2025-03-13 Thread Daniel Gustafsson
> On 12 Mar 2025, at 14:16, Tomas Vondra wrote: > I continued investigating this and experimenting with alternative > approaches, and I think the way the patch relies on ControlFile is not > quite right. That is, it always sets data_checksum_version to the last > ("current") value, but that's not

Re: Changing the state of data checksums in a running cluster

2025-03-12 Thread Dagfinn Ilmari Mannsåker
Tomas Vondra writes: > On 3/11/25 14:07, Dagfinn Ilmari Mannsåker wrote: >> As the resident perl style pedant, I'd just like to complain about the >> below: >> >> Tomas Vondra writes: >> >>> diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm >>> b/src/test/perl/PostgreSQL/Test/Cluster.pm >

Re: Changing the state of data checksums in a running cluster

2025-03-12 Thread Tomas Vondra
On 3/11/25 14:07, Dagfinn Ilmari Mannsåker wrote: > As the resident perl style pedant, I'd just like to complain about the > below: > > Tomas Vondra writes: > >> diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm >> b/src/test/perl/PostgreSQL/Test/Cluster.pm >> index 666bd2a2d4c..1c66360c16c

Re: Changing the state of data checksums in a running cluster

2025-03-11 Thread Tomas Vondra
Hi, I continued stress testing this, as I was rather unsure why the assert failures reported in [1] disappeared. And I managed to reproduce that again, and I think I actually understand why it happens. I modified the test script (attached) to setup replication, not just a single instance. And the

Re: Changing the state of data checksums in a running cluster

2025-03-11 Thread Dagfinn Ilmari Mannsåker
As the resident perl style pedant, I'd just like to complain about the below: Tomas Vondra writes: > diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm > b/src/test/perl/PostgreSQL/Test/Cluster.pm > index 666bd2a2d4c..1c66360c16c 100644 > --- a/src/test/perl/PostgreSQL/Test/Cluster.pm > +++

Re: Changing the state of data checksums in a running cluster

2025-03-10 Thread Tomas Vondra
One thing I forgot to mention is the progress reporting only updates blocks for the FORK_MAIN. It wouldn't be difficult to report blocks for each fork, but it'd be confusing - the relation counters would remain the same, but the block counters would change for each fork. I guess we could report th

Re: Changing the state of data checksums in a running cluster

2025-03-10 Thread Daniel Gustafsson
> On 10 Mar 2025, at 12:17, Tomas Vondra wrote: > > On 3/10/25 10:46, Tomas Vondra wrote: >> On 3/10/25 01:18, Tomas Vondra wrote: Thank you so much for picking up and fixing the blockers, it's highly appreciated! >> For me, this passes all CI tests, hopefully cfbot will be happy too. Confirm

Re: Changing the state of data checksums in a running cluster

2024-12-10 Thread Michael Paquier
On Tue, Nov 26, 2024 at 11:07:12PM +0100, Tomas Vondra wrote: > I spent a bit more time doing some testing on the last version of the > patch from [1]. And I ran into this assert in PostmasterStateMachine() > when stopping the cluster: > > /* All types should be included in targetMask or remainM

Re: Changing the state of data checksums in a running cluster

2024-11-26 Thread Tomas Vondra
Hi, I spent a bit more time doing some testing on the last version of the patch from [1]. And I ran into this assert in PostmasterStateMachine() when stopping the cluster: /* All types should be included in targetMask or remainMask */ Assert((remainMask.mask | targetMask.mask) == BTYPE_MASK_A

Re: Changing the state of data checksums in a running cluster

2024-11-07 Thread Tomas Vondra
Hi, Unfortunately it seems we're not out of the woods yet :-( I started doing some more testing on the v8 patch. My plan was to do some stress testing with physical replication, random restarts and stuff like that. But I ran into issues before that. Attached is a reproducer script, that does thi

Re: Changing the state of data checksums in a running cluster

2024-10-09 Thread Tomas Vondra
On 10/8/24 22:38, Daniel Gustafsson wrote: > >> 7) controldata.c - maybe this >> >> if (oldctrl->data_checksum_version == 2) >> >> should use PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION instead of the magic >> constant? For "off" we use "0" which seems somewhat acceptable, but for >> other values

Re: Changing the state of data checksums in a running cluster

2024-10-08 Thread Daniel Gustafsson
> On 7 Oct 2024, at 20:42, Dagfinn Ilmari Mannsåker wrote: > > Tomas Vondra writes: > >> 3) unnecessary ListCell in DataChecksumsWorkerMain() on line 1345, >> shadowing earlier variable > > All the ListCell variables can be eliminated by using the foreach_ptr > and foreach_oid macros instead

Re: Changing the state of data checksums in a running cluster

2024-10-07 Thread Dagfinn Ilmari Mannsåker
Tomas Vondra writes: > 3) unnecessary ListCell in DataChecksumsWorkerMain() on line 1345, >shadowing earlier variable All the ListCell variables can be eliminated by using the foreach_ptr and foreach_oid macros instead of plain foreach. - ilmari

Re: Changing the state of data checksums in a running cluster

2024-10-07 Thread Tomas Vondra
Hi, I did a quick review today. First a couple minor comments: 1) monitoring.sgml typos: number of database -> number of databases calcuated -> calculated 2) unnecessary newline in heapam.c (no other changes) 3) unnecessary ListCell in DataChecksumsWorkerMain() on line 1345, shadowing e

Re: Changing the state of data checksums in a running cluster

2024-10-01 Thread Daniel Gustafsson
> On 1 Oct 2024, at 00:43, Michael Banck wrote: > > Hi, > > On Mon, Sep 30, 2024 at 11:21:30PM +0200, Daniel Gustafsson wrote: >>> Yeah, I think a view like pg_stat_progress_checksums would work. >> >> Added in the attached version. It probably needs some polish (the docs for >> sure do) but i

Re: Changing the state of data checksums in a running cluster

2024-09-30 Thread Michael Banck
Hi, On Mon, Sep 30, 2024 at 11:21:30PM +0200, Daniel Gustafsson wrote: > > Yeah, I think a view like pg_stat_progress_checksums would work. > > Added in the attached version. It probably needs some polish (the docs for > sure do) but it's at least a start. Just a nitpick, but we call it data_ch

Re: Changing the state of data checksums in a running cluster

2024-07-05 Thread Bruce Momjian
On Wed, Jul 3, 2024 at 01:20:10PM +0200, Tomas Vondra wrote: > > * Restartability - the initial version of the patch did not support stateful > > restarts, a shutdown performed (or crash) before checksums were enabled > > would > > result in a need to start over from the beginning. This was deem

Re: Changing the state of data checksums in a running cluster

2024-07-03 Thread Tomas Vondra
Hi Daniel, Thanks for rebasing the patch and submitting it again! On 7/3/24 08:41, Daniel Gustafsson wrote: > After some off-list discussion about the desirability of this feature, where > several hackers opined that it's something that we should have, I've decided > to > rebase this patch and s