Re: 039_end_of_wal: error in "xl_tot_len zero" test
Pushed. Thanks!
Re: 039_end_of_wal: error in "xl_tot_len zero" test
On Thu, Aug 29, 2024 at 01:55:27AM -0400, Tom Lane wrote: > Thomas Munro writes: >> The fix I propose to commit shortly is just the first of those new >> lines, to homogenise the initial state. See attached. The previous >> idea works too, I think, but this bigger hammer is more obviously >> removing variation. > > +1, but a comment explaining the need for the pg_switch_wal call > seems in order. +1 -- nathan
Re: 039_end_of_wal: error in "xl_tot_len zero" test
Thomas Munro writes: > The fix I propose to commit shortly is just the first of those new > lines, to homogenise the initial state. See attached. The previous > idea works too, I think, but this bigger hammer is more obviously > removing variation. +1, but a comment explaining the need for the pg_switch_wal call seems in order. regards, tom lane
Re: 039_end_of_wal: error in "xl_tot_len zero" test
On Sat, Aug 24, 2024 at 10:43 AM Thomas Munro wrote: > On Sat, Aug 24, 2024 at 10:33 AM Nathan Bossart > wrote: > > I am seeing the exact problem described in this thread on my laptop since > > commit 490f869. I have yet to do a thorough investigation, but what I've > > seen thus far does seem to fit the subtle-differences-in-generated-WAL > > theory. If no one is planning to pick up the fix soon, I will try to. > > Sorry for dropping that. It looks like we know approximately how to > stabilise it, and I'll look at it early next week if you don't beat me > to it, but please feel free if you would like to. It fails reliably if you nail down the initial conditions like this: $TLI = $node->safe_psql('postgres', "SELECT timeline_id FROM pg_control_checkpoint();"); +$node->safe_psql('postgres', "SELECT pg_switch_wal();"); +emit_message($node, 7956); + my $end_lsn; my $prev_lsn; The fix I propose to commit shortly is just the first of those new lines, to homogenise the initial state. See attached. The previous idea works too, I think, but this bigger hammer is more obviously removing variation. From 9f9f16f6325fc12a83d62683503dab7a9bdfdc45 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 29 Aug 2024 17:15:24 +1200 Subject: [PATCH] Stabilize 039_end_of_wal test. The first test was sensitive to the insert LSN after setting up the catalogs, which depended on environmental things like the locales on the OS and usernames. Switch to a new WAL file before the first test, as a simple way to put every computer into the same state. Back-patch to all supported releases. Reported-by: Anton Voloshin Reported-by: Nathan Bossart Discussion: https://postgr.es/m/b26aeac2-cb6d-4633-a7ea-945baae83dcf%40postgrespro.ru --- src/test/recovery/t/039_end_of_wal.pl | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/test/recovery/t/039_end_of_wal.pl b/src/test/recovery/t/039_end_of_wal.pl index f9acc83c7d0..8498e83f4a7 100644 --- a/src/test/recovery/t/039_end_of_wal.pl +++ b/src/test/recovery/t/039_end_of_wal.pl @@ -251,6 +251,8 @@ $WAL_BLOCK_SIZE = get_int_setting($node, 'wal_block_size'); $TLI = $node->safe_psql('postgres', "SELECT timeline_id FROM pg_control_checkpoint();"); +$node->safe_psql('postgres', "SELECT pg_switch_wal();"); + my $end_lsn; my $prev_lsn; -- 2.46.0
Re: 039_end_of_wal: error in "xl_tot_len zero" test
On Sat, Aug 24, 2024 at 10:33 AM Nathan Bossart wrote: > I am seeing the exact problem described in this thread on my laptop since > commit 490f869. I have yet to do a thorough investigation, but what I've > seen thus far does seem to fit the subtle-differences-in-generated-WAL > theory. If no one is planning to pick up the fix soon, I will try to. Sorry for dropping that. It looks like we know approximately how to stabilise it, and I'll look at it early next week if you don't beat me to it, but please feel free if you would like to.
Re: 039_end_of_wal: error in "xl_tot_len zero" test
I am seeing the exact problem described in this thread on my laptop since commit 490f869. I have yet to do a thorough investigation, but what I've seen thus far does seem to fit the subtle-differences-in-generated-WAL theory. If no one is planning to pick up the fix soon, I will try to. -- nathan
Re: 039_end_of_wal: error in "xl_tot_len zero" test
On 13/05/2024 00:39, Tom Lane wrote: Hm. It occurs to me that there *is* a system-specific component to the amount of WAL emitted during initdb: the number of locales that "locale -a" prints translates directly to the number of rows inserted into pg_collation. [...] Yes. Another system-specific circumstance affecting WAL position is the name length of the unix user doing initdb. I've seen 039_end_of_wal failing consistently under user but working fine with , both on the same machine at the same time. To be more precise, on one particular machine under those particular circumstances (including set of locales) it would work for any username with length < 8 or >= 16, but would fail for length 8..15 (in bytes, not characters, if non-ASCII usernames were used). -- Anton Voloshin Postgres Professional, The Russian Postgres Company https://postgrespro.ru
Re: 039_end_of_wal: error in "xl_tot_len zero" test
David Rowley writes: > On Mon, 6 May 2024 at 15:06, Tom Lane wrote: >> My best guess is that that changed the amount of WAL generated by >> initdb just enough to make the problem reproduce on this animal. >> However, why's it *only* happening on this animal? The amount of >> WAL we generate isn't all that system-specific. > I'd say that's a good theory as it's now passing again [1] after the > recent system_views.sql change done in 521a7156ab. Hm. It occurs to me that there *is* a system-specific component to the amount of WAL emitted during initdb: the number of locales that "locale -a" prints translates directly to the number of rows inserted into pg_collation. So maybe skimmer has a locale set that's a bit different from anybody else's, and that's what let it see this issue. regards, tom lane
Re: 039_end_of_wal: error in "xl_tot_len zero" test
On Mon, 6 May 2024 at 15:06, Tom Lane wrote: > My best guess is that that changed the amount of WAL generated by > initdb just enough to make the problem reproduce on this animal. > However, why's it *only* happening on this animal? The amount of > WAL we generate isn't all that system-specific. I'd say that's a good theory as it's now passing again [1] after the recent system_views.sql change done in 521a7156ab. David [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skimmer&dt=2024-05-06%2017%3A43%3A38
Re: 039_end_of_wal: error in "xl_tot_len zero" test
Thomas Munro writes: > Oh, it looks like this new build farm animal "skimmer" might be > reminding us about this issue: > https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=skimmer&br=HEAD > I don't know why it changed, At this point it seems indisputable that 7d2c7f08d9 is what broke skimmer, but that didn't go anywhere near WAL-related code, so how? My best guess is that that changed the amount of WAL generated by initdb just enough to make the problem reproduce on this animal. However, why's it *only* happening on this animal? The amount of WAL we generate isn't all that system-specific. regards, tom lane
Re: 039_end_of_wal: error in "xl_tot_len zero" test
On Thu, Feb 15, 2024 at 10:40 PM Anton Voloshin wrote: > On 19/01/2024 01:35, Thomas Munro wrote: > > I don't yet have an opinion on the best way to > > do it though. Would it be enough to add emit_message($node, 0) after > > advance_out_of_record_splitting_zone()? > > Yes, indeed that seems to be enough. At least I could not produce any > more "xl_tot_len zero" failures with that addition. > > I like this solution the best. Oh, it looks like this new build farm animal "skimmer" might be reminding us about this issue: https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=skimmer&br=HEAD I don't know why it changed, but since this is an LSN/page alignment thing, it could be due to external things like an OS upgrade adding more locales or something that affects initdb. Will look soon and fix.
Re: 039_end_of_wal: error in "xl_tot_len zero" test
Hello, Thomas, On 19/01/2024 01:35, Thomas Munro wrote: I don't yet have an opinion on the best way to do it though. Would it be enough to add emit_message($node, 0) after advance_out_of_record_splitting_zone()? Yes, indeed that seems to be enough. At least I could not produce any more "xl_tot_len zero" failures with that addition. I like this solution the best. Tolerating the two different messages would weaken the test. I agree, I also don't really like this option. -- Anton Voloshin Postgres Professional, The Russian Postgres Company https://postgrespro.ru
Re: 039_end_of_wal: error in "xl_tot_len zero" test
On Fri, Jan 19, 2024 at 11:35:30AM +1300, Thomas Munro wrote: > On Fri, Jan 19, 2024 at 1:47 AM Anton Voloshin > wrote: >> I believe there is a small problem in the 039_end_of_wal.pl's >> "xl_tot_len zero" test. It supposes that after immediate shutdown the >> server, upon startup recovery, should always produce a message matching >> "invalid record length at .*: wanted 24, got 0". However, if the >> circumstances are just right and we happened to hit exactly on the edge >> of WAL page, then the message on startup recovery would be "invalid >> magic number in log segment .*, offset .*". The test does not take >> that into account. > > Thanks for figuring this out! Right, I see. I will look more closely > when I'm back from summer vacation in a few days, but first reaction: Thomas, are you planning to look at that? -- Michael signature.asc Description: PGP signature
Re: 039_end_of_wal: error in "xl_tot_len zero" test
On Fri, Jan 19, 2024 at 1:47 AM Anton Voloshin wrote: > I believe there is a small problem in the 039_end_of_wal.pl's > "xl_tot_len zero" test. It supposes that after immediate shutdown the > server, upon startup recovery, should always produce a message matching > "invalid record length at .*: wanted 24, got 0". However, if the > circumstances are just right and we happened to hit exactly on the edge > of WAL page, then the message on startup recovery would be "invalid > magic number in log segment .*, offset .*". The test does not take > that into account. Hi Anton, Thanks for figuring this out! Right, I see. I will look more closely when I'm back from summer vacation in a few days, but first reaction: > Now, reproducing this is somewhat tricky, because exact position in WAL > at the test time depends on what exactly initdb did, and that not only > differs in different major verisons, but also depends on e.g. username > length, locales available, and, perhaps, more. Even though originally > this problem was found "in the wild" on one particular system on one > particular code branch, I've written small helper patch to make > reproduction on master easier, see > 0001-repro-for-039_end_of_wal-s-problem-with-page-end.patch. > > This patch adds single emit_message of (hopefully) the right size to > make sure we hit end of WAL block right by the time we call > $node->stop('immediate') in "xl_tot_len zero" test. With this patch, > "xl_tot_len zero" test fails every time because the server writes > "invalid magic number in log segment" while the test still only > expects "invalid record length at .*: wanted 24, got 0". If course, this > 0001 patch is *not* meant to be committed, but only as an issue > reproduction helper. > > I can think of two possible fixes: > > 1. Update advance_out_of_record_splitting_zone to also avoid stopping at > exactly the block end: > > my $page_offset = $end_lsn % $WAL_BLOCK_SIZE; > -while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold) > +while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold || > $page_offset <= $SizeOfXLogShortPHD) > { > see 0002-fix-xl_tot_len-zero-test-amend-advance_out_of.patch > > We need to compare with $SizeOfXLogShortPHD (and not with zero) because > at that point, even though we didn't actually write out new WAL page > yet, it's header is already in place in memory and taken in account > for LSN reporting. I like the fact that this preserves the same end-of-WAL case that we're trying to test. I don't yet have an opinion on the best way to do it though. Would it be enough to add emit_message($node, 0) after advance_out_of_record_splitting_zone()? The thing about this one specific test that is different from the later ones is that it doesn't actually write a record header at all, it was relying purely on pre-existing trailing zeroes, but it assumed the page header would be valid. As you figured out, that isn't true if we were right on the page boundary. Perhaps advance_out_of_record_splitting_zone() followed by emit_message(0) would make that always true, even then? > 2. Alternatively, amend "xl_tot_len zero" test to expect "invalid magic > number in WAL segment" message as well: > > $node->start; > ok( $node->log_contains( > +"invalid magic number in WAL segment|" . > "invalid record length at .*: expected at least 24, got 0", > $log_size > ), > see 0003-alt.fix-for-xl_tot_len-zero-test-accept-invalid.patch Tolerating the two different messages would weaken the test. > I think it makes sense to backport whatever the final change would be to > all branches with 039_end_of_wal (REL_12+). +1
039_end_of_wal: error in "xl_tot_len zero" test
Hello, hackers, I believe there is a small problem in the 039_end_of_wal.pl's "xl_tot_len zero" test. It supposes that after immediate shutdown the server, upon startup recovery, should always produce a message matching "invalid record length at .*: wanted 24, got 0". However, if the circumstances are just right and we happened to hit exactly on the edge of WAL page, then the message on startup recovery would be "invalid magic number in log segment .*, offset .*". The test does not take that into account. Now, reproducing this is somewhat tricky, because exact position in WAL at the test time depends on what exactly initdb did, and that not only differs in different major verisons, but also depends on e.g. username length, locales available, and, perhaps, more. Even though originally this problem was found "in the wild" on one particular system on one particular code branch, I've written small helper patch to make reproduction on master easier, see 0001-repro-for-039_end_of_wal-s-problem-with-page-end.patch. This patch adds single emit_message of (hopefully) the right size to make sure we hit end of WAL block right by the time we call $node->stop('immediate') in "xl_tot_len zero" test. With this patch, "xl_tot_len zero" test fails every time because the server writes "invalid magic number in log segment" while the test still only expects "invalid record length at .*: wanted 24, got 0". If course, this 0001 patch is *not* meant to be committed, but only as an issue reproduction helper. I can think of two possible fixes: 1. Update advance_out_of_record_splitting_zone to also avoid stopping at exactly the block end: my $page_offset = $end_lsn % $WAL_BLOCK_SIZE; -while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold) +while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold || $page_offset <= $SizeOfXLogShortPHD) { see 0002-fix-xl_tot_len-zero-test-amend-advance_out_of.patch We need to compare with $SizeOfXLogShortPHD (and not with zero) because at that point, even though we didn't actually write out new WAL page yet, it's header is already in place in memory and taken in account for LSN reporting. 2. Alternatively, amend "xl_tot_len zero" test to expect "invalid magic number in WAL segment" message as well: $node->start; ok( $node->log_contains( +"invalid magic number in WAL segment|" . "invalid record length at .*: expected at least 24, got 0", $log_size ), see 0003-alt.fix-for-xl_tot_len-zero-test-accept-invalid.patch I think it makes sense to backport whatever the final change would be to all branches with 039_end_of_wal (REL_12+). Any thoughts? Anton Voloshin Postgres Professional, The Russian Postgres Company https://postgrespro.ruFrom 5f12139816f6c1bc7d625ba8007aedb8f5d04a71 Mon Sep 17 00:00:00 2001 From: Anton Voloshin Date: Thu, 18 Jan 2024 12:45:12 +0300 Subject: [PATCH 1/3] repro for 039_end_of_wal's problem with page end Tags: commitfest_hotfix --- src/test/recovery/t/039_end_of_wal.pl | 20 1 file changed, 20 insertions(+) diff --git a/src/test/recovery/t/039_end_of_wal.pl b/src/test/recovery/t/039_end_of_wal.pl index f9acc83c7d0..ecf9b6089de 100644 --- a/src/test/recovery/t/039_end_of_wal.pl +++ b/src/test/recovery/t/039_end_of_wal.pl @@ -36,6 +36,8 @@ my $WAL_SEGMENT_SIZE; my $WAL_BLOCK_SIZE; my $TLI; +my $SizeOfXLogShortPHD = 24; # rounded up to 8 bytes + # Build path of a WAL segment. sub wal_segment_path { @@ -258,9 +260,27 @@ my $prev_lsn; note "Single-page end-of-WAL detection"; ### +my $lsn = get_insert_lsn($node); +my $lsn_offset = $lsn % $WAL_BLOCK_SIZE; +my $empty_msg_size = ( ( ($ENV{EMPTY_MSG_SIZE} || 51) + 7) / 8) * 8; +my $commit_msg_size = ( (34 + 7) / 8) * 8; +# empty logical msg and commit message take this many bytes of WAL: +my $empty_msg_overhead = $empty_msg_size + $commit_msg_size; +# cound overhead twice to account for two emit_message calls below: +# we want to hit the page edge after the second call. +my $target_lsn_offset = $WAL_BLOCK_SIZE * 2 - $empty_msg_overhead * 2; +my $msg_size = ($target_lsn_offset - $lsn_offset) % $WAL_BLOCK_SIZE; +# If we happen to switch to the next WAL block mid-message, reduce the message +# by $SizeOfXLogShortPHD (minimal page header) to hit the same target. +if ($lsn_offset + $msg_size >= $WAL_BLOCK_SIZE) { $msg_size -= $SizeOfXLogShortPHD; } +print "QWE0: $lsn WAL_BLOCK_SIZE='$WAL_BLOCK_SIZE', lsn_offset='$lsn_offset' target_lsn_offset='$target_lsn_offset' msg_size='$msg_size'\n"; +emit_message($node, $msg_size); +printf "QWE1: %s - after corrective msg\n", $node->safe_psql('postgres', "SELECT pg_current_wal_lsn()"); # xl_tot_len is 0 (a common case, we hit trailing zeroes). emit_message($node, 0); +printf "QWE2: %s - after zero-length message\n", $node->safe_psql('postgres', "SELECT pg_current_wal_lsn()"); $end_lsn