Re: XLogReadRecord() error in XlogReadTwoPhaseData()
On Sun, Jan 16, 2022 at 01:02:41PM -0800, Noah Misch wrote: > My next steps: > > - Report a Debian bug for the sparc64+ext4 zeros problem. Reported to Debian, then upstream: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1006157 https://marc.info/?t=16453926991 Last week, someone confirmed the behavior on kernel 5.17.0-rc5. No other news.
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
On Mon, Jan 24, 2022 at 12:02:43AM -0800, Noah Misch wrote: > For 003_cic_2pc.pl, I'm > fine using $TODO so we continue to run all test commands and quietly log their > results. For 027_stream_regress.pl, which would need deep changes to use > $TODO, it works to use any of todo_skip, skip, or skip_all. I prefer > skip_all, because it prints the skip reason to gmake's stdout. (If the number > of affected users is zero as theorized, the choice doesn't matter.) Any > objections? I pushed that on 2022-01-26 as ce6d793, but it wasn't enough: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&dt=2022-02-01%2013%3A29%3A58 In older Test::More versions, $TODO only works if tests know the right package in which to examine $that_pkg::TODO. In PostgreSQL v12+, all relevant functions set $Test::Builder::Level, so Test::More finds $main::TODO as intended. This would be one way to fix v11: --- a/src/test/perl/PostgresNode.pm +++ b/src/test/perl/PostgresNode.pm @@ -1869,4 +1869,6 @@ command_ok(...) sub command_checks_all { + local $Test::Builder::Level = $Test::Builder::Level + 1; + my $self = shift; diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm index 0c56156..2ab78b4 100644 --- a/src/test/perl/TestLib.pm +++ b/src/test/perl/TestLib.pm @@ -598,4 +598,6 @@ sub command_fails_like sub command_checks_all { + local $Test::Builder::Level = $Test::Builder::Level + 1; + my ($cmd, $expected_ret, $out, $err, $test_name) = @_; However, I'm inclined to use a less-fragile way, in all branches: --- a/src/bin/pgbench/t/023_cic_2pc.pl +++ b/src/bin/pgbench/t/023_cic_2pc.pl @@ -14 +14,2 @@ use Test::More tests => 6; -local $TODO = 'filesystem bug' if TestLib::has_wal_read_bug; +Test::More->builder->todo_start('filesystem bug') + if TestLib::has_wal_read_bug; None of this matters under newer Test::More. kittiwake's Test::More 1.302133 is new enough not to care, but tadarida's Test::More 1.001014 is not. The Test::More->builder->todo_start() syntax works in 0.82, if not earlier. (PostgreSQL v10 requires Test::More 0.87.)
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
On Sun, Jan 23, 2022 at 06:34:32PM -0800, Andres Freund wrote: > On 2022-01-23 18:10:07 -0800, Noah Misch wrote: > > On Sun, Jan 23, 2022 at 05:40:54PM -0800, Andres Freund wrote: > > > Test::more's description: "If it's something the programmer hasn't done > > > yet, > > > use TODO. This is for any code you haven't written yet, or bugs you have > > > yet > > > to fix, but want to put tests in your testing script (always a good > > > idea)." > > > > Could do that. Every run that doesn't get the flaky failure will print a > > message like "TODO passed: 3-5", though the test file could mitigate that > > by > > declaring the TODO only on configurations where we expect a failure. > > Yea, that's what I was thinking we'd do. > > > The > > 027_stream_regress.pl trouble involves reaching a die(), not failing a test > > in > > this sense, so that one would take more work. > > I'm really not a perl person... But my understanding is that todo_skip() would > address this? I.e. something like > > TODO: > { >$todo_skip "linux/sparc has unaddressed problems with partial page > overwrites" > if ($^O eq 'linux' and $Config{archname'} ~= 'sparc'); > >ok(whatever is broken); > } Yes. todo_skip() behaves much like regular skip(). The enclosed tests don't run. Hence, it prevents die() and BAIL_OUT() failures. $TODO is a different beast; tests still run, and it changes the reporting. For 003_cic_2pc.pl, I'm fine using $TODO so we continue to run all test commands and quietly log their results. For 027_stream_regress.pl, which would need deep changes to use $TODO, it works to use any of todo_skip, skip, or skip_all. I prefer skip_all, because it prints the skip reason to gmake's stdout. (If the number of affected users is zero as theorized, the choice doesn't matter.) Any objections? Here's the appearance of each strategy on gmake's stdout: === todo_skip [23:56:45] t/003_cic_2pc.pl .. ok 60 ms ( 0.00 usr 0.00 sys + 0.05 cusr 0.00 csys = 0.05 CPU) [23:56:45] All tests successful. Files=1, Tests=5, 1 wallclock secs ( 0.01 usr 0.00 sys + 0.05 cusr 0.00 csys = 0.06 CPU) Result: PASS === skip [23:55:47] t/003_cic_2pc.pl .. ok 59 ms ( 0.00 usr 0.00 sys + 0.05 cusr 0.00 csys = 0.05 CPU) [23:55:48] All tests successful. Files=1, Tests=5, 1 wallclock secs ( 0.01 usr 0.00 sys + 0.05 cusr 0.00 csys = 0.06 CPU) Result: PASS === skip_all [23:31:04] t/003_cic_2pc.pl .. skipped: filesystem bug [23:31:04] Files=1, Tests=0, 0 wallclock secs ( 0.02 usr 0.00 sys + 0.05 cusr 0.00 csys = 0.07 CPU) Result: NOTESTS === $TODO, test 1 is expected fail, tests 2-5 are unexpected pass [23:32:32] t/003_cic_2pc.pl .. ok 1371 ms ( 0.01 usr 0.00 sys + 0.51 cusr 0.24 csys = 0.76 CPU) [23:32:33] All tests successful. Test Summary Report --- t/003_cic_2pc.pl (Wstat: 0 Tests: 5 Failed: 0) TODO passed: 2-5 Files=1, Tests=5, 1 wallclock secs ( 0.02 usr 0.00 sys + 0.51 cusr 0.24 csys = 0.77 CPU) Result: PASS
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
On 2022-01-23 21:25:04 -0500, Tom Lane wrote: > Michael Paquier writes: > > On Sun, Jan 23, 2022 at 06:10:07PM -0800, Noah Misch wrote: > >> Could do that. Every run that doesn't get the flaky failure will print a > >> message like "TODO passed: 3-5", though the test file could mitigate that > >> by > >> declaring the TODO only on configurations where we expect a failure. The > >> 027_stream_regress.pl trouble involves reaching a die(), not failing a > >> test in > >> this sense, so that one would take more work. > > > Using a TODO has the advantage to allow the tests to run on a periodic > > basis, even if they could fail in this unexpected way. > > I'm okay with this *if* the TODO marking can be constrained to platforms > where we know there's a problem. Otherwise I'm afraid it will mask > unexpected problems. Yep. Very weird syntax... TODO blocks only take effect if a local $TODO is set (to a text explaining the reason hopefully) or if there's a todo_skip. So something like TODO: { local $TODO = 'linux + sparc is borked' if $^O eq 'linux' and ...; tests... } would run the tests and accept failures if the if matches, and otherwise not. Greetings, Andres Freund
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
Hi, On 2022-01-23 18:10:07 -0800, Noah Misch wrote: > On Sun, Jan 23, 2022 at 05:40:54PM -0800, Andres Freund wrote: > > Test::more's description: "If it's something the programmer hasn't done yet, > > use TODO. This is for any code you haven't written yet, or bugs you have yet > > to fix, but want to put tests in your testing script (always a good idea)." > > Could do that. Every run that doesn't get the flaky failure will print a > message like "TODO passed: 3-5", though the test file could mitigate that by > declaring the TODO only on configurations where we expect a failure. Yea, that's what I was thinking we'd do. > The > 027_stream_regress.pl trouble involves reaching a die(), not failing a test in > this sense, so that one would take more work. I'm really not a perl person... But my understanding is that todo_skip() would address this? I.e. something like TODO: { $todo_skip "linux/sparc has unaddressed problems with partial page overwrites" if ($^O eq 'linux' and $Config{archname'} ~= 'sparc'); ok(whatever is broken); } (no idea if the above todo condition even approximates something working) Greetings, Andres Freund
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
Michael Paquier writes: > On Sun, Jan 23, 2022 at 06:10:07PM -0800, Noah Misch wrote: >> Could do that. Every run that doesn't get the flaky failure will print a >> message like "TODO passed: 3-5", though the test file could mitigate that by >> declaring the TODO only on configurations where we expect a failure. The >> 027_stream_regress.pl trouble involves reaching a die(), not failing a test >> in >> this sense, so that one would take more work. > Using a TODO has the advantage to allow the tests to run on a periodic > basis, even if they could fail in this unexpected way. I'm okay with this *if* the TODO marking can be constrained to platforms where we know there's a problem. Otherwise I'm afraid it will mask unexpected problems. regards, tom lane
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
On Sun, Jan 23, 2022 at 06:10:07PM -0800, Noah Misch wrote: > Could do that. Every run that doesn't get the flaky failure will print a > message like "TODO passed: 3-5", though the test file could mitigate that by > declaring the TODO only on configurations where we expect a failure. The > 027_stream_regress.pl trouble involves reaching a die(), not failing a test in > this sense, so that one would take more work. Using a TODO has the advantage to allow the tests to run on a periodic basis, even if they could fail in this unexpected way. We've had our load of issues in the past proper to specific architectures, so this could help with the 2PC code paths and SPARC. Not to mention that we could just let the code be as-is, and it would fix itself once the kernel is updated. So that sounds much better to me than a skip phase. -- Michael signature.asc Description: PGP signature
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
On Sun, Jan 23, 2022 at 05:40:54PM -0800, Andres Freund wrote: > On 2022-01-23 17:17:59 -0800, Noah Misch wrote: > > On Sun, Jan 23, 2022 at 05:03:04PM -0800, Andres Freund wrote: > > > On January 23, 2022 3:29:27 PM PST > > > >(a) Modify the tests so the affected animals can skip affected tests by > > > >setting an environment variable, named PG_TEST_HAS_WAL_READ_BUG or > > > >similar. > > > > > > Why not just detect the problem in the tap test and skip, rather than > > > requiring multiple buildfarm configs to be changed as well as the test > > > itself? > > > > End users running PostgreSQL test suites to acceptance-test their stack > > should > > consider the affected stack unusable for PostgreSQL. > > I'd bet that that's zero users ;) Wouldn't surprise me. I'm attaching what I had written and discarded. If nobody else hates it, I can live with it. > > Hence, I ruled out that > > approach, despite having implemented it at one point. Under some plausible > > set of goals, it is optimal. > > It's not perfect due to the way we run our tests (seeing output is hard, it's > not aggregated), but marking the test as todo rather than SKIP seems like the > most appropriate test status. It's known to be a problem, we've not fixed it, > but we want to be able to run the tests. > > Test::more's description: "If it's something the programmer hasn't done yet, > use TODO. This is for any code you haven't written yet, or bugs you have yet > to fix, but want to put tests in your testing script (always a good idea)." Could do that. Every run that doesn't get the flaky failure will print a message like "TODO passed: 3-5", though the test file could mitigate that by declaring the TODO only on configurations where we expect a failure. The 027_stream_regress.pl trouble involves reaching a die(), not failing a test in this sense, so that one would take more work. Author: Noah Misch Commit: Noah Misch On sparc64+ext4, skip tests that expect flakes from WAL read failure. Buildfarm member kittiwake began to fail frequently when commits 3cd9c3b921977272e6650a5efbeade4203c4bca2 and f47ed79cc8a0cfa154dc7f01faaf59822552363f added tests of concurrency, but the problem was reachable before those commits. Back-patch to v10 (all supported versions). Reviewed by FIXME. Discussion: https://postgr.es/m/20220116210241.gc756...@rfd.leadboat.com diff --git a/contrib/amcheck/t/003_cic_2pc.pl b/contrib/amcheck/t/003_cic_2pc.pl index dfe7f0f..9ef3d49 100644 --- a/contrib/amcheck/t/003_cic_2pc.pl +++ b/contrib/amcheck/t/003_cic_2pc.pl @@ -9,7 +9,16 @@ use Config; use PostgreSQL::Test::Cluster; use PostgreSQL::Test::Utils; -use Test::More tests => 5; +use Test::More; + +if (PostgreSQL::Test::Utils::has_wal_read_bug) +{ + plan skip_all => 'filesystem bug'; +} +else +{ + plan tests => 5; +} my ($node, $result); diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm b/src/test/perl/PostgreSQL/Test/Utils.pm index 50be10f..cbd2cb3 100644 --- a/src/test/perl/PostgreSQL/Test/Utils.pm +++ b/src/test/perl/PostgreSQL/Test/Utils.pm @@ -351,6 +351,29 @@ sub perl2host =pod +=item has_wal_read_bug() + +Returns true if $tmp_check is subject to a sparc64+ext4 bug that causes WAL +readers to see zeros if another process simultaneously wrote the same offsets. +Use this to skip tests that fail frequently on affected configurations. The +bug has made streaming standbys fail to advance, reporting corrupt WAL. It +has made COMMIT PREPARED fail with "could not read two-phase state from WAL". +Non-WAL PostgreSQL reads haven't been affected, likely because those readers +and writers have buffering systems in common. See +https://postgr.es/m/20220116210241.gc756...@rfd.leadboat.com for details. + +=cut + +sub has_wal_read_bug +{ + return +$Config{osname} eq 'linux' + && $Config{archname} =~ /^sparc/ + && !run_log([ qw(df -x ext4), $tmp_check ], '>', '/dev/null', '2>&1'); +} + +=pod + =item system_log(@cmd) Run (via C) the command passed as argument; the return diff --git a/src/test/recovery/t/027_stream_regress.pl b/src/test/recovery/t/027_stream_regress.pl index 8c0a8b6..2bb83d3 100644 --- a/src/test/recovery/t/027_stream_regress.pl +++ b/src/test/recovery/t/027_stream_regress.pl @@ -3,9 +3,18 @@ use strict; use warnings; use PostgreSQL::Test::Cluster; use PostgreSQL::Test::Utils; -use Test::More tests => 4; +use Test::More; use File::Basename; +if (PostgreSQL::Test::Utils::has_wal_read_bug) +{ + plan skip_all => 'filesystem bug'; +} +else +{ + plan tests => 4; +} + # Initialize primary node my $node_primary = PostgreSQL::Test::Cluster->new('primary'); $node_primary->init(allows_streaming => 1);
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
Hi, On 2022-01-23 17:17:59 -0800, Noah Misch wrote: > On Sun, Jan 23, 2022 at 05:03:04PM -0800, Andres Freund wrote: > > On January 23, 2022 3:29:27 PM PST > > >(a) Modify the tests so the affected animals can skip affected tests by > > >setting an environment variable, named PG_TEST_HAS_WAL_READ_BUG or similar. > > > > Why not just detect the problem in the tap test and skip, rather than > > requiring multiple buildfarm configs to be changed as well as the test > > itself? > > End users running PostgreSQL test suites to acceptance-test their stack should > consider the affected stack unusable for PostgreSQL. I'd bet that that's zero users ;) > Hence, I ruled out that > approach, despite having implemented it at one point. Under some plausible > set of goals, it is optimal. It's not perfect due to the way we run our tests (seeing output is hard, it's not aggregated), but marking the test as todo rather than SKIP seems like the most appropriate test status. It's known to be a problem, we've not fixed it, but we want to be able to run the tests. Test::more's description: "If it's something the programmer hasn't done yet, use TODO. This is for any code you haven't written yet, or bugs you have yet to fix, but want to put tests in your testing script (always a good idea)." Greetings, Andres Freund
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
On Sun, Jan 23, 2022 at 05:03:04PM -0800, Andres Freund wrote: > On January 23, 2022 3:29:27 PM PST > >(a) Modify the tests so the affected animals can skip affected tests by > >setting an environment variable, named PG_TEST_HAS_WAL_READ_BUG or similar. > > Why not just detect the problem in the tap test and skip, rather than > requiring multiple buildfarm configs to be changed as well as the test > itself? End users running PostgreSQL test suites to acceptance-test their stack should consider the affected stack unusable for PostgreSQL. Hence, I ruled out that approach, despite having implemented it at one point. Under some plausible set of goals, it is optimal.
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
On January 23, 2022 3:29:27 PM PST >(a) Modify the tests so the affected animals can skip affected tests by >setting an environment variable, named PG_TEST_HAS_WAL_READ_BUG or similar. Why not just detect the problem in the tap test and skip, rather than requiring multiple buildfarm configs to be changed as well as the test itself? -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
Noah Misch writes: > On Mon, Jan 24, 2022 at 12:49:16PM +1300, Thomas Munro wrote: >> Trying out a new idea: what if we could tell the buildfarm website >> that a certain test is currently expected to fail for reasons we can't >> fix yet (configuration change needed but owner not responding, or >> bugfix from another project needed, etc)? That could cause it to be >> displayed in a different shade of green, or grey, or whatever? Other >> kinds of failures would still show as red. Perhaps this would be >> configured with a file in a git repo that any committer can push to. > That would be a better capability to use if we had it, agreed. Is it feasible > to acquire that capability soon enough? It's not merely a website issue: you'd really rather that the buildfarm animal runs the rest of the tests rather than going belly-up after an expected failure. I think your suggestion about skipping problematic tests based on an environment variable is more practical in the near term. We already have some cases like that, too, eg in src/bin/psql/t/010_tab_completion.pl. In the long term I could get behind having some less ad-hoc way of skipping tests, but I don't think we can have that quickly. regards, tom lane
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
On Mon, Jan 24, 2022 at 12:49:16PM +1300, Thomas Munro wrote: > On Mon, Jan 24, 2022 at 12:29 PM Noah Misch wrote: > > On Mon, Jan 24, 2022 at 09:42:13AM +1300, Thomas Munro wrote: > > > I'm less > > > sure it makes sense to do anything to support the presumed bogus > > > zeroes bug for (probably) no real users, especially before we've even > > > reported it and heard some analysis, for example acceptance that it's > > > broken and confirmation that this really is just a sparc problem. > > > > Got it. I've already done a bad thing leaving the buildfarm broken for > > three > > months, so I don't want to let the buildfarm wait for a kernel fix. These > > are > > the two main options I'm seeing now: > > > > (a) Modify the tests so the affected animals can skip affected tests by > > setting an environment variable, named PG_TEST_HAS_WAL_READ_BUG or similar. > > > > (b) Remove --enable-tap-tests from affected animals. > > > > Do you have a preference among those two or some other option that gets the > > buildfarm green on a predictable schedule? I somewhat prefer (a), since > > --enable-tap-tests is where most of the interesting buildfarm reports happen > > these days. > > Trying out a new idea: what if we could tell the buildfarm website > that a certain test is currently expected to fail for reasons we can't > fix yet (configuration change needed but owner not responding, or > bugfix from another project needed, etc)? That could cause it to be > displayed in a different shade of green, or grey, or whatever? Other > kinds of failures would still show as red. Perhaps this would be > configured with a file in a git repo that any committer can push to. That would be a better capability to use if we had it, agreed. Is it feasible to acquire that capability soon enough?
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
On Mon, Jan 24, 2022 at 12:29 PM Noah Misch wrote: > On Mon, Jan 24, 2022 at 09:42:13AM +1300, Thomas Munro wrote: > > I'm less > > sure it makes sense to do anything to support the presumed bogus > > zeroes bug for (probably) no real users, especially before we've even > > reported it and heard some analysis, for example acceptance that it's > > broken and confirmation that this really is just a sparc problem. > > Got it. I've already done a bad thing leaving the buildfarm broken for three > months, so I don't want to let the buildfarm wait for a kernel fix. These are > the two main options I'm seeing now: > > (a) Modify the tests so the affected animals can skip affected tests by > setting an environment variable, named PG_TEST_HAS_WAL_READ_BUG or similar. > > (b) Remove --enable-tap-tests from affected animals. > > Do you have a preference among those two or some other option that gets the > buildfarm green on a predictable schedule? I somewhat prefer (a), since > --enable-tap-tests is where most of the interesting buildfarm reports happen > these days. Trying out a new idea: what if we could tell the buildfarm website that a certain test is currently expected to fail for reasons we can't fix yet (configuration change needed but owner not responding, or bugfix from another project needed, etc)? That could cause it to be displayed in a different shade of green, or grey, or whatever? Other kinds of failures would still show as red. Perhaps this would be configured with a file in a git repo that any committer can push to.
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
On Mon, Jan 24, 2022 at 09:42:13AM +1300, Thomas Munro wrote: > I'm less > sure it makes sense to do anything to support the presumed bogus > zeroes bug for (probably) no real users, especially before we've even > reported it and heard some analysis, for example acceptance that it's > broken and confirmation that this really is just a sparc problem. Got it. I've already done a bad thing leaving the buildfarm broken for three months, so I don't want to let the buildfarm wait for a kernel fix. These are the two main options I'm seeing now: (a) Modify the tests so the affected animals can skip affected tests by setting an environment variable, named PG_TEST_HAS_WAL_READ_BUG or similar. (b) Remove --enable-tap-tests from affected animals. Do you have a preference among those two or some other option that gets the buildfarm green on a predictable schedule? I somewhat prefer (a), since --enable-tap-tests is where most of the interesting buildfarm reports happen these days.
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
Hi, On 2022-01-24 09:42:13 +1300, Thomas Munro wrote: > On Sun, Jan 23, 2022 at 7:52 AM Noah Misch wrote: > > Future work can benchmark the new behavior and, if it performs well, make > > it unconditional in v15+. I would expect performance to be unchanged or > > slightly better, because the new behavior requests less futile work from > > the OS. I doubt it'll be generally applicable. Turning a write operation into a read-write isn't free. Yes, often enough it's likely that the prior page is still in cache, but I don't think we can rely on that in general. It also just fundamentally locks us into never using O_DIRECT in anger. I don't think that's a good direction. > One detail is that wal_level=open_datasync, wal_senders=0, > wal_level=minimal will panic, because O_DIRECT requires fs > page-aligned access (and fails in various other ways on other OSes, eg > expensive read-before-write every time). That's an ultra-niche > concern likely affecting nobody, especially when multiplied by the > odds that anyone is using that stack at all (considering that > streaming rep has apparently been borked for years on linux/sparc/ext4 > and nobody told us). Seems like the patch should at least make this error out? Greetings, Andres Freund
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
On Sun, Jan 23, 2022 at 7:52 AM Noah Misch wrote: > Attached. With this, kittiwake has survived 8.5hr of 003_cic_2pc.pl. Without > the patch, it failed many times, always within 1.3hr. For easier review, this > patch uses the new behavior on all platforms. Before commit and back-patch, I > plan to limit use of the new behavior to sparc Linux. Future work can > benchmark the new behavior and, if it performs well, make it unconditional in > v15+. I would expect performance to be unchanged or slightly better, because > the new behavior requests less futile work from the OS. One detail is that wal_level=open_datasync, wal_senders=0, wal_level=minimal will panic, because O_DIRECT requires fs page-aligned access (and fails in various other ways on other OSes, eg expensive read-before-write every time). That's an ultra-niche concern likely affecting nobody, especially when multiplied by the odds that anyone is using that stack at all (considering that streaming rep has apparently been borked for years on linux/sparc/ext4 and nobody told us). I was +1 for the control file locking change in that other thread (I view the atomicity stuff as Linux-realpolitik-vs-POSIX, not to mention that we run Windows too, which requires separate analysis). I'm less sure it makes sense to do anything to support the presumed bogus zeroes bug for (probably) no real users, especially before we've even reported it and heard some analysis, for example acceptance that it's broken and confirmation that this really is just a sparc problem.
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
On Sun, Jan 16, 2022 at 01:02:41PM -0800, Noah Misch wrote: > My next steps: > > - Report a Debian bug for the sparc64+ext4 zeros problem. (Not done yet.) > - Try to falsify the idea that "write only the not-already-written portion of > a WAL block" is an effective workaround. Specifically, modify the test > program to have the writer process mutate offsets [N-k,N-1] and [N+1,N+k] > while the reader process reads offset N. If the reader sees a zero, that > workaround is ineffective. The reader did not see a zero. In addition to bytes outside the write being immune to the zeros bug, the first and last forty bytes of a write were immune to the zeros bug. > - Implement the workaround, if I didn't falsify its effectiveness. If it > doesn't hurt performance on x86_64, we can use it unconditionally. > Otherwise, limit its use to sparc64 Linux. Attached. With this, kittiwake has survived 8.5hr of 003_cic_2pc.pl. Without the patch, it failed many times, always within 1.3hr. For easier review, this patch uses the new behavior on all platforms. Before commit and back-patch, I plan to limit use of the new behavior to sparc Linux. Future work can benchmark the new behavior and, if it performs well, make it unconditional in v15+. I would expect performance to be unchanged or slightly better, because the new behavior requests less futile work from the OS. Author: Noah Misch Commit: Noah Misch On sparc Linux, write each WAL byte just once. On sparc64, an ext4 filesystem bug can make readers see zeros if another process simultaneously wrote the same offsets. Per buildfarm member kittiwake, this could make a streaming standby fail to advance, reporting corrupt WAL. It could make COMMIT PREPARED fail with "could not read two-phase state from WAL". Non-WAL PostgreSQL reads haven't been affected, likely because those readers and writers have buffering systems in common. Fix this by writing each WAL byte just once, instead of rewriting every byte of a page whenever writing some new byte of that page. This problem is not new, but buildfarm failures became frequent when commits 3cd9c3b921977272e6650a5efbeade4203c4bca2 and f47ed79cc8a0cfa154dc7f01faaf59822552363f added tests with concurrency. Back-patch to v10 (all supported versions). Reviewed by FIXME. Discussion: https://postgr.es/m/20220116210241.gc756...@rfd.leadboat.com diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e073121..68fe510 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -130,6 +130,19 @@ intwal_segment_size = DEFAULT_XLOG_SEG_SIZE; #define NUM_XLOGINSERT_LOCKS 8 /* + * Should XLogWrite() skip the previously-written prefix of the block when + * LogwrtResult falls in the middle of a block, or should it write whole + * blocks always? The latter is traditional PostgreSQL behavior, but the + * former avoids a sparc64+ext4 bug extant as of Linux 5.15.5: + * https://postgr.es/m/20220116210241.gc756...@rfd.leadboat.com + */ +#if defined(__sparc__) && defined(__linux__) +#define SKIP_PREVIOUSLY_WRITTEN 1 +#else +#define SKIP_PREVIOUSLY_WRITTEN 1 /* FIXME change to 0 before commit */ +#endif + +/* * Max distance from last checkpoint, before triggering a new xlog-based * checkpoint. */ @@ -2468,6 +2481,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) boollast_iteration; boolfinishing_seg; int curridx; + Sizeskip; int npages; int startidx; uint32 startoffset; @@ -2483,12 +2497,15 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) /* * Since successive pages in the xlog cache are consecutively allocated, * we can usually gather multiple pages together and issue just one -* write() call. npages is the number of pages we have determined can be -* written together; startidx is the cache block index of the first one, -* and startoffset is the file offset at which it should go. The latter -* two variables are only valid when npages > 0, but we must initialize -* all of them to keep the compiler quiet. -*/ +* write() call. skip is the already-written portion of the first page. +* npages is the number of pages we have determined can be written +* together; startidx is the cache block index of the first one, and +* startoffset is the file offset at which the first byte (possibly +* mid-page) should go. The latter two variables are only valid when +* npages > 0, but we must initialize all of them to keep the compiler +* quiet. +*/ + skip = 0; npages = 0; startidx = 0; startoffset =
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
On Fri, Jan 21, 2022 at 08:34:22AM +1300, Thomas Munro wrote: > On Mon, Jan 17, 2022 at 10:02 AM Noah Misch wrote: > > - Report a Debian bug for the sparc64+ext4 zeros problem. > > I suspect that 027_stream_regress.pl hits this kernel bug with high > probability[1]. I wonder if the owner of kittiwake and tadarida would > consider setting up an xfs file system? Or alternatively, since ext4 > didn't support concurrent writes until recently, I wonder if there is > an option somewhere to turn the new concurrency stuff off, or failing > that, if we could temporarily downgrade the kernel to an older version > that does inode-level read/write locking. If the write-only-new-bytes approach works, I think we'd want to revert those changes. Perhaps a cheaper stopgap is to make the affected tests skip on sparc Linux. Is that worth doing? (Could even limit the skip to ext4, e.g. by testing "df -x ext4 . >/dev/null".) > [1] > https://www.postgresql.org/message-id/CA%2BhUKG%2BeuZ%3Ddc27ZB%3Ds74x0q%3DzU%3D2%3Dvs8%2B6TkJoTUiCPUd2dQA%40mail.gmail.com
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
On Mon, Jan 17, 2022 at 10:02 AM Noah Misch wrote: > - Report a Debian bug for the sparc64+ext4 zeros problem. I suspect that 027_stream_regress.pl hits this kernel bug with high probability[1]. I wonder if the owner of kittiwake and tadarida would consider setting up an xfs file system? Or alternatively, since ext4 didn't support concurrent writes until recently, I wonder if there is an option somewhere to turn the new concurrency stuff off, or failing that, if we could temporarily downgrade the kernel to an older version that does inode-level read/write locking. [1] https://www.postgresql.org/message-id/CA%2BhUKG%2BeuZ%3Ddc27ZB%3Ds74x0q%3DzU%3D2%3Dvs8%2B6TkJoTUiCPUd2dQA%40mail.gmail.com
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
Cancel that kernel upgrade idea. I no longer expect it to help... On Sun, Jan 16, 2022 at 10:19:30PM +1300, Thomas Munro wrote: > On Sun, Jan 16, 2022 at 8:12 PM Noah Misch wrote: > > For specifics of the kernel bug, see the attached test program. In brief, > > the > > bug arises if one process is write()ing or pwrite()ing a file at about the > > same time that another process is read()ing or pread()ing the same. POSIX > > says the reader should see the data as it existed before the write or the > > newly-written data. On this kernel, the reader can see zeros instead. That > > leads to the $SUBJECT failure. PostgreSQL processes write out a given WAL > > block 20-30 times in ~10ms, and COMMIT PREPARED reads that block. The > > writers > > aren't changing the bytes of interest to COMMIT PREPARED, but the zeros from > > the kernel bug yield the failure. The difference between kittiwake and thorntail comes from thorntail using xfs and kittiwake using ext4. Running the io-rectitude.c tests on an ext4 partition on thorntail, I see the zeros bug just like I do on kittiwake. I don't see the zeros bug on ppc64 or x86_64, just sparc64 so far: * ext4, Linux 3.10.0-1160.49.1.el7.x86_64 (CentOS 7.9.2009): * pwrite/pread is non-atomic if count>16 (no -D switches) * write/read is atomic (-DUSE_SEEK -DXLOG_BLCKSZ=8192000) * pwrite/pread is free from zeros bug (-DCHANGE_CONTENT=0) * write/read is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0) * * ext4, Linux 4.9.0-13-sparc64-smp (Debian): * pwrite/pread is non-atomic if count>4 (no -D switches) * write/read is non-atomic if count>4 (-DUSE_SEEK) * write/read IS atomic w/o REOPEN (-DUSE_SEEK -DREOPEN=0 -DXLOG_BLCKSZ=8192000) * pwrite/pread has zeros bug for count>127 (-DCHANGE_CONTENT=0) * pwrite/pread w/ O_SYNC has zeros bug (-DCHANGE_CONTENT=0 -DOPEN_FLAGS=O_SYNC) *far less frequent w/ O_SYNC, but it still happens * pwrite/pread w/o REOPEN also has zeros bug for count>127 (-DCHANGE_CONTENT=0 -DREOPEN=0) * write/read has zeros bug for count>127 (-DUSE_SEEK -DCHANGE_CONTENT=0) * write/read w/ O_SYNC has zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0 -DOPEN_FLAGS=O_SYNC) * write/read w/o REOPEN is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0 -DREOPEN=0) * * ext4, Linux 5.15.0-2-sparc64-smp (Debian bookworm/sid): * [behaviors match the previous kernel exactly] * * ext4, Linux 5.15.0-2-powerpc64 (Debian bookworm/sid): * [atomicity matches previous kernel, but zeros bug does not] * pwrite/pread is non-atomic if count>4 (no -D switches) * write/read is non-atomic if count>4 (-DUSE_SEEK) * write/read IS atomic w/o REOPEN (-DUSE_SEEK -DREOPEN=0 -DXLOG_BLCKSZ=8192000) * pwrite/pread is free from zeros bug (-DCHANGE_CONTENT=0) * write/read is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0) * * ext4, Linux 5.15.5-0-virt x86_64 (Alpine): * [behaviors match the previous kernel exactly] * * xfs, Linux 5.15.0-2-sparc64-smp (Debian bookworm/sid): * pwrite/pread is atomic (-DXLOG_BLCKSZ=8192000) * write/read is atomic (-DUSE_SEEK -DXLOG_BLCKSZ=8192000) * pwrite/pread is free from zeros bug (-DCHANGE_CONTENT=0) * write/read is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0) > > We could opt to work around that by writing > > only the not-already-written portion of a WAL block, but I doubt that's > > worthwhile unless it happens to be a performance win anyway. My next steps: - Report a Debian bug for the sparc64+ext4 zeros problem. - Try to falsify the idea that "write only the not-already-written portion of a WAL block" is an effective workaround. Specifically, modify the test program to have the writer process mutate offsets [N-k,N-1] and [N+1,N+k] while the reader process reads offset N. If the reader sees a zero, that workaround is ineffective. - Implement the workaround, if I didn't falsify its effectiveness. If it doesn't hurt performance on x86_64, we can use it unconditionally. Otherwise, limit its use to sparc64 Linux. > > Separately, while I don't know of relevance to PostgreSQL, I was interested > > to > > see that CentOS 7 pwrite()/pread() fail to have the POSIX-required > > atomicity. > > FWIW there was some related discussion over here: > > https://www.postgresql.org/message-id/flat/17064-bb0d7904ef72add3%40postgresql.org That gave me the idea to test different filesystems. Thanks. Incidentally, I find https://utcc.utoronto.ca/~cks/space/blog/unix/WriteNotVeryAtomic is mistaken about POSIX requirements. There's no precedent for POSIX writing "two threads" when it means "two threads of the same process". Moreover, the part about "shall also apply whenever a file descriptor is successfully closed, however caused (for example [...] process termination)" would be superfluous in a requirement specific to threads of one process. Having said that, if the most-prominent POSIX regular file implementation (ext4 on x86_64) doesn't implement a POSIX requirement, that has the same practical
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
On Sun, Jan 16, 2022 at 8:12 PM Noah Misch wrote: > For specifics of the kernel bug, see the attached test program. In brief, the > bug arises if one process is write()ing or pwrite()ing a file at about the > same time that another process is read()ing or pread()ing the same. POSIX > says the reader should see the data as it existed before the write or the > newly-written data. On this kernel, the reader can see zeros instead. That > leads to the $SUBJECT failure. PostgreSQL processes write out a given WAL > block 20-30 times in ~10ms, and COMMIT PREPARED reads that block. The writers > aren't changing the bytes of interest to COMMIT PREPARED, but the zeros from > the kernel bug yield the failure. We could opt to work around that by writing > only the not-already-written portion of a WAL block, but I doubt that's > worthwhile unless it happens to be a performance win anyway. > > Separately, while I don't know of relevance to PostgreSQL, I was interested to > see that CentOS 7 pwrite()/pread() fail to have the POSIX-required atomicity. FWIW there was some related discussion over here: https://www.postgresql.org/message-id/flat/17064-bb0d7904ef72add3%40postgresql.org
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
On Fri, Nov 19, 2021 at 09:18:23PM -0800, Noah Misch wrote: > On Wed, Nov 17, 2021 at 11:05:06PM -0800, Noah Misch wrote: > > On Wed, Nov 17, 2021 at 05:47:10PM -0500, Tom Lane wrote: > > > Noah Misch writes: > > > > Each of the three failures happened on a sparc64 Debian+gcc machine. I > > > > had > > > > tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal, > > > > without reproducing this. > > > > > # 'pgbench: error: client 0 script 1 aborted in command > > > 4 query 0: ERROR: could not read two-phase state from WAL at 0/159EF88: > > > unexpected pageaddr 0/0 in log segment 00010001, offset > > > 5890048 > > > [1] > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-17%2013%3A01%3A24 > > > > Two others: > > ERROR: could not read two-phase state from WAL at 0/16F1850: invalid > > record length at 0/16F1850: wanted 24, got 0 > > -- > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-12%2013%3A01%3A15 > > ERROR: could not read two-phase state from WAL at 0/1668020: incorrect > > resource manager data checksum in record at 0/1668020 > > -- > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-11-16%2015%3A00%3A52 > > I don't have a great theory, but here are candidates to examine next: > > > > - Run with wal_debug=on to confirm logged write location matches read > > location. > > - Run > > "PGDATA=contrib/amcheck/tmp_check/t_003_cic_2pc_CIC_2PC_test_data/pgdata > > pg_waldump -s 0/0100" at the end of the test. > > - Dump WAL page binary image at the point of failure. > > - Log which branches in XLogReadRecord() are taken. > > Tom Turelinckx, are you able to provide remote access to kittiwake or > tadarida? I'd use it to attempt the above things. All else being equal, > kittiwake is more relevant since it's still supported upstream. Thanks for setting up access. Summary: this kernel has a bug in I/O syscalls. How practical is it to update that kernel? (Userland differs across these animals, but the kernel does not.) The kernel on buildfarm member thorntail doesn't exhibit the bug. For specifics of the kernel bug, see the attached test program. In brief, the bug arises if one process is write()ing or pwrite()ing a file at about the same time that another process is read()ing or pread()ing the same. POSIX says the reader should see the data as it existed before the write or the newly-written data. On this kernel, the reader can see zeros instead. That leads to the $SUBJECT failure. PostgreSQL processes write out a given WAL block 20-30 times in ~10ms, and COMMIT PREPARED reads that block. The writers aren't changing the bytes of interest to COMMIT PREPARED, but the zeros from the kernel bug yield the failure. We could opt to work around that by writing only the not-already-written portion of a WAL block, but I doubt that's worthwhile unless it happens to be a performance win anyway. Separately, while I don't know of relevance to PostgreSQL, I was interested to see that CentOS 7 pwrite()/pread() fail to have the POSIX-required atomicity. > The setup of your buildfarm animals provides a clue. I understand kittiwake > is the same as ibisbill except for build options, and tadarida is the same as > mussurana except for build options. ibisbill and mussurana haven't failed, so > one of these is likely needed to reproduce: > > absence of --enable-cassert > CFLAGS='-g -O2 -fstack-protector -Wformat -Werror=format-security ' > CPPFLAGS='-Wdate-time -D_FORTIFY_SOURCE=2' > LDFLAGS='-Wl,-z,relro -Wl,-z,now' That was a red herring. ibisbill and mussurana don't use --with-tap-tests. Adding --with-tap-tests has been enough to make their configurations witness the same kinds of failures. nm /* * Stress-test pread(), pwrite(), read(), and write() to detect a few problems * with their handling of regular files: * * - Lack of atomicity. * https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07 * requires atomicity. (An implementation may always return <= 1; if it * chooses to return higher values, it must maintain atomicity.) * * - Transiently making readers see zeros when they read concurrently with a * write, even if the file had no zero at that offset before or after the * write. * * By default, this program will print "mismatch" messages if pwrite() is * non-atomic. Build with other options to test other behaviors: * -DCHANGE_CONTENT=0 tests the zeros bug instead of plain atomicity * -DUSE_SEEK=1 tests write()/read() instead of pwrite()/pread() * -DREOPEN=0 reuses the same file descriptor across iterations * -DXLOG_BLCKSZ=32 tests a different byte count *high values may require "ulimit -Ss" changes * * * Observed behaviors: * * Linux 3.10.0-1160.49.1.el7.x86_64 (CentOS 7.9.2009): * pwrite/pread is non-atomic if count>16 (no -D switches) * write/read is atomic (-DUSE
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
I wrote: > snapper just exhibited the same failure, too: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&dt=2021-11-18%2016%3A09%3A49 I grepped the buildfarm logs for all recent (last 3 months) occurrences of 'could not read two-phase state'. Here's the results: sysname |branch | snapshot | stage | l ---+---+-+-+ kittiwake | REL9_6_STABLE | 2021-10-24 12:01:10 | pgbenchCheck| # 'client 1 aborted in state 3: ERROR: could not read two-phase state from xlog at 0/158F4E0 kittiwake | REL_13_STABLE | 2021-10-26 12:51:11 | ContribCheck-en_US.utf8 | # 'pgbench: error: client 3 script 1 aborted in command 4 query 0: ERROR: could not read two-phase state from WAL at 0/168C8D8 kittiwake | REL_14_STABLE | 2021-11-08 15:42:35 | ContribCheck-en_US.utf8 | # 'pgbench: error: client 0 script 0 aborted in command 3 query 0: ERROR: could not read two-phase state from WAL at 0/17ABF48 kittiwake | REL_13_STABLE | 2021-11-16 15:00:52 | ContribCheck-en_US.utf8 | # 'pgbench: error: client 3 script 1 aborted in command 4 query 0: ERROR: could not read two-phase state from WAL at 0/1668020: incorrect resource manager data checksum in record at 0/1668020 snapper | REL_14_STABLE | 2021-11-18 16:09:49 | contrib-amcheckCheck| # 'pgbench: error: client 3 script 1 aborted in command 4 query 0: ERROR: could not read two-phase state from WAL at 0/1770328: unexpected pageaddr 0/0 in log segment 00010001, offset 7798784 tadarida | REL_11_STABLE | 2021-11-11 13:29:58 | pgbenchCheck| # 'client 3 aborted in command 3 (SQL) of script 0; ERROR: could not read two-phase state from WAL at 0/1716C68 tadarida | REL_10_STABLE | 2021-11-12 13:01:15 | pgbenchCheck| # 'client 4 aborted in command 3 of script 0; ERROR: could not read two-phase state from WAL at 0/16F1850: invalid record length at 0/16F1850: wanted 24, got 0 tadarida | HEAD | 2021-11-17 13:01:24 | contrib-amcheckCheck| # 'pgbench: error: client 0 script 1 aborted in command 4 query 0: ERROR: could not read two-phase state from WAL at 0/159EF88: unexpected pageaddr 0/0 in log segment 00010001, offset 5890048 So not all are exactly 'unexpected pageaddr 0/0', but they do all look like we read garbage data. regards, tom lane
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
Noah Misch writes: > Tom Turelinckx, are you able to provide remote access to kittiwake or > tadarida? I'd use it to attempt the above things. All else being equal, > kittiwake is more relevant since it's still supported upstream. snapper just exhibited the same failure, too: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&dt=2021-11-18%2016%3A09%3A49 regards, tom lane
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
On Wed, Nov 17, 2021 at 11:05:06PM -0800, Noah Misch wrote: > On Wed, Nov 17, 2021 at 05:47:10PM -0500, Tom Lane wrote: > > Noah Misch writes: > > > Each of the three failures happened on a sparc64 Debian+gcc machine. I > > > had > > > tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal, > > > without reproducing this. > > > # 'pgbench: error: client 0 script 1 aborted in command 4 > > query 0: ERROR: could not read two-phase state from WAL at 0/159EF88: > > unexpected pageaddr 0/0 in log segment 00010001, offset > > 5890048 > > [1] > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-17%2013%3A01%3A24 > > Two others: > ERROR: could not read two-phase state from WAL at 0/16F1850: invalid record > length at 0/16F1850: wanted 24, got 0 > -- > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-12%2013%3A01%3A15 > ERROR: could not read two-phase state from WAL at 0/1668020: incorrect > resource manager data checksum in record at 0/1668020 > -- > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-11-16%2015%3A00%3A52 > > > I suppose "unexpected pageaddr 0/0" is most easily explained by supposing > > that XlogReadTwoPhaseData tried to read a WAL page that hadn't been > > written out yet. Have we got any synchronization around that? > > If the WAL address isn't on disk, that error doesn't happen. Instead, > read_local_xlog_page() blocks waiting for the WAL to become available. It's > still possible that we make the WAL region exist, but it somehow doesn't > contain the right data until shortly later. FinishPreparedTransaction() takes > TwoPhaseStateLock and looks for an entry having gxact->valid. EndPrepare() > fills gxact->prepare_end_lsn, then calls MarkAsPrepared() to set gxact->valid > under TwoPhaseStateLock. All freelist (freeGXacts) interaction holds > TwoPhaseStateLock. I'm not seeing a gap in that synchronization. > > I don't have a great theory, but here are candidates to examine next: > > - Run with wal_debug=on to confirm logged write location matches read > location. > - Run "PGDATA=contrib/amcheck/tmp_check/t_003_cic_2pc_CIC_2PC_test_data/pgdata > pg_waldump -s 0/0100" at the end of the test. > - Dump WAL page binary image at the point of failure. > - Log which branches in XLogReadRecord() are taken. Tom Turelinckx, are you able to provide remote access to kittiwake or tadarida? I'd use it to attempt the above things. All else being equal, kittiwake is more relevant since it's still supported upstream. The setup of your buildfarm animals provides a clue. I understand kittiwake is the same as ibisbill except for build options, and tadarida is the same as mussurana except for build options. ibisbill and mussurana haven't failed, so one of these is likely needed to reproduce: absence of --enable-cassert CFLAGS='-g -O2 -fstack-protector -Wformat -Werror=format-security ' CPPFLAGS='-Wdate-time -D_FORTIFY_SOURCE=2' LDFLAGS='-Wl,-z,relro -Wl,-z,now' However, I can't reproduce this on thorntail, even if I use tadarida's ./configure options, CFLAGS, CPPFLAGS, and LDFLAGS.
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
Andrey Borodin writes: > Let's add more tests that check survival of 2PC through crash recovery? We do > now only one restart. Maybe it worth to do 4 or 8? That seems a little premature when we can't explain the failure we have. Also, buildfarm cycles aren't free. regards, tom lane
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
> 18 нояб. 2021 г., в 12:05, Noah Misch написал(а): > > What else might help? Let's add more tests that check survival of 2PC through crash recovery? We do now only one restart. Maybe it worth to do 4 or 8? Best regards, Andrey Borodin.
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
On Wed, Nov 17, 2021 at 05:47:10PM -0500, Tom Lane wrote: > Noah Misch writes: > > Each of the three failures happened on a sparc64 Debian+gcc machine. I had > > tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal, > > without reproducing this. > # 'pgbench: error: client 0 script 1 aborted in command 4 > query 0: ERROR: could not read two-phase state from WAL at 0/159EF88: > unexpected pageaddr 0/0 in log segment 00010001, offset > 5890048 > [1] > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-17%2013%3A01%3A24 Two others: ERROR: could not read two-phase state from WAL at 0/16F1850: invalid record length at 0/16F1850: wanted 24, got 0 -- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-12%2013%3A01%3A15 ERROR: could not read two-phase state from WAL at 0/1668020: incorrect resource manager data checksum in record at 0/1668020 -- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-11-16%2015%3A00%3A52 > I suppose "unexpected pageaddr 0/0" is most easily explained by supposing > that XlogReadTwoPhaseData tried to read a WAL page that hadn't been > written out yet. Have we got any synchronization around that? If the WAL address isn't on disk, that error doesn't happen. Instead, read_local_xlog_page() blocks waiting for the WAL to become available. It's still possible that we make the WAL region exist, but it somehow doesn't contain the right data until shortly later. FinishPreparedTransaction() takes TwoPhaseStateLock and looks for an entry having gxact->valid. EndPrepare() fills gxact->prepare_end_lsn, then calls MarkAsPrepared() to set gxact->valid under TwoPhaseStateLock. All freelist (freeGXacts) interaction holds TwoPhaseStateLock. I'm not seeing a gap in that synchronization. I don't have a great theory, but here are candidates to examine next: - Run with wal_debug=on to confirm logged write location matches read location. - Run "PGDATA=contrib/amcheck/tmp_check/t_003_cic_2pc_CIC_2PC_test_data/pgdata pg_waldump -s 0/0100" at the end of the test. - Dump WAL page binary image at the point of failure. - Log which branches in XLogReadRecord() are taken. What else might help?
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
Noah Misch writes: > Tom Lane reported another instance today: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-11%2013%3A29%3A58 > Each of the three failures happened on a sparc64 Debian+gcc machine. I had > tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal, > without reproducing this. >>> As a first step, let's report the actual XLogReadRecord() error message. >>> Attached. >> Good catch! This looks good. > Pushed. Well, we didn't have to wait too long [1]: # at t/003_cic_2pc.pl line 143. # 'pgbench: error: client 0 script 1 aborted in command 4 query 0: ERROR: could not read two-phase state from WAL at 0/159EF88: unexpected pageaddr 0/0 in log segment 00010001, offset 5890048 # pgbench: error: client 2 script 3 aborted in command 2 query 0: ERROR: canceling statement due to lock timeout # pgbench: fatal: Run was aborted; the above results are incomplete. I suppose "unexpected pageaddr 0/0" is most easily explained by supposing that XlogReadTwoPhaseData tried to read a WAL page that hadn't been written out yet. Have we got any synchronization around that? regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-17%2013%3A01%3A24
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
On Mon, Nov 08, 2021 at 01:42:46PM +0900, Michael Paquier wrote: > On Sat, Nov 06, 2021 at 06:31:57PM -0700, Noah Misch wrote: > > On Sun, Oct 24, 2021 at 04:35:02PM -0700, Noah Misch wrote: > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-10-24%2012%3A01%3A10 > > > got an interesting v9.6 failure [...]: > > > > > > 2021-10-24 14:25:29.263 CEST [34569:175] pgbench ERROR: could not read > > > two-phase state from xlog at 0/158F4E0 > > > 2021-10-24 14:25:29.263 CEST [34569:176] pgbench STATEMENT: COMMIT > > > PREPARED 'c1'; Tom Lane reported another instance today: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-11%2013%3A29%3A58 Each of the three failures happened on a sparc64 Debian+gcc machine. I had tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal, without reproducing this. > > As a first step, let's report the actual XLogReadRecord() error message. > > Attached. > > Good catch! This looks good. Pushed.
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
On Mon, Nov 08, 2021 at 01:42:46PM +0900, Michael Paquier wrote: > Indeed. Looking closer, I think that we'd better improve > DecodingContextFindStartpoint(), > pg_logical_replication_slot_advance(), XLogSendLogical() as well as > pg_logical_slot_get_changes_guts() to follow a format closer to what > you have in your patch, with an error message that describes the > context where the problem has been found, instead of just elog()'ing > what XLogReadRecord() returns. FYI, I have just begun a new thread about those ones: https://www.postgresql.org/message-id/yynth6oyowqca...@paquier.xyz -- Michael signature.asc Description: PGP signature
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
> 7 нояб. 2021 г., в 06:31, Noah Misch написал(а): > > As a first step, let's report the actual XLogReadRecord() error message. > Attached. All the other sites that expect no error already do this. BTW some time ago I've spotted a good number of related unreported errors [0]. [0] https://www.postgresql.org/message-id/A0A36AEE-3476-4326-B877-EE2B55BAEEED%40yandex-team.ru
Re: XLogReadRecord() error in XlogReadTwoPhaseData()
On Sat, Nov 06, 2021 at 06:31:57PM -0700, Noah Misch wrote: > As a first step, let's report the actual XLogReadRecord() error message. > Attached. Good catch! This looks good. > All the other sites that expect no error already do this. Indeed. Looking closer, I think that we'd better improve DecodingContextFindStartpoint(), pg_logical_replication_slot_advance(), XLogSendLogical() as well as pg_logical_slot_get_changes_guts() to follow a format closer to what you have in your patch, with an error message that describes the context where the problem has been found, instead of just elog()'ing what XLogReadRecord() returns. -- Michael signature.asc Description: PGP signature
XLogReadRecord() error in XlogReadTwoPhaseData()
On Sun, Oct 24, 2021 at 04:35:02PM -0700, Noah Misch wrote: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-10-24%2012%3A01%3A10 > got an interesting v9.6 failure [...]: > > 2021-10-24 14:25:29.263 CEST [34569:175] pgbench ERROR: could not read > two-phase state from xlog at 0/158F4E0 > 2021-10-24 14:25:29.263 CEST [34569:176] pgbench STATEMENT: COMMIT PREPARED > 'c1'; As a first step, let's report the actual XLogReadRecord() error message. Attached. All the other sites that expect no error already do this. Author: Noah Misch Commit: Noah Misch Report any XLogReadRecord() error in XlogReadTwoPhaseData(). Buildfarm member kittiwake has witnessed errors at this site. The site discarded key facts. Back-patch to 9.6 (all supported versions). Reviewed by FIXME. Discussion: https://postgr.es/m/FIXME diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c index ef4b5f6..28b153a 100644 --- a/src/backend/access/transam/twophase.c +++ b/src/backend/access/transam/twophase.c @@ -1397,10 +1397,18 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len) record = XLogReadRecord(xlogreader, &errormsg); if (record == NULL) - ereport(ERROR, - (errcode_for_file_access(), -errmsg("could not read two-phase state from WAL at %X/%X", - LSN_FORMAT_ARGS(lsn; + { + if (errormsg) + ereport(ERROR, + (errcode_for_file_access(), +errmsg("could not read two-phase state from WAL at %X/%X: %s", + LSN_FORMAT_ARGS(lsn), errormsg))); + else + ereport(ERROR, + (errcode_for_file_access(), +errmsg("could not read two-phase state from WAL at %X/%X", + LSN_FORMAT_ARGS(lsn; + } if (XLogRecGetRmid(xlogreader) != RM_XACT_ID || (XLogRecGetInfo(xlogreader) & XLOG_XACT_OPMASK) != XLOG_XACT_PREPARE)