Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-03-03 Thread Noah Misch
On Sun, Jan 16, 2022 at 01:02:41PM -0800, Noah Misch wrote:
> My next steps:
> 
> - Report a Debian bug for the sparc64+ext4 zeros problem.

Reported to Debian, then upstream:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1006157
https://marc.info/?t=16453926991

Last week, someone confirmed the behavior on kernel 5.17.0-rc5.  No other news.




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-02-01 Thread Noah Misch
On Mon, Jan 24, 2022 at 12:02:43AM -0800, Noah Misch wrote:
> For 003_cic_2pc.pl, I'm
> fine using $TODO so we continue to run all test commands and quietly log their
> results.  For 027_stream_regress.pl, which would need deep changes to use
> $TODO, it works to use any of todo_skip, skip, or skip_all.  I prefer
> skip_all, because it prints the skip reason to gmake's stdout.  (If the number
> of affected users is zero as theorized, the choice doesn't matter.)  Any
> objections?

I pushed that on 2022-01-26 as ce6d793, but it wasn't enough:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper=2022-02-01%2013%3A29%3A58

In older Test::More versions, $TODO only works if tests know the right package
in which to examine $that_pkg::TODO.  In PostgreSQL v12+, all relevant
functions set $Test::Builder::Level, so Test::More finds $main::TODO as
intended.  This would be one way to fix v11:

--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -1869,4 +1869,6 @@ command_ok(...)
 sub command_checks_all
 {
+   local $Test::Builder::Level = $Test::Builder::Level + 1;
+
my $self = shift;
 
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index 0c56156..2ab78b4 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -598,4 +598,6 @@ sub command_fails_like
 sub command_checks_all
 {
+   local $Test::Builder::Level = $Test::Builder::Level + 1;
+
my ($cmd, $expected_ret, $out, $err, $test_name) = @_;

However, I'm inclined to use a less-fragile way, in all branches:

--- a/src/bin/pgbench/t/023_cic_2pc.pl
+++ b/src/bin/pgbench/t/023_cic_2pc.pl
@@ -14 +14,2 @@ use Test::More tests => 6;
-local $TODO = 'filesystem bug' if TestLib::has_wal_read_bug;
+Test::More->builder->todo_start('filesystem bug')
+  if TestLib::has_wal_read_bug;

None of this matters under newer Test::More.  kittiwake's Test::More 1.302133
is new enough not to care, but tadarida's Test::More 1.001014 is not.  The
Test::More->builder->todo_start() syntax works in 0.82, if not earlier.
(PostgreSQL v10 requires Test::More 0.87.)




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-24 Thread Noah Misch
On Sun, Jan 23, 2022 at 06:34:32PM -0800, Andres Freund wrote:
> On 2022-01-23 18:10:07 -0800, Noah Misch wrote:
> > On Sun, Jan 23, 2022 at 05:40:54PM -0800, Andres Freund wrote:
> > > Test::more's description: "If it's something the programmer hasn't done 
> > > yet,
> > > use TODO. This is for any code you haven't written yet, or bugs you have 
> > > yet
> > > to fix, but want to put tests in your testing script (always a good 
> > > idea)."
> >
> > Could do that.  Every run that doesn't get the flaky failure will print a
> > message like "TODO passed:  3-5", though the test file could mitigate that 
> > by
> > declaring the TODO only on configurations where we expect a failure.
> 
> Yea, that's what I was thinking we'd do.
> 
> > The
> > 027_stream_regress.pl trouble involves reaching a die(), not failing a test 
> > in
> > this sense, so that one would take more work.
> 
> I'm really not a perl person... But my understanding is that todo_skip() would
> address this? I.e. something like
> 
> TODO:
> {
>$todo_skip "linux/sparc has unaddressed problems with partial page  
> overwrites"
>  if ($^O eq 'linux' and $Config{archname'} ~= 'sparc');
> 
>ok(whatever is broken);
> }

Yes.  todo_skip() behaves much like regular skip().  The enclosed tests don't
run.  Hence, it prevents die() and BAIL_OUT() failures.  $TODO is a different
beast; tests still run, and it changes the reporting.  For 003_cic_2pc.pl, I'm
fine using $TODO so we continue to run all test commands and quietly log their
results.  For 027_stream_regress.pl, which would need deep changes to use
$TODO, it works to use any of todo_skip, skip, or skip_all.  I prefer
skip_all, because it prints the skip reason to gmake's stdout.  (If the number
of affected users is zero as theorized, the choice doesn't matter.)  Any
objections?  Here's the appearance of each strategy on gmake's stdout:

=== todo_skip
[23:56:45] t/003_cic_2pc.pl .. ok   60 ms ( 0.00 usr  0.00 sys +  0.05 cusr 
 0.00 csys =  0.05 CPU)
[23:56:45]
All tests successful.
Files=1, Tests=5,  1 wallclock secs ( 0.01 usr  0.00 sys +  0.05 cusr  0.00 
csys =  0.06 CPU)
Result: PASS

=== skip
[23:55:47] t/003_cic_2pc.pl .. ok   59 ms ( 0.00 usr  0.00 sys +  0.05 cusr 
 0.00 csys =  0.05 CPU)
[23:55:48]
All tests successful.
Files=1, Tests=5,  1 wallclock secs ( 0.01 usr  0.00 sys +  0.05 cusr  0.00 
csys =  0.06 CPU)
Result: PASS

=== skip_all
[23:31:04] t/003_cic_2pc.pl .. skipped: filesystem bug
[23:31:04]
Files=1, Tests=0,  0 wallclock secs ( 0.02 usr  0.00 sys +  0.05 cusr  0.00 
csys =  0.07 CPU)
Result: NOTESTS

=== $TODO, test 1 is expected fail, tests 2-5 are unexpected pass
[23:32:32] t/003_cic_2pc.pl .. ok 1371 ms ( 0.01 usr  0.00 sys +  0.51 cusr 
 0.24 csys =  0.76 CPU)
[23:32:33]
All tests successful.

Test Summary Report
---
t/003_cic_2pc.pl (Wstat: 0 Tests: 5 Failed: 0)
  TODO passed:   2-5
Files=1, Tests=5,  1 wallclock secs ( 0.02 usr  0.00 sys +  0.51 cusr  0.24 
csys =  0.77 CPU)
Result: PASS




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-23 Thread Andres Freund
On 2022-01-23 21:25:04 -0500, Tom Lane wrote:
> Michael Paquier  writes:
> > On Sun, Jan 23, 2022 at 06:10:07PM -0800, Noah Misch wrote:
> >> Could do that.  Every run that doesn't get the flaky failure will print a
> >> message like "TODO passed:  3-5", though the test file could mitigate that 
> >> by
> >> declaring the TODO only on configurations where we expect a failure.  The
> >> 027_stream_regress.pl trouble involves reaching a die(), not failing a 
> >> test in
> >> this sense, so that one would take more work.
> 
> > Using a TODO has the advantage to allow the tests to run on a periodic
> > basis, even if they could fail in this unexpected way.
> 
> I'm okay with this *if* the TODO marking can be constrained to platforms
> where we know there's a problem.  Otherwise I'm afraid it will mask
> unexpected problems.

Yep. Very weird syntax... TODO blocks only take effect if a local $TODO is set
(to a text explaining the reason hopefully) or if there's a todo_skip. So
something like

TODO:
{
   local $TODO = 'linux + sparc is borked' if $^O eq 'linux' and ...;

   tests...
}

would run the tests and accept failures if the if matches, and otherwise not.

Greetings,

Andres Freund




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-23 Thread Andres Freund
Hi,

On 2022-01-23 18:10:07 -0800, Noah Misch wrote:
> On Sun, Jan 23, 2022 at 05:40:54PM -0800, Andres Freund wrote:
> > Test::more's description: "If it's something the programmer hasn't done yet,
> > use TODO. This is for any code you haven't written yet, or bugs you have yet
> > to fix, but want to put tests in your testing script (always a good idea)."
>
> Could do that.  Every run that doesn't get the flaky failure will print a
> message like "TODO passed:  3-5", though the test file could mitigate that by
> declaring the TODO only on configurations where we expect a failure.

Yea, that's what I was thinking we'd do.


> The
> 027_stream_regress.pl trouble involves reaching a die(), not failing a test in
> this sense, so that one would take more work.

I'm really not a perl person... But my understanding is that todo_skip() would
address this? I.e. something like

TODO:
{
   $todo_skip "linux/sparc has unaddressed problems with partial page  
overwrites"
 if ($^O eq 'linux' and $Config{archname'} ~= 'sparc');

   ok(whatever is broken);
}

(no idea if the above todo condition even approximates something working)

Greetings,

Andres Freund




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-23 Thread Tom Lane
Michael Paquier  writes:
> On Sun, Jan 23, 2022 at 06:10:07PM -0800, Noah Misch wrote:
>> Could do that.  Every run that doesn't get the flaky failure will print a
>> message like "TODO passed:  3-5", though the test file could mitigate that by
>> declaring the TODO only on configurations where we expect a failure.  The
>> 027_stream_regress.pl trouble involves reaching a die(), not failing a test 
>> in
>> this sense, so that one would take more work.

> Using a TODO has the advantage to allow the tests to run on a periodic
> basis, even if they could fail in this unexpected way.

I'm okay with this *if* the TODO marking can be constrained to platforms
where we know there's a problem.  Otherwise I'm afraid it will mask
unexpected problems.

regards, tom lane




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-23 Thread Michael Paquier
On Sun, Jan 23, 2022 at 06:10:07PM -0800, Noah Misch wrote:
> Could do that.  Every run that doesn't get the flaky failure will print a
> message like "TODO passed:  3-5", though the test file could mitigate that by
> declaring the TODO only on configurations where we expect a failure.  The
> 027_stream_regress.pl trouble involves reaching a die(), not failing a test in
> this sense, so that one would take more work.

Using a TODO has the advantage to allow the tests to run on a periodic
basis, even if they could fail in this unexpected way.  We've had our
load of issues in the past proper to specific architectures, so this
could help with the 2PC code paths and SPARC.  Not to mention that we
could just let the code be as-is, and it would fix itself once the
kernel is updated.  So that sounds much better to me than a skip
phase.
--
Michael


signature.asc
Description: PGP signature


Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-23 Thread Noah Misch
On Sun, Jan 23, 2022 at 05:40:54PM -0800, Andres Freund wrote:
> On 2022-01-23 17:17:59 -0800, Noah Misch wrote:
> > On Sun, Jan 23, 2022 at 05:03:04PM -0800, Andres Freund wrote:
> > > On January 23, 2022 3:29:27 PM PST
> > > >(a) Modify the tests so the affected animals can skip affected tests by
> > > >setting an environment variable, named PG_TEST_HAS_WAL_READ_BUG or 
> > > >similar.
> > > 
> > > Why not just detect the problem in the tap test and skip, rather than 
> > > requiring multiple buildfarm configs to be changed as well as the test 
> > > itself? 
> > 
> > End users running PostgreSQL test suites to acceptance-test their stack 
> > should
> > consider the affected stack unusable for PostgreSQL.
> 
> I'd bet that that's zero users ;)

Wouldn't surprise me.  I'm attaching what I had written and discarded.  If
nobody else hates it, I can live with it.

> > Hence, I ruled out that
> > approach, despite having implemented it at one point.  Under some plausible
> > set of goals, it is optimal.
> 
> It's not perfect due to the way we run our tests (seeing output is hard, it's
> not aggregated), but marking the test as todo rather than SKIP seems like the
> most appropriate test status. It's known to be a problem, we've not fixed it,
> but we want to be able to run the tests.
> 
> Test::more's description: "If it's something the programmer hasn't done yet,
> use TODO. This is for any code you haven't written yet, or bugs you have yet
> to fix, but want to put tests in your testing script (always a good idea)."

Could do that.  Every run that doesn't get the flaky failure will print a
message like "TODO passed:  3-5", though the test file could mitigate that by
declaring the TODO only on configurations where we expect a failure.  The
027_stream_regress.pl trouble involves reaching a die(), not failing a test in
this sense, so that one would take more work.
Author: Noah Misch 
Commit: Noah Misch 

On sparc64+ext4, skip tests that expect flakes from WAL read failure.

Buildfarm member kittiwake began to fail frequently when commits
3cd9c3b921977272e6650a5efbeade4203c4bca2 and
f47ed79cc8a0cfa154dc7f01faaf59822552363f added tests of concurrency, but
the problem was reachable before those commits.  Back-patch to v10 (all
supported versions).

Reviewed by FIXME.

Discussion: https://postgr.es/m/20220116210241.gc756...@rfd.leadboat.com

diff --git a/contrib/amcheck/t/003_cic_2pc.pl b/contrib/amcheck/t/003_cic_2pc.pl
index dfe7f0f..9ef3d49 100644
--- a/contrib/amcheck/t/003_cic_2pc.pl
+++ b/contrib/amcheck/t/003_cic_2pc.pl
@@ -9,7 +9,16 @@ use Config;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 
-use Test::More tests => 5;
+use Test::More;
+
+if (PostgreSQL::Test::Utils::has_wal_read_bug)
+{
+   plan skip_all => 'filesystem bug';
+}
+else
+{
+   plan tests => 5;
+}
 
 my ($node, $result);
 
diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm 
b/src/test/perl/PostgreSQL/Test/Utils.pm
index 50be10f..cbd2cb3 100644
--- a/src/test/perl/PostgreSQL/Test/Utils.pm
+++ b/src/test/perl/PostgreSQL/Test/Utils.pm
@@ -351,6 +351,29 @@ sub perl2host
 
 =pod
 
+=item has_wal_read_bug()
+
+Returns true if $tmp_check is subject to a sparc64+ext4 bug that causes WAL
+readers to see zeros if another process simultaneously wrote the same offsets.
+Use this to skip tests that fail frequently on affected configurations.  The
+bug has made streaming standbys fail to advance, reporting corrupt WAL.  It
+has made COMMIT PREPARED fail with "could not read two-phase state from WAL".
+Non-WAL PostgreSQL reads haven't been affected, likely because those readers
+and writers have buffering systems in common.  See
+https://postgr.es/m/20220116210241.gc756...@rfd.leadboat.com for details.
+
+=cut
+
+sub has_wal_read_bug
+{
+   return
+$Config{osname} eq 'linux'
+ && $Config{archname} =~ /^sparc/
+ && !run_log([ qw(df -x ext4), $tmp_check ], '>', '/dev/null', '2>&1');
+}
+
+=pod
+
 =item system_log(@cmd)
 
 Run (via C) the command passed as argument; the return
diff --git a/src/test/recovery/t/027_stream_regress.pl 
b/src/test/recovery/t/027_stream_regress.pl
index 8c0a8b6..2bb83d3 100644
--- a/src/test/recovery/t/027_stream_regress.pl
+++ b/src/test/recovery/t/027_stream_regress.pl
@@ -3,9 +3,18 @@ use strict;
 use warnings;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
-use Test::More tests => 4;
+use Test::More;
 use File::Basename;
 
+if (PostgreSQL::Test::Utils::has_wal_read_bug)
+{
+   plan skip_all => 'filesystem bug';
+}
+else
+{
+   plan tests => 4;
+}
+
 # Initialize primary node
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(allows_streaming => 1);


Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-23 Thread Andres Freund
Hi,

On 2022-01-23 17:17:59 -0800, Noah Misch wrote:
> On Sun, Jan 23, 2022 at 05:03:04PM -0800, Andres Freund wrote:
> > On January 23, 2022 3:29:27 PM PST
> > >(a) Modify the tests so the affected animals can skip affected tests by
> > >setting an environment variable, named PG_TEST_HAS_WAL_READ_BUG or similar.
> > 
> > Why not just detect the problem in the tap test and skip, rather than 
> > requiring multiple buildfarm configs to be changed as well as the test 
> > itself? 
> 
> End users running PostgreSQL test suites to acceptance-test their stack should
> consider the affected stack unusable for PostgreSQL.

I'd bet that that's zero users ;)


> Hence, I ruled out that
> approach, despite having implemented it at one point.  Under some plausible
> set of goals, it is optimal.

It's not perfect due to the way we run our tests (seeing output is hard, it's
not aggregated), but marking the test as todo rather than SKIP seems like the
most appropriate test status. It's known to be a problem, we've not fixed it,
but we want to be able to run the tests.

Test::more's description: "If it's something the programmer hasn't done yet,
use TODO. This is for any code you haven't written yet, or bugs you have yet
to fix, but want to put tests in your testing script (always a good idea)."

Greetings,

Andres Freund




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-23 Thread Noah Misch
On Sun, Jan 23, 2022 at 05:03:04PM -0800, Andres Freund wrote:
> On January 23, 2022 3:29:27 PM PST
> >(a) Modify the tests so the affected animals can skip affected tests by
> >setting an environment variable, named PG_TEST_HAS_WAL_READ_BUG or similar.
> 
> Why not just detect the problem in the tap test and skip, rather than 
> requiring multiple buildfarm configs to be changed as well as the test 
> itself? 

End users running PostgreSQL test suites to acceptance-test their stack should
consider the affected stack unusable for PostgreSQL.  Hence, I ruled out that
approach, despite having implemented it at one point.  Under some plausible
set of goals, it is optimal.




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-23 Thread Andres Freund



On January 23, 2022 3:29:27 PM PST
>(a) Modify the tests so the affected animals can skip affected tests by
>setting an environment variable, named PG_TEST_HAS_WAL_READ_BUG or similar.

Why not just detect the problem in the tap test and skip, rather than requiring 
multiple buildfarm configs to be changed as well as the test itself? 
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-23 Thread Tom Lane
Noah Misch  writes:
> On Mon, Jan 24, 2022 at 12:49:16PM +1300, Thomas Munro wrote:
>> Trying out a new idea: what if we could tell the buildfarm website
>> that a certain test is currently expected to fail for reasons we can't
>> fix yet (configuration change needed but owner not responding, or
>> bugfix from another project needed, etc)?  That could cause it to be
>> displayed in a different shade of green, or grey, or whatever?  Other
>> kinds of failures would still show as red.  Perhaps this would be
>> configured with a file in a git repo that any committer can push to.

> That would be a better capability to use if we had it, agreed.  Is it feasible
> to acquire that capability soon enough?

It's not merely a website issue: you'd really rather that the
buildfarm animal runs the rest of the tests rather than going belly-up
after an expected failure.  I think your suggestion about skipping
problematic tests based on an environment variable is more practical
in the near term.  We already have some cases like that, too, eg in
src/bin/psql/t/010_tab_completion.pl.

In the long term I could get behind having some less ad-hoc way
of skipping tests, but I don't think we can have that quickly.

regards, tom lane




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-23 Thread Noah Misch
On Mon, Jan 24, 2022 at 12:49:16PM +1300, Thomas Munro wrote:
> On Mon, Jan 24, 2022 at 12:29 PM Noah Misch  wrote:
> > On Mon, Jan 24, 2022 at 09:42:13AM +1300, Thomas Munro wrote:
> > > I'm less
> > > sure it makes sense to do anything to support the presumed bogus
> > > zeroes bug for (probably) no real users, especially before we've even
> > > reported it and heard some analysis, for example acceptance that it's
> > > broken and confirmation that this really is just a sparc problem.
> >
> > Got it.  I've already done a bad thing leaving the buildfarm broken for 
> > three
> > months, so I don't want to let the buildfarm wait for a kernel fix.  These 
> > are
> > the two main options I'm seeing now:
> >
> > (a) Modify the tests so the affected animals can skip affected tests by
> > setting an environment variable, named PG_TEST_HAS_WAL_READ_BUG or similar.
> >
> > (b) Remove --enable-tap-tests from affected animals.
> >
> > Do you have a preference among those two or some other option that gets the
> > buildfarm green on a predictable schedule?  I somewhat prefer (a), since
> > --enable-tap-tests is where most of the interesting buildfarm reports happen
> > these days.
> 
> Trying out a new idea: what if we could tell the buildfarm website
> that a certain test is currently expected to fail for reasons we can't
> fix yet (configuration change needed but owner not responding, or
> bugfix from another project needed, etc)?  That could cause it to be
> displayed in a different shade of green, or grey, or whatever?  Other
> kinds of failures would still show as red.  Perhaps this would be
> configured with a file in a git repo that any committer can push to.

That would be a better capability to use if we had it, agreed.  Is it feasible
to acquire that capability soon enough?




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-23 Thread Thomas Munro
On Mon, Jan 24, 2022 at 12:29 PM Noah Misch  wrote:
> On Mon, Jan 24, 2022 at 09:42:13AM +1300, Thomas Munro wrote:
> > I'm less
> > sure it makes sense to do anything to support the presumed bogus
> > zeroes bug for (probably) no real users, especially before we've even
> > reported it and heard some analysis, for example acceptance that it's
> > broken and confirmation that this really is just a sparc problem.
>
> Got it.  I've already done a bad thing leaving the buildfarm broken for three
> months, so I don't want to let the buildfarm wait for a kernel fix.  These are
> the two main options I'm seeing now:
>
> (a) Modify the tests so the affected animals can skip affected tests by
> setting an environment variable, named PG_TEST_HAS_WAL_READ_BUG or similar.
>
> (b) Remove --enable-tap-tests from affected animals.
>
> Do you have a preference among those two or some other option that gets the
> buildfarm green on a predictable schedule?  I somewhat prefer (a), since
> --enable-tap-tests is where most of the interesting buildfarm reports happen
> these days.

Trying out a new idea: what if we could tell the buildfarm website
that a certain test is currently expected to fail for reasons we can't
fix yet (configuration change needed but owner not responding, or
bugfix from another project needed, etc)?  That could cause it to be
displayed in a different shade of green, or grey, or whatever?  Other
kinds of failures would still show as red.  Perhaps this would be
configured with a file in a git repo that any committer can push to.




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-23 Thread Noah Misch
On Mon, Jan 24, 2022 at 09:42:13AM +1300, Thomas Munro wrote:
> I'm less
> sure it makes sense to do anything to support the presumed bogus
> zeroes bug for (probably) no real users, especially before we've even
> reported it and heard some analysis, for example acceptance that it's
> broken and confirmation that this really is just a sparc problem.

Got it.  I've already done a bad thing leaving the buildfarm broken for three
months, so I don't want to let the buildfarm wait for a kernel fix.  These are
the two main options I'm seeing now:

(a) Modify the tests so the affected animals can skip affected tests by
setting an environment variable, named PG_TEST_HAS_WAL_READ_BUG or similar.

(b) Remove --enable-tap-tests from affected animals.

Do you have a preference among those two or some other option that gets the
buildfarm green on a predictable schedule?  I somewhat prefer (a), since
--enable-tap-tests is where most of the interesting buildfarm reports happen
these days.




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-23 Thread Andres Freund
Hi,

On 2022-01-24 09:42:13 +1300, Thomas Munro wrote:
> On Sun, Jan 23, 2022 at 7:52 AM Noah Misch  wrote:
> > Future work can benchmark the new behavior and, if it performs well, make
> > it unconditional in v15+.  I would expect performance to be unchanged or
> > slightly better, because the new behavior requests less futile work from
> > the OS.

I doubt it'll be generally applicable. Turning a write operation into a
read-write isn't free. Yes, often enough it's likely that the prior page is
still in cache, but I don't think we can rely on that in general.

It also just fundamentally locks us into never using O_DIRECT in anger. I
don't think that's a good direction.


> One detail is that wal_level=open_datasync, wal_senders=0,
> wal_level=minimal will panic, because O_DIRECT requires fs
> page-aligned access (and fails in various other ways on other OSes, eg
> expensive read-before-write every time).  That's an ultra-niche
> concern likely affecting nobody, especially when multiplied by the
> odds that anyone is using that stack at all (considering that
> streaming rep has apparently been borked for years on linux/sparc/ext4
> and nobody told us).

Seems like the patch should at least make this error out?

Greetings,

Andres Freund




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-23 Thread Thomas Munro
On Sun, Jan 23, 2022 at 7:52 AM Noah Misch  wrote:
> Attached.  With this, kittiwake has survived 8.5hr of 003_cic_2pc.pl.  Without
> the patch, it failed many times, always within 1.3hr.  For easier review, this
> patch uses the new behavior on all platforms.  Before commit and back-patch, I
> plan to limit use of the new behavior to sparc Linux.  Future work can
> benchmark the new behavior and, if it performs well, make it unconditional in
> v15+.  I would expect performance to be unchanged or slightly better, because
> the new behavior requests less futile work from the OS.

One detail is that wal_level=open_datasync, wal_senders=0,
wal_level=minimal will panic, because O_DIRECT requires fs
page-aligned access (and fails in various other ways on other OSes, eg
expensive read-before-write every time).  That's an ultra-niche
concern likely affecting nobody, especially when multiplied by the
odds that anyone is using that stack at all (considering that
streaming rep has apparently been borked for years on linux/sparc/ext4
and nobody told us).

I was +1 for the control file locking change in that other thread (I
view the atomicity stuff as Linux-realpolitik-vs-POSIX, not to mention
that we run Windows too, which requires separate analysis).  I'm less
sure it makes sense to do anything to support the presumed bogus
zeroes bug for (probably) no real users, especially before we've even
reported it and heard some analysis, for example acceptance that it's
broken and confirmation that this really is just a sparc problem.




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-22 Thread Noah Misch
On Sun, Jan 16, 2022 at 01:02:41PM -0800, Noah Misch wrote:
> My next steps:
> 
> - Report a Debian bug for the sparc64+ext4 zeros problem.

(Not done yet.)

> - Try to falsify the idea that "write only the not-already-written portion of
>   a WAL block" is an effective workaround.  Specifically, modify the test
>   program to have the writer process mutate offsets [N-k,N-1] and [N+1,N+k]
>   while the reader process reads offset N.  If the reader sees a zero, that
>   workaround is ineffective.

The reader did not see a zero.  In addition to bytes outside the write being
immune to the zeros bug, the first and last forty bytes of a write were immune
to the zeros bug.

> - Implement the workaround, if I didn't falsify its effectiveness.  If it
>   doesn't hurt performance on x86_64, we can use it unconditionally.
>   Otherwise, limit its use to sparc64 Linux.

Attached.  With this, kittiwake has survived 8.5hr of 003_cic_2pc.pl.  Without
the patch, it failed many times, always within 1.3hr.  For easier review, this
patch uses the new behavior on all platforms.  Before commit and back-patch, I
plan to limit use of the new behavior to sparc Linux.  Future work can
benchmark the new behavior and, if it performs well, make it unconditional in
v15+.  I would expect performance to be unchanged or slightly better, because
the new behavior requests less futile work from the OS.
Author: Noah Misch 
Commit: Noah Misch 

On sparc Linux, write each WAL byte just once.

On sparc64, an ext4 filesystem bug can make readers see zeros if another
process simultaneously wrote the same offsets.  Per buildfarm member
kittiwake, this could make a streaming standby fail to advance,
reporting corrupt WAL.  It could make COMMIT PREPARED fail with "could
not read two-phase state from WAL".  Non-WAL PostgreSQL reads haven't
been affected, likely because those readers and writers have buffering
systems in common.  Fix this by writing each WAL byte just once, instead
of rewriting every byte of a page whenever writing some new byte of that
page.  This problem is not new, but buildfarm failures became frequent
when commits 3cd9c3b921977272e6650a5efbeade4203c4bca2 and
f47ed79cc8a0cfa154dc7f01faaf59822552363f added tests with concurrency.
Back-patch to v10 (all supported versions).

Reviewed by FIXME.

Discussion: https://postgr.es/m/20220116210241.gc756...@rfd.leadboat.com

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index e073121..68fe510 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -130,6 +130,19 @@ intwal_segment_size = 
DEFAULT_XLOG_SEG_SIZE;
 #define NUM_XLOGINSERT_LOCKS  8
 
 /*
+ * Should XLogWrite() skip the previously-written prefix of the block when
+ * LogwrtResult falls in the middle of a block, or should it write whole
+ * blocks always?  The latter is traditional PostgreSQL behavior, but the
+ * former avoids a sparc64+ext4 bug extant as of Linux 5.15.5:
+ * https://postgr.es/m/20220116210241.gc756...@rfd.leadboat.com
+ */
+#if defined(__sparc__) && defined(__linux__)
+#define SKIP_PREVIOUSLY_WRITTEN 1
+#else
+#define SKIP_PREVIOUSLY_WRITTEN 1  /* FIXME change to 0 before commit */
+#endif
+
+/*
  * Max distance from last checkpoint, before triggering a new xlog-based
  * checkpoint.
  */
@@ -2468,6 +2481,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool 
flexible)
boollast_iteration;
boolfinishing_seg;
int curridx;
+   Sizeskip;
int npages;
int startidx;
uint32  startoffset;
@@ -2483,12 +2497,15 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool 
flexible)
/*
 * Since successive pages in the xlog cache are consecutively allocated,
 * we can usually gather multiple pages together and issue just one
-* write() call.  npages is the number of pages we have determined can 
be
-* written together; startidx is the cache block index of the first one,
-* and startoffset is the file offset at which it should go. The latter
-* two variables are only valid when npages > 0, but we must initialize
-* all of them to keep the compiler quiet.
-*/
+* write() call.  skip is the already-written portion of the first page.
+* npages is the number of pages we have determined can be written
+* together; startidx is the cache block index of the first one, and
+* startoffset is the file offset at which the first byte (possibly
+* mid-page) should go. The latter two variables are only valid when
+* npages > 0, but we must initialize all of them to keep the compiler
+* quiet.
+*/
+   skip = 0;
npages = 0;
startidx = 0;
startoffset = 

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-20 Thread Noah Misch
On Fri, Jan 21, 2022 at 08:34:22AM +1300, Thomas Munro wrote:
> On Mon, Jan 17, 2022 at 10:02 AM Noah Misch  wrote:
> > - Report a Debian bug for the sparc64+ext4 zeros problem.
> 
> I suspect that 027_stream_regress.pl hits this kernel bug with high
> probability[1].  I wonder if the owner of kittiwake and tadarida would
> consider setting up an xfs file system?  Or alternatively, since ext4
> didn't support concurrent writes until recently, I wonder if there is
> an option somewhere to turn the new concurrency stuff off, or failing
> that, if we could temporarily downgrade the kernel to an older version
> that does inode-level read/write locking.

If the write-only-new-bytes approach works, I think we'd want to revert those
changes.  Perhaps a cheaper stopgap is to make the affected tests skip on
sparc Linux.  Is that worth doing?  (Could even limit the skip to ext4,
e.g. by testing "df -x ext4 . >/dev/null".)

> [1] 
> https://www.postgresql.org/message-id/CA%2BhUKG%2BeuZ%3Ddc27ZB%3Ds74x0q%3DzU%3D2%3Dvs8%2B6TkJoTUiCPUd2dQA%40mail.gmail.com




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-20 Thread Thomas Munro
On Mon, Jan 17, 2022 at 10:02 AM Noah Misch  wrote:
> - Report a Debian bug for the sparc64+ext4 zeros problem.

I suspect that 027_stream_regress.pl hits this kernel bug with high
probability[1].  I wonder if the owner of kittiwake and tadarida would
consider setting up an xfs file system?  Or alternatively, since ext4
didn't support concurrent writes until recently, I wonder if there is
an option somewhere to turn the new concurrency stuff off, or failing
that, if we could temporarily downgrade the kernel to an older version
that does inode-level read/write locking.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKG%2BeuZ%3Ddc27ZB%3Ds74x0q%3DzU%3D2%3Dvs8%2B6TkJoTUiCPUd2dQA%40mail.gmail.com




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-16 Thread Noah Misch
Cancel that kernel upgrade idea.  I no longer expect it to help...

On Sun, Jan 16, 2022 at 10:19:30PM +1300, Thomas Munro wrote:
> On Sun, Jan 16, 2022 at 8:12 PM Noah Misch  wrote:
> > For specifics of the kernel bug, see the attached test program.  In brief, 
> > the
> > bug arises if one process is write()ing or pwrite()ing a file at about the
> > same time that another process is read()ing or pread()ing the same.  POSIX
> > says the reader should see the data as it existed before the write or the
> > newly-written data.  On this kernel, the reader can see zeros instead.  That
> > leads to the $SUBJECT failure.  PostgreSQL processes write out a given WAL
> > block 20-30 times in ~10ms, and COMMIT PREPARED reads that block.  The 
> > writers
> > aren't changing the bytes of interest to COMMIT PREPARED, but the zeros from
> > the kernel bug yield the failure.

The difference between kittiwake and thorntail comes from thorntail using xfs
and kittiwake using ext4.  Running the io-rectitude.c tests on an ext4
partition on thorntail, I see the zeros bug just like I do on kittiwake.  I
don't see the zeros bug on ppc64 or x86_64, just sparc64 so far:

 * ext4, Linux 3.10.0-1160.49.1.el7.x86_64 (CentOS 7.9.2009):
 * pwrite/pread is non-atomic if count>16 (no -D switches)
 * write/read is atomic (-DUSE_SEEK -DXLOG_BLCKSZ=8192000)
 * pwrite/pread is free from zeros bug (-DCHANGE_CONTENT=0)
 * write/read is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0)
 *
 * ext4, Linux 4.9.0-13-sparc64-smp (Debian):
 * pwrite/pread is non-atomic if count>4 (no -D switches)
 * write/read is non-atomic if count>4 (-DUSE_SEEK)
 * write/read IS atomic w/o REOPEN (-DUSE_SEEK -DREOPEN=0 -DXLOG_BLCKSZ=8192000)
 * pwrite/pread has zeros bug for count>127 (-DCHANGE_CONTENT=0)
 * pwrite/pread w/ O_SYNC has zeros bug (-DCHANGE_CONTENT=0 -DOPEN_FLAGS=O_SYNC)
 *far less frequent w/ O_SYNC, but it still happens
 * pwrite/pread w/o REOPEN also has zeros bug for count>127 (-DCHANGE_CONTENT=0 
-DREOPEN=0)
 * write/read has zeros bug for count>127 (-DUSE_SEEK -DCHANGE_CONTENT=0)
 * write/read w/ O_SYNC has zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0 
-DOPEN_FLAGS=O_SYNC)
 * write/read w/o REOPEN is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0 
-DREOPEN=0)
 *
 * ext4, Linux 5.15.0-2-sparc64-smp (Debian bookworm/sid):
 * [behaviors match the previous kernel exactly]
 *
 * ext4, Linux 5.15.0-2-powerpc64 (Debian bookworm/sid):
 * [atomicity matches previous kernel, but zeros bug does not]
 * pwrite/pread is non-atomic if count>4 (no -D switches)
 * write/read is non-atomic if count>4 (-DUSE_SEEK)
 * write/read IS atomic w/o REOPEN (-DUSE_SEEK -DREOPEN=0 -DXLOG_BLCKSZ=8192000)
 * pwrite/pread is free from zeros bug (-DCHANGE_CONTENT=0)
 * write/read is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0)
 *
 * ext4, Linux 5.15.5-0-virt x86_64 (Alpine):
 * [behaviors match the previous kernel exactly]
 *
 * xfs, Linux 5.15.0-2-sparc64-smp (Debian bookworm/sid):
 * pwrite/pread is atomic (-DXLOG_BLCKSZ=8192000)
 * write/read is atomic (-DUSE_SEEK -DXLOG_BLCKSZ=8192000)
 * pwrite/pread is free from zeros bug (-DCHANGE_CONTENT=0)
 * write/read is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0)

> > We could opt to work around that by writing
> > only the not-already-written portion of a WAL block, but I doubt that's
> > worthwhile unless it happens to be a performance win anyway.

My next steps:

- Report a Debian bug for the sparc64+ext4 zeros problem.

- Try to falsify the idea that "write only the not-already-written portion of
  a WAL block" is an effective workaround.  Specifically, modify the test
  program to have the writer process mutate offsets [N-k,N-1] and [N+1,N+k]
  while the reader process reads offset N.  If the reader sees a zero, that
  workaround is ineffective.

- Implement the workaround, if I didn't falsify its effectiveness.  If it
  doesn't hurt performance on x86_64, we can use it unconditionally.
  Otherwise, limit its use to sparc64 Linux.

> > Separately, while I don't know of relevance to PostgreSQL, I was interested 
> > to
> > see that CentOS 7 pwrite()/pread() fail to have the POSIX-required 
> > atomicity.
> 
> FWIW there was some related discussion over here:
> 
> https://www.postgresql.org/message-id/flat/17064-bb0d7904ef72add3%40postgresql.org

That gave me the idea to test different filesystems.  Thanks.  Incidentally, I
find https://utcc.utoronto.ca/~cks/space/blog/unix/WriteNotVeryAtomic is
mistaken about POSIX requirements.  There's no precedent for POSIX writing
"two threads" when it means "two threads of the same process".  Moreover, the
part about "shall also apply whenever a file descriptor is successfully
closed, however caused (for example [...] process termination)" would be
superfluous in a requirement specific to threads of one process.  Having said
that, if the most-prominent POSIX regular file implementation (ext4 on x86_64)
doesn't implement a POSIX requirement, that has the same 

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-16 Thread Thomas Munro
On Sun, Jan 16, 2022 at 8:12 PM Noah Misch  wrote:
> For specifics of the kernel bug, see the attached test program.  In brief, the
> bug arises if one process is write()ing or pwrite()ing a file at about the
> same time that another process is read()ing or pread()ing the same.  POSIX
> says the reader should see the data as it existed before the write or the
> newly-written data.  On this kernel, the reader can see zeros instead.  That
> leads to the $SUBJECT failure.  PostgreSQL processes write out a given WAL
> block 20-30 times in ~10ms, and COMMIT PREPARED reads that block.  The writers
> aren't changing the bytes of interest to COMMIT PREPARED, but the zeros from
> the kernel bug yield the failure.  We could opt to work around that by writing
> only the not-already-written portion of a WAL block, but I doubt that's
> worthwhile unless it happens to be a performance win anyway.
>
> Separately, while I don't know of relevance to PostgreSQL, I was interested to
> see that CentOS 7 pwrite()/pread() fail to have the POSIX-required atomicity.

FWIW there was some related discussion over here:

https://www.postgresql.org/message-id/flat/17064-bb0d7904ef72add3%40postgresql.org




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2022-01-15 Thread Noah Misch
On Fri, Nov 19, 2021 at 09:18:23PM -0800, Noah Misch wrote:
> On Wed, Nov 17, 2021 at 11:05:06PM -0800, Noah Misch wrote:
> > On Wed, Nov 17, 2021 at 05:47:10PM -0500, Tom Lane wrote:
> > > Noah Misch  writes:
> > > > Each of the three failures happened on a sparc64 Debian+gcc machine.  I 
> > > > had
> > > > tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal,
> > > > without reproducing this.
> > 
> > > #   'pgbench: error: client 0 script 1 aborted in command 
> > > 4 query 0: ERROR:  could not read two-phase state from WAL at 0/159EF88: 
> > > unexpected pageaddr 0/0 in log segment 00010001, offset 
> > > 5890048
> > > [1] 
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida=2021-11-17%2013%3A01%3A24
> > 
> > Two others:
> > ERROR:  could not read two-phase state from WAL at 0/16F1850: invalid 
> > record length at 0/16F1850: wanted 24, got 0
> > -- 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida=2021-11-12%2013%3A01%3A15
> > ERROR:  could not read two-phase state from WAL at 0/1668020: incorrect 
> > resource manager data checksum in record at 0/1668020
> > -- 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake=2021-11-16%2015%3A00%3A52

> > I don't have a great theory, but here are candidates to examine next:
> > 
> > - Run with wal_debug=on to confirm logged write location matches read 
> > location.
> > - Run 
> > "PGDATA=contrib/amcheck/tmp_check/t_003_cic_2pc_CIC_2PC_test_data/pgdata
> >   pg_waldump -s 0/0100" at the end of the test.
> > - Dump WAL page binary image at the point of failure.
> > - Log which branches in XLogReadRecord() are taken.
> 
> Tom Turelinckx, are you able to provide remote access to kittiwake or
> tadarida?  I'd use it to attempt the above things.  All else being equal,
> kittiwake is more relevant since it's still supported upstream.

Thanks for setting up access.  Summary: this kernel has a bug in I/O syscalls.
How practical is it to update that kernel?  (Userland differs across these
animals, but the kernel does not.)  The kernel on buildfarm member thorntail
doesn't exhibit the bug.

For specifics of the kernel bug, see the attached test program.  In brief, the
bug arises if one process is write()ing or pwrite()ing a file at about the
same time that another process is read()ing or pread()ing the same.  POSIX
says the reader should see the data as it existed before the write or the
newly-written data.  On this kernel, the reader can see zeros instead.  That
leads to the $SUBJECT failure.  PostgreSQL processes write out a given WAL
block 20-30 times in ~10ms, and COMMIT PREPARED reads that block.  The writers
aren't changing the bytes of interest to COMMIT PREPARED, but the zeros from
the kernel bug yield the failure.  We could opt to work around that by writing
only the not-already-written portion of a WAL block, but I doubt that's
worthwhile unless it happens to be a performance win anyway.

Separately, while I don't know of relevance to PostgreSQL, I was interested to
see that CentOS 7 pwrite()/pread() fail to have the POSIX-required atomicity.

> The setup of your buildfarm animals provides a clue.  I understand kittiwake
> is the same as ibisbill except for build options, and tadarida is the same as
> mussurana except for build options.  ibisbill and mussurana haven't failed, so
> one of these is likely needed to reproduce:
> 
>   absence of --enable-cassert
>   CFLAGS='-g -O2 -fstack-protector -Wformat -Werror=format-security '
>   CPPFLAGS='-Wdate-time -D_FORTIFY_SOURCE=2'
>   LDFLAGS='-Wl,-z,relro -Wl,-z,now'

That was a red herring.  ibisbill and mussurana don't use --with-tap-tests.
Adding --with-tap-tests has been enough to make their configurations witness
the same kinds of failures.

nm
/*
 * Stress-test pread(), pwrite(), read(), and write() to detect a few problems
 * with their handling of regular files:
 *
 * - Lack of atomicity.
 *   
https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07
 *   requires atomicity.  (An implementation may always return <= 1; if it
 *   chooses to return higher values, it must maintain atomicity.)
 *
 * - Transiently making readers see zeros when they read concurrently with a
 *   write, even if the file had no zero at that offset before or after the
 *   write.
 *
 * By default, this program will print "mismatch" messages if pwrite() is
 * non-atomic.  Build with other options to test other behaviors:
 * -DCHANGE_CONTENT=0 tests the zeros bug instead of plain atomicity
 * -DUSE_SEEK=1 tests write()/read() instead of pwrite()/pread()
 * -DREOPEN=0 reuses the same file descriptor across iterations
 * -DXLOG_BLCKSZ=32 tests a different byte count
 *high values may require "ulimit -Ss" changes
 *
 *
 * Observed behaviors:
 *
 * Linux 3.10.0-1160.49.1.el7.x86_64 (CentOS 7.9.2009):
 * pwrite/pread is non-atomic if count>16 (no -D switches)
 * write/read is atomic (-DUSE_SEEK 

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2021-11-19 Thread Tom Lane
I wrote:
> snapper just exhibited the same failure, too:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper=2021-11-18%2016%3A09%3A49

I grepped the buildfarm logs for all recent (last 3 months) occurrences of
'could not read two-phase state'.  Here's the results:

  sysname  |branch |  snapshot   |  stage  |

 l  

---+---+-+-+
 kittiwake | REL9_6_STABLE | 2021-10-24 12:01:10 | pgbenchCheck| #  
 'client 1 aborted in state 3: ERROR:  could not read two-phase 
state from xlog at 0/158F4E0
 kittiwake | REL_13_STABLE | 2021-10-26 12:51:11 | ContribCheck-en_US.utf8 | #  
 'pgbench: error: client 3 script 1 aborted in command 4 query 
0: ERROR:  could not read two-phase state from WAL at 0/168C8D8
 kittiwake | REL_14_STABLE | 2021-11-08 15:42:35 | ContribCheck-en_US.utf8 | #  
 'pgbench: error: client 0 script 0 aborted in command 3 query 
0: ERROR:  could not read two-phase state from WAL at 0/17ABF48
 kittiwake | REL_13_STABLE | 2021-11-16 15:00:52 | ContribCheck-en_US.utf8 | #  
 'pgbench: error: client 3 script 1 aborted in command 4 query 
0: ERROR:  could not read two-phase state from WAL at 0/1668020: incorrect 
resource manager data checksum in record at 0/1668020
 snapper   | REL_14_STABLE | 2021-11-18 16:09:49 | contrib-amcheckCheck| #  
 'pgbench: error: client 3 script 1 aborted in command 4 query 
0: ERROR:  could not read two-phase state from WAL at 0/1770328: unexpected 
pageaddr 0/0 in log segment 00010001, offset 7798784
 tadarida  | REL_11_STABLE | 2021-11-11 13:29:58 | pgbenchCheck| #  
 'client 3 aborted in command 3 (SQL) of script 0; ERROR:  
could not read two-phase state from WAL at 0/1716C68
 tadarida  | REL_10_STABLE | 2021-11-12 13:01:15 | pgbenchCheck| #  
 'client 4 aborted in command 3 of script 0; ERROR:  could not 
read two-phase state from WAL at 0/16F1850: invalid record length at 0/16F1850: 
wanted 24, got 0
 tadarida  | HEAD  | 2021-11-17 13:01:24 | contrib-amcheckCheck| #  
 'pgbench: error: client 0 script 1 aborted in command 4 query 
0: ERROR:  could not read two-phase state from WAL at 0/159EF88: unexpected 
pageaddr 0/0 in log segment 00010001, offset 5890048

So not all are exactly 'unexpected pageaddr 0/0', but they do all
look like we read garbage data.

regards, tom lane




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2021-11-19 Thread Tom Lane
Noah Misch  writes:
> Tom Turelinckx, are you able to provide remote access to kittiwake or
> tadarida?  I'd use it to attempt the above things.  All else being equal,
> kittiwake is more relevant since it's still supported upstream.

snapper just exhibited the same failure, too:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper=2021-11-18%2016%3A09%3A49

regards, tom lane




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2021-11-19 Thread Noah Misch
On Wed, Nov 17, 2021 at 11:05:06PM -0800, Noah Misch wrote:
> On Wed, Nov 17, 2021 at 05:47:10PM -0500, Tom Lane wrote:
> > Noah Misch  writes:
> > > Each of the three failures happened on a sparc64 Debian+gcc machine.  I 
> > > had
> > > tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal,
> > > without reproducing this.
> 
> > #   'pgbench: error: client 0 script 1 aborted in command 4 
> > query 0: ERROR:  could not read two-phase state from WAL at 0/159EF88: 
> > unexpected pageaddr 0/0 in log segment 00010001, offset 
> > 5890048
> > [1] 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida=2021-11-17%2013%3A01%3A24
> 
> Two others:
> ERROR:  could not read two-phase state from WAL at 0/16F1850: invalid record 
> length at 0/16F1850: wanted 24, got 0
> -- 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida=2021-11-12%2013%3A01%3A15
> ERROR:  could not read two-phase state from WAL at 0/1668020: incorrect 
> resource manager data checksum in record at 0/1668020
> -- 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake=2021-11-16%2015%3A00%3A52
> 
> > I suppose "unexpected pageaddr 0/0" is most easily explained by supposing
> > that XlogReadTwoPhaseData tried to read a WAL page that hadn't been
> > written out yet.  Have we got any synchronization around that?
> 
> If the WAL address isn't on disk, that error doesn't happen.  Instead,
> read_local_xlog_page() blocks waiting for the WAL to become available.  It's
> still possible that we make the WAL region exist, but it somehow doesn't
> contain the right data until shortly later.  FinishPreparedTransaction() takes
> TwoPhaseStateLock and looks for an entry having gxact->valid.  EndPrepare()
> fills gxact->prepare_end_lsn, then calls MarkAsPrepared() to set gxact->valid
> under TwoPhaseStateLock.  All freelist (freeGXacts) interaction holds
> TwoPhaseStateLock.  I'm not seeing a gap in that synchronization.
> 
> I don't have a great theory, but here are candidates to examine next:
> 
> - Run with wal_debug=on to confirm logged write location matches read 
> location.
> - Run "PGDATA=contrib/amcheck/tmp_check/t_003_cic_2pc_CIC_2PC_test_data/pgdata
>   pg_waldump -s 0/0100" at the end of the test.
> - Dump WAL page binary image at the point of failure.
> - Log which branches in XLogReadRecord() are taken.

Tom Turelinckx, are you able to provide remote access to kittiwake or
tadarida?  I'd use it to attempt the above things.  All else being equal,
kittiwake is more relevant since it's still supported upstream.

The setup of your buildfarm animals provides a clue.  I understand kittiwake
is the same as ibisbill except for build options, and tadarida is the same as
mussurana except for build options.  ibisbill and mussurana haven't failed, so
one of these is likely needed to reproduce:

  absence of --enable-cassert
  CFLAGS='-g -O2 -fstack-protector -Wformat -Werror=format-security '
  CPPFLAGS='-Wdate-time -D_FORTIFY_SOURCE=2'
  LDFLAGS='-Wl,-z,relro -Wl,-z,now'

However, I can't reproduce this on thorntail, even if I use tadarida's
./configure options, CFLAGS, CPPFLAGS, and LDFLAGS.




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2021-11-18 Thread Tom Lane
Andrey Borodin  writes:
> Let's add more tests that check survival of 2PC through crash recovery? We do 
> now only one restart. Maybe it worth to do 4 or 8?

That seems a little premature when we can't explain the failure
we have.  Also, buildfarm cycles aren't free.

regards, tom lane




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2021-11-18 Thread Andrey Borodin



> 18 нояб. 2021 г., в 12:05, Noah Misch  написал(а):
> 
> What else might help?

Let's add more tests that check survival of 2PC through crash recovery? We do 
now only one restart. Maybe it worth to do 4 or 8?

Best regards, Andrey Borodin.



Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2021-11-17 Thread Noah Misch
On Wed, Nov 17, 2021 at 05:47:10PM -0500, Tom Lane wrote:
> Noah Misch  writes:
> > Each of the three failures happened on a sparc64 Debian+gcc machine.  I had
> > tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal,
> > without reproducing this.

> #   'pgbench: error: client 0 script 1 aborted in command 4 
> query 0: ERROR:  could not read two-phase state from WAL at 0/159EF88: 
> unexpected pageaddr 0/0 in log segment 00010001, offset 
> 5890048
> [1] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida=2021-11-17%2013%3A01%3A24

Two others:
ERROR:  could not read two-phase state from WAL at 0/16F1850: invalid record 
length at 0/16F1850: wanted 24, got 0
-- 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida=2021-11-12%2013%3A01%3A15
ERROR:  could not read two-phase state from WAL at 0/1668020: incorrect 
resource manager data checksum in record at 0/1668020
-- 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake=2021-11-16%2015%3A00%3A52

> I suppose "unexpected pageaddr 0/0" is most easily explained by supposing
> that XlogReadTwoPhaseData tried to read a WAL page that hadn't been
> written out yet.  Have we got any synchronization around that?

If the WAL address isn't on disk, that error doesn't happen.  Instead,
read_local_xlog_page() blocks waiting for the WAL to become available.  It's
still possible that we make the WAL region exist, but it somehow doesn't
contain the right data until shortly later.  FinishPreparedTransaction() takes
TwoPhaseStateLock and looks for an entry having gxact->valid.  EndPrepare()
fills gxact->prepare_end_lsn, then calls MarkAsPrepared() to set gxact->valid
under TwoPhaseStateLock.  All freelist (freeGXacts) interaction holds
TwoPhaseStateLock.  I'm not seeing a gap in that synchronization.

I don't have a great theory, but here are candidates to examine next:

- Run with wal_debug=on to confirm logged write location matches read location.
- Run "PGDATA=contrib/amcheck/tmp_check/t_003_cic_2pc_CIC_2PC_test_data/pgdata
  pg_waldump -s 0/0100" at the end of the test.
- Dump WAL page binary image at the point of failure.
- Log which branches in XLogReadRecord() are taken.

What else might help?




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2021-11-17 Thread Tom Lane
Noah Misch  writes:
> Tom Lane reported another instance today:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida=2021-11-11%2013%3A29%3A58

> Each of the three failures happened on a sparc64 Debian+gcc machine.  I had
> tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal,
> without reproducing this.

>>> As a first step, let's report the actual XLogReadRecord() error message.
>>> Attached.

>> Good catch!  This looks good.

> Pushed.

Well, we didn't have to wait too long [1]:

#   at t/003_cic_2pc.pl line 143.
#   'pgbench: error: client 0 script 1 aborted in command 4 
query 0: ERROR:  could not read two-phase state from WAL at 0/159EF88: 
unexpected pageaddr 0/0 in log segment 00010001, offset 5890048
# pgbench: error: client 2 script 3 aborted in command 2 query 0: ERROR:  
canceling statement due to lock timeout
# pgbench: fatal: Run was aborted; the above results are incomplete.

I suppose "unexpected pageaddr 0/0" is most easily explained by supposing
that XlogReadTwoPhaseData tried to read a WAL page that hadn't been
written out yet.  Have we got any synchronization around that?

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida=2021-11-17%2013%3A01%3A24




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2021-11-11 Thread Noah Misch
On Mon, Nov 08, 2021 at 01:42:46PM +0900, Michael Paquier wrote:
> On Sat, Nov 06, 2021 at 06:31:57PM -0700, Noah Misch wrote:
> > On Sun, Oct 24, 2021 at 04:35:02PM -0700, Noah Misch wrote:
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake=2021-10-24%2012%3A01%3A10
> > > got an interesting v9.6 failure [...]:
> > > 
> > > 2021-10-24 14:25:29.263 CEST [34569:175] pgbench ERROR:  could not read 
> > > two-phase state from xlog at 0/158F4E0
> > > 2021-10-24 14:25:29.263 CEST [34569:176] pgbench STATEMENT:  COMMIT 
> > > PREPARED 'c1';

Tom Lane reported another instance today:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida=2021-11-11%2013%3A29%3A58

Each of the three failures happened on a sparc64 Debian+gcc machine.  I had
tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal,
without reproducing this.

> > As a first step, let's report the actual XLogReadRecord() error message.
> > Attached.
> 
> Good catch!  This looks good.

Pushed.




Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2021-11-08 Thread Michael Paquier
On Mon, Nov 08, 2021 at 01:42:46PM +0900, Michael Paquier wrote:
> Indeed.  Looking closer, I think that we'd better improve
> DecodingContextFindStartpoint(),
> pg_logical_replication_slot_advance(), XLogSendLogical() as well as
> pg_logical_slot_get_changes_guts() to follow a format closer to what
> you have in your patch, with an error message that describes the
> context where the problem has been found, instead of just elog()'ing
> what XLogReadRecord() returns.

FYI, I have just begun a new thread about those ones:
https://www.postgresql.org/message-id/yynth6oyowqca...@paquier.xyz
--
Michael


signature.asc
Description: PGP signature


Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2021-11-08 Thread Andrey Borodin



> 7 нояб. 2021 г., в 06:31, Noah Misch  написал(а):
> 
> As a first step, let's report the actual XLogReadRecord() error message.
> Attached.  All the other sites that expect no error already do this.

BTW some time ago I've spotted a good number of related unreported errors [0].


[0] 
https://www.postgresql.org/message-id/A0A36AEE-3476-4326-B877-EE2B55BAEEED%40yandex-team.ru



Re: XLogReadRecord() error in XlogReadTwoPhaseData()

2021-11-07 Thread Michael Paquier
On Sat, Nov 06, 2021 at 06:31:57PM -0700, Noah Misch wrote:
> As a first step, let's report the actual XLogReadRecord() error message.
> Attached.

Good catch!  This looks good.

> All the other sites that expect no error already do this.

Indeed.  Looking closer, I think that we'd better improve
DecodingContextFindStartpoint(),
pg_logical_replication_slot_advance(), XLogSendLogical() as well as
pg_logical_slot_get_changes_guts() to follow a format closer to what
you have in your patch, with an error message that describes the
context where the problem has been found, instead of just elog()'ing
what XLogReadRecord() returns.
--
Michael


signature.asc
Description: PGP signature