On 12/02/2022 05:42, Andres Freund wrote:
On 2022-02-11 16:41:24 -0800, Andres Freund wrote:
FWIW, I've indeed reproduced this fairly easily with such a setup. A pgbench
r/w workload that's been modified to start 70 savepoints at the start shows

pgbench: error: client 22 script 0 aborted in command 12 query 0: ERROR:  t_xmin 3853739 
is uncommitted in tuple (2,159) to be updated in table "pgbench_branches"
pgbench: error: client 13 script 0 aborted in command 12 query 0: ERROR:  t_xmin 3954305 
is uncommitted in tuple (2,58) to be updated in table "pgbench_branches"
pgbench: error: client 7 script 0 aborted in command 12 query 0: ERROR:  t_xmin 4017908 
is uncommitted in tuple (3,44) to be updated in table "pgbench_branches"

after a few minutes of running with a local, not slowed down, syncrep. Without
any other artifical slowdowns or such.

And this can easily be triggered even without subtransactions, in a completely
reliable way.

The only reason I'm so far not succeeding in turning it into an
isolationtester spec is that a transaction waiting for SyncRep doesn't count
as waiting for isolationtester.

Basically

S1: BEGIN; $xid = txid_current(); UPDATE; COMMIT; <commit wait for syncrep>
S2: SELECT pg_xact_status($xid);
S2: UPDATE;

suffices, because the pg_xact_status() causes an xlog fetch, priming the xid
cache, which then causes the TransactionIdIsInProgress() to take the early
return path, despite the transaction still being in progress. Which then
allows the update to proceed, despite the S1 not having "properly committed"
yet.

I started to improve isolationtester, to allow the spec file to specify a custom query to check for whether the backend is blocked. But then I realized that it's not quite enough: there is currently no way write the "$xid = txid_current()" step either. Isolationtester doesn't have variables like that. Also, you need to set synchronous_standby_names to make it block, which seems a bit weird in an isolation test.

I wish we had an explicit way to inject delays or failures. We discussed it before (https://www.postgresql.org/message-id/flat/CANXE4TdxdESX1jKw48xet-5GvBFVSq%3D4cgNeioTQff372KO45A%40mail.gmail.com), but I don't want to pick up that task right now.

Anyway, I wrote a TAP test for this instead. See attached. I'm not sure if this is worth committing, the pg_xact_status() trick is quite special for this particular bug.

Simon's just_remove_TransactionIdIsKnownCompleted_call.v1.patch looks good to me. Replying to the discussion on that:

On 12/02/2022 23:50, Andres Freund wrote:
On 2022-02-12 13:25:58 +0000, Simon Riggs wrote:
I'm not sure it is possible to remove TransactionIdIsKnownCompleted()
in backbranches.

I think it'd be fine if we needed to. Or we could just make it always return
false or such.


just removes the known offending call, passes make check, but IMHO
leaves the same error just as likely by other callers.

Which other callers are you referring to?

I don't know of any, but we can't just remove a function in a
backbranch, can we?

We have in the past, if it's a "sufficiently internal" function.

I think we should remove it in v15, and leave it as it is in backbranches. Just add a comment to point out that the name is a bit misleading, because it checks the clog rather than the proc array. It's not inherently dangerous, and someone might have a legit use case for it. True, someone might also be doing this incorrect thing with it, but still seems like the right balance to me.

I think we also need to change pg_xact_status(), to also call TransactionIdIsInProgress() before TransactionIdDidCommit/Abort(). Currently, if it returns "committed", and you start a new transaction, the new transaction might not yet see the effects of that "committed" transaction. With that, you cannot reproduce the original bug with pg_xact_status() though, so there's no point in committing that test then.

In summary, I think we should:
- commit and backpatch Simon's just_remove_TransactionIdIsKnownCompleted_call.v1.patch
- fix pg_xact_status() to check TransactionIdIsInProgress()
- in v15, remove TransationIdIsKnownCompleted function altogether

I'll try to get around to that in the next few days, unless someone beats me to it.

- Heikki
diff --git a/src/test/modules/test_misc/t/004_bug_xact_inprogress.pl b/src/test/modules/test_misc/t/004_bug_xact_inprogress.pl
new file mode 100644
index 0000000000..8a651c0c60
--- /dev/null
+++ b/src/test/modules/test_misc/t/004_bug_xact_inprogress.pl
@@ -0,0 +1,86 @@
+# Test for visibility checks, when an updating transaction has already
+# written commit record but has not yet removed its entry from the
+# proc array, and thus isn't yet visible to other transactions.
+#
+# In particular, this reproduced an old bug in that, see:
+# https://www.postgresql.org/message-id/flat/4da7913d-398c-e2ad-d777-f752cf7f0bbb%40garret.ru
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init(allows_streaming => 1);
+$node->start;
+
+# Create test table
+$node->safe_psql('postgres', 'CREATE TABLE xact_test (t text)');
+$node->safe_psql('postgres', "INSERT INTO xact_test VALUES ('');");
+
+# Launch a psql session in the background that UPDATEs the row and
+# commits, but the COMMIT blocks waiting for synchronous replication.
+#
+# We don't have a replica configured, so it will hang forever.
+$node->safe_psql('postgres',
+				 "ALTER SYSTEM SET synchronous_standby_names = 'nonexistent';");
+$node->reload;
+
+my $in  = '';
+my $out = '';
+my $timeout = IPC::Run::timeout($PostgreSQL::Test::Utils::timeout_default);
+
+my $h = $node->background_psql('postgres', \$in, \$out, $timeout);
+
+# Read current xid, we'll need it later
+$out = '';
+$in = "BEGIN;\n";
+$in .= "show synchronous_standby_names;\n";
+$in .= "SELECT pg_backend_pid(), pg_current_xact_id();\n";
+pump_until($h, $timeout, \$out, qr/([[:digit:]]+)\|([[:digit:]]+)[\r\n]$/m);
+
+$out =~ /([[:digit:]]+)\|([[:digit:]]+)[\r\n]/;
+my ($pid, $xid) = ($1, $2);
+
+# Update and commit. The commit will block, waiting for the synchronous replica (which doesn't
+# exist).
+$in = "UPDATE xact_test SET t = 'first';\n";
+$in .= "COMMIT;\n";
+$h->pump_nb;
+
+# Make sure the COMMIT has progressed to the synchronous replication wait.
+my $wait_query = "SELECT true FROM pg_stat_activity WHERE pid = $pid AND wait_event = 'SyncRep'";
+$node->poll_query_until('postgres', $wait_query)
+  or die "Timed out while waiting for commit to reach SyncRep state";
+
+# Try to UPDATE the same row from another session.
+#
+# This is expected to block, waiting for the first updating transaction to finish.
+my ($ret, $stdout, $stderr) = $node->psql(
+	'postgres',
+	qq{
+\\set ECHO all
+SET lock_timeout=1;
+
+-- Prime the single-element CLOG lookup cache. (We used to have a bug where the
+-- single-element cache was used incorrectly in the visibility check)
+SELECT pg_xact_status(xid8 '$xid');
+
+-- This is expected to fail due to the lock timeout
+BEGIN;
+UPDATE xact_test SET t = t || ' second';
+
+-- Not reached, because the UPDATE should fail
+SELECT * from xact_test;
+ROLLBACK;
+});
+print "#### Begin standard output\n";
+print $stdout;
+print "\n#### End standard output\n";
+like($stderr, qr/canceling statement due to lock timeout/, "UPDATE fails due to lock timeout");
+
+$node->stop;
+$h->finish;
+
+done_testing();

Reply via email to