Re: Refactoring postmaster's code to cleanup after child exit

Andres Freund Fri, 07 Mar 2025 07:03:00 -0800

Hi,

On 2025-03-06 22:49:20 +0200, Heikki Linnakangas wrote:
> In short, all the 4 patches look good to me. Thanks for picking this up!
> 
> On 06/03/2025 22:16, Andres Freund wrote:
> > On 2025-03-05 20:49:33 -0800, Noah Misch wrote:
> > > > This behaviour makes it really hard to debug problems. It'd have been a 
> > > > lot
> > > > easier to understand the problem if we'd seen psql's stderr before the 
> > > > test
> > > > died.
> > > > 
> > > > I guess that mean at the very least we'd need to put an eval {} around 
> > > > the
> > > > ->pump() call., print $self->{stdout}, ->{stderr} and reraise an error?
> > > 
> > > That sounds right.
> > 
> > In the attached patch I did that for wait_connect().  I did verify that it
> > works by implementing the wait_connect() fix before fixing
> > 002_connection_limits.pl, which fails if a sleep(1) is added just before the
> > proc_exit(1) for FATAL.
> 
> +1. For the archives sake, I just want to clarify that this pump stuff is
> all about getting better error messages on a test failure. It doesn't help
> with the original issue.

Agreed.

> This is all annoyingly complicated, but getting good error messages is worth
> it.

Yea. I really look forward to having a way to write stuff like this that
doesn't involve hackily driving psql from 100m away using rubber bands.

> > On 2025-03-05 08:23:32 +0900, Michael Paquier wrote:>> Why not adding an
> > injection point with a WARNING or a LOG generated,
> then
> > > check the server logs for the code path taken based on the elog() 
> > > generated
> > > with the point name?
> > 
> > I think the log_min_messages approach is a lot simpler. If we need something
> > like this more widely we can reconsider injection points...
> 
> +1. It's a little annoying to depend on a detail like the "client backend
> process exited" debug message, but seems like the best fix for now.

We use the same message for LOG messages too, for other types of backends, so
I think it's not that likely to change.  But stilll not great.

> While we're at it, attached are a few more cleanups I noticed.

I assume you'll apply that yourself?

Commits with updated commit messages attached.

I wonder if we should apply the polishing of connect_ok()/connect_fails() and
the wait_connect() debuggability improvements to the backbranches? Keeping TAP
infrastructure as similar as possible between branches has proven worthwhile
IME.

Greetings,

Andres Freund

>From d6dbf4c4a1e723a27df8a08b7e75352b8fb29d05 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Fri, 7 Mar 2025 09:44:00 -0500
Subject: [PATCH v2 1/4] tests: Improve test names in
 connect_fails()/connect_ok()

connect_fails() didn't mention that stderr matched, whereas connect_ok() did.

Neither connect_fails() nor connect_ok() mentioned what they were checking
when checking psql's return status.

Reviewed-by: Heikki Linnakangas <hlinn...@iki.fi>
Discussion: https://postgr.es/m/ggflhkciwdyotpoie323chu2c2idpjk5qimrn462encwx2io7s@thmcxl7i6dpw
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index b105cba05a6..883532e1cd3 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2554,7 +2554,7 @@ sub connect_ok
 		connstr => "$connstr",
 		on_error_stop => 0);
 
-	is($ret, 0, $test_name);
+	is($ret, 0, "$test_name: connect succeeds, as expected");
 
 	if (defined($params{expected_stdout}))
 	{
@@ -2619,11 +2619,11 @@ sub connect_fails
 		extra_params => ['-w'],
 		connstr => "$connstr");
 
-	isnt($ret, 0, $test_name);
+	isnt($ret, 0, "$test_name: connect fails, as expected");
 
 	if (defined($params{expected_stderr}))
 	{
-		like($stderr, $params{expected_stderr}, "$test_name: matches");
+		like($stderr, $params{expected_stderr}, "$test_name: stderr matches");
 	}
 
 	$self->log_check($test_name, $log_location, %params);
-- 
2.48.1.76.g4e746b1a31.dirty

>From e514fe32c2566c524f1f18410266a1e2efdc7644 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Fri, 7 Mar 2025 09:44:00 -0500
Subject: [PATCH v2 2/4] tests: Add note if BackgroundPsql::wait_connect()
 fails

If wait_connect() failed due to psql exiting, all that we'd see is a "process
ended prematurely" error thrown by IPC::Run, without ever seeing psql's error
message.

Address that by wrapping the pump() call in eval and taking note of stdout &
stderr in case of failure.

We might want to do that in pump_until() as well, but that seems to require
API changes, so let's do the easily achievable bit first.

Reviewed-by: Heikki Linnakangas <hlinn...@iki.fi>
Discussion: https://postgr.es/m/ggflhkciwdyotpoie323chu2c2idpjk5qimrn462encwx2io7s@thmcxl7i6dpw
---
 .../perl/PostgreSQL/Test/BackgroundPsql.pm    | 26 ++++++++++++++++---
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
index c611a61cf4e..1deb410c133 100644
--- a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
+++ b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
@@ -154,10 +154,28 @@ sub wait_connect
 	my $banner = "background_psql: ready";
 	my $banner_match = qr/(^|\n)$banner\r?\n/;
 	$self->{stdin} .= "\\echo $banner\n\\warn $banner\n";
-	$self->{run}->pump()
-	  until ($self->{stdout} =~ /$banner_match/
-		  && $self->{stderr} =~ /$banner\r?\n/)
-	  || $self->{timeout}->is_expired;
+
+	# IPC::Run throws in case psql exits while we're pumping. To make it
+	# easier to diagnose that, catch the error, report stdout/stderr at time
+	# of death and reraise.
+	eval {
+		$self->{run}->pump()
+		  until ($self->{stdout} =~ /$banner_match/
+			  && $self->{stderr} =~ /$banner\r?\n/)
+		  || $self->{timeout}->is_expired;
+	};
+	if ($@)
+	{
+		chomp(my $stdout = $self->{stdout});
+		chomp(my $stderr = $self->{stderr});
+		chomp(my $err = $@);
+		diag qq(psql died while connecting:
+  stdout: $stdout
+  stderr: $stderr
+  perl error: $err
+);
+		die "psql died while connecting";
+	}
 
 	note "connect output:\n",
 	  explain {
-- 
2.48.1.76.g4e746b1a31.dirty

>From 6bd317053557030d7d5b1818b1254aebf2230f08 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Fri, 7 Mar 2025 09:44:00 -0500
Subject: [PATCH v2 3/4] tests: Fix race condition in
 postmaster/002_connection_limits

The test occasionally failed due to unexpected connection limit errors being
encountered after having waited for FATAL errors on another connection. These
spurious failures were caused by the the backend reporting FATAL errors to the
client before detaching from the PGPROC entry. Adding a sleep(1) before
proc_exit() makes it easy to reproduce that problem.

To fix the issue, add a helper function that waits for postmaster to notice
the process having exited. For now this is implemented by waiting for the
DEBUG2 message that postmaster logs in that case. That's not the prettiest
fix, but simple. If we notice this problem elsewhere, it might be worthwhile
to make this more general, e.g. by adding an injection point.

Reported-by: Tomas Vondra <to...@vondra.me>
Diagnosed-by: Heikki Linnakangas <hlinn...@iki.fi>
Reviewed-by: Heikki Linnakangas <hlinn...@iki.fi>
Discussion: https://postgr.es/m/ggflhkciwdyotpoie323chu2c2idpjk5qimrn462encwx2io7s@thmcxl7i6dpw
---
 .../postmaster/t/002_connection_limits.pl     | 35 +++++++++++++++++--
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/src/test/postmaster/t/002_connection_limits.pl b/src/test/postmaster/t/002_connection_limits.pl
index 8cfa6e0ced5..2c185eef6eb 100644
--- a/src/test/postmaster/t/002_connection_limits.pl
+++ b/src/test/postmaster/t/002_connection_limits.pl
@@ -20,6 +20,7 @@ $node->append_conf('postgresql.conf', "max_connections = 6");
 $node->append_conf('postgresql.conf', "reserved_connections = 2");
 $node->append_conf('postgresql.conf', "superuser_reserved_connections = 1");
 $node->append_conf('postgresql.conf', "log_connections = on");
+$node->append_conf('postgresql.conf', "log_min_messages=debug2");
 $node->start;
 
 $node->safe_psql(
@@ -45,13 +46,39 @@ sub background_psql_as_user
 		extra_params => [ '-U', $user ]);
 }
 
+# Like connect_fails(), except that we also wait for the failed backend to
+# have exited.
+#
+# This tests needs to wait for client processes to exit because the error
+# message for a failed connection is reported before the backend has detached
+# from shared memory. If we didn't wait, subsequent tests might hit connection
+# limits spuriously.
+#
+# This can't easily be generalized, as detecting process exit requires
+# log_min_messages to be at least DEBUG2 and is not concurrency safe, as we
+# can't easily be sure the right process exited. In this test that's not a
+# problem though, we only have one new connection at a time.
+sub connect_fails_wait
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+	my ($node, $connstr, $test_name, %params) = @_;
+
+	my $log_location = -s $node->logfile;
+
+	$node->connect_fails($connstr, $test_name, %params);
+	$node->wait_for_log(qr/DEBUG:  client backend.*exited with exit code 1/,
+		$log_location);
+	ok(1, "$test_name: client backend process exited");
+}
+
 my @sessions = ();
 my @raw_connections = ();
 
 push(@sessions, background_psql_as_user('regress_regular'));
 push(@sessions, background_psql_as_user('regress_regular'));
 push(@sessions, background_psql_as_user('regress_regular'));
-$node->connect_fails(
+connect_fails_wait(
+	$node,
 	"dbname=postgres user=regress_regular",
 	"reserved_connections limit",
 	expected_stderr =>
@@ -60,7 +87,8 @@ $node->connect_fails(
 
 push(@sessions, background_psql_as_user('regress_reserved'));
 push(@sessions, background_psql_as_user('regress_reserved'));
-$node->connect_fails(
+connect_fails_wait(
+	$node,
 	"dbname=postgres user=regress_regular",
 	"reserved_connections limit",
 	expected_stderr =>
@@ -68,7 +96,8 @@ $node->connect_fails(
 );
 
 push(@sessions, background_psql_as_user('regress_superuser'));
-$node->connect_fails(
+connect_fails_wait(
+	$node,
 	"dbname=postgres user=regress_superuser",
 	"superuser_reserved_connections limit",
 	expected_stderr => qr/FATAL:  sorry, too many clients already/);
-- 
2.48.1.76.g4e746b1a31.dirty

>From f8c5ee4adae736b275185c837a0971a5b2dbbc40 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Fri, 7 Mar 2025 09:44:00 -0500
Subject: [PATCH v2 4/4] tests: Don't fail due to high default timeout in
 postmaster/003_start_stop

Some BF animals use very high timeouts due to their slowness. Unfortunately
postmaster/003_start_stop fails if a high timeout is configured, due to
authentication_timeout having a fairly low max.

As this test is reasonably fast, the easiest fix seems to be to cap the
timeout to 600.

Per buildfarm animal skink.

Reviewed-by: Heikki Linnakangas <hlinn...@iki.fi>
Discussion: https://postgr.es/m/ggflhkciwdyotpoie323chu2c2idpjk5qimrn462encwx2io7s@thmcxl7i6dpw
---
 src/test/postmaster/t/003_start_stop.pl | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/test/postmaster/t/003_start_stop.pl b/src/test/postmaster/t/003_start_stop.pl
index 036b296f72b..4dc394139d9 100644
--- a/src/test/postmaster/t/003_start_stop.pl
+++ b/src/test/postmaster/t/003_start_stop.pl
@@ -20,6 +20,10 @@ use Test::More;
 # "pg_ctl stop" will error out before the authentication timeout kicks
 # in and cleans up the dead-end backends.
 my $authentication_timeout = $PostgreSQL::Test::Utils::timeout_default;
+
+# Don't fail due to hitting the max value allowed for authentication_timeout.
+$authentication_timeout = 600 unless $authentication_timeout < 600;
+
 my $stop_timeout = $authentication_timeout / 2;
 
 # Initialize the server with low connection limits, to test dead-end backends
-- 
2.48.1.76.g4e746b1a31.dirty

Re: Refactoring postmaster's code to cleanup after child exit

Reply via email to