On 2022-Oct-01, Andres Freund wrote:
> Perhaps the END{} routine should call $node->_update_pid(-1); if $exit_code !=
> 0 and _pid is undefined?
Yeah, that sounds reasonable.
> That does seem to reduce the incidence of "leftover" postgres
> instances. 001_start_stop.pl leaves some behind, but that makes sense, because
> it's bypassing the whole node management. But I still occasionally see some
> remaining processes if I crank up test concurrency.
>
> Ah! At least part of the problem is that sub stop() does BAIL_OUT, and of
> course it can fail as part of the shutdown.
I made teardown_node pass down fail_ok=>1 to avoid this problem, so we
no longer BAIL_OUT in that case.
> But there's still some that survive, where your perl.trace doesn't contain the
> node getting shut down...
Yeah, something's still unexplained. I'll get this pushed soon, which
already reduces the number of leftover instances a good amount.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
>From b43b3071abfd7bb83f7ec5942095aa031c3ed038 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <[email protected]>
Date: Tue, 18 Oct 2022 12:11:46 +0200
Subject: [PATCH v2] Better handle interrupting TAP tests
Set up a signal handler for INT/TERM so that we run our END block if we
get them. In END, if the exit status indicates a problem, call
_update_pid(-1) to improve chances of the stop working in case start()
hasn't returned yet.
Also, change END's teardown_node() so that it passes fail_ok=>1, so that
if a node fails to stop, we still stop the other nodes in the same test.
---
src/test/perl/PostgreSQL/Test/Cluster.pm | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index d7c13318b0..17856d69c8 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1545,7 +1545,13 @@ END
foreach my $node (@all_nodes)
{
- $node->teardown_node;
+ # During unclean termination (which could be a signal or some
+ # other failure), we're not sure that the status of our nodes
+ # has been correctly set up already, so try and detect their
+ # status so that we can shut them down properly.
+ $node->_update_pid(-1) if $exit_code != 0;
+
+ $node->teardown_node(fail_ok => 1);
# skip clean if we are requested to retain the basedir
next if defined $ENV{'PG_TEST_NOCLEAN'};
@@ -1564,13 +1570,15 @@ END
Do an immediate stop of the node
+Any optional extra parameter is passed to ->stop.
+
=cut
sub teardown_node
{
- my $self = shift;
+ my ($self, %params) = @_;
- $self->stop('immediate');
+ $self->stop('immediate', %params);
return;
}
@@ -2922,6 +2930,13 @@ sub corrupt_page_checksum
return;
}
+#
+# Signal handlers
+#
+$SIG{TERM} = $SIG{INT} = sub {
+ die "death by signal";
+};
+
=pod
=back
--
2.30.2