Re: backup manifests and contemporaneous buildfarm failures

2020-04-08 Thread Tom Lane
Andrew Dunstan  writes:
> On 4/8/20 3:41 PM, Robert Haas wrote:
>> I don't understand what the local $ENV{MSYS2_ARG_CONV_EXCL} =
>> $source_ts_prefix does, 

> You don't want to know 
> See  for the
> gory details.

I don't want to know either, but maybe that reference should be cited
somewhere near where we use this sort of hack.

regards, tom lane




Re: backup manifests and contemporaneous buildfarm failures

2020-04-08 Thread Andrew Dunstan


On 4/8/20 3:41 PM, Robert Haas wrote:
> On Wed, Apr 8, 2020 at 1:59 PM Tom Lane  wrote:
>> I guess we could commit it and find out.  I'm all for the simpler
>> coding if it works.
> I don't understand what the local $ENV{MSYS2_ARG_CONV_EXCL} =
> $source_ts_prefix does, 


You don't want to know 


See  for the
gory details.


It's the tablespace map parameter that is upsetting it.



> but the remove/unlink condition was suggested
> by Amit Kapila on the basis of testing on his Windows development
> environment, so I suspect that's actually needed on at least some
> systems. I just work here, though.
>

Yeah, drongo doesn't like it, so we'll have to tweak the logic.


I'll update after some more testing.


cheers


andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: backup manifests and contemporaneous buildfarm failures

2020-04-08 Thread Robert Haas
On Wed, Apr 8, 2020 at 1:59 PM Tom Lane  wrote:
> I guess we could commit it and find out.  I'm all for the simpler
> coding if it works.

I don't understand what the local $ENV{MSYS2_ARG_CONV_EXCL} =
$source_ts_prefix does, but the remove/unlink condition was suggested
by Amit Kapila on the basis of testing on his Windows development
environment, so I suspect that's actually needed on at least some
systems. I just work here, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: backup manifests and contemporaneous buildfarm failures

2020-04-08 Thread Tom Lane
Andrew Dunstan  writes:
> OK, tricky, but here's what I did to get this working on fairywren.
> First, on Msys2 there is a problem with name mangling. We've had to fix
> this before by telling it to ignore certain argument prefixes.
> Second, once that was fixed rmdir was failing on the tablespace. On
> Windows this is a junction, so unlink is the correct thing to do, I
> believe, just as it is on Unix where it's a symlink.

Hmm, no opinion about the name mangling business, but the other part
seems like it might break jacana and/or bowerbird, which are currently
happy with this test?  (AFAICS we only have four Windows animals
running the TAP tests, and the fourth (drongo) hasn't reported in
for awhile.)

I guess we could commit it and find out.  I'm all for the simpler
coding if it works.

regards, tom lane




Re: backup manifests and contemporaneous buildfarm failures

2020-04-08 Thread Andrew Dunstan

On 4/7/20 9:42 AM, Andrew Dunstan wrote:
> On Tue, Apr 7, 2020 at 12:37 AM Tom Lane  wrote:
>> Robert Haas  writes:
>>> Taking stock of the situation this morning, most of the buildfarm is
>>> now green. There are three failures, on eelpout (6 hours ago),
>>> fairywren (17 hours ago), and hyrax (3 days, 7 hours ago).
>> fairywren has now done this twice in the pg_validatebackupCheck step:
>>
>> exec failed: Bad address at 
>> /home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/perl/TestLib.pm 
>> line 340.
>>  at 
>> /home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/perl/TestLib.pm 
>> line 340.
>>
>> I'm a tad suspicious that it needs another perl2host()
>> somewhere, but the log isn't very clear as to where.
>>
>> More generally, I wonder if we ought to be trying to
>> centralize those perl2host() calls instead of sticking
>> them into individual test cases.
>>
>>
>
> Not sure about that. I'll see if I can run it by hand and get some
> more info. What's quite odd is that jacana (a very similar setup) is
> passing this happily.
>


OK, tricky, but here's what I did to get this working on fairywren.


First, on Msys2 there is a problem with name mangling. We've had to fix
this before by telling it to ignore certain argument prefixes.


Second, once that was fixed rmdir was failing on the tablespace. On
Windows this is a junction, so unlink is the correct thing to do, I
believe, just as it is on Unix where it's a symlink.


cheers


andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/src/bin/pg_validatebackup/t/003_corruption.pl b/src/bin/pg_validatebackup/t/003_corruption.pl
index 7a09d02e6c..fe717dfc73 100644
--- a/src/bin/pg_validatebackup/t/003_corruption.pl
+++ b/src/bin/pg_validatebackup/t/003_corruption.pl
@@ -16,6 +16,8 @@ $master->start;
 # Include a user-defined tablespace in the hopes of detecting problems in that
 # area.
 my $source_ts_path = TestLib::perl2host(TestLib::tempdir_short());
+my $source_ts_prefix = $source_ts_path;
+$source_ts_prefix =~ s!([^A-Z]:/[^/]*)/.*!$1!;
 $master->safe_psql('postgres', command_ok(['pg_basebackup', '-D', $backup_path, '--no-sync',
 			'-T', "${source_ts_path}=${backup_ts_path}"],
 			"base backup ok");
@@ -177,14 +180,7 @@ sub mutilate_missing_tablespace
 	my ($tsoid) = grep { $_ ne '.' && $_ ne '..' }
 		 slurp_dir("$backup_path/pg_tblspc");
 	my $pathname = "$backup_path/pg_tblspc/$tsoid";
-	if ($windows_os)
-	{
-		rmdir($pathname) || die "$pathname: $!";
-	}
-	else
-	{
-		unlink($pathname) || die "$pathname: $!";
-	}
+	unlink($pathname) || die "$pathname: $!";
 	return;
 }
 


Re: backup manifests and contemporaneous buildfarm failures

2020-04-07 Thread Andrew Dunstan
On Tue, Apr 7, 2020 at 12:37 AM Tom Lane  wrote:
>
> Robert Haas  writes:
> > Taking stock of the situation this morning, most of the buildfarm is
> > now green. There are three failures, on eelpout (6 hours ago),
> > fairywren (17 hours ago), and hyrax (3 days, 7 hours ago).
>
> fairywren has now done this twice in the pg_validatebackupCheck step:
>
> exec failed: Bad address at 
> /home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/perl/TestLib.pm 
> line 340.
>  at /home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/perl/TestLib.pm 
> line 340.
>
> I'm a tad suspicious that it needs another perl2host()
> somewhere, but the log isn't very clear as to where.
>
> More generally, I wonder if we ought to be trying to
> centralize those perl2host() calls instead of sticking
> them into individual test cases.
>
>


Not sure about that. I'll see if I can run it by hand and get some
more info. What's quite odd is that jacana (a very similar setup) is
passing this happily.

cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: backup manifests and contemporaneous buildfarm failures

2020-04-07 Thread Andrew Dunstan
On Mon, Apr 6, 2020 at 1:18 AM Fabien COELHO  wrote:
>
>
> Hello,
>
> >> Do I need to precede those with some recursive chmod commands? Perhaps
> >> the client should refuse to run if there is still something left after
> >> these.
> >
> > I think the latter would be a very good idea, just so that this sort of
> > failure is less obscure.  Not sure about whether a recursive chmod is
> > really going to be worth the cycles.  (On the other hand, the normal
> > case should be that there's nothing there anyway, so maybe it's not
> > going to be costly.)
>
> Could it be a two-stage process to minimize cost but still be resilient?
>
>rmtree
>if (-d $DIR) {
>  emit warning
>  chmodtree
>  rmtree again
>  if (-d $DIR)
>emit error
>}
>


I thought about doing that. However, it's not really necessary. In the
normal course of events these directories should have been removed at
the end of the previous run, so we're only dealing with exceptional
cases here.

cheers

andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: backup manifests and contemporaneous buildfarm failures

2020-04-06 Thread Tom Lane
Robert Haas  writes:
> Taking stock of the situation this morning, most of the buildfarm is
> now green. There are three failures, on eelpout (6 hours ago),
> fairywren (17 hours ago), and hyrax (3 days, 7 hours ago).

fairywren has now done this twice in the pg_validatebackupCheck step:

exec failed: Bad address at 
/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/perl/TestLib.pm line 
340.
 at /home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/perl/TestLib.pm 
line 340.

I'm a tad suspicious that it needs another perl2host()
somewhere, but the log isn't very clear as to where.

More generally, I wonder if we ought to be trying to
centralize those perl2host() calls instead of sticking
them into individual test cases.

regards, tom lane




Re: backup manifests and contemporaneous buildfarm failures

2020-04-06 Thread Andrew Dunstan


On 4/6/20 7:53 AM, Robert Haas wrote:
> On Sun, Apr 5, 2020 at 4:07 PM Andrew Dunstan
>  wrote:
>> Do I need to precede those with some recursive chmod commands?
> +1.
>
>> Perhaps
>> the client should refuse to run if there is still something left after
>> these.
> +1 to that, too.
>


See
https://github.com/PGBuildFarm/client-code/commit/0ef76bb1e2629713898631b9a3380d02d41c60ad


This will be in the next release, probably fairly soon.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: backup manifests and contemporaneous buildfarm failures

2020-04-06 Thread Robert Haas
On Sun, Apr 5, 2020 at 4:07 PM Andrew Dunstan
 wrote:
> Do I need to precede those with some recursive chmod commands?

+1.

> Perhaps
> the client should refuse to run if there is still something left after
> these.

+1 to that, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: backup manifests and contemporaneous buildfarm failures

2020-04-05 Thread Fabien COELHO



Hello,


Do I need to precede those with some recursive chmod commands? Perhaps
the client should refuse to run if there is still something left after
these.


I think the latter would be a very good idea, just so that this sort of
failure is less obscure.  Not sure about whether a recursive chmod is
really going to be worth the cycles.  (On the other hand, the normal
case should be that there's nothing there anyway, so maybe it's not
going to be costly.)


Could it be a two-stage process to minimize cost but still be resilient?

  rmtree
  if (-d $DIR) {
emit warning
chmodtree
rmtree again
if (-d $DIR)
  emit error
  }

--
Fabien.




Re: backup manifests and contemporaneous buildfarm failures

2020-04-05 Thread Tom Lane
Andrew Dunstan  writes:
> Hmm, the buildfarm client does this at the beginning of each run to
> remove anything that might be left over from a previous run:

> rmtree("inst");
> rmtree("$pgsql") unless ($from_source && !$use_vpath);

Right, the point is precisely that some versions of rmtree() fail
to remove a mode-0 subdirectory.

> Do I need to precede those with some recursive chmod commands? Perhaps
> the client should refuse to run if there is still something left after
> these.

I think the latter would be a very good idea, just so that this sort of
failure is less obscure.  Not sure about whether a recursive chmod is
really going to be worth the cycles.  (On the other hand, the normal
case should be that there's nothing there anyway, so maybe it's not
going to be costly.)  

regards, tom lane




Re: backup manifests and contemporaneous buildfarm failures

2020-04-05 Thread Andrew Dunstan


On 4/5/20 9:10 AM, Mikael Kjellström wrote:
> On 2020-04-04 04:43, Robert Haas wrote:
>
>> I think I've done about as much as I can do for tonight, though. Most
>> things are green now, and the ones that aren't are failing because of
>> stuff that is at least plausibly fixed. By morning it should be
>> clearer how much broken stuff is left, although that will be somewhat
>> complicated by at least sidewinder and seawasp needing manual
>> intervention to get back on track.
>
> I fixed sidewinder I think.  Should clear up the next time it runs.
>
> It was the mode on the directory it couldn't handle-  A regular rm -rf
> didn't work I had to do a chmod -R 700 on all directories to be able
> to manually remove it.
>
>


Hmm, the buildfarm client does this at the beginning of each run to
remove anything that might be left over from a previous run:


rmtree("inst");
rmtree("$pgsql") unless ($from_source && !$use_vpath);


Do I need to precede those with some recursive chmod commands? Perhaps
the client should refuse to run if there is still something left after
these.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: backup manifests and contemporaneous buildfarm failures

2020-04-05 Thread Mikael Kjellström

On 2020-04-04 04:43, Robert Haas wrote:


I think I've done about as much as I can do for tonight, though. Most
things are green now, and the ones that aren't are failing because of
stuff that is at least plausibly fixed. By morning it should be
clearer how much broken stuff is left, although that will be somewhat
complicated by at least sidewinder and seawasp needing manual
intervention to get back on track.


I fixed sidewinder I think.  Should clear up the next time it runs.

It was the mode on the directory it couldn't handle-  A regular rm -rf 
didn't work I had to do a chmod -R 700 on all directories to be able to 
manually remove it.


/Mikael




Re: backup manifests and contemporaneous buildfarm failures

2020-04-04 Thread Thomas Munro
On Sun, Apr 5, 2020 at 2:36 AM Robert Haas  wrote:
> eelpout is unhappy because:
>
> +WARNING:  could not remove shared memory segment
> "/PostgreSQL.248989127": No such file or directory
> +WARNING:  could not remove shared memory segment
> "/PostgreSQL.1450751626": No such file or directory

Seems to have fixed itself while I was sleeping. I did happen run
apt-get upgrade on that box some time yesterday-ish, but I don't
understand what mechanism would trash my /dev/shm in that process.
/me eyes systemd with suspicion




Re: backup manifests and contemporaneous buildfarm failures

2020-04-04 Thread Tom Lane
Robert Haas  writes:
> On Sat, Apr 4, 2020 at 10:57 AM Tom Lane  wrote:
>> What is odd is that
>> (AFAIR) we've never seen this before.  Maybe somebody recently added
>> an error cursor callback in a place that didn't have it before, and
>> is involved in SQL-function processing?  None of the commits leading
>> up to the earlier failure look promising for that, though.

> The relevant range of commits (e8b1774fc2 to a7b9d24e4e) includes an
> ereport change (bda6dedbea) and a couple of "simple expression"
> changes (8f59f6b9c0, fbc7a71608) but I don't know exactly why they
> would have caused this.

When I first noticed hyrax's failure, some days ago, I immediately
thought of the "simple expression" patch.  But that should not have
affected SQL-function processing in any way: the bulk of the changes
were in plpgsql, and even the changes in plancache could not be
relevant, because functions.c does not use the plancache.

As for ereport, you'd think that that would only matter once you were
already doing an ereport.  The point at which the stack overflow
check triggers should be in normal code, not error recovery.

> It seems at least possible, though, that
> changing the return type of functions involved in error reporting
> would slightly change the amount of stack space used;

Right, but if it's down to that sort of phase-of-the-moon codegen
difference, you'd think this failure would have been coming and
going for years.  I still suppose that some fairly recent change
must be contributing to this, but haven't had time to investigate.

> Other than experimenting on
> that machine, I'm not sure how we could really determine the relevant
> factors here.

We don't have a lot of CCA buildfarm machines, so I'm suspecting that
it's probably not that hard to repro if you build with CCA.

regards, tom lane




Re: backup manifests and contemporaneous buildfarm failures

2020-04-04 Thread Robert Haas
On Sat, Apr 4, 2020 at 10:57 AM Tom Lane  wrote:
> It's not so surprising that we could get a different result that way
> from a CLOBBER_CACHE_ALWAYS animal like hyrax, since CCA-forced
> cache reloads would cause extra stack expenditure at a lot of places.
> And it could vary depending on totally random details, like the number
> of local variables in seemingly unrelated code.

Oh, yeah. That's unfortunate.

> What is odd is that
> (AFAIR) we've never seen this before.  Maybe somebody recently added
> an error cursor callback in a place that didn't have it before, and
> is involved in SQL-function processing?  None of the commits leading
> up to the earlier failure look promising for that, though.

The relevant range of commits (e8b1774fc2 to a7b9d24e4e) includes an
ereport change (bda6dedbea) and a couple of "simple expression"
changes (8f59f6b9c0, fbc7a71608) but I don't know exactly why they
would have caused this. It seems at least possible, though, that
changing the return type of functions involved in error reporting
would slightly change the amount of stack space used; and the others
are related to SQL-function processing. Other than experimenting on
that machine, I'm not sure how we could really determine the relevant
factors here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: backup manifests and contemporaneous buildfarm failures

2020-04-04 Thread Tom Lane
Robert Haas  writes:
> hyrax's last run was before any of this happened, so it seems to have
> an unrelated problem. The last two runs, three and six days ago, both
> failed like this:

> -ERROR:  stack depth limit exceeded
> +ERROR:  stack depth limit exceeded at character 8

> Not sure what that's about.

What it looks like is that hyrax is managing to detect stack overflow
at a point where an errcontext callback is active that adds an error
cursor to the failure.

It's not so surprising that we could get a different result that way
from a CLOBBER_CACHE_ALWAYS animal like hyrax, since CCA-forced
cache reloads would cause extra stack expenditure at a lot of places.
And it could vary depending on totally random details, like the number
of local variables in seemingly unrelated code.  What is odd is that
(AFAIR) we've never seen this before.  Maybe somebody recently added
an error cursor callback in a place that didn't have it before, and
is involved in SQL-function processing?  None of the commits leading
up to the earlier failure look promising for that, though.

regards, tom lane




Re: backup manifests and contemporaneous buildfarm failures

2020-04-04 Thread Robert Haas
On Fri, Apr 3, 2020 at 10:43 PM Robert Haas  wrote:
> I think I've done about as much as I can do for tonight, though. Most
> things are green now, and the ones that aren't are failing because of
> stuff that is at least plausibly fixed. By morning it should be
> clearer how much broken stuff is left, although that will be somewhat
> complicated by at least sidewinder and seawasp needing manual
> intervention to get back on track.

Taking stock of the situation this morning, most of the buildfarm is
now green. There are three failures, on eelpout (6 hours ago),
fairywren (17 hours ago), and hyrax (3 days, 7 hours ago).

eelpout is unhappy because:

+WARNING:  could not remove shared memory segment
"/PostgreSQL.248989127": No such file or directory
+WARNING:  could not remove shared memory segment
"/PostgreSQL.1450751626": No such file or directory
  multibatch
 
  f
@@ -861,22 +863,15 @@

 select length(max(s.t))
 from wide left join (select id, coalesce(t, '') || '' as t from wide)
s using (id);
- length
-
- 32
-(1 row)
-
+ERROR:  could not open shared memory segment "/PostgreSQL.605707657":
No such file or directory
+CONTEXT:  parallel worker

I'm not sure what caused that exactly, but it sorta looks like
operator intervention. Thomas, any ideas?

fairywren's last run was on 21dc488, and commit
460314db08e8688e1a54a0a26657941e058e45c5 was an attempt to fix what
broken there. I guess we'll find out whether that worked the next time
it runs.

hyrax's last run was before any of this happened, so it seems to have
an unrelated problem. The last two runs, three and six days ago, both
failed like this:

-ERROR:  stack depth limit exceeded
+ERROR:  stack depth limit exceeded at character 8

Not sure what that's about.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: backup manifests and contemporaneous buildfarm failures

2020-04-04 Thread Robert Haas
On Fri, Apr 3, 2020 at 11:06 PM Andres Freund  wrote:
> On 2020-04-03 20:48:09 -0400, Robert Haas wrote:
> > 'serinus' is also failing. This is less obviously related:
>
> Hm. Tests passed once since then.

Yeah, but conchuela also failed once in what I think was a similar
way. I suspect the fix I pushed last night
(3e0d80fd8d3dd4f999e0d3aa3e591f480d8ad1fd) may have been enough to
clear this up.

> That already seems suspicious. I checked the following (successful) run
> and I did not see that in the stage's logs.

Yeah, the behavior of the test case doesn't seem to be entirely deterministic.

> I, again, have to say that the amount of stuff that was done as part of
>
> commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920
> Author: Peter Eisentraut 
> Date:   2017-03-23 08:36:36 -0400
>
> Logical replication support for initial data copy
>
> is insane. Adding support for running sql over replication connections
> and extending CREATE_REPLICATION_SLOT with new options (without even
> mentioning that in the commit message!) as part of a commit described as
> "Logical replication support for initial data copy" shouldn't happen.

I agreed then and still do.

> So I'm a bit confused here. The best approach is probably to try to
> reproduce this by adding an artifical delay into backend shutdown.

I was able to reproduce an assertion failure by starting a
transaction, running a replication command that failed, and then
exiting the backend. 3e0d80fd8d3dd4f999e0d3aa3e591f480d8ad1fd made
that go away. I had wrongly assumed that there was no other way for a
walsender to have a ResourceOwner, and in the face of SQL commands
also being executed by walsenders, that's clearly not true. I'm not
sure *precisely* how that lead to the BF failures, but it was really
clear that it was wrong.

> > (I still really dislike the fact that we have this evil hack allowing
> > one connection to mix and match those sets of commands...)
>
> FWIW, I think the opposite. We should get rid of the difference as much
> as possible.

Well, that's another approach. It's OK to have one system and it's OK
to have two systems, but one and a half is not ideal.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: backup manifests and contemporaneous buildfarm failures

2020-04-03 Thread Petr Jelinek

On 04/04/2020 05:06, Andres Freund wrote:

Hi,

Peter, Petr, CCed you because it's probably a bug somewhere around the
initial copy code for logical replication.


On 2020-04-03 20:48:09 -0400, Robert Haas wrote:

'serinus' is also failing. This is less obviously related:


Hm. Tests passed once since then.



2020-04-04 02:08:57.299 CEST [5e87d019.506c1:4] LOG:  received
replication command: CREATE_REPLICATION_SLOT
"tap_sub_16390_sync_16384" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
2020-04-04 02:08:57.299 CEST [5e87d019.506c1:5] ERROR:  replication
slot "tap_sub_16390_sync_16384" already exists


That already seems suspicious. I checked the following (successful) run
and I did not see that in the stage's logs.

Looking at the failing log, it fails because for some reason there's
rounds (once due to a refresh, once due to an intention replication
failure) of copying the relation. Each creates its own temporary slot.

first time:
2020-04-04 02:08:57.276 CEST [5e87d019.506bd:1] LOG:  connection received: 
host=[local]
2020-04-04 02:08:57.278 CEST [5e87d019.506bd:4] LOG:  received replication command: 
CREATE_REPLICATION_SLOT "tap_sub_16390_sync_16384" TEMPORARY LOGICAL pgoutput 
USE_SNAPSHOT
2020-04-04 02:08:57.282 CEST [5e87d019.506bd:9] LOG:  statement: COPY 
public.tab_rep TO STDOUT
2020-04-04 02:08:57.284 CEST [5e87d019.506bd:10] LOG:  disconnection: session 
time: 0:00:00.007 user=bf database=postgres host=[local]

second time:
2020-04-04 02:08:57.288 CEST [5e87d019.506bf:1] LOG:  connection received: 
host=[local]
2020-04-04 02:08:57.289 CEST [5e87d019.506bf:4] LOG:  received replication command: 
CREATE_REPLICATION_SLOT "tap_sub_16390_sync_16384" TEMPORARY LOGICAL pgoutput 
USE_SNAPSHOT
2020-04-04 02:08:57.293 CEST [5e87d019.506bf:9] LOG:  statement: COPY 
public.tab_rep TO STDOUT

third time:
2020-04-04 02:08:57.297 CEST [5e87d019.506c1:1] LOG:  connection received: 
host=[local]
2020-04-04 02:08:57.299 CEST [5e87d019.506c1:4] LOG:  received replication command: 
CREATE_REPLICATION_SLOT "tap_sub_16390_sync_16384" TEMPORARY LOGICAL pgoutput 
USE_SNAPSHOT
2020-04-04 02:08:57.299 CEST [5e87d019.506c1:5] ERROR:  replication slot 
"tap_sub_16390_sync_16384" already exists

Note that the connection from the second attempt has not yet
disconnected. Hence the error about the replication slot already
existing - it's a temporary replication slot that'd otherwise already
have been dropped.


Seems the logical rep code needs to do something about this race?



The downstream:


2020-04-04 02:08:57.275 CEST [5e87d019.506bc:1] LOG:  logical replication table synchronization 
worker for subscription "tap_sub", table "tab_rep" has started
2020-04-04 02:08:57.282 CEST [5e87d019.506bc:2] ERROR:  duplicate key value violates 
unique constraint "tab_rep_pkey"
2020-04-04 02:08:57.282 CEST [5e87d019.506bc:3] DETAIL:  Key (a)=(1) already 
exists.
2020-04-04 02:08:57.282 CEST [5e87d019.506bc:4] CONTEXT:  COPY tab_rep, line 1
2020-04-04 02:08:57.283 CEST [5e87d018.50689:5] LOG:  background worker "logical 
replication worker" (PID 329404) exited with exit code 1
2020-04-04 02:08:57.287 CEST [5e87d019.506be:1] LOG:  logical replication table synchronization 
worker for subscription "tap_sub", table "tab_rep" has started
2020-04-04 02:08:57.293 CEST [5e87d019.506be:2] ERROR:  duplicate key value violates 
unique constraint "tab_rep_pkey"
2020-04-04 02:08:57.293 CEST [5e87d019.506be:3] DETAIL:  Key (a)=(1) already 
exists.
2020-04-04 02:08:57.293 CEST [5e87d019.506be:4] CONTEXT:  COPY tab_rep, line 1
2020-04-04 02:08:57.295 CEST [5e87d018.50689:6] LOG:  background worker "logical 
replication worker" (PID 329406) exited with exit code 1
2020-04-04 02:08:57.297 CEST [5e87d019.506c0:1] LOG:  logical replication table synchronization 
worker for subscription "tap_sub", table "tab_rep" has started
2020-04-04 02:08:57.299 CEST [5e87d019.506c0:2] ERROR:  could not create replication slot 
"tap_sub_16390_sync_16384": ERROR:  replication slot "tap_sub_16390_sync_16384" 
already exists
2020-04-04 02:08:57.300 CEST [5e87d018.50689:7] LOG:  background worker "logical replication worker" (PID 329408) exited with exit code 


Looks like we are simply retrying so fast that upstream will not have 
finished cleanup after second try by the time we already run the third one.


The last_start_times is supposed to protect against that so I guess 
there is some issue with how that works.


--
Petr Jelinek
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/




Re: backup manifests and contemporaneous buildfarm failures

2020-04-03 Thread Andres Freund
Hi,

Peter, Petr, CCed you because it's probably a bug somewhere around the
initial copy code for logical replication.


On 2020-04-03 20:48:09 -0400, Robert Haas wrote:
> 'serinus' is also failing. This is less obviously related:

Hm. Tests passed once since then.


> 2020-04-04 02:08:57.299 CEST [5e87d019.506c1:4] LOG:  received
> replication command: CREATE_REPLICATION_SLOT
> "tap_sub_16390_sync_16384" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
> 2020-04-04 02:08:57.299 CEST [5e87d019.506c1:5] ERROR:  replication
> slot "tap_sub_16390_sync_16384" already exists

That already seems suspicious. I checked the following (successful) run
and I did not see that in the stage's logs.

Looking at the failing log, it fails because for some reason there's
rounds (once due to a refresh, once due to an intention replication
failure) of copying the relation. Each creates its own temporary slot.

first time:
2020-04-04 02:08:57.276 CEST [5e87d019.506bd:1] LOG:  connection received: 
host=[local]
2020-04-04 02:08:57.278 CEST [5e87d019.506bd:4] LOG:  received replication 
command: CREATE_REPLICATION_SLOT "tap_sub_16390_sync_16384" TEMPORARY LOGICAL 
pgoutput USE_SNAPSHOT
2020-04-04 02:08:57.282 CEST [5e87d019.506bd:9] LOG:  statement: COPY 
public.tab_rep TO STDOUT
2020-04-04 02:08:57.284 CEST [5e87d019.506bd:10] LOG:  disconnection: session 
time: 0:00:00.007 user=bf database=postgres host=[local]

second time:
2020-04-04 02:08:57.288 CEST [5e87d019.506bf:1] LOG:  connection received: 
host=[local]
2020-04-04 02:08:57.289 CEST [5e87d019.506bf:4] LOG:  received replication 
command: CREATE_REPLICATION_SLOT "tap_sub_16390_sync_16384" TEMPORARY LOGICAL 
pgoutput USE_SNAPSHOT
2020-04-04 02:08:57.293 CEST [5e87d019.506bf:9] LOG:  statement: COPY 
public.tab_rep TO STDOUT

third time:
2020-04-04 02:08:57.297 CEST [5e87d019.506c1:1] LOG:  connection received: 
host=[local]
2020-04-04 02:08:57.299 CEST [5e87d019.506c1:4] LOG:  received replication 
command: CREATE_REPLICATION_SLOT "tap_sub_16390_sync_16384" TEMPORARY LOGICAL 
pgoutput USE_SNAPSHOT
2020-04-04 02:08:57.299 CEST [5e87d019.506c1:5] ERROR:  replication slot 
"tap_sub_16390_sync_16384" already exists

Note that the connection from the second attempt has not yet
disconnected. Hence the error about the replication slot already
existing - it's a temporary replication slot that'd otherwise already
have been dropped.


Seems the logical rep code needs to do something about this race?


About the assertion failure:

TRAP: FailedAssertion("owner->bufferarr.nitems == 0", File: 
"/home/bf/build/buildfarm-serinus/HEAD/pgsql.build/../pgsql/src/backend/utils/resowner/resowner.c",
 Line: 718)
postgres: publisher: walsender bf [local] idle in transaction 
(aborted)(ExceptionalCondition+0x5c)[0x9a13ac]
postgres: publisher: walsender bf [local] idle in transaction 
(aborted)(ResourceOwnerDelete+0x295)[0x9db8e5]
postgres: publisher: walsender bf [local] idle in transaction 
(aborted)[0x54c61f]
postgres: publisher: walsender bf [local] idle in transaction 
(aborted)(AbortOutOfAnyTransaction+0x122)[0x550e32]
postgres: publisher: walsender bf [local] idle in transaction 
(aborted)[0x9b3bc9]
postgres: publisher: walsender bf [local] idle in transaction 
(aborted)(shmem_exit+0x35)[0x80db45]
postgres: publisher: walsender bf [local] idle in transaction 
(aborted)[0x80dc77]
postgres: publisher: walsender bf [local] idle in transaction 
(aborted)(proc_exit+0x8)[0x80dd08]
postgres: publisher: walsender bf [local] idle in transaction 
(aborted)(PostgresMain+0x59f)[0x83bd0f]
postgres: publisher: walsender bf [local] idle in transaction 
(aborted)[0x7a0264]
postgres: publisher: walsender bf [local] idle in transaction 
(aborted)(PostmasterMain+0xbfc)[0x7a2b8c]
postgres: publisher: walsender bf [local] idle in transaction 
(aborted)(main+0x6fb)[0x49749b]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7fc52d83]
postgres: publisher: walsender bf [local] idle in transaction 
(aborted)(_start+0x2a)[0x49753a]
2020-04-04 02:08:57.302 CEST [5e87d018.5066b:4] LOG:  server process (PID 
329409) was terminated by signal 6: Aborted

Due to the log_line_prefix used, I was at first not entirely sure the
backend that crashed was the one with the ERROR. But it appears we print
the pid as hex for '%c' (why?), so it indeed is the one.


I, again, have to say that the amount of stuff that was done as part of

commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920
Author: Peter Eisentraut 
Date:   2017-03-23 08:36:36 -0400

Logical replication support for initial data copy

is insane. Adding support for running sql over replication connections
and extending CREATE_REPLICATION_SLOT with new options (without even
mentioning that in the commit message!) as part of a commit described as
"Logical replication support for initial data copy" shouldn't happen.


It's not obvious to me what buffer pins could be held at this point. I
wonder if this could be somehow related to

commit 3cb646264e8ced9f25557ce271284da5

Re: backup manifests and contemporaneous buildfarm failures

2020-04-03 Thread Robert Haas
On Fri, Apr 3, 2020 at 9:52 PM Tom Lane  wrote:
> Robert Haas  writes:
> > 'prairiedog' is also unhappy, and it looks related:
>
> Yeah, gaur also failed in the same place.  Both of those are
> alignment-picky 32-bit hardware, so I'm thinking the problem is
> pg_gmtime() trying to fetch a 64-bit pg_time_t from an insufficiently
> aligned address.  I'm trying to confirm that on gaur's host right now,
> but it's a slow machine ...

You might just want to wait until tomorrow and see whether it clears
up in newer runs. I just pushed yet another fix that might be
relevant.

I think I've done about as much as I can do for tonight, though. Most
things are green now, and the ones that aren't are failing because of
stuff that is at least plausibly fixed. By morning it should be
clearer how much broken stuff is left, although that will be somewhat
complicated by at least sidewinder and seawasp needing manual
intervention to get back on track.

I apologize to everyone who has been or will be inconvenienced by all
of this. So far I've pushed 4 test case fixes, 2 bug fixes, and 1
makefile fix, which I'm pretty sure is over quota for one patch. :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: backup manifests and contemporaneous buildfarm failures

2020-04-03 Thread Tom Lane
Robert Haas  writes:
> Interestingly, on my machine, rmtree coped with a mode 0 directory
> just fine, but mode 0400 was more than its tiny brain could handle, so
> the originally committed fix had code to revert 0400 back to 0700, but
> I didn't add similar code to revert from 0 back to 0700 because that
> was working fine.

It seems really odd that an implementation could cope with mode-0
but not mode-400.  Not sure I care enough to dig into the Perl
library code, though.

regards, tom lane




Re: backup manifests and contemporaneous buildfarm failures

2020-04-03 Thread Tom Lane
Robert Haas  writes:
> 'prairiedog' is also unhappy, and it looks related:

Yeah, gaur also failed in the same place.  Both of those are
alignment-picky 32-bit hardware, so I'm thinking the problem is
pg_gmtime() trying to fetch a 64-bit pg_time_t from an insufficiently
aligned address.  I'm trying to confirm that on gaur's host right now,
but it's a slow machine ...

> 'serinus' is also failing. This is less obviously related:

Dunno about this one.

regards, tom lane




Re: backup manifests and contemporaneous buildfarm failures

2020-04-03 Thread Robert Haas
On Fri, Apr 3, 2020 at 8:12 PM Tom Lane  wrote:
> Yeah, so it would seem.  The buildfarm script uses rmtree to clean out
> the old build tree.  The man page for File::Path suggests (but can't
> quite bring itself to say in so many words) that by default, rmtree
> will adjust the permissions on target directories to allow the deletion
> to succeed.  But that's very clearly not happening on some platforms.
> (Maybe that represents a local patch on the part of some packagers
> who thought it was too unsafe?)

Interestingly, on my machine, rmtree coped with a mode 0 directory
just fine, but mode 0400 was more than its tiny brain could handle, so
the originally committed fix had code to revert 0400 back to 0700, but
I didn't add similar code to revert from 0 back to 0700 because that
was working fine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: backup manifests and contemporaneous buildfarm failures

2020-04-03 Thread Robert Haas
On Fri, Apr 3, 2020 at 6:13 PM Tom Lane  wrote:
> Locally, I observe that "make clean" in src/bin/pg_validatebackup fails
> to clean up the tmp_check directory left behind by "make check".

Fixed.

I also tried to fix 'lapwing', which was complaining about about a
call to pg_gmtime, saying that it "expected 'const pg_time_t *' but
argument is of type 'time_t *'". I was thinking that the problem had
something to do with const, but Thomas pointed out to me that
pg_time_t != time_t, so I pushed a fix which assumes that was the
issue. (It was certainly *an* issue.)

'prairiedog' is also unhappy, and it looks related:

/bin/sh ../../../../config/install-sh -c -d
'/Users/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts'/tmp_check
cd . && 
TESTDIR='/Users/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts'
PATH="/Users/buildfarm/bf-data/HEAD/pgsql.build/tmp_install/Users/buildfarm/bf-data/HEAD/inst/bin:$PATH"
DYLD_LIBRARY_PATH="/Users/buildfarm/bf-data/HEAD/pgsql.build/tmp_install/Users/buildfarm/bf-data/HEAD/inst/lib:$DYLD_LIBRARY_PATH"
 PGPORT='65678'
PG_REGRESS='/Users/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/../../../../src/test/regress/pg_regress'
REGRESS_SHLIB='/Users/buildfarm/bf-data/HEAD/pgsql.build/src/test/regress/regress.so'
/usr/local/perl5.8.3/bin/prove -I ../../../../src/test/perl/ -I .
t/*.pl
t/001_base.ok
t/002_standby..FAILED--Further testing stopped: system pg_basebackup failed
make: *** [check] Error 25

Unfortunately, that error message is not very informative and for some
reason the TAP logs don't seem to be included in the buildfarm output
in this case, so it's hard to tell exactly what went wrong. This
appears to be another 32-bit critter, which may be related somehow,
but I don't know how exactly.

'serinus' is also failing. This is less obviously related:

[02:08:55] t/003_constraints.pl .. ok 2048 ms ( 0.01 usr  0.00 sys
+  1.28 cusr  0.38 csys =  1.67 CPU)
# poll_query_until timed out executing this query:
# SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN
('r', 's');
# expecting this output:
# t
# last actual query output:
# f
# with stderr:

But there's also this:

2020-04-04 02:08:57.297 CEST [5e87d019.506c1:1] LOG:  connection
received: host=[local]
2020-04-04 02:08:57.298 CEST [5e87d019.506c1:2] LOG:  replication
connection authorized: user=bf
application_name=tap_sub_16390_sync_16384
2020-04-04 02:08:57.299 CEST [5e87d019.506c1:3] LOG:  statement: BEGIN
READ ONLY ISOLATION LEVEL REPEATABLE READ
2020-04-04 02:08:57.299 CEST [5e87d019.506c1:4] LOG:  received
replication command: CREATE_REPLICATION_SLOT
"tap_sub_16390_sync_16384" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
2020-04-04 02:08:57.299 CEST [5e87d019.506c1:5] ERROR:  replication
slot "tap_sub_16390_sync_16384" already exists
TRAP: FailedAssertion("owner->bufferarr.nitems == 0", File:
"/home/bf/build/buildfarm-serinus/HEAD/pgsql.build/../pgsql/src/backend/utils/resowner/resowner.c",
Line: 718)
postgres: publisher: walsender bf [local] idle in transaction
(aborted)(ExceptionalCondition+0x5c)[0x9a13ac]
postgres: publisher: walsender bf [local] idle in transaction
(aborted)(ResourceOwnerDelete+0x295)[0x9db8e5]
postgres: publisher: walsender bf [local] idle in transaction
(aborted)[0x54c61f]
postgres: publisher: walsender bf [local] idle in transaction
(aborted)(AbortOutOfAnyTransaction+0x122)[0x550e32]
postgres: publisher: walsender bf [local] idle in transaction
(aborted)[0x9b3bc9]
postgres: publisher: walsender bf [local] idle in transaction
(aborted)(shmem_exit+0x35)[0x80db45]
postgres: publisher: walsender bf [local] idle in transaction
(aborted)[0x80dc77]
postgres: publisher: walsender bf [local] idle in transaction
(aborted)(proc_exit+0x8)[0x80dd08]
postgres: publisher: walsender bf [local] idle in transaction
(aborted)(PostgresMain+0x59f)[0x83bd0f]
postgres: publisher: walsender bf [local] idle in transaction
(aborted)[0x7a0264]
postgres: publisher: walsender bf [local] idle in transaction
(aborted)(PostmasterMain+0xbfc)[0x7a2b8c]
postgres: publisher: walsender bf [local] idle in transaction
(aborted)(main+0x6fb)[0x49749b]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7fc52d83]
postgres: publisher: walsender bf [local] idle in transaction
(aborted)(_start+0x2a)[0x49753a]
2020-04-04 02:08:57.302 CEST [5e87d018.5066b:4] LOG:  server process
(PID 329409) was terminated by signal 6: Aborted
2020-04-04 02:08:57.302 CEST [5e87d018.5066b:5] DETAIL:  Failed
process was running: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ

That might well be related. I note in passing that the DETAIL emitted
by the postmaster shows the previous SQL command rather than the
more-recent replication command, which seems like something to fix. (I
still really dislike the fact that we have this evil hack allowing one
connection to mix and match those sets of commands...)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL 

Re: backup manifests and contemporaneous buildfarm failures

2020-04-03 Thread Tom Lane
Robert Haas  writes:
> On Fri, Apr 3, 2020 at 6:48 PM Tom Lane  wrote:
>> I'm guessing that we're looking at a platform-specific difference in
>> whether "rm -rf" fails outright on an unreadable subdirectory, or
>> just tries to carry on by unlinking it anyway.

> My intention was that it would be cleaned by the TAP framework itself,
> since the temporary directories it creates are marked for cleanup. But
> it may be that there's a platform dependency in the behavior of Perl's
> File::Path::rmtree, too.

Yeah, so it would seem.  The buildfarm script uses rmtree to clean out
the old build tree.  The man page for File::Path suggests (but can't
quite bring itself to say in so many words) that by default, rmtree
will adjust the permissions on target directories to allow the deletion
to succeed.  But that's very clearly not happening on some platforms.
(Maybe that represents a local patch on the part of some packagers
who thought it was too unsafe?)

Anyway, the end state presumably is that the pgsql.build directory
is still there at the end of the buildfarm run, and the next run's
attempt to also rmtree it fares no better.  Then look what it does
to set up the new build:

system("cp -R -p $target $build_path 2>&1");

Of course, if $build_path already exists, then cp copies to a subdirectory
of the target not the target itself.  So that explains the symptom
"./configure does not exist" --- it exists all right, but in a
subdirectory below the one where the buildfarm expects it to be.

It looks to me like the same problem would occur with VPATH or no.
The lack of failures among the VPATH-using critters probably has
more to do with whether their rmtree is willing to deal with this
case than with VPATH.

Anyway, it's evident that the buildfarm critters that are busted
will need manual cleanup, because the script is not going to be
able to get out of this by itself.  I remain of the opinion that
the hazard of that happening again in the future (eg, if a buildfarm
animal loses power during the test) is sufficient reason to remove
this test case.

regards, tom lane




Re: backup manifests and contemporaneous buildfarm failures

2020-04-03 Thread Robert Haas
On Fri, Apr 3, 2020 at 5:58 PM Fabien COELHO  wrote:
> seawasp just failed the same way. Good news, I can see "configure" under
> "HEAD/pgsql".

Ah, good.

> The only strange thing under buildroot I found is:
>
> HEAD/pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans/

Huh. I wonder how that got left behind ... it should've been cleaned
up by the TAP test framework. But I pushed a commit to change the
permissions back explicitly before exiting. As Tom says, I probably
need to remove that entire test, but I'm going to try this first.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: backup manifests and contemporaneous buildfarm failures

2020-04-03 Thread Robert Haas
On Fri, Apr 3, 2020 at 6:48 PM Tom Lane  wrote:
> I'm guessing that we're looking at a platform-specific difference in
> whether "rm -rf" fails outright on an unreadable subdirectory, or
> just tries to carry on by unlinking it anyway.

My intention was that it would be cleaned by the TAP framework itself,
since the temporary directories it creates are marked for cleanup. But
it may be that there's a platform dependency in the behavior of Perl's
File::Path::rmtree, too.

> A partial fix would be to have the test script put back normal
> permissions on that directory before it exits ... but any failure
> partway through the script would leave a time bomb requiring manual
> cleanup.

Yeah. I've pushed that fix for now, but as you say, it may not survive
contact with the enemy. That's kind of disappointing, because I put a
lot of work into trying to make the tests cover every line of code
that they possibly could, and there's no reason to suppose that
pg_validatebackup is the only tool that could benefit from having code
coverage of those kinds of scenarios. It's probably not even the tool
that is most in need of such testing; it must be far worse if, say,
pg_rewind can't cope with it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: backup manifests and contemporaneous buildfarm failures

2020-04-03 Thread Tom Lane
I wrote:
> I'm guessing that we're looking at a platform-specific difference in
> whether "rm -rf" fails outright on an unreadable subdirectory, or
> just tries to carry on by unlinking it anyway.

Yeah... on my RHEL6 box, "make check" cleans up the working directories
under tmp_check, but on a FreeBSD 12.1 box, not so much: I'm left with

$ ls tmp_check/
log/t_003_corruption_master_data/
tgl@oldmini$ ls -R tmp_check/t_003_corruption_master_data/
backup/

tmp_check/t_003_corruption_master_data/backup:
open_directory_fails/

tmp_check/t_003_corruption_master_data/backup/open_directory_fails:
pg_subtrans/

tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans:
ls: 
tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans: 
Permission denied

I did not see any complaints printed to the terminal, but in
regress_log_003_corruption there's

...
ok 40 - corrupt backup fails validation: open_directory_fails: matches
cannot chdir to child for 
/usr/home/tgl/pgsql/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans:
 Permission denied at t/003_corruption.pl line 126.
cannot remove directory for 
/usr/home/tgl/pgsql/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails:
 Directory not empty at t/003_corruption.pl line 126.
# Running: pg_basebackup -D 
/usr/home/tgl/pgsql/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/search_directory_fails
 --no-sync -T /tmp/lxaL_sLcnr=/tmp/_fegwVjoDR
ok 41 - base backup ok
...

This may be more of a Perl version issue than a platform issue,
but either way it's a problem.

Also, on the FreeBSD box, "rm -rf" isn't happy either:

$ rm -rf tmp_check
rm: 
tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans: 
Permission denied
rm: tmp_check/t_003_corruption_master_data/backup/open_directory_fails: 
Directory not empty
rm: tmp_check/t_003_corruption_master_data/backup: Directory not empty
rm: tmp_check/t_003_corruption_master_data: Directory not empty
rm: tmp_check: Directory not empty


regards, tom lane




Re: backup manifests and contemporaneous buildfarm failures

2020-04-03 Thread Tom Lane
Thomas Munro  writes:
> Same here, on elver.  I see pg_subtrans has been chmod(0)'d,
> presumably by the perl subroutine mutilate_open_directory_fails.  I
> see this in my inbox (the build farm wrote it to stderr or stdout
> rather than the log file):

> cannot chdir to child for
> pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans:
> Permission denied at ./run_build.pl line 1013.
> cannot remove directory for
> pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails:
> Directory not empty at ./run_build.pl line 1013.

I'm guessing that we're looking at a platform-specific difference in
whether "rm -rf" fails outright on an unreadable subdirectory, or
just tries to carry on by unlinking it anyway.

A partial fix would be to have the test script put back normal
permissions on that directory before it exits ... but any failure
partway through the script would leave a time bomb requiring manual
cleanup.

On the whole, I'd argue that testing that behavior is not valuable
enough to take risks of periodically breaking buildfarm members
in a way that will require manual recovery --- to say nothing of
annoying developers who trip over it.  So my vote is to remove
that part of the test and be satisfied with checking the behavior
for an unreadable file.

This doesn't directly explain the failure-at-next-configure behavior
that we're seeing in the buildfarm, but it wouldn't be too surprising
if it ends up being that the buildfarm client script doesn't manage
to fully recover from the situation.

regards, tom lane




Re: backup manifests and contemporaneous buildfarm failures

2020-04-03 Thread Stephen Frost
Greetings,

* Thomas Munro (thomas.mu...@gmail.com) wrote:
> On Sat, Apr 4, 2020 at 11:13 AM Tom Lane  wrote:
> > Fabien COELHO  writes:
> > > The only strange thing under buildroot I found is:
> >
> > > HEAD/pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans/
> >
> > > this last directory perms are d- which seems to break cleanup.
> 
> Same here, on elver.  I see pg_subtrans has been chmod(0)'d,
> presumably by the perl subroutine mutilate_open_directory_fails.  I
> see this in my inbox (the build farm wrote it to stderr or stdout
> rather than the log file):

Yup, saw the same here.

chmod'ing it to 755 seemed to result it the next run cleaning it up, at
least.  Not sure how things will go on the next actual build tho.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: backup manifests and contemporaneous buildfarm failures

2020-04-03 Thread Thomas Munro
On Sat, Apr 4, 2020 at 11:13 AM Tom Lane  wrote:
> Fabien COELHO  writes:
> > The only strange thing under buildroot I found is:
>
> > HEAD/pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans/
>
> > this last directory perms are d- which seems to break cleanup.

Same here, on elver.  I see pg_subtrans has been chmod(0)'d,
presumably by the perl subroutine mutilate_open_directory_fails.  I
see this in my inbox (the build farm wrote it to stderr or stdout
rather than the log file):

cannot chdir to child for
pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans:
Permission denied at ./run_build.pl line 1013.
cannot remove directory for
pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails:
Directory not empty at ./run_build.pl line 1013.
cannot remove directory for
pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup:
Directory not empty at ./run_build.pl line 1013.
cannot remove directory for
pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data:
Directory not empty at ./run_build.pl line 1013.
cannot remove directory for
pgsql.build/src/bin/pg_validatebackup/tmp_check: Directory not empty
at ./run_build.pl line 1013.
cannot remove directory for pgsql.build/src/bin/pg_validatebackup:
Directory not empty at ./run_build.pl line 1013.
cannot remove directory for pgsql.build/src/bin: Directory not empty
at ./run_build.pl line 1013.
cannot remove directory for pgsql.build/src: Directory not empty at
./run_build.pl line 1013.
cannot remove directory for pgsql.build: Directory not empty at
./run_build.pl line 1013.
cannot chdir to child for
pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans:
Permission denied at ./run_build.pl line 589.
cannot remove directory for
pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails:
Directory not empty at ./run_build.pl line 589.
cannot remove directory for
pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup:
Directory not empty at ./run_build.pl line 589.
cannot remove directory for
pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data:
Directory not empty at ./run_build.pl line 589.
cannot remove directory for
pgsql.build/src/bin/pg_validatebackup/tmp_check: Directory not empty
at ./run_build.pl line 589.
cannot remove directory for pgsql.build/src/bin/pg_validatebackup:
Directory not empty at ./run_build.pl line 589.
cannot remove directory for pgsql.build/src/bin: Directory not empty
at ./run_build.pl line 589.
cannot remove directory for pgsql.build/src: Directory not empty at
./run_build.pl line 589.
cannot remove directory for pgsql.build: Directory not empty at
./run_build.pl line 589.




Re: backup manifests and contemporaneous buildfarm failures

2020-04-03 Thread Alvaro Herrera
On 2020-Apr-03, Tom Lane wrote:

> I wonder if VPATH versus not-VPATH might be a relevant factor ...

Oh, absolutely.  The ones that failed show, in the last successful run,
the configure line invoked as "./configure", while the animals that are
still running are invoking configure from some other directory.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: backup manifests and contemporaneous buildfarm failures

2020-04-03 Thread Tom Lane
Fabien COELHO  writes:
> The only strange thing under buildroot I found is:

> HEAD/pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans/

> this last directory perms are d- which seems to break cleanup.

Locally, I observe that "make clean" in src/bin/pg_validatebackup fails
to clean up the tmp_check directory left behind by "make check".
So the new makefile is not fully plugged into its standard
responsibilities.  I don't see any unreadable subdirectories though.

I wonder if VPATH versus not-VPATH might be a relevant factor ...

regards, tom lane




Re: backup manifests and contemporaneous buildfarm failures

2020-04-03 Thread Fabien COELHO



Hello Robert,


Done now. Meanwhile, two more machines have reported the mysterious message:

sh: ./configure: not found

...that first appeared on spurfowl a few hours ago. The other two
machines are eelpout and elver, both of which list Thomas Munro as a
maintainer. spurfowl lists Stephen Frost. Thomas, Stephen, can one of
you check and see what's going on? spurfowl has failed this way four
times now, and eelpout and elver have each failed the last two runs,
but since there's no helpful information in the logs, it's hard to
guess what went wrong.

I'm sort of afraid that something in the new TAP tests accidentally
removed way too many files during the cleanup phase - e.g. it decided
the temporary directory was / and removed every file it could access,
or something like that. It doesn't do that here, or I, uh, would've
noticed by now. But sometimes strange things happen on other people's
machines. Hopefully one of those strange things is not that my test
code is single-handedly destroying the entire buildfarm, but it's
possible.


seawasp just failed the same way. Good news, I can see "configure" under 
"HEAD/pgsql".


The only strange thing under buildroot I found is:

HEAD/pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans/

this last directory perms are d- which seems to break cleanup.

It may be a left over from a previous run which failed (possibly 21dc488 
?). I cannot see how this would be related to configure, though. Maybe 
something else fails silently and the message is about a consequence of 
the prior silent failure.


I commented out the cron job and will try to look into it on tomorrow if 
the status has not changed by then.


--
Fabien.