Re: [HACKERS] Idea for improving buildfarm robustness

2015-10-03 Thread Tom Lane
Josh Berkus writes: > On 10/02/2015 09:39 PM, Tom Lane wrote: >> I wrote: >>> Here's a rewritten patch that looks at postmaster.pid instead of >>> pg_control. It should be effectively the same as the prior patch in terms >>> of response to directory-removal cases, and it should also catch many >>

Re: [HACKERS] Idea for improving buildfarm robustness

2015-10-03 Thread Josh Berkus
On 10/02/2015 09:39 PM, Tom Lane wrote: > I wrote: >> Here's a rewritten patch that looks at postmaster.pid instead of >> pg_control. It should be effectively the same as the prior patch in terms >> of response to directory-removal cases, and it should also catch many >> overwrite cases. > > BTW,

Re: [HACKERS] Idea for improving buildfarm robustness

2015-10-03 Thread Tom Lane
Michael Paquier writes: > On Sat, Oct 3, 2015 at 1:39 PM, Tom Lane wrote: >> BTW, my thought at the moment is to wait till after next week's releases >> to push this in. I think it's probably solid, but it doesn't seem like >> it's worth taking the risk of pushing shortly before a wrap date. >

Re: [HACKERS] Idea for improving buildfarm robustness

2015-10-02 Thread Michael Paquier
On Sat, Oct 3, 2015 at 1:39 PM, Tom Lane wrote: > I wrote: > > Here's a rewritten patch that looks at postmaster.pid instead of > > pg_control. It should be effectively the same as the prior patch in > terms > > of response to directory-removal cases, and it should also catch many > > overwrite

Re: [HACKERS] Idea for improving buildfarm robustness

2015-10-02 Thread Tom Lane
I wrote: > Here's a rewritten patch that looks at postmaster.pid instead of > pg_control. It should be effectively the same as the prior patch in terms > of response to directory-removal cases, and it should also catch many > overwrite cases. BTW, my thought at the moment is to wait till after ne

Re: [HACKERS] Idea for improving buildfarm robustness

2015-10-01 Thread Tom Lane
I wrote: > It strikes me that a different approach that might be of value would > be to re-read postmaster.pid and make sure that (a) it's still there > and (b) it still contains the current postmaster's PID. This would > be morally equivalent to what Jim suggests above, and it would dodge > the c

Re: [HACKERS] Idea for improving buildfarm robustness

2015-09-30 Thread Josh Berkus
So, testing: 1. I tested running an AWS instance (Ubuntu 14.04) into 100% IOWAIT, and the shutdown didn't kick in even when storage went full "d" state. It's possible that other kinds of remote storage failures would cause a shutdown, but don't we want them to? 2. I tested deleting /pgdata/* sev

Re: [HACKERS] Idea for improving buildfarm robustness

2015-09-30 Thread Tom Lane
Jim Nasby writes: > Ouch. So it sounds like there's value to seeing if pg_control isn't what > we expect it to be. > Instead of looking at the inode (portability problem), what if > pg_control contained a random number that was created at initdb time? On > startup postmaster would read that va

Re: [HACKERS] Idea for improving buildfarm robustness

2015-09-30 Thread Andrew Dunstan
On 09/30/2015 01:18 AM, Michael Paquier wrote: On Wed, Sep 30, 2015 at 7:19 AM, Tom Lane wrote: I wrote: Josh Berkus writes: Give me source with the change, and I'll put it on a cheap, low-bandwith AWS instance and hammer the heck out of it. That should raise any false positives we can ex

Re: [HACKERS] Idea for improving buildfarm robustness

2015-09-29 Thread Jim Nasby
On 9/29/15 4:13 PM, Alvaro Herrera wrote: Joe Conway wrote: On 09/29/2015 01:48 PM, Alvaro Herrera wrote: I remember it, but I'm not sure it would have helped you. As I recall, your trouble was that after a reboot the init script decided to initdb the mount point -- postmaster wouldn't have

Re: [HACKERS] Idea for improving buildfarm robustness

2015-09-29 Thread Michael Paquier
On Wed, Sep 30, 2015 at 7:19 AM, Tom Lane wrote: > I wrote: >> Josh Berkus writes: >>> Give me source with the change, and I'll put it on a cheap, low-bandwith >>> AWS instance and hammer the heck out of it. That should raise any false >>> positives we can expect. > >> Here's a draft patch again

Re: [HACKERS] Idea for improving buildfarm robustness

2015-09-29 Thread Josh Berkus
On 09/29/2015 12:47 PM, Tom Lane wrote: > Josh Berkus writes: >> In general, having the postmaster survive deletion of PGDATA is >> suboptimal. In rare cases of having it survive installation of a new >> PGDATA (via PITR restore, for example), I've even seen the zombie >> postmaster corrupt the d

Re: [HACKERS] Idea for improving buildfarm robustness

2015-09-29 Thread Tom Lane
I wrote: > Josh Berkus writes: >> Give me source with the change, and I'll put it on a cheap, low-bandwith >> AWS instance and hammer the heck out of it. That should raise any false >> positives we can expect. > Here's a draft patch against HEAD (looks like it will work on 9.5 or > 9.4 without m

Re: [HACKERS] Idea for improving buildfarm robustness

2015-09-29 Thread Tom Lane
Josh Berkus writes: > Give me source with the change, and I'll put it on a cheap, low-bandwith > AWS instance and hammer the heck out of it. That should raise any false > positives we can expect. Here's a draft patch against HEAD (looks like it will work on 9.5 or 9.4 without modifications, too)

Re: [HACKERS] Idea for improving buildfarm robustness

2015-09-29 Thread Alvaro Herrera
Joe Conway wrote: > On 09/29/2015 01:48 PM, Alvaro Herrera wrote: > > I remember it, but I'm not sure it would have helped you. As I recall, > > your trouble was that after a reboot the init script decided to initdb > > the mount point -- postmaster wouldn't have been running at all ... > > Righ

Re: [HACKERS] Idea for improving buildfarm robustness

2015-09-29 Thread Joe Conway
On 09/29/2015 01:48 PM, Alvaro Herrera wrote: > Joe Conway wrote: >> On 09/29/2015 12:47 PM, Tom Lane wrote: >>> We could possibly add additional checks, like trying to verify that >>> pg_control has the same inode number it used to. But I'm afraid that >>> would add portability issues and false-p

Re: [HACKERS] Idea for improving buildfarm robustness

2015-09-29 Thread Alvaro Herrera
Joe Conway wrote: > On 09/29/2015 12:47 PM, Tom Lane wrote: > > We could possibly add additional checks, like trying to verify that > > pg_control has the same inode number it used to. But I'm afraid that > > would add portability issues and false-positive hazards that would > > outweigh the value

Re: [HACKERS] Idea for improving buildfarm robustness

2015-09-29 Thread Joe Conway
On 09/29/2015 12:47 PM, Tom Lane wrote: > We could possibly add additional checks, like trying to verify that > pg_control has the same inode number it used to. But I'm afraid that > would add portability issues and false-positive hazards that would > outweigh the value. Not sure you remember the

Re: [HACKERS] Idea for improving buildfarm robustness

2015-09-29 Thread Alvaro Herrera
Tom Lane wrote: > Testing accessibility of "global/pg_control" would be enough to catch this > case, but only if we do it before you create a new one. So that seems > like an argument for making the test relatively often. The once-a-minute > option is sounding better and better. If we weren't a

Re: [HACKERS] Idea for improving buildfarm robustness

2015-09-29 Thread Tom Lane
Josh Berkus writes: > On 09/29/2015 11:48 AM, Tom Lane wrote: >> But today I thought of another way: suppose that we teach the postmaster >> to commit hara-kiri if the $PGDATA directory goes away. Since the >> buildfarm script definitely does remove all the temporary data directories >> it create

Re: [HACKERS] Idea for improving buildfarm robustness

2015-09-29 Thread Josh Berkus
On 09/29/2015 12:18 PM, Tom Lane wrote: > Andrew Dunstan writes: >> On 09/29/2015 02:48 PM, Tom Lane wrote: >>> Also, perhaps we'd only enable this behavior in --enable-cassert builds, >>> to avoid any risk of a postmaster incorrectly choosing to suicide in a >>> production scenario. Or maybe tha

Re: [HACKERS] Idea for improving buildfarm robustness

2015-09-29 Thread Tom Lane
Andrew Dunstan writes: > On 09/29/2015 02:48 PM, Tom Lane wrote: >> Also, perhaps we'd only enable this behavior in --enable-cassert builds, >> to avoid any risk of a postmaster incorrectly choosing to suicide in a >> production scenario. Or maybe that's overly conservative. > Not every buildfar

Re: [HACKERS] Idea for improving buildfarm robustness

2015-09-29 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote: > Stephen Frost writes: > > * Tom Lane (t...@sss.pgh.pa.us) wrote: > >> I wouldn't want to do this every time through the postmaster's main loop, > >> but we could do this once an hour for no added cost by adding the check > >> where it does TouchSocketLockFi

Re: [HACKERS] Idea for improving buildfarm robustness

2015-09-29 Thread Andrew Dunstan
On 09/29/2015 02:48 PM, Tom Lane wrote: A problem the buildfarm has had for a long time is that if for some reason the scripts fail to stop a test postmaster, the postmaster process will hang around and cause subsequent runs to fail because of socket conflicts. This seems to have gotten a lot w

Re: [HACKERS] Idea for improving buildfarm robustness

2015-09-29 Thread Tom Lane
Stephen Frost writes: > * Tom Lane (t...@sss.pgh.pa.us) wrote: >> I wouldn't want to do this every time through the postmaster's main loop, >> but we could do this once an hour for no added cost by adding the check >> where it does TouchSocketLockFiles; or once every few minutes if we >> carried a

Re: [HACKERS] Idea for improving buildfarm robustness

2015-09-29 Thread Josh Berkus
On 09/29/2015 11:48 AM, Tom Lane wrote: > But today I thought of another way: suppose that we teach the postmaster > to commit hara-kiri if the $PGDATA directory goes away. Since the > buildfarm script definitely does remove all the temporary data directories > it creates, this ought to get the jo

Re: [HACKERS] Idea for improving buildfarm robustness

2015-09-29 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote: > But today I thought of another way: suppose that we teach the postmaster > to commit hara-kiri if the $PGDATA directory goes away. Since the > buildfarm script definitely does remove all the temporary data directories > it creates, this ought to get the job

[HACKERS] Idea for improving buildfarm robustness

2015-09-29 Thread Tom Lane
A problem the buildfarm has had for a long time is that if for some reason the scripts fail to stop a test postmaster, the postmaster process will hang around and cause subsequent runs to fail because of socket conflicts. This seems to have gotten a lot worse lately due to the influx of very slow b