Re: Return codes for archive and restore commands

2018-12-03 Thread Олег Самойлов


> If you were to rewrite those paragraphs or make them more precise, how
> would you actually shape your suggestions?  I personally quite like the
> current formulations, but I am rather used to it to be honest.
> --
> Michael

Yep, I am for make them more precise. Now this paragraphs describe PostgreSQL 
and bash behavior  for users of PostgreSQL and may be they are good in this. 
But for a script or application programmer must be described not only behavior 
of PostgreSQL, but also precisely described the program interface. For 
instance, aws cli utility, that I used for my archive and restore commands 
sometimes return 255 code, for instance, in a case of network fault to connect 
to S3 (object get command). And I was surprised that the PostgreSQL suddenly 
stoped in such case, there was nothing in documentation about this.  So 
explicitly describing behavior of PostgreSQL in terms of script returning codes 
will be useful for script programmes.


Re: Return codes for archive and restore commands

2018-11-29 Thread Oleg Bartunov
On Thu, Nov 29, 2018 at 5:40 AM Stephen Frost  wrote:
>
> Greetings,
>
> * Michael Paquier (mich...@paquier.xyz) wrote:
> > On Wed, Nov 28, 2018 at 11:00:31AM +, PG Doc comments form wrote:
> > > For the archive command:
> > > <=128 There are not errors in the PostgreSQL log (messages with severity
> > > equal or higher than ERROR). Firstly 3 messages of type LOG about fault,
> > > then WARNING about this and pause for 1 minute, then repeated.
> > > >=129 FATAL error in the PostgeSQL log. The message about stoping an 
> > > >archive
> > > process, but not the database. Repeated after roughly 16 seconds.
> >
> > This code is around for some time, and comes from this commit:
> > commit: 3ad0728c817bf8abd2c76bd11d856967509b307c
> > author: Tom Lane 
> > date: Tue, 21 Nov 2006 20:59:53 +
> > committer: Tom Lane 
> > date: Tue, 21 Nov 2006 20:59:53 +
> > On systems that have setsid(2) (which should be just about everything except
> > Windows), arrange for each postmaster child process to be its own process
> > group leader, and deliver signals SIGINT, SIGTERM, SIGQUIT to the whole
> > process group not only the direct child process.  This provides saner 
> > behavior
> > for archive and recovery scripts; in particular, it's possible to shut down 
> > a
> > warm-standby recovery server using "pg_ctl stop -m immediate", since 
> > delivery
> > of SIGQUIT to the startup subprocess will result in killing the waiting
> > recovery_command.  Also, this makes Query Cancel and statement_timeout apply
> > to scripts being run from backends via system().  (There is no support in 
> > the
> > core backend for that, but it's widely done using untrusted PLs.)  Per gripe
> > from Stephen Harris and subsequent discussion.
> >
> > The relevant part if pgarch_archiveXlog() in pgarch.c, and this part
> > is most relevant:
> > * Per the Single Unix Spec, shells report exit status > 128 when a
> > * called command died on a signal.
> >
> > > In this case PostgreSQL tries confirm rules for return codes of a unix
> > > shell. A unix shell return 126 in the case of "command not executable", 
> > > 127
> > > in the case "command not found", 128+# of signal in the case if 
> > > application
> > > interrupted by uncatched signal.
> >
> > If you were to rewrite those paragraphs or make them more precise, how
> > would you actually shape your suggestions?  I personally quite like the
> > current formulations, but I am rather used to it to be honest.
>
> This is another example, at least imv, of why we really need to move
> away from archive_command as an interface for doing WAL archiving.

+1

>
> Having discussed this quite a bit lately with David Steele and Magnus,
> it's pretty clear that we need to completely rip out how this works
> today and rewrite it based around an extension model where a background
> worker can start up and essentially take the place of the archiver
> process, with flexibility to jump forward through the WAL stream,
> communicate clearly with other processes, handle failure to do so
> gracefully based on the specific cases, etc.
>
> We could then possibly write an extension to be included that mimics
> what archive_command does today, but imv we should immediately consider
> it deprecated and encourage people to move off of it.
>
> Thanks!
>
> Stephen



-- 
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Return codes for archive and restore commands

2018-11-28 Thread Michael Paquier
On Wed, Nov 28, 2018 at 10:27:31PM -0500, Stephen Frost wrote:
> Yes, it couldn't be exactly the same as a generic background worker,
> that's a good point.  We definitely need to make sure that the
> postmaster waits for the archiver to shut down, as it does for the WAL
> senders.

Just to be clear, please note I don't think that what removing the
archiver code from the core code is a bad idea, quite the contrary
actually.  But I doubt that it would be acceptable to rip off this code
without something which has the same properties and guarantees for any
users depending on it.  And archive_command is used a lot.
--
Michael


signature.asc
Description: PGP signature


Re: Return codes for archive and restore commands

2018-11-28 Thread Stephen Frost
Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Wed, Nov 28, 2018 at 09:39:58PM -0500, Stephen Frost wrote:
> > Having discussed this quite a bit lately with David Steele and Magnus,
> > it's pretty clear that we need to completely rip out how this works
> > today and rewrite it based around an extension model where a background
> > worker can start up and essentially take the place of the archiver
> > process, with flexibility to jump forward through the WAL stream,
> > communicate clearly with other processes, handle failure to do so
> > gracefully based on the specific cases, etc.
> 
> Hm.  When an instance state is in PM_SHUTDOWN_2, the postmaster
> explicitely waits for the WAL senders and the archiver to shut down.  So
> I think that you would need more control regarding the timing a bgworker
> should be shut down first to be completely correct.

Yes, it couldn't be exactly the same as a generic background worker,
that's a good point.  We definitely need to make sure that the
postmaster waits for the archiver to shut down, as it does for the WAL
senders.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Return codes for archive and restore commands

2018-11-28 Thread Michael Paquier
On Wed, Nov 28, 2018 at 09:39:58PM -0500, Stephen Frost wrote:
> Having discussed this quite a bit lately with David Steele and Magnus,
> it's pretty clear that we need to completely rip out how this works
> today and rewrite it based around an extension model where a background
> worker can start up and essentially take the place of the archiver
> process, with flexibility to jump forward through the WAL stream,
> communicate clearly with other processes, handle failure to do so
> gracefully based on the specific cases, etc.

Hm.  When an instance state is in PM_SHUTDOWN_2, the postmaster
explicitely waits for the WAL senders and the archiver to shut down.  So
I think that you would need more control regarding the timing a bgworker
should be shut down first to be completely correct.
--
Michael


signature.asc
Description: PGP signature


Re: Return codes for archive and restore commands

2018-11-28 Thread Stephen Frost
Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Wed, Nov 28, 2018 at 11:00:31AM +, PG Doc comments form wrote:
> > For the archive command:
> > <=128 There are not errors in the PostgreSQL log (messages with severity
> > equal or higher than ERROR). Firstly 3 messages of type LOG about fault,
> > then WARNING about this and pause for 1 minute, then repeated.
> > >=129 FATAL error in the PostgeSQL log. The message about stoping an archive
> > process, but not the database. Repeated after roughly 16 seconds.
> 
> This code is around for some time, and comes from this commit:
> commit: 3ad0728c817bf8abd2c76bd11d856967509b307c
> author: Tom Lane 
> date: Tue, 21 Nov 2006 20:59:53 +
> committer: Tom Lane 
> date: Tue, 21 Nov 2006 20:59:53 +
> On systems that have setsid(2) (which should be just about everything except
> Windows), arrange for each postmaster child process to be its own process
> group leader, and deliver signals SIGINT, SIGTERM, SIGQUIT to the whole
> process group not only the direct child process.  This provides saner behavior
> for archive and recovery scripts; in particular, it's possible to shut down a
> warm-standby recovery server using "pg_ctl stop -m immediate", since delivery
> of SIGQUIT to the startup subprocess will result in killing the waiting
> recovery_command.  Also, this makes Query Cancel and statement_timeout apply
> to scripts being run from backends via system().  (There is no support in the
> core backend for that, but it's widely done using untrusted PLs.)  Per gripe
> from Stephen Harris and subsequent discussion.
> 
> The relevant part if pgarch_archiveXlog() in pgarch.c, and this part
> is most relevant:
> * Per the Single Unix Spec, shells report exit status > 128 when a
> * called command died on a signal.
> 
> > In this case PostgreSQL tries confirm rules for return codes of a unix
> > shell. A unix shell return 126 in the case of "command not executable", 127
> > in the case "command not found", 128+# of signal in the case if application
> > interrupted by uncatched signal.
> 
> If you were to rewrite those paragraphs or make them more precise, how
> would you actually shape your suggestions?  I personally quite like the
> current formulations, but I am rather used to it to be honest.

This is another example, at least imv, of why we really need to move
away from archive_command as an interface for doing WAL archiving.

Having discussed this quite a bit lately with David Steele and Magnus,
it's pretty clear that we need to completely rip out how this works
today and rewrite it based around an extension model where a background
worker can start up and essentially take the place of the archiver
process, with flexibility to jump forward through the WAL stream,
communicate clearly with other processes, handle failure to do so
gracefully based on the specific cases, etc.

We could then possibly write an extension to be included that mimics
what archive_command does today, but imv we should immediately consider
it deprecated and encourage people to move off of it.

Thanks!

Stephen


signature.asc
Description: PGP signature


Return codes for archive and restore commands

2018-11-28 Thread PG Doc comments form
The following documentation comment has been logged on the website:

Page: https://www.postgresql.org/docs/11/archive-recovery-settings.html
Description:

For instance for the restore command in the documentation said: 

It is important for the command to return a zero exit status only if it
succeeds. The command will be asked for file names that are not present in
the archive; it must return nonzero when so asked. Examples:
...
An exception is that if the command was terminated by a signal (other than
SIGTERM, which is used as part of a database server shutdown) or an error by
the shell (such as command not found), then recovery will abort and the
server will not start up.
end cite

This is not correct. I think that how the behavior of PostgreSQL depends on
return codes of restore and archive commands must be more exactly explained,
this is important for those how write scripts and applications for this
commands. For instance, if the aws command line interface (awscli) used as
restore command, aws on some commands return 255 code (for instance in case
of network fault) and this leads to unexpected result with PostgreSQL.

For the archive command:
<=128 There are not errors in the PostgreSQL log (messages with severity
equal or higher than ERROR). Firstly 3 messages of type LOG about fault,
then WARNING about this and pause for 1 minute, then repeated.
>=129 FATAL error in the PostgeSQL log. The message about stoping an archive
process, but not the database. Repeated after roughly 16 seconds.

For restore command:
<=125 There are not errors in the PostgreSQL log, repeated after several
seconds. Good to return network failure or in case of absent file.
>=126 FATAL error in the PostgreSQL log, stop a startup process, shutdown
the database. Good for a fatal error, for instance misconfiguration.

In this case PostgreSQL tries confirm rules for return codes of a unix
shell. A unix shell return 126 in the case of "command not executable", 127
in the case "command not found", 128+# of signal in the case if application
interrupted by uncatched signal.