subject:"We should stop telling users to \"vacuum that database in single\-user mode\""

We should stop telling users to "vacuum that database in single-user mode"

2021-03-01 Thread Hannu Krosing

It looks like we are unnecessarily instructing our usiers to vacuum their
databases in single-user mode when just vacuuming would be enough.

We should fix the error message to be less misleading.

== The story

I think most of us have at some time seen the following message, if not in their
own database, then at some client.

ERROR: database is not accepting commands to avoid wraparound data
loss in database ""
HINT: Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions.

But "vacuum that database in single-user mode" is the very last thing
one wants to
do, because

*   it is single-user mode, so nothing else works ...
*   CHECKPOINTs are not running, so all the WAL segments can not be rotated and
reused
*   Replication does not work, so after vacuum is done and database is started
in normal mode, there is huge backlog to replicate
*   pg_stat_progress_vacuum is not available so you have no idea when the
command is going to complete
*   VACUUM VERBOSE isn't - there is absolutely no output from single-user mode
vacuum with or without VERBOSE, so you *really* do not know what is going on
and how much progress is made (if you are locky you can guess something from
IO and CPU monitoring, but it is inexact at best )

When I started looking at improving the situation I discovered, that there
already is no need to run VACUUM in single user mode in any currently supported
PostgreSQL version as you can run VACUUM perfectly well when the wraparound
protection is active.

It worked in all PG versions from v9.6 to v13.

I also tested v 8.3 as this is where we added virtual transactions, but there
VACUUM really fails to run successfully without single-user mode..

So my proposal is to change the error message [*]  to something that does not
suggest the single-user mode as the requirement for running VACUUM.
Also COMMIT PREPARED still works ok in this situation.

Single-user mode still may be needed in case one needs to drop a
replication slot
or something similar.

[*] The message is in src/backend/access/transam/varsup.c around line 120

=== How to test

The following instructions let you run into wraparound in about an hour,
depending on your setup (was 1.2 hours on my laptop)

 First, set some flags

To allow PREPARE TRANSACTION to block VACUUM cleanup

```
alter system set max_prepared_transactions = 10;
```

Also set *_min_messages to errors, unless you want to get 10M of
WARNINGs (~4GB) to
logs and the same amount sent to client, slowing down the last 10M transactions
significantly.

```
alter system set log_min_messages = error;
alter system set client_min_messages = error;
```

 Restart the system to activate the settings

 Block Vacuum from cleaning up transactions

Create a database `wraptest` and connect to it, then

```
create table t (i int);

BEGIN;
insert into t values(1);
PREPARE TRANSACTION 'trx_id_pin';
```

Now you have a prepared transaction, which makes sure that even well-tuned
autovacuum does not prevent running into the wraparound protection.

```
[local]:5096 hannu@wraptest=# SELECT * FROM pg_prepared_xacts;
 transaction |gid |   prepared| owner | database
-++---+---+--
 593 | trx_id_pin | 2021-03-01 08:57:27.024777+01 | hannu | wraptest
(1 row)
```

 Create a function to consume transaction ids as fast as possible:

```
CREATE OR REPLACE FUNCTION trx_eater(n int) RETURNS void
LANGUAGE plpgsql
AS $plpgsql$
BEGIN
 FOR i IN 0..n LOOP
   BEGIN
 INSERT INTO t values(i);
   EXCEPTION WHEN OTHERS THEN
   RAISE; -- raise it again, so that we actually err out on wraparound
   END;
 END LOOP;
END;
$plpgsql$;
```

 Use pgbench to drive this function

Make a pgbench command file

$ echo 'select trx_eater(10);' > trx_eater.pgbench

and start pgbench to run this function in a few backends in parallel

$ pgbench -c 16 -T 2 -P 60 -n wraptest -f trx_eater.pgbench

 Wait 1-2 hours

In about an hour or two this should error out with

ERROR: database is not accepting commands to avoid wraparound data
loss in database "postgres"
HINT: Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions.

After this just do

COMMIT PREPARED 'trx_id_pin';

 Verify that VACUUM still works

to release the blocket 2PC transaction and you can verify yourself that

*   you can run VACUUM on any table, and
*   Autovacuum is working, and will eventually clear up the situation

If you have not tuned autovacuum_vacuum_cost_* at all, especially in earlier
versions where it is 20ms by default the autovacuum-started vacuum is
running really
slowly, and it will take about 8 hours to clean up the table, but this
can be sped up
if you set autovacuum_vacuum_cost_delay=0 and then either restart the database
or just kill th

Re: We should stop telling users to "vacuum that database in single-user mode"

2021-03-01 Thread Noah Misch

On Mon, Mar 01, 2021 at 04:32:23PM +0100, Hannu Krosing wrote:
> It looks like we are unnecessarily instructing our usiers to vacuum their
> databases in single-user mode when just vacuuming would be enough.

> When I started looking at improving the situation I discovered, that there
> already is no need to run VACUUM in single user mode in any currently 
> supported
> PostgreSQL version as you can run VACUUM perfectly well when the wraparound
> protection is active.

A comment in SetTransactionIdLimit() says, "VACUUM requires an XID if it
truncates at wal_level!=minimal."  Hence, I think plain VACUUM will fail some
of the time; VACUUM (TRUNCATE false) should be reliable.  In general, I like
your goal of replacing painful error message advice with less-painful advice.
At the same time, it's valuable for the advice to reliably get the user out of
the bad situation.

Re: We should stop telling users to "vacuum that database in single-user mode"

2021-03-01 Thread David Rowley

On Tue, 2 Mar 2021 at 04:32, Hannu Krosing  wrote:
>
> It looks like we are unnecessarily instructing our usiers to vacuum their
> databases in single-user mode when just vacuuming would be enough.
>
> We should fix the error message to be less misleading.

It would be good to change the message as it's pretty outdated. Back
in 8ad3965a1 (2005) when the message was added, SELECT and VACUUM
would have called GetNewTransactionId().  That's no longer the case.
We only do that when we actually need an XID.

However, I wonder if it's worth going a few steps further to try and
reduce the chances of that message being seen in the first place.
Maybe it's worth considering ditching any (auto)vacuum cost limits for
any table which is within X transaction from wrapping around.
Likewise for "VACUUM;" when the database's datfrozenxid is getting
dangerously high.

Such "emergency" vacuums could be noted in the auto-vacuum log and
NOTICEd or WARNING sent to the user during manual VACUUMs. Maybe the
value of X could be xidStopLimit minus a hundred million or so.

I have seen it happen that an instance has a vacuum_cost_limit set and
someone did start the database in single-user mode, per the advice of
the error message only to find that the VACUUM took a very long time
due to the restrictive cost limit. I struggle to imagine why anyone
wouldn't want the vacuum to run as quickly as possible in that
situation.

(Ideally, the speed of auto-vacuum would be expressed as a percentage
of time spent working vs sleeping rather than an absolute speed
limit... that way, faster servers would get faster vacuums, assuming
the same settings.  Vacuums may also get more work done per unit of
time during offpeak, which seems like it might be a thing that people
might want.)

David

Re: We should stop telling users to "vacuum that database in single-user mode"

2021-03-02 Thread Magnus Hagander

On Tue, Mar 2, 2021 at 7:52 AM David Rowley  wrote:
>
> On Tue, 2 Mar 2021 at 04:32, Hannu Krosing  wrote:
> >
> > It looks like we are unnecessarily instructing our usiers to vacuum their
> > databases in single-user mode when just vacuuming would be enough.
> >
> > We should fix the error message to be less misleading.
>
> It would be good to change the message as it's pretty outdated. Back
> in 8ad3965a1 (2005) when the message was added, SELECT and VACUUM
> would have called GetNewTransactionId().  That's no longer the case.
> We only do that when we actually need an XID.
>
> However, I wonder if it's worth going a few steps further to try and
> reduce the chances of that message being seen in the first place.
> Maybe it's worth considering ditching any (auto)vacuum cost limits for
> any table which is within X transaction from wrapping around.
> Likewise for "VACUUM;" when the database's datfrozenxid is getting
> dangerously high.
>
> Such "emergency" vacuums could be noted in the auto-vacuum log and
> NOTICEd or WARNING sent to the user during manual VACUUMs. Maybe the
> value of X could be xidStopLimit minus a hundred million or so.
>
> I have seen it happen that an instance has a vacuum_cost_limit set and
> someone did start the database in single-user mode, per the advice of
> the error message only to find that the VACUUM took a very long time
> due to the restrictive cost limit. I struggle to imagine why anyone
> wouldn't want the vacuum to run as quickly as possible in that
> situation.

Multiple instances running on the same hardware and only one of them
being in trouble?

But it would probably be worthwhile throwing a WARNING if vacuum is
run with cost delay enabled in single user mode -- so that the user is
at least aware of the choice (and can cancel and try again). Maybe
even a warning directly when starting up a single user session, to let
them know?


-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: We should stop telling users to "vacuum that database in single-user mode"

2021-03-02 Thread David Rowley

On Wed, 3 Mar 2021 at 01:12, Magnus Hagander  wrote:
>
> On Tue, Mar 2, 2021 at 7:52 AM David Rowley  wrote:
> > I have seen it happen that an instance has a vacuum_cost_limit set and
> > someone did start the database in single-user mode, per the advice of
> > the error message only to find that the VACUUM took a very long time
> > due to the restrictive cost limit. I struggle to imagine why anyone
> > wouldn't want the vacuum to run as quickly as possible in that
> > situation.
>
> Multiple instances running on the same hardware and only one of them
> being in trouble?

You might be right. I'm not saying it's a great idea but thought it
was worth considering.

We could turn to POLA and ask; what would you be more surprised at; 1)
Your database suddenly using more I/O than it had been previously, or;
2) Your database no longer accepting DML.

David

Re: We should stop telling users to "vacuum that database in single-user mode"

2021-03-03 Thread Magnus Hagander

On Tue, Mar 2, 2021 at 10:07 PM David Rowley  wrote:
>
> On Wed, 3 Mar 2021 at 01:12, Magnus Hagander  wrote:
> >
> > On Tue, Mar 2, 2021 at 7:52 AM David Rowley  wrote:
> > > I have seen it happen that an instance has a vacuum_cost_limit set and
> > > someone did start the database in single-user mode, per the advice of
> > > the error message only to find that the VACUUM took a very long time
> > > due to the restrictive cost limit. I struggle to imagine why anyone
> > > wouldn't want the vacuum to run as quickly as possible in that
> > > situation.
> >
> > Multiple instances running on the same hardware and only one of them
> > being in trouble?
>
> You might be right. I'm not saying it's a great idea but thought it
> was worth considering.
>
> We could turn to POLA and ask; what would you be more surprised at; 1)
> Your database suddenly using more I/O than it had been previously, or;
> 2) Your database no longer accepting DML.

I think we misunderstand each other. I meant this only as a comment
about the idea of ignoring the cost limit in single user mode -- that
is, it's a reason to *want* vacuum to not run as quickly as possible
in single user mode. I should've trimmed the email better.

I agree with your other idea, that of kicking in a more aggressive
autovacuum if it's not dealing with things fast enough. Maybe even on
an incremental way - that is run with the default, then at another
threshold drop them to half, and at yet another threshold drop them to
0. I agree that pretty much anything is better than forcing the user
into single user mode.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: We should stop telling users to "vacuum that database in single-user mode"

2021-03-03 Thread David Rowley

On Wed, 3 Mar 2021 at 21:44, Magnus Hagander  wrote:
>
> On Tue, Mar 2, 2021 at 10:07 PM David Rowley  wrote:
> >
> > On Wed, 3 Mar 2021 at 01:12, Magnus Hagander  wrote:
> > >
> > > On Tue, Mar 2, 2021 at 7:52 AM David Rowley  wrote:
> > > > I have seen it happen that an instance has a vacuum_cost_limit set and
> > > > someone did start the database in single-user mode, per the advice of
> > > > the error message only to find that the VACUUM took a very long time
> > > > due to the restrictive cost limit. I struggle to imagine why anyone
> > > > wouldn't want the vacuum to run as quickly as possible in that
> > > > situation.
> > >
> > > Multiple instances running on the same hardware and only one of them
> > > being in trouble?
> >
> > You might be right. I'm not saying it's a great idea but thought it
> > was worth considering.
> >
> > We could turn to POLA and ask; what would you be more surprised at; 1)
> > Your database suddenly using more I/O than it had been previously, or;
> > 2) Your database no longer accepting DML.
>
> I think we misunderstand each other. I meant this only as a comment
> about the idea of ignoring the cost limit in single user mode -- that
> is, it's a reason to *want* vacuum to not run as quickly as possible
> in single user mode. I should've trimmed the email better.

I meant to ignore the cost limits if we're within a hundred million or
so of the stopLimit.  Per what Hannu mentioned, there does not seem to
be a great need with current versions of PostgreSQL to restart in the
instance in single-user mode. VACUUM still works once we're beyond the
stopLimit. It's just commands that need to generate a new XID that'll
fail with the error message mentioned by Hannu.

> I agree with your other idea, that of kicking in a more aggressive
> autovacuum if it's not dealing with things fast enough. Maybe even on
> an incremental way - that is run with the default, then at another
> threshold drop them to half, and at yet another threshold drop them to
> 0. I agree that pretty much anything is better than forcing the user
> into single user mode.

OK cool. I wondered if it should be reduced incrementally or just
switch off the cost limit completely once we're beyond
ShmemVariableCache->xidStopLimit.  If we did want it to be incremental
then if we had say ShmemVariableCache->xidFastVacLimit, which was
about 100 million xids before xidStopLimit, then the code could adjust
the sleep delay down by the percentage through we are from
xidFastVacLimit to xidStopLimit.

However, if we want to keep adjusting the sleep delay then we need to
make that work for vacuums that are running already. We don't want to
call ReadNextTransactionId() too often, but maybe if we did it once
per 10 seconds worth of vacuum_delay_point()s.  That way we'd never do
it for vacuums already going at full speed.

David

Re: We should stop telling users to "vacuum that database in single-user mode"

2021-03-03 Thread Hannu Krosing

On Wed, Mar 3, 2021 at 11:33 AM David Rowley  wrote:
>
> On Wed, 3 Mar 2021 at 21:44, Magnus Hagander  wrote:
...
> > I think we misunderstand each other. I meant this only as a comment
> > about the idea of ignoring the cost limit in single user mode -- that
> > is, it's a reason to *want* vacuum to not run as quickly as possible
> > in single user mode. I should've trimmed the email better.
>
> I meant to ignore the cost limits if we're within a hundred million or
> so of the stopLimit.  Per what Hannu mentioned, there does not seem to
> be a great need with current versions of PostgreSQL to restart in the
> instance in single-user mode. VACUUM still works once we're beyond the
> stopLimit. It's just commands that need to generate a new XID that'll
> fail with the error message mentioned by Hannu.

I am investigating a possibility of introducing a special "Restricted
Maintenance
Mode" to let admin mitigate after xidStopLimit, maybe for another 0.5M txids,
by doing things like

* dropping an index - to make vacuum faster
* dropping a table - sometimes it is better to drop a table in order to get the
  production database functional again instead of waiting hours for the vacuum
  to finish.
  And then later restore it from backup or maybe access it from a read-only
  clone of the database via FDW.
* drop a stale replication slot which is holding back vacuum

To make sure that this will not accidentally just move xidStopLimit to 0.5M for
users who run main workloads as a superuser (they do exists!) this mode should
be restricted to
* only superuser
* only a subset of commands /  functions
* be heavily throttled to avoid running out of TXIDs, maybe 1-10 xids per second
* maybe require also setting a GUC to be very explicit

> > I agree with your other idea, that of kicking in a more aggressive
> > autovacuum if it's not dealing with things fast enough. Maybe even on
> > an incremental way - that is run with the default, then at another
> > threshold drop them to half, and at yet another threshold drop them to
> > 0. I agree that pretty much anything is better than forcing the user
> > into single user mode.
>
> OK cool. I wondered if it should be reduced incrementally or just
> switch off the cost limit completely once we're beyond
> ShmemVariableCache->xidStopLimit.

Abrupt change is something that is more likely to make the user/DBA notice
that something is going on. I have even been thinking about deliberate
throttling to make the user notice / pay attention.

> If we did want it to be incremental
> then if we had say ShmemVariableCache->xidFastVacLimit, which was
> about 100 million xids before xidStopLimit, then the code could adjust
> the sleep delay down by the percentage through we are from
> xidFastVacLimit to xidStopLimit.
>
> However, if we want to keep adjusting the sleep delay then we need to
> make that work for vacuums that are running already. We don't want to
> call ReadNextTransactionId() too often, but maybe if we did it once
> per 10 seconds worth of vacuum_delay_point()s.  That way we'd never do
> it for vacuums already going at full speed.

There are already samples of this in code, for example the decision to
force-start disabled autovacuum is considered after every 64k transactions.

There is a related item in https://commitfest.postgresql.org/32/2983/ .
When that gets done, we could drive the adjustments from autovacuum.c by
adding the remaining XID range adjustment to existing worker delay adjust
mechanisms in autovac_balance_cost() and signalling the autovacuum
backend to run the adjustment every few seconds once we are in the danger
zone.

Cheers
Hannu

Re: We should stop telling users to "vacuum that database in single-user mode"

2021-03-03 Thread Robert Treat

On Wed, Mar 3, 2021 at 7:10 AM Hannu Krosing  wrote:
> On Wed, Mar 3, 2021 at 11:33 AM David Rowley  wrote:
> > On Wed, 3 Mar 2021 at 21:44, Magnus Hagander  wrote:
> >
> > I meant to ignore the cost limits if we're within a hundred million or
> > so of the stopLimit.  Per what Hannu mentioned, there does not seem to
> > be a great need with current versions of PostgreSQL to restart in the
> > instance in single-user mode. VACUUM still works once we're beyond the
> > stopLimit. It's just commands that need to generate a new XID that'll
> > fail with the error message mentioned by Hannu.
>
> I am investigating a possibility of introducing a special "Restricted
> Maintenance
> Mode" to let admin mitigate after xidStopLimit, maybe for another 0.5M txids,
> by doing things like
>
> * dropping an index - to make vacuum faster
> * dropping a table - sometimes it is better to drop a table in order to get 
> the
>   production database functional again instead of waiting hours for the vacuum
>   to finish.
>   And then later restore it from backup or maybe access it from a read-only
>   clone of the database via FDW.
> * drop a stale replication slot which is holding back vacuum
>

I've talked with a few people about modifying wraparound and xid
emergency vacuums to be more efficient, ie. run them without indexes,
and possibly some other options. That seems like low-hanging fruit if
not already a thing.

> To make sure that this will not accidentally just move xidStopLimit to 0.5M 
> for
> users who run main workloads as a superuser (they do exists!) this mode should
> be restricted to
> * only superuser
> * only a subset of commands /  functions
> * be heavily throttled to avoid running out of TXIDs, maybe 1-10 xids per 
> second
> * maybe require also setting a GUC to be very explicit
>
> > > I agree with your other idea, that of kicking in a more aggressive
> > > autovacuum if it's not dealing with things fast enough. Maybe even on
> > > an incremental way - that is run with the default, then at another
> > > threshold drop them to half, and at yet another threshold drop them to
> > > 0. I agree that pretty much anything is better than forcing the user
> > > into single user mode.
> >
> > OK cool. I wondered if it should be reduced incrementally or just
> > switch off the cost limit completely once we're beyond
> > ShmemVariableCache->xidStopLimit.
>
> Abrupt change is something that is more likely to make the user/DBA notice
> that something is going on. I have even been thinking about deliberate
> throttling to make the user notice / pay attention.
>

I worry that we're walking down the path of trying to find "clever"
solutions in a situation where the variety of production environments
(and therefore the right way to handle this issue) is nearly endless.
That said... I think at the point we're talking about, subtly is not
an absolute requirement... if people were paying attention they'd have
noticed autovacuum for wrap-around running or warnings in the logs; at
some point you do need to be a bit in your face that there is a real
possibility of disaster around the corner.

> > If we did want it to be incremental
> > then if we had say ShmemVariableCache->xidFastVacLimit, which was
> > about 100 million xids before xidStopLimit, then the code could adjust
> > the sleep delay down by the percentage through we are from
> > xidFastVacLimit to xidStopLimit.
> >
> > However, if we want to keep adjusting the sleep delay then we need to
> > make that work for vacuums that are running already. We don't want to
> > call ReadNextTransactionId() too often, but maybe if we did it once
> > per 10 seconds worth of vacuum_delay_point()s.  That way we'd never do
> > it for vacuums already going at full speed.
>
> There are already samples of this in code, for example the decision to
> force-start disabled autovacuum is considered after every 64k transactions.
>
> There is a related item in https://commitfest.postgresql.org/32/2983/ .
> When that gets done, we could drive the adjustments from autovacuum.c by
> adding the remaining XID range adjustment to existing worker delay adjust
> mechanisms in autovac_balance_cost() and signalling the autovacuum
> backend to run the adjustment every few seconds once we are in the danger
> zone.
>

That patch certainly looks interesting; many many times I've had to
have people kick off manual vacuums to use more i/o and kill the
wrap-around vacuum. Reading the discussion there, I wonder if we
should think about weighting the most urgent vacuum at the expense of
other potential autovacuums, although I feel like they often come in
bunches in these scenarios.


Robert Treat
https://xzilla.net

Re: We should stop telling users to "vacuum that database in single-user mode"

2021-03-05 Thread Alvaro Herrera

On 2021-Mar-02, David Rowley wrote:

> However, I wonder if it's worth going a few steps further to try and
> reduce the chances of that message being seen in the first place.
> Maybe it's worth considering ditching any (auto)vacuum cost limits for
> any table which is within X transaction from wrapping around.
> Likewise for "VACUUM;" when the database's datfrozenxid is getting
> dangerously high.

Yeah, I like this kind of approach, and I think one change we can do
that can have a very large effect is to disable index cleanup when in
emergency situations.  That way, the XID limit is advanced as much as
possible with as little effort as possible; once the system is back in
normal conditions, indexes can be cleaned up at a leisurely pace.

-- 
Álvaro Herrera   Valdivia, Chile
"El destino baraja y nosotros jugamos" (A. Schopenhauer)

We should stop telling users to "vacuum that database in single-user mode"

Re: We should stop telling users to "vacuum that database in single-user mode"

Re: We should stop telling users to "vacuum that database in single-user mode"

Re: We should stop telling users to "vacuum that database in single-user mode"

Re: We should stop telling users to "vacuum that database in single-user mode"

Re: We should stop telling users to "vacuum that database in single-user mode"

Re: We should stop telling users to "vacuum that database in single-user mode"

Re: We should stop telling users to "vacuum that database in single-user mode"

Re: We should stop telling users to "vacuum that database in single-user mode"

Re: We should stop telling users to "vacuum that database in single-user mode"

10 matches

Site Navigation

Mail list logo

Footer information