Re: [HACKERS] Cluster "stuck" in "not accepting commands to avoid wraparound data loss"

Jeff Janes Fri, 11 Dec 2015 10:10:58 -0800

On Thu, Dec 10, 2015 at 1:55 PM, Andres Freund <[email protected]> wrote:
> Hi,
>
> I recently started a pgbench benchmark (to evaluate a piece of hardware,
> not postgres) with master. Unfortunately, by accident, I started
> postgres in a shell, not screen like pgbench.
>
> Just logged back in and saw:
> client 71 aborted in state 8: ERROR:  database is not accepting commands to 
> avoid wraparound data loss in database "postgres"
> HINT:  Stop the postmaster and vacuum that database in single-user mode.
> You might also need to commit or roll back old prepared transactions.
> transaction type: TPC-B (sort of)
> scaling factor: 300
> query mode: prepared
> number of clients: 97
> number of threads: 97
> duration: 300000 s
> number of transactions actually processed: 2566862424
> latency average: 3.214 ms
> latency stddev: 7.336 ms
> tps = 30169.374133 (including connections establishing)
> tps = 30169.378406 (excluding connections establishing)
>
> Hm. Bad news. We apparently didn't keep up vacuuming. But worse news is
> that even now, days later, autovacuum hasn't progressed:
> postgres=# select txid_current();
> ERROR:  database is not accepting commands to avoid wraparound data loss in 
> database "postgres"
> HINT:  Stop the postmaster and vacuum that database in single-user mode.
> You might also need to commit or roll back old prepared transactions.


This is still in regular mode, correct?

I don't think this has ever worked.  Vacuum needs to start a
transaction in order to record its update of datfrozenxid and
relfrozenxid to the catalogs (or at least, starts one for something).
Once you are within 1,000,000 of wraparound, you have to do the vacuum
in single-user mode, you can no longer just wait for autovacuum to do
its thing.  Otherwise the vacuum will do all the work of the vacuum,
but then fail to clear the error condition.



>
> Looking at datfrozenxid:
> postgres=# select datname, datfrozenxid, age(datfrozenxid) FROM pg_database ;
>   datname  | datfrozenxid |    age
> -----------+--------------+-----------
>  template1 |   3357685367 |         0
>  template0 |   3357685367 |         0
>  postgres  |   3159867733 | 197817634
> (3 rows)
> reveals that the launcher doesn't do squat because it doesn't think it
> needs to do anything.
>
> (gdb) p *ShmemVariableCache
> $3 = {nextOid = 24576, oidCount = 0, nextXid = 3357685367, oldestXid = 
> 1211201715, xidVacLimit = 1411201715, xidWarnLimit = 3347685362,
>   xidStopLimit = 3357685362, xidWrapLimit = 3358685362, oldestXidDB = 12380, 
> oldestCommitTs = 0, newestCommitTs = 0,
>   latestCompletedXid = 3357685366}
>
> 'oldestXid' shows the problem: We're indeed pretty short before a
> wraparound.
>
>
> The question is, how did we get here?

Could the database have undergone a crash and recovery cycle?

Since changes to datfrozenxid are WAL logged at the time they occur,
but the supposedly-synchronous change to ShmemVariableCache is not WAL
logged until the next checkpoint, a well timed crash can leave you in
the state where the system is in a tizzy about wraparound but each
database says "Nope, not me".

Since with default settings each database/table gets frozen 10 times
per real wrap-around, this is usually not going to be a problem as
having 10 consecutive well timed crashes is very unlikely.

But if you increase autovacuum_freeze_max_age a lot, or if the freeze
scan takes so long that there is only time to complete one and a
fraction of them during a single real wrap-around interval, then just
a single crash can you leave you destined for trouble.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cluster "stuck" in "not accepting commands to avoid wraparound data loss"

Reply via email to