Andres Freund wrote: > On 2014-06-20 17:38:16 -0400, Alvaro Herrera wrote:
> > It seems to me that we need to keep the offsets files around until a > > checkpoint has written the "oldest" number to WAL. In other words we > > need additional state in shared memory: (a) what we currently store > > which is the oldest number as computed by vacuum (not safe to delete, > > but it's the number that the next checkpoint must write), and (b) the > > oldest number that the last checkpoint wrote (the safe deletion point). > > Why not just WAL log truncations? If we'd emit the WAL record after > determining the offsets page we should be safe I think? That seems like > easier and more robust fix? And it's what e.g. the clog does. Yes, I think this whole thing would be simpler if we just wal-logged the truncations, like pg_clog does. But I would like to avoid doing that for now, and do it in 9.5 only in the future. As a backpatchable (to 9.4/9.3) fix, I propose we do the following: 1. have vacuum update MultiXactState->oldestMultiXactId based on the minimum value of pg_database->datminmxid. Since this value is saved in pg_control, it is restored from checkpoint replay during recovery. 2. Keep track of a new value, MultiXactState->lastCheckpointedOldest. This value is updated by CreateCheckPoint in a primary server after the checkpoint record has been flushed, and by xlog_redo in a hot standby, to be the MultiXactState->oldestMultiXactId value that was last flushed. 3. TruncateMultiXact() no longer receives a parameter. Files are removed based on MultiXactState->lastCheckpointedOldest instead. 4. call TruncateMultiXact at checkpoint time, after the checkpoint WAL record has been flushed, and at restartpoint time (just like today). This means we only remove files that a prior checkpoint has already registered as being no longer necessary. Also, if a recovery is interrupted before end of WAL (recovery target), the files are still present. So we no longer truncate during vacuum. Another consideration for (4) is that right now we're only invoking multixact truncation in a primary when we're able to advance pg_database.datminmxid (see vac_update_datfrozenxid). The problem is that after a crash and subsequent recovery, pg_database might be updated without removing pg_multixact files; this would mean that the next opportunity to remove files would be far in the future, when the minimum datminmxid is advanced again. One way to fix that would be to have every single call to vac_update_datfrozenxid() attempt multixact truncation, but that seems wasteful since I expect vacuuming is more frequent than checkpointing. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers