We just discussed this in detail with Simon, and it looks like we have 5 (!) different but related problems:

1) The original problem of freeze then crash, leaving too high values in relminxid and datminxid. If you then run vacuum, it might truncate CLOG and you lose the commit status of the records that were supposed to be frozen.

To fix this, we need to WAL log freezing as already discussed.

2) vactuple_get_minxid doesn't take into account xmax's of tuples that have HEAP_XMAX_INVALID set. That's a problem:

transaction 1001 - BEGIN; DELETE FROM foo where key = 1;
transaction 1001 - ROLLBACK;
transaction 1002 - VACUUM foo;
crash

VACUUM foo will set relminxid to 1002, because HEAP_XMAX_INVALID was set on the tuple (possibly by vacuum itself) that the deletion that rolled back touched. However, that hint-bit update hasn't hit the disk yet, so after recovery, the tuple will have an xmax of 1001 with no hint-bit, and relminxid is 1002.

The simplest fix for this issue is to ignore the HEAP_XMAX_INVALID hint bit, and take any xmax other than InvalidXid into account when calculating the relminxid.

3) If you recover from a PITR backup (or have a hot stand-by), with base backup that's more than 4 billion transactions older than the newest WAL record, the clog entries of old transactions in the base backup will overlap with the clog entries of new transactions that are in the WAL records. This is the problem you also pointed out below.

To fix this, we need to emit a WAL record when truncating the clog. We must also make sure that recovery of any WAL record type doesn't rely on clog, because if we truncate the clog and then crash, recovery won't have the clog available for the old transactions. At the moment, TruncateCLog issues a checkpoint to protect from that but that's not going to work when rolling forward logs in PITR, right?

4) If we fix issue 2 so that vactuple_get_minxid always takes xmax into account, even if HEAP_XMAX_INVALID is set, a tuple with an aborted xmax will keep us from advancing relminxid and truncating clog etc. That doesn't lead to data corruption, but you will eventually hit the transaction wrap-around limit. We don't have the same problem with xmin, because we freeze tuples that are older than FreezeLimit to avoid it, but we don't do that for xmax.

To fix this, replace any xmax older than FreezeLimit with InvalidXid during vacuum. That also needs to be WAL logged.

5) We don't freeze tuples that are in RECENTLY_DEAD or DELETE_IN_PROGRESS state. That doesn't lead to data corruption, but it might make you hit the transaction wrap-around limit. That can happen if you have a transaction that deletes or updates a very old, but not yet frozen tuple. If you run vacuum while the deleting transaction is in progress, vacuum won't freeze the tuple, and won't advance the wrap-around limit because of the old tuple. That's not serious if the deleting transaction commits, because the next vacuum will then remove the tuple, but if it aborts, we might run into the same problem on the next vacuum, and the next one, and the next one, until we reach the wrap-around.

To fix this, simply do the freezing for tuples in RECENTLY_DEAD and DELETE_IN_PROGRESS states as well,

Am I missing something? Finding this many bugs makes me nervous...

Simon volunteered to make the clog changes for 3 because it's a PITR related issue. I can write a patch/patches for the other changes if it helps.

Tom Lane wrote:
"Heikki Linnakangas" <[EMAIL PROTECTED]> writes:
Tom Lane wrote:
I think it's premature to start writing
patches until we've decided how this really needs to work.

Not logging hint-bit updates seems safe to me. As long as we have the clog, the hint-bit is just a hint. The problem with freezing is that after freezing tuples, the corresponding clog page can go away.

Actually clog can go away much sooner than that, at least in normal
operation --- that's what datvacuumxid is for, to track where we can
truncate clog.  Maybe it's OK to say that during WAL replay we keep it
all the way back to the freeze horizon, but I'm not sure how we keep the
system from wiping clog it still needs right after switching to normal
operation.  Maybe we should somehow not xlog updates of datvacuumxid?

Another thing I'm concerned about is the scenario where a PITR
hot-standby machine tracks a master over a period of more than 4 billion
transactions.  I'm not sure what will happen in the slave's pg_clog
directory, but I'm afraid it won't be good :-(

                        regards, tom lane



--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Reply via email to