Simon Riggs wrote:
I suggest we handle this on the recovery side, not on the master, by
deriving the xmin at the point the WAL record arrives. We would
calculate it by looking at recovery procs only. That will likely give us
a later value than we would get from the master, but that can't be
helped.

Hmm, that's an interesting idea. It presumes that we see an abort/commit WAL record at the right moment for every transaction that we have a recovery proc for. We just concluded in the other thread that we do always emit abortion records when the database is running normally; I think that's good enough for this purpose.

A few other random ideas I had:

- in btree delete redo, follow the index pointers, and look at the xids on the heap tuples. That requires some random I/O, but will give the exact value we need. Since it's quite expensive, I think we'd only want to do it after using some more conservative test but quicker test to determine that there might be a conflict.

- Add latestRemovedXid to b-tree page header, and update it as tuples are killed. Need to tolerate the fact that tuple kills are not WAL-logged.

Btree deletes were an important optimisation when it first went it, but
now we have HOT it is much less important.

If HOT is working well for your application, there won't be many btree deletes anyway, and the whole issue is moot.

Another route might be to put
an option to turn off btree delete on the master, default = on. We
probably should consider turning it off entirely when it doesn't yield
significant benefit.

I'd rather put in a generic mechanism to prevent vacuuming of recent tuples that might still be needed in the standby. Like always subtracting a fixed amount of xids from OldestXmin/RecentGlobalXmin, or having a feedback loop from the standby to the master, allowing the master to say what it's oldest xmin is. But that's a fair amount of work; I'd rather leave that as a future enhancement, and just figure out something simple for this specific issue. We'll need to handle it gracefully even if we try to avoid it by retaining dead tuples longer.

Lots of scanning to remove the odd row is probably
pretty wasteful and likely adds contention at the very point we don't
want it - index splits.

Remember that if you can remove enough dead tuples from the index page, you've just made room on the page and don't need to split. Splitting is pretty expensive compared to scanning a few line pointers.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to