On Mon, 2009-01-19 at 15:47 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > I suggest we handle this on the recovery side, not on the master, by
> > deriving the xmin at the point the WAL record arrives. We would
> > calculate it by looking at recovery procs only. That will likely give us
> > a later value than we would get from the master, but that can't be
> > helped.
> 
> Hmm, that's an interesting idea. It presumes that we see an abort/commit 
> WAL record at the right moment for every transaction that we have a 
> recovery proc for. We just concluded in the other thread that we do 
> always emit abortion records when the database is running normally; I 
> think that's good enough for this purpose.

But not perfect.

> A few other random ideas I had:
> 
> - in btree delete redo, follow the index pointers, and look at the xids 
> on the heap tuples. That requires some random I/O, but will give the 
> exact value we need. Since it's quite expensive, I think we'd only want 
> to do it after using some more conservative test but quicker test to 
> determine that there might be a conflict.

Ouch.

> - Add latestRemovedXid to b-tree page header, and update it as tuples 
> are killed. Need to tolerate the fact that tuple kills are not WAL-logged.

Sounds easy-ish. 

If tuple kills aren't WAL logged then if we crash latestRemovedXid will
remain as it was at time of last write. So if we do a delete scan it
will only remove the index tuples with hint bits set at time of that
write, so the value would always be correct, no?

I'm somehow uncomfortable with this idea though. Care to persuade me
further?

> > Btree deletes were an important optimisation when it first went it, but
> > now we have HOT it is much less important. 
> 
> If HOT is working well for your application, there won't be many btree 
> deletes anyway, and the whole issue is moot.

That was my point.

> > Another route might be to put
> > an option to turn off btree delete on the master, default = on. We
> > probably should consider turning it off entirely when it doesn't yield
> > significant benefit.
> 
> I'd rather put in a generic mechanism to prevent vacuuming of recent 
> tuples that might still be needed in the standby. Like always 
> subtracting a fixed amount of xids from OldestXmin/RecentGlobalXmin, or 
> having a feedback loop from the standby to the master, allowing the 
> master to say what it's oldest xmin is. But that's a fair amount of 
> work; I'd rather leave that as a future enhancement, and just figure out 
> something simple for this specific issue. We'll need to handle it 
> gracefully even if we try to avoid it by retaining dead tuples longer.

Yeh, looked at both of those also. Definitely after sync rep goes in
though.

-- 
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to