On Tue, 2008-10-28 at 14:21 +0200, Heikki Linnakangas wrote:

> 1. You should avoid useless posix_fadvise() calls. In the naive 
> implementation, where you simply call posix_fadvise() for every page 
> referenced in every WAL record, you'll do 1-2 posix_fadvise() syscalls 
> per WAL record, and that's a lot of overhead. We face the same design 
> question as with Greg's patch to use posix_fadvise() to prefetch index 
> and bitmap scans: what should the interface to the buffer manager look 
> like? The simplest approach would be a new function call like 
> AdviseBuffer(Relation, BlockNumber), that calls posix_fadvise() for the 
> page if it's not in the buffer cache, but is a no-op otherwise. But that 
> means more overhead, since for every page access, we need to find the 
> page twice in the buffer cache; once for the AdviseBuffer() call, and 
> 2nd time for the actual ReadBuffer(). 

That's a much smaller overhead than waiting for an I/O. The CPU overhead
isn't really a problem if we're I/O bound.

> It would be more efficient to pin 
> the buffer in the AdviseBuffer() call already, but that requires much 
> more changes to the callers.

That would be hard to cleanup safely, plus we'd have difficulty with
timing: is there enough buffer space to allow all the prefetched blocks
live in cache at once? If not, this approach would cause problems.

> 2. The format of each WAL record is different, so you need a "readahead 
> handler" for every resource manager, for every record type. It would be 
> a lot simpler if there was a standardized way to store that information 
> in the WAL records.

I would prefer a new rmgr API call that returns a list of blocks. That's
better than trying to make everything fit one pattern. If the call
doesn't exist then that rmgr won't get prefetch.

> 3. IIRC I tried to handle just a few most important WAL records at 
> first, but it turned out that you really need to handle all WAL records 
> (that are used at all) before you see any benefit. Otherwise, every time 
> you hit a WAL record that you haven't done posix_fadvise() on, the 
> recovery "stalls", and you don't need much of those to diminish the gains.
> 
> Not sure how these apply to your approach, it's very different. You seem 
> to handle 1. by collecting all the page references for the WAL file, and 
> sorting and removing the duplicates. I wonder how much CPU time is spent 
> on that?

Removing duplicates seems like it will save CPU.

-- 
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to