On Tue, 2008-10-28 at 14:21 +0200, Heikki Linnakangas wrote: > 1. You should avoid useless posix_fadvise() calls. In the naive > implementation, where you simply call posix_fadvise() for every page > referenced in every WAL record, you'll do 1-2 posix_fadvise() syscalls > per WAL record, and that's a lot of overhead. We face the same design > question as with Greg's patch to use posix_fadvise() to prefetch index > and bitmap scans: what should the interface to the buffer manager look > like? The simplest approach would be a new function call like > AdviseBuffer(Relation, BlockNumber), that calls posix_fadvise() for the > page if it's not in the buffer cache, but is a no-op otherwise. But that > means more overhead, since for every page access, we need to find the > page twice in the buffer cache; once for the AdviseBuffer() call, and > 2nd time for the actual ReadBuffer().
That's a much smaller overhead than waiting for an I/O. The CPU overhead isn't really a problem if we're I/O bound. > It would be more efficient to pin > the buffer in the AdviseBuffer() call already, but that requires much > more changes to the callers. That would be hard to cleanup safely, plus we'd have difficulty with timing: is there enough buffer space to allow all the prefetched blocks live in cache at once? If not, this approach would cause problems. > 2. The format of each WAL record is different, so you need a "readahead > handler" for every resource manager, for every record type. It would be > a lot simpler if there was a standardized way to store that information > in the WAL records. I would prefer a new rmgr API call that returns a list of blocks. That's better than trying to make everything fit one pattern. If the call doesn't exist then that rmgr won't get prefetch. > 3. IIRC I tried to handle just a few most important WAL records at > first, but it turned out that you really need to handle all WAL records > (that are used at all) before you see any benefit. Otherwise, every time > you hit a WAL record that you haven't done posix_fadvise() on, the > recovery "stalls", and you don't need much of those to diminish the gains. > > Not sure how these apply to your approach, it's very different. You seem > to handle 1. by collecting all the page references for the WAL file, and > sorting and removing the duplicates. I wonder how much CPU time is spent > on that? Removing duplicates seems like it will save CPU. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers