Re: [HACKERS] Proposal of PITR performance improvement for 8.4.

Heikki Linnakangas Tue, 28 Oct 2008 05:21:41 -0700

Gregory Stark wrote:

"Koichi Suzuki" <[EMAIL PROTECTED]> writes:

This is my first proposal of PITR performance improvement for
PostgreSQL 8.4 development.   This proposal includes readahead
mechanism of data pages which will be read by redo() routines in the
recovery.   This is especially effective in the recovery without full
page write.   Readahead is done by posix_fadvise() as proposed in
index scan improvement.


Incidentally: a bit of background for anyone who wasn't around when last this
came up: prefetching is especially for our recovery code because it's
single-threaded. If you have a raid array you're effectively limited to using
a single drive at a time. This is a major problem because the logs could have
been written by many processes hammering the raid array concurrently. In other
words your warm standby database might not be able to keep up with the logs
from the master database even on identical (or even better) hardware.

Simon (I think?) proposed allowing our recovery code to be multi-threaded.
Heikki suggested using prefetching.

I actually played around with the prefetching, and even wrote a quickprototype of it, about a year ago. It read ahead a fixed number of theWAL records in xlog.c, calling posix_fadvise() for all pages that werereferenced in them. I never got around to finish it, as I wanted to seeGreg's posix_fadvise() patch get done first and rely on the sameinfrastructure, but here's some lessons I learned:

1. You should avoid useless posix_fadvise() calls. In the naiveimplementation, where you simply call posix_fadvise() for every pagereferenced in every WAL record, you'll do 1-2 posix_fadvise() syscallsper WAL record, and that's a lot of overhead. We face the same designquestion as with Greg's patch to use posix_fadvise() to prefetch indexand bitmap scans: what should the interface to the buffer manager looklike? The simplest approach would be a new function call likeAdviseBuffer(Relation, BlockNumber), that calls posix_fadvise() for thepage if it's not in the buffer cache, but is a no-op otherwise. But thatmeans more overhead, since for every page access, we need to find thepage twice in the buffer cache; once for the AdviseBuffer() call, and2nd time for the actual ReadBuffer(). It would be more efficient to pinthe buffer in the AdviseBuffer() call already, but that requires muchmore changes to the callers.

2. The format of each WAL record is different, so you need a "readaheadhandler" for every resource manager, for every record type. It would bea lot simpler if there was a standardized way to store that informationin the WAL records.

3. IIRC I tried to handle just a few most important WAL records atfirst, but it turned out that you really need to handle all WAL records(that are used at all) before you see any benefit. Otherwise, every timeyou hit a WAL record that you haven't done posix_fadvise() on, therecovery "stalls", and you don't need much of those to diminish the gains.

Not sure how these apply to your approach, it's very different. You seemto handle 1. by collecting all the page references for the WAL file, andsorting and removing the duplicates. I wonder how much CPU time is spenton that?

Details of the implementation will be found in README file in the material.


I've read through this and I think I disagree with the idea of using a
separate program. It's a lot of extra code -- and duplicated code from the
normal recovery too.

Agreed, this belongs into core. The nice thing about a separate processis that you could hook it into recovery_command, with no changes to theserver, but as you note in the README, we'd want to use this in crashrecovery as well, and the interaction between the external command andthe startup process seems overly complex for that. Besides, we want touse the posix_fadvise() stuff in the backend anyway, so we should usethe same infrastructure during WAL replay as well.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal of PITR performance improvement for 8.4.

Reply via email to