Gregory Stark wrote:
"Koichi Suzuki" <[EMAIL PROTECTED]> writes:

This is my first proposal of PITR performance improvement for
PostgreSQL 8.4 development.   This proposal includes readahead
mechanism of data pages which will be read by redo() routines in the
recovery.   This is especially effective in the recovery without full
page write.   Readahead is done by posix_fadvise() as proposed in
index scan improvement.

Incidentally: a bit of background for anyone who wasn't around when last this
came up: prefetching is especially for our recovery code because it's
single-threaded. If you have a raid array you're effectively limited to using
a single drive at a time. This is a major problem because the logs could have
been written by many processes hammering the raid array concurrently. In other
words your warm standby database might not be able to keep up with the logs
from the master database even on identical (or even better) hardware.

Simon (I think?) proposed allowing our recovery code to be multi-threaded.
Heikki suggested using prefetching.

I actually played around with the prefetching, and even wrote a quick prototype of it, about a year ago. It read ahead a fixed number of the WAL records in xlog.c, calling posix_fadvise() for all pages that were referenced in them. I never got around to finish it, as I wanted to see Greg's posix_fadvise() patch get done first and rely on the same infrastructure, but here's some lessons I learned:

1. You should avoid useless posix_fadvise() calls. In the naive implementation, where you simply call posix_fadvise() for every page referenced in every WAL record, you'll do 1-2 posix_fadvise() syscalls per WAL record, and that's a lot of overhead. We face the same design question as with Greg's patch to use posix_fadvise() to prefetch index and bitmap scans: what should the interface to the buffer manager look like? The simplest approach would be a new function call like AdviseBuffer(Relation, BlockNumber), that calls posix_fadvise() for the page if it's not in the buffer cache, but is a no-op otherwise. But that means more overhead, since for every page access, we need to find the page twice in the buffer cache; once for the AdviseBuffer() call, and 2nd time for the actual ReadBuffer(). It would be more efficient to pin the buffer in the AdviseBuffer() call already, but that requires much more changes to the callers.

2. The format of each WAL record is different, so you need a "readahead handler" for every resource manager, for every record type. It would be a lot simpler if there was a standardized way to store that information in the WAL records.

3. IIRC I tried to handle just a few most important WAL records at first, but it turned out that you really need to handle all WAL records (that are used at all) before you see any benefit. Otherwise, every time you hit a WAL record that you haven't done posix_fadvise() on, the recovery "stalls", and you don't need much of those to diminish the gains.

Not sure how these apply to your approach, it's very different. You seem to handle 1. by collecting all the page references for the WAL file, and sorting and removing the duplicates. I wonder how much CPU time is spent on that?

Details of the implementation will be found in README file in the material.

I've read through this and I think I disagree with the idea of using a
separate program. It's a lot of extra code -- and duplicated code from the
normal recovery too.

Agreed, this belongs into core. The nice thing about a separate process is that you could hook it into recovery_command, with no changes to the server, but as you note in the README, we'd want to use this in crash recovery as well, and the interaction between the external command and the startup process seems overly complex for that. Besides, we want to use the posix_fadvise() stuff in the backend anyway, so we should use the same infrastructure during WAL replay as well.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to