Re: [HACKERS] [GENERAL] Slow PITR restore

Heikki Linnakangas Thu, 13 Dec 2007 04:30:51 -0800

Gregory Stark wrote:

"Simon Riggs" <[EMAIL PROTECTED]> writes:

We would have readbuffers in shared memory, like wal_buffers in reverse.
Each worker would read the next WAL record and check there is no
conflict with other concurrent WAL records. If not, it will apply the
record immediately, otherwise wait for the conflicting worker to

complete.


Well I guess you would have to bring up the locking infrastructure and lock
any blocks in the record you're applying (sorted first to avoid deadlocks). As
I understand it we don't use locks during recovery now but I'm not sure if
that's just because we don't have to or if there are practical problems which
would have to be solved to do so.

We do use locks during recovery, XLogReadBuffer takes an exclusive lockon the buffer. According to the comments there, it wouldn't be strictlynecessary. But I believe we do actually need it to protect frombgwriter writing out a buffer while it's been modified. We only lock onepage at a time, which is good enough for WAL replay, but not to protectthings like b-tree split from concurrent access.

I hacked together a quick & dirty prototype of using posix_fadvise inrecovery a while ago. First of all, there's the changes to the buffermanager, which we'd need anyway if we wanted to use posix_fadvise forspeeding up other stuff like index scans. Then there's changes toxlog.c, to buffer a number of WAL records, so that you can read aheadthe data pages needed by WAL records ahead of the WAL record you'reactually replaying.

I added a new function, readahead, to the rmgr API. It's similar to theredo function, but it doesn't actually replay the WAL record, but justissues the fadvise calls to the buffer manager for the pages that areneeded to replay the WAL record. This needs to be implemented for eachresource manager that we want to do readahead for. If we had the list ofblocks in the WAL record in a rmgr-independent format, we could do thatin a more generic way, like we do the backup block restoration.

The multiple-process approach seems a lot more complex to me. You need alot of bookkeeping to keep the processes from stepping on each otherstoes, and to choose the next WAL record to replay. I think you have thesame problem that you need to have a rmgr-specific function to extractthe data blocks #s required to replay that WAL record, or add that listto the WAL record header in a generic format. Multi-process approach isnice because it allows you to parallelize the CPU work of replaying therecords as well, but I wonder how much that really scales given all thelocking required. Also, I don't think replaying WAL records is veryexpensive CPU-wise. You'd need a pretty impressive RAID array to readWAL from, to saturate a single CPU.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to [EMAIL PROTECTED] so that your
      message can get through to the mailing list cleanly

Re: [HACKERS] [GENERAL] Slow PITR restore

Reply via email to