On Mon, 2008-09-08 at 17:40 -0400, Bruce Momjian wrote:
> Fujii Masao wrote:
> > On Mon, Sep 8, 2008 at 8:44 PM, Markus Wanner <[EMAIL PROTECTED]> wrote:
> > >>        Merge into WAL writer?
> > >
> > > Uh.. that would mean you'd loose parallelism between WAL writing to disk 
> > > and
> > > WAL shipping via network. That does not sound appealing to me.
> > 
> > That depends on the order of WAL writing and WAL shipping.
> > How about the following order?
> > 
> > 1. A backend writes WAL to disk.
> > 2. The backend wakes up WAL sender process and sleeps.
> > 3. WAL sender process does WAL shipping and wakes up the backend.
> > 4. The backend issues sync command.
> 
> I am confused why this is considered so complicated.  Having individual
> backends doing the wal transfer to the slave is never going to work
> well.

Agreed.

> I figured we would have a single WAL streamer that continues advancing
> forward in the WAL file, streaming to the standby.  Backends would
> update a shared memory variable specifying how far they want the wal
> streamer to advance and send a signal to the wal streamer if necessary. 
> Backends would monitor another shared memory variable that specifies how
> far the wal streamer has advanced.

Yes. We should have a LogwrtRqst pointer and LogwrtResult pointer for
the send operation. The Write and Send operations can then continue
independently of one another. XLogInsert() cannot advance to a new page
while we are waiting to send or write. Notice that the Send process
might be the bottleneck - that is the price of synchronous replication.

Backends then wait
* not at all for asynch commit
* just for Write for local synch commit
* for both Write and Send for remote synch commit
(various additional options for what happens to confirm Send)

So normal backends neither write nor send. We have two dedicated
processes, one for write, one for send. We need to put an extra test
into WALWriter loop so that it will continue immediately (with no wait)
if there is an outstanding request for synchronous operation.

This gives us the Group Commit feature also, even if we are not using
replication. So we can drop the commit_delay stuff.

XLogBackgroundFlush() processes data page at a time if it can. That may
not be the correct batch size for XLogBackgroundSend(), so we may need a
tunable for the MTU. Under heavy load we need the Write and Send to act
in a way to maximise throughput rather than minimise response time, as
we do now.

If wal_buffers overflows, we continue to hold WALInsertLock while we
wait for WALWriter and WALSender to complete.

We should increase default wal_buffers to 64.

After (or during) XLogInsert backends will sleep in a proc queue,
similar to LWlocks and protected by a spinlock. When preparing to
write/send the WAL process should read the proc at the *tail* of the
queue to see what the next LogwrtRqst should be. Then it performs its
action and wakes procs up starting with the head of the queue. We would
add LSN into PGPROC, so WAL processes can check whether the backend
should be woken. The LSN field can be accessed without spinlocks since
it is only ever set by the backend itself and only read while a backend
is sleeping. So we access spinlock, find tail, drop spinlock then read
LSN of the backend that (was) the tail.

Another thought occurs that we might measure the time a Send takes and
specify a limit on how long we are prepared to wait for confirmation.
Limit=0 => asynchronous. Limit > 0 implies synchronous-up-to-the-limit.
This would give better user behaviour across a highly variable network
connection.

-- 
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to