On Mon, 2008-09-08 at 17:40 -0400, Bruce Momjian wrote: > Fujii Masao wrote: > > On Mon, Sep 8, 2008 at 8:44 PM, Markus Wanner <[EMAIL PROTECTED]> wrote: > > >> Merge into WAL writer? > > > > > > Uh.. that would mean you'd loose parallelism between WAL writing to disk > > > and > > > WAL shipping via network. That does not sound appealing to me. > > > > That depends on the order of WAL writing and WAL shipping. > > How about the following order? > > > > 1. A backend writes WAL to disk. > > 2. The backend wakes up WAL sender process and sleeps. > > 3. WAL sender process does WAL shipping and wakes up the backend. > > 4. The backend issues sync command. > > I am confused why this is considered so complicated. Having individual > backends doing the wal transfer to the slave is never going to work > well.
Agreed. > I figured we would have a single WAL streamer that continues advancing > forward in the WAL file, streaming to the standby. Backends would > update a shared memory variable specifying how far they want the wal > streamer to advance and send a signal to the wal streamer if necessary. > Backends would monitor another shared memory variable that specifies how > far the wal streamer has advanced. Yes. We should have a LogwrtRqst pointer and LogwrtResult pointer for the send operation. The Write and Send operations can then continue independently of one another. XLogInsert() cannot advance to a new page while we are waiting to send or write. Notice that the Send process might be the bottleneck - that is the price of synchronous replication. Backends then wait * not at all for asynch commit * just for Write for local synch commit * for both Write and Send for remote synch commit (various additional options for what happens to confirm Send) So normal backends neither write nor send. We have two dedicated processes, one for write, one for send. We need to put an extra test into WALWriter loop so that it will continue immediately (with no wait) if there is an outstanding request for synchronous operation. This gives us the Group Commit feature also, even if we are not using replication. So we can drop the commit_delay stuff. XLogBackgroundFlush() processes data page at a time if it can. That may not be the correct batch size for XLogBackgroundSend(), so we may need a tunable for the MTU. Under heavy load we need the Write and Send to act in a way to maximise throughput rather than minimise response time, as we do now. If wal_buffers overflows, we continue to hold WALInsertLock while we wait for WALWriter and WALSender to complete. We should increase default wal_buffers to 64. After (or during) XLogInsert backends will sleep in a proc queue, similar to LWlocks and protected by a spinlock. When preparing to write/send the WAL process should read the proc at the *tail* of the queue to see what the next LogwrtRqst should be. Then it performs its action and wakes procs up starting with the head of the queue. We would add LSN into PGPROC, so WAL processes can check whether the backend should be woken. The LSN field can be accessed without spinlocks since it is only ever set by the backend itself and only read while a backend is sleeping. So we access spinlock, find tail, drop spinlock then read LSN of the backend that (was) the tail. Another thought occurs that we might measure the time a Send takes and specify a limit on how long we are prepared to wait for confirmation. Limit=0 => asynchronous. Limit > 0 implies synchronous-up-to-the-limit. This would give better user behaviour across a highly variable network connection. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers