On 16.01.2012 00:42, Peter Geoghegan wrote:
I've also attached the results of a pgbench-tools driven benchmark,
which are quite striking (Just the most relevant image - e-mail me
privately if you'd like a copy of the full report, as I don't want to
send a large PDF file to the list as a courtesy to others). Apart from
the obvious improvement in throughput, there is also a considerable
improvement in average and worst latency at all client counts.

Impressive results. How about uploading the PDF to the community wiki?

The main way that I've added value here is by refactoring and fixing
bugs. There were some tricky race conditions that caused the
regression tests to fail for that early draft patch, but there were a
few other small bugs too. There is an unprecedented latch pattern
introduced by this patch: Two processes (the WAL Writer and any given
connection backend) have a mutual dependency. Only one may sleep, in
which case the other is responsible for waking it. Much of the
difficulty in debugging this patch, and much of the complexity that
I've added, came from preventing both from simultaneously sleeping,
even in the face of various known failure modes like postmaster death.
If this happens, it does of course represent a denial-of-service, so
that's something that reviewers are going to want to heavily
scrutinise. I suppose that sync rep should be fine here, as it waits
on walsenders, but I haven't actually comprehensively tested the
patch, so there may be undiscovered unpleasant interactions with other
xlog related features. I can report that it passes the regression
tests successfully, and on an apparently consistently basis - I
battled with intermittent failures for a time.

I think it might be simpler if it wasn't the background writer that's responsible for "driving" the group commit queue, but the backends themselves. When a flush request comes in, you join the queue, and if someone else is already doing the flush, sleep until the driver wakes you up. If no-one is doing the flush yet (ie. the queue is empty), start doing it yourself. You'll need a state variable to keep track who's driving the queue, but otherwise I think it would be simpler as there would be no dependency on WAL writer.

I tend think of the group commit facility as a bus. Passengers can hop on board at any time, and they take turns on who drives the bus. When the first passengers hops in, there is no driver so he takes the driver seat. When the next passenger hops in, he sees that someone is driving the bus already, so he sits down, and places a big sign on his forehead stating the bus stop where he wants to get off, and goes to sleep. When the driver has reached his own bus stop, he wakes up all the passengers who wanted to get off at the same stop or any of the earlier stops [1]. He also wakes up the passenger whose bus stop is the farthest from the current stop, and gets off the bus. The woken-up passengers who have already reached their stops can immediately get off the bus, and the one who has not notices that no-one is driving the bus anymore, so he takes the driver seat.

[1] in a real bus, a passenger would not be happy if he's woken up too late and finds himself at the next stop instead of the one where he wanted to go, but for group commit, that is fine.

In this arrangement, you could use the per-process semaphore for the sleep/wakeups, instead of latches. I'm not sure if there's any difference, but semaphores are more tried and tested, at least.

The benefits (and, admittedly, the controversies) of this patch go
beyond mere batching of commits: it also largely, though not entirely,
obviates the need for user backends to directly write/flush WAL, and
the WAL Writer may never sleep if it continually finds work to do -
wal_writer_delay is obsoleted, as are commit_siblings and
commit_delay.

wal_writer_delay is still needed for controlling how often asynchronous commits are flushed to disk.

Auxiliary processes cannot use group commit. The changes made prevent
them from availing of commit_siblings/commit_delay parallelism,
because it doesn't exist anymore.

Auxiliary processes have PGPROC entries too. Why can't they participate?

Group commit is sometimes throttled, which seems appropriate - if a
backend requests that the WAL Writer flush an LSN deemed too far from
the known flushed point, that request is rejected and the backend goes
through another path, where XLogWrite() is called.

Hmm, if the backend doing the big flush gets the WALWriteLock before a bunch of group committers, the group committers will have to wait until the big flush is finished, anyway. I presume the idea of the throttling is to avoid the situation where a bunch of small commits need to wait for a huge flush to finish.

Perhaps the big flusher should always join the queue, but use some heuristic to first flush up to the previous commit request, to wake up others quickly, and do another flush to flush its own request after that.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to