"Curtis Faith" <[EMAIL PROTECTED]> writes: > Even the theoretical limit you mention of one transaction per revolution > per committing process seem like a significant bottleneck.
Well, too bad. If you haven't gotten your commit record down to disk, then *you have not committed*. This is not negotiable. (If you think it is, then turn off fsync and quit worrying ;-)) An application that is willing to have multiple transactions in flight at the same time can open up multiple backend connections to issue those transactions, and thereby perhaps beat the theoretical limit. But for serial transactions, there is not anything we can do to beat that limit. (At least not with the log structure we have now. One could imagine dropping a commit record into the nearest one of multiple buckets that are carefully scattered around the disk. But exploiting that would take near-perfect knowledge about disk head positioning; it's even harder to solve than the problem we're considering now.) > I still think that it would be much faster to just keep writing the WAL log > blocks when they fill up and have a separate process wake the commiting > process when the write completes. This would eliminate WAL writing as a > bottleneck. You're failing to distinguish total throughput to the WAL drive from response time seen by any one transaction. Yes, a policy of writing each WAL block once when it fills would maximize potential throughput, but it would also mean a potentially very large delay for a transaction waiting to commit. The lower the system load, the worse the performance on that scale. The scheme we now have (with my recent patch) essentially says that the commit delay seen by any one transaction is at most two disk rotations. Unfortunately it's also at least one rotation :-(, except in the case where there is no contention, ie, no already-scheduled WAL write when the transaction reaches the commit stage. It would be nice to be able to say "at most one disk rotation" instead --- but I don't see how to do that in the absence of detailed information about disk head position. Something I was toying with this afternoon: assume we have a background process responsible for all WAL writes --- not only filled buffers, but the currently active buffer. It periodically checks to see if there are unwritten commit records in the active buffer, and if so schedules a write for them. If this could be done during each disk rotation, "just before" the disk reaches the active WAL log block, we'd have an ideal solution. And it would not be too hard for such a process to determine the right time: it could measure the drive rotational speed by observing the completion times of successive writes to the same sector, and it wouldn't take much logic to empirically find the latest time at which a write can be issued and have a good probability of hitting the disk on time. (At least, this would work pretty well given a dedicated WAL drive, else there'd be too much interference from other I/O requests.) However, this whole scheme falls down on the same problem we've run into before: user processes can't schedule themselves with millisecond accuracy. The writer process might be able to determine the ideal time to wake up and make the check, but it can't get the Unix kernel to dispatch it then, at least not on most Unixen. The typical scheduling slop is one time slice, which is comparable to if not more than the disk rotation time. ISTM aio_write only improves the picture if there's some magic in-kernel processing that makes this same kind of judgment as to when to issue the "ganged" write for real, and is able to do it on time because it's in the kernel. I haven't heard anything to make me think that that feature actually exists. AFAIK the kernel isn't much more enlightened about physical head positions than we are. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly