It appears the fsync problem is pervasive. Here's Linux 2.4.19's version from fs/buffer.c:
lock-> down(&inode->i_sem); ret = filemap_fdatasync(inode->i_mapping); err = file->f_op->fsync(file, dentry, 1); if (err && !ret) ret = err; err = filemap_fdatawait(inode->i_mapping); if (err && !ret) ret = err; unlock-> up(&inode->i_sem); But this is probably not a big factor as you outline below because the WALWriteLock is causing the same kind of contention. tom lane wrote: > This is kind of ugly in general terms but I'm not sure that it really > hurts Postgres. In our present scheme, the only files we ever fsync() > are WAL log files, not data files. And in normal operation there is > only one WAL writer at a time, and *no* WAL readers. So an exclusive > kernel-level lock on a WAL file while we fsync really shouldn't create > any problem for us. (Unless this indirectly blocks other operations > that I'm missing?) I hope you're right but I see some very similar contention problems in the case of many small transactions because of the WALWriteLock. Assume Transaction A which writes a lot of buffers and XLog entries, so the Commit forces a relatively lengthy fsynch. Transactions B - E block not on the kernel lock from fsync but on the WALWriteLock. When A finishes the fsync and subsequently releases the WALWriteLock B unblocks and gets the WALWriteLock for its fsync for the flush. C blocks on the WALWriteLock waiting to write its XLOG_XACT_COMMIT. B Releases and now C writes its XLOG_XACT_COMMIT. There now seems to be a lot of contention on the WALWriteLock. This is a shame for a system that has no locking at the logical level and therefore seems like it could be very, very fast and offer incredible concurrency. > As I commented before, I think we could do with an extra process to > issue WAL writes in places where they're not in the critical path for > a foreground process. But that seems to be orthogonal from this issue. It's only orthogonal to the fsync-specific contention issue. We now have to worry about WALWriteLock semantics causes the same contention. Your idea of a separate LogWriter process could very nicely solve this problem and accomplish a few other things at the same time if we make a few enhancements. Back-end servers would not issue fsync calls. They would simply block waiting until the LogWriter had written their record to the disk, i.e. until the sync'd block # was greater than the block that contained the XLOG_XACT_COMMIT record. The LogWriter could wake up committed back- ends after its log write returns. The log file would be opened O_DSYNC, O_APPEND every time. The LogWriter would issue writes of the optimal size when enough data was present or of smaller chunks if enough time had elapsed since the last write. The nice part is that the WALWriteLock semantics could be changed to allow the LogWriter to write to disk while WALWriteLocks are acquired by back-end servers. WALWriteLocks would only be held for the brief time needed to copy the entries into the log buffer. The LogWriter would only need to grab a lock to determine the current end of the log buffer. Since it would be writing blocks that occur earlier in the cache than the XLogInsert log writers it won't need to grab a WALWriteLock before writing the cache buffers. Many transactions would commit on the same fsync (now really a write with O_DSYNC) and we would get optimal write throughput for the log system. This would handle all the issues I had and it doesn't sound like a huge change. In fact, it ends up being almost semantically identical to the aio_write suggestion I made orignally, except the LogWriter is doing the background writing instead of the OS and we don't have to worry about aio implementations and portability. - Curtis ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org