Zeugswetter Andreas SB <[EMAIL PROTECTED]> writes: >> This seems odd. As near as I can tell, O_SYNC is simply a command to do >> fsync implicitly during each write call. It cannot save any I/O unless >> I'm missing something significant. Where is the performance difference >> coming from? > Yes, odd, but sure very reproducible here. I tried this on HPUX 10.20, which has not only O_SYNC but also O_DSYNC (defined to do the equivalent of fdatasync()), and got truly fascinating results. Apparently, on this platform these flags change the kernel's buffering behavior! Observe: $ gcc -Wall -O tfsync.c $ time a.out real 1m0.32s user 0m0.02s sys 0m16.16s $ gcc -Wall -O -DINIT_WRITE tfsync.c $ time a.out real 1m15.11s user 0m0.04s sys 0m32.76s Note the large amount of system time here, and the fact that the extra time in INIT_WRITE is all system time. I have previously observed that fsync() on HPUX 10.20 appears to iterate through every kernel disk buffer belonging to the file, presumably checking their dirtybits one by one. The INIT_WRITE form loses because each fsync in the second loop has to iterate through a full 16Mb worth of buffers, whereas without INIT_WRITE there will only be as many buffers as the amount of file we've filled so far. (On this platform, it'd probably be a win to use log segments smaller than 16Mb...) It's interesting that there's no visible I/O cost here for the extra write pass --- the extra I/O must be completely overlapped with the extra system time. $ gcc -Wall -O -DINIT_WRITE -DUSE_OSYNC tfsync.c $ time a.out real 0m45.04s user 0m0.02s sys 0m0.83s We just bought back almost all the system time. The only possible explanation is that this way either doesn't keep the buffers from prior blocks, or does not scan them for dirtybits. I note that the open(2) man page is phrased so that O_SYNC is actually defined not to fsync the whole file, but only the part you just wrote --- I wonder if it's actually implemented that way? $ gcc -Wall -O -DINIT_WRITE -DUSE_SPARSE tfsync.c $ time a.out real 1m2.96s user 0m0.02s sys 0m27.11s $ gcc -Wall -O -DINIT_WRITE -DUSE_OSYNC -DUSE_SPARSE tfsync.c $ time a.out real 1m1.34s user 0m0.01s sys 0m0.59s Sparse initialization wins a little in the non-O_SYNC case, but loses when compared with O_SYNC on. Not sure why; perhaps it alters the amount of I/O that has to be done for indirect blocks? $ gcc -Wall -O -DINIT_WRITE -DUSE_ODSYNC tfsync.c $ time a.out real 0m21.40s user 0m0.02s sys 0m0.60s And the piece de resistance: O_DSYNC *actually works* here, even though I previously found that the fdatasync() call is stubbed to fsync() in libc! This old HP box is built like a tank and has a similar lack of attention to noise level ;-) so I can very easily tell by ear that I am not getting back-and-forth seeks in this last case, even if the timing didn't prove it to be true. $ gcc -Wall -O -DUSE_ODSYNC tfsync.c $ time a.out real 1m1.56s user 0m0.02s sys 0m0.67s Without INIT_WRITE, we are back to essentially the performance of fsync even though we use DSYNC. This is expected since the inode must be written to change the EOF value. Interestingly, the system time is small, whereas in my first example it was large; but the elapsed time is the same. Evidently the system time is nearly all overlapped with I/O in the first example. At least on this platform, it would be definitely worthwhile to use O_DSYNC even if that meant fsync per write rather than per transaction. Can anyone else reproduce these results? I attach my modified version of Andreas' program. Note I do not believe his assertion that close() implies fsync() --- on the machines I've used, it demonstrably does not sync. You'll also note that I made the lseek optional in the second loop. This appears to make no real difference, so I didn't include timings with the lseek enabled. regards, tom lane
#include <stdio.h> #include <fcntl.h> #include <unistd.h> /* #define USE_SPARSE */ // #define INIT_WRITE // #define USE_OSYNC int main() { char zbuffer[8192]; int nbytes; int fd; char *tpath="xlog_testfile"; fd = open(tpath, O_RDWR | O_CREAT | O_EXCL, S_IRUSR | S_IWUSR); if (fd < 0) exit(1); #ifdef INIT_WRITE #ifdef USE_SPARSE /* the sparse file makes things worse here */ lseek(fd, 16*1024*1024 - 1, SEEK_SET); write(fd, 0, 1); #else memset(zbuffer, '\0', sizeof(zbuffer)); for (nbytes = 0; nbytes < 16*1024*1024; nbytes += sizeof(zbuffer)) { write(fd, zbuffer, sizeof(zbuffer)); } #endif fsync(fd); #endif /* no fsync here, since close does it for us */ /* You think so? */ close (fd); #ifdef USE_OSYNC fd = open(tpath, O_RDWR | O_SYNC, S_IRUSR | S_IWUSR); #else #ifdef USE_ODSYNC fd = open(tpath, O_RDWR | O_DSYNC, S_IRUSR | S_IWUSR); #else fd = open(tpath, O_RDWR , S_IRUSR | S_IWUSR); #endif #endif if (fd < 0) exit(1); memset(zbuffer, 'X', sizeof(zbuffer)); nbytes = 0; do { #ifdef USE_LSEEK lseek(fd, nbytes, SEEK_SET); #endif write(fd, zbuffer, sizeof(zbuffer)); #ifndef USE_OSYNC #ifndef USE_ODSYNC fsync(fd); #endif #endif nbytes += sizeof(zbuffer); } while (nbytes < 16*1024*1024); close(fd); return 0; }
---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl