Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes
Ok, I finally got a couple of tests done against CVS from Aug 3, 2005. I'm not sure if I'm showing anything insightful though. I've learned that fdatasync and O_DSYNC are simply fsync and O_SYNC respectively on Linux, which you guys may have already known. There appears to be a fair performance decrease in using open_sync. Just to double check, am I correct in understanding only open_sync uses O_DIRECT? fdatasync http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/38/ 5462 notpm open_sync http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/40/ 4860 notpm Mark ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes
Mark Wong wrote: Ok, I finally got a couple of tests done against CVS from Aug 3, 2005. I'm not sure if I'm showing anything insightful though. I've learned that fdatasync and O_DSYNC are simply fsync and O_SYNC respectively on Linux, which you guys may have already known. There appears to be a That is not what we thought for Linux, but many other OS's behave that way. fair performance decrease in using open_sync. Just to double check, am I correct in understanding only open_sync uses O_DIRECT? Right. fdatasync http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/38/ 5462 notpm open_sync http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/40/ 4860 notpm Right now open_sync is our last choice, which seems to still be valid for Linux, at least. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes
Mark Wong wrote: O_DIRECT + fsync() can make sense. It avoids the copying of data to the page cache before being written and will also guarantee that the file's metadata is also written to disk. It also prevents the page cache from filling up with write data that will never be read (I assume it is only read if a recovery is necessary - which should be rare). It can also helps disks with write back cache when using the journaling file system that use i/o barriers. You would want to use large writes, since the kernel page cache won't be writing multiple pages for you. Right, but it seems O_DIRECT is pretty much the same as O_DIRECT with O_DSYNC because the data is always written to disk on write(). Our logic is that there is nothing for fdatasync to do in most cases after using O_DIRECT, so the O_DIRECT/fdatasync() combination doesn't make sense. And FreeBSD, and perhaps others, need O_SYNC or fdatasync with O_DIRECT because O_DIRECT doesn't force stuff to disk in all cases. I need to look at the kernel code more to comment on O_DIRECT with O_SYNC. Questions: Does the database transaction logger preallocate the log file? Yes. Does the logger care about the order in which each write hits the disk? Not really. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes
Here are comments that Daniel McNeil made earlier, which I've neglected to forward earlier. I've cc'ed him and Mark Havercamp, which some of you got to meet the other day. Mark - With O_DIRECT on Linux, when the write() returns the i/o has been transferred to the disk. Normally, this i/o will be DMAed directly from user-space to the device. The current exception is when doing an O_DIRECT write to a hole in a file. (If an program does a truncate() or lseek()/write() that makes a file larger, the file system does not allocated space between the old end of file and the new end of file.) An O_DIRECT write to hole like this, requires the file system to allocated space, but there is a race condition between the O_DIRECT write doing the allocate and then write to initialized the newly allocated data and any other process that attempts a buffered (page cache) read of the same area in the file -- it was possible for the read to data from the allocated region before the O_DIRECT write(). The fix in Linux is for the O_DIRECT write() to fall back to use buffer i/o to do the write() and flush the data from the page cache to the disk. A write() with O_DIRECT only means the data has been transferred to the disk. Depending on the file system and mount options, it does not mean the meta data for the file has been written to disk (see fsync man page). Fsync() will guarantee the data and metadata have been written to disk. Lastly, if a disk has a write back cache, an O_DIRECT write() does not guarantee that the disk has put the data on the physical media. I think some of the journal file systems now support i/o barriers on commit which will flush the disk write back cache. (I'm still looking the kernel code to see how this is done). Conclusion: O_DIRECT + fsync() can make sense. It avoids the copying of data to the page cache before being written and will also guarantee that the file's metadata is also written to disk. It also prevents the page cache from filling up with write data that will never be read (I assume it is only read if a recovery is necessary - which should be rare). It can also helps disks with write back cache when using the journaling file system that use i/o barriers. You would want to use large writes, since the kernel page cache won't be writing multiple pages for you. I need to look at the kernel code more to comment on O_DIRECT with O_SYNC. Questions: Does the database transaction logger preallocate the log file? Does the logger care about the order in which each write hits the disk? Now someone else can comment on my comments. Daniel ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes
Patch applied. Thanks. --- ITAGAKI Takahiro wrote: Thanks for reviewing! But the patch does not work on HEAD, because of the changes in BootStrapXLOG(). I send the patch with a fix for it. Bruce Momjian pgman@candle.pha.pa.us wrote: If you are doing fsync(), I don't see how O_DIRECT makes any sense because O_DIRECT is writing to disk on every write, and then what is the fsync() actually doing. It's depends on OSes. Manpage of Linux says, http://linux.com.hk/PenguinWeb/manpage.jsp?name=opensection=2 File I/O is done directly to/from user space buffers. The I/O is synchronous, i.e., at the completion of the read(2) or write(2) system call, data is **guaranteed to have been transferred**. But manpage of FreeBSD says, http://www.manpages.info/freebsd/open.2.html O_DIRECT may be used to minimize or eliminate the cache effects of read- ing and writing. The system will attempt to avoid caching the data you read or write. If it cannot avoid caching the data, it will **minimize the impact the data has on the cache**. In my understanding, the completion of write() with O_DIRECT does not always assure an actual write. So there may be difference between O_DIRECT+O_SYNC and O_DIRECT+fsync(), but I think that is not very often. What I did was to add O_DIRECT unconditionally for all uses of O_SYNC and O_DSYNC, so it is automatically used in those cases. And of course, if your operating system doens't support O_DIRECT, it isn't used. I agree with your way, where O_DIRECT is automatically used. I bet the combination of O_DIRECT and O_SYNC is always better than the case O_SYNC only used. --- ITAGAKI Takahiro NTT Cyber Space Laboratories [ Attachment, skipping... ] ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes
ITAGAKI Takahiro wrote: Thanks for reviewing! But the patch does not work on HEAD, because of the changes in BootStrapXLOG(). I send the patch with a fix for it. Thanks. If you are doing fsync(), I don't see how O_DIRECT makes any sense because O_DIRECT is writing to disk on every write, and then what is the fsync() actually doing. It's depends on OSes. Manpage of Linux says, http://linux.com.hk/PenguinWeb/manpage.jsp?name=opensection=2 File I/O is done directly to/from user space buffers. The I/O is synchronous, i.e., at the completion of the read(2) or write(2) system call, data is **guaranteed to have been transferred**. But manpage of FreeBSD says, http://www.manpages.info/freebsd/open.2.html O_DIRECT may be used to minimize or eliminate the cache effects of read- ing and writing. The system will attempt to avoid caching the data you read or write. If it cannot avoid caching the data, it will **minimize the impact the data has on the cache**. In my understanding, the completion of write() with O_DIRECT does not always assure an actual write. So there may be difference between O_DIRECT+O_SYNC and O_DIRECT+fsync(), but I think that is not very often. Yes, I do remember that. I know we _need_ fsync when using O_DIRECT, but the downside of O_DIRECT (force every write to disk) is the same as O_SYNC, so it seems if we are using O_DIRECT, we might as well use O_SYNC too and skip the fsync(). I will add a comment mentioning this. What I did was to add O_DIRECT unconditionally for all uses of O_SYNC and O_DSYNC, so it is automatically used in those cases. And of course, if your operating system doens't support O_DIRECT, it isn't used. I agree with your way, where O_DIRECT is automatically used. I bet the combination of O_DIRECT and O_SYNC is always better than the case O_SYNC only used. OK. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes
I have modified and attached your patch for your review. I didn't see any value to adding new fsync_method values because, to me, O_DIRECT is basically just like O_SYNC except it doesn't keep a copy of the buffer in the kernel cache. If you are doing fsync(), I don't see how O_DIRECT makes any sense because O_DIRECT is writing to disk on every write, and then what is the fsync() actually doing. This might explain why your fsync/direct and open/direct performance numbers are almost identical. Basically, if you are going to use O_DIRECT, why not use open_sync. What I did was to add O_DIRECT unconditionally for all uses of O_SYNC and O_DSYNC, so it is automatically used in those cases. And of course, if your operating system doens't support O_DIRECT, it isn't used. With your posted performance numbers, perhaps we should favor fsync_method O_SYNC on platforms that have O_DIRECT even if we don't support OPEN_DATASYNC, but I bet most platforms that have O_DIRECT also have O_DATASYNC. Perhaps some folks can run testes once the patch is applied. --- ITAGAKI Takahiro wrote: Tom Lane [EMAIL PROTECTED] wrote: Yeah, this is about what I was afraid of: if you're actually fsyncing then you get at best one commit per disk revolution, and the negotiation with the OS is down in the noise. If we disable writeback-cache and use open_sync, the per-page writing behavior in WAL module will show up as bad result. O_DIRECT is similar to O_DSYNC (at least on linux), so that the benefit of it will disappear behind the slow disk revolution. In the current source, WAL is written as: for (i = 0; i N; i++) { write(buffers[i], BLCKSZ); } Is this intentional? Can we rewrite it as follows? write(buffers[0], N * BLCKSZ); In order to achieve it, I wrote a 'gather-write' patch (xlog.gw.diff). Aside from this, I'll also send the fixed direct io patch (xlog.dio.diff). These two patches are independent, so they can be applied either or both. I tested them on my machine and the results as follows. It shows that direct-io and gather-write is the best choice when writeback-cache is off. Are these two patches worth trying if they are used together? | writeback | fsync= | fdata | open_ | fsync_ | open_ patch | cache | false | sync | sync | direct | direct +---++---+---++- direct io | off | 124.2 | 105.7 | 48.3 | 48.3 | 48.2 direct io | on| 129.1 | 112.3 | 114.1 | 142.9 | 144.5 gather-write| off | 124.3 | 108.7 | 105.4 | (N/A) | (N/A) both| off | 131.5 | 115.5 | 114.4 | 145.4 | 145.2 - 20runs * pgbench -s 100 -c 50 -t 200 - with tuning (wal_buffers=64, commit_delay=500, checkpoint_segments=8) - using 2 ATA disks: - hda(reiserfs) includes system and wal. - hdc(jfs) includes database files. writeback-cache is always on. --- ITAGAKI Takahiro NTT Cyber Space Laboratories [ Attachment, skipping... ] [ Attachment, skipping... ] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 Index: src/backend/access/transam/xlog.c === RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v retrieving revision 1.210 diff -c -c -r1.210 xlog.c *** src/backend/access/transam/xlog.c 23 Jul 2005 15:31:16 - 1.210 --- src/backend/access/transam/xlog.c 23 Jul 2005 16:09:12 - *** *** 48,77 /* * This chunk of hackery attempts to determine which file sync methods * are available on the current platform, and to choose an appropriate * default method.We assume that fsync() is always available, and that * configure determined whether fdatasync() is. */ #if defined(O_SYNC) ! #define OPEN_SYNC_FLAGO_SYNC #else #if defined(O_FSYNC) ! #define OPEN_SYNC_FLAGO_FSYNC #endif #endif #if defined(O_DSYNC) #if defined(OPEN_SYNC_FLAG) ! #if O_DSYNC != OPEN_SYNC_FLAG ! #define OPEN_DATASYNC_FLAGO_DSYNC #endif #else /* !defined(OPEN_SYNC_FLAG) */ /* Win32 only has O_DSYNC */ ! #define OPEN_DATASYNC_FLAGO_DSYNC #endif #endif #if defined(OPEN_DATASYNC_FLAG) #define DEFAULT_SYNC_METHOD_STR open_datasync #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN --- 48,114 /* + *Becauase O_DIRECT bypasses the kernel buffers, and because we never + *read those buffers except during