Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
Josh Berkus wrote: > On 12/6/10 6:10 PM, Tom Lane wrote: > > Robert Haas writes: > >> On Mon, Dec 6, 2010 at 9:04 PM, Josh Berkus wrote: > >>> Actually, on OSX 10.5.8, o_dsync and fdatasync aren't even available. > >>> From my run, it looks like even so regular fsync might be better than > >>> open_sync. > > > >> But I think you need to use fsync_writethrough if you actually want > >> durability. > > > > Yeah. Unless your laptop contains an SSD, those numbers are garbage on > > their face. So that's another problem with test_fsync: it omits > > fsync_writethrough. > > Yeah, the issue with test_fsync appears to be that it's designed to work > without os-specific switches no matter what, not to accurately reflect > how we access wal. I have now modified pg_test_fsync to use O_DIRECT for O_SYNC/O_FSYNC, and O_DSYNC, if supported, so it now matches how we use WAL (except we don't use O_DIRECT when in 'archive' and 'hot standby' mode). Applied patch attached. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/contrib/pg_test_fsync/pg_test_fsync.c b/contrib/pg_test_fsync/pg_test_fsync.c new file mode 100644 index d075483..49a7b3c *** a/contrib/pg_test_fsync/pg_test_fsync.c --- b/contrib/pg_test_fsync/pg_test_fsync.c *** *** 23,29 #define XLOG_BLCKSZ_K (XLOG_BLCKSZ / 1024) #define LABEL_FORMAT "%-32s" ! #define NA_FORMAT LABEL_FORMAT "%18s" #define OPS_FORMAT "%9.3f ops/sec" static const char *progname; --- 23,29 #define XLOG_BLCKSZ_K (XLOG_BLCKSZ / 1024) #define LABEL_FORMAT "%-32s" ! #define NA_FORMAT "%18s" #define OPS_FORMAT "%9.3f ops/sec" static const char *progname; *** handle_args(int argc, char *argv[]) *** 134,139 --- 134,144 } printf("%d operations per test\n", ops_per_test); + #if PG_O_DIRECT != 0 + printf("O_DIRECT supported on this platform for open_datasync and open_sync.\n"); + #else + printf("Direct I/O is not supported on this platform.\n"); + #endif } static void *** test_sync(int writes_per_op) *** 184,226 /* * Test open_datasync if available */ ! #ifdef OPEN_DATASYNC_FLAG ! printf(LABEL_FORMAT, "open_datasync" ! #if PG_O_DIRECT != 0 ! " (non-direct I/O)*" ! #endif ! ); fflush(stdout); ! if ((tmpfile = open(filename, O_RDWR | O_DSYNC, 0)) == -1) ! die("could not open output file"); ! gettimeofday(&start_t, NULL); ! for (ops = 0; ops < ops_per_test; ops++) ! { ! for (writes = 0; writes < writes_per_op; writes++) ! if (write(tmpfile, buf, XLOG_BLCKSZ) != XLOG_BLCKSZ) ! die("write failed"); ! if (lseek(tmpfile, 0, SEEK_SET) == -1) ! die("seek failed"); ! } ! gettimeofday(&stop_t, NULL); ! close(tmpfile); ! print_elapse(start_t, stop_t); ! ! /* ! * If O_DIRECT is enabled, test that with open_datasync ! */ ! #if PG_O_DIRECT != 0 if ((tmpfile = open(filename, O_RDWR | O_DSYNC | PG_O_DIRECT, 0)) == -1) { ! printf(NA_FORMAT, "o_direct", "n/a**\n"); fs_warning = true; } else { ! printf(LABEL_FORMAT, "open_datasync (direct I/O)"); ! fflush(stdout); ! gettimeofday(&start_t, NULL); for (ops = 0; ops < ops_per_test; ops++) { --- 189,207 /* * Test open_datasync if available */ ! printf(LABEL_FORMAT, "open_datasync"); fflush(stdout); ! #ifdef OPEN_DATASYNC_FLAG if ((tmpfile = open(filename, O_RDWR | O_DSYNC | PG_O_DIRECT, 0)) == -1) { ! printf(NA_FORMAT, "n/a*\n"); fs_warning = true; } else { ! if ((tmpfile = open(filename, O_RDWR | O_DSYNC | PG_O_DIRECT, 0)) == -1) ! die("could not open output file"); gettimeofday(&start_t, NULL); for (ops = 0; ops < ops_per_test; ops++) { *** test_sync(int writes_per_op) *** 234,252 close(tmpfile); print_elapse(start_t, stop_t); } - #endif - #else ! printf(NA_FORMAT, "open_datasync", "n/a\n"); #endif /* * Test fdatasync if available */ - #ifdef HAVE_FDATASYNC printf(LABEL_FORMAT, "fdatasync"); fflush(stdout); if ((tmpfile = open(filename, O_RDWR, 0)) == -1) die("could not open output file"); gettimeofday(&start_t, NULL); --- 215,231 close(tmpfile); print_elapse(start_t, stop_t); } #else ! printf(NA_FORMAT, "n/a\n"); #endif /* * Test fdatasync if available */ printf(LABEL_FORMAT, "fdatasync"); fflush(stdout); + #ifdef HAVE_FDATASYNC if ((tmpfile = open(filename, O_RDWR, 0)) == -1) die("could not open output file"); gettimeofday(&start_t, NULL); *** test_sync(int writes_per_op) *** 263,269 close(tmpfile); print_elapse(start_t, stop_t); #else ! printf(NA_FORMAT, "fdatasync", "n/a\n"); #endif /* --- 242,248 close(tmpfile); print_elapse(start_t, stop_t); #else ! printf(NA_FORMAT, "n/a\n"); #endif /*
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
Marti Raudsepp writes: > On Tue, Dec 7, 2010 at 03:34, Tom Lane wrote: >> To my mind, O_DIRECT is not really the key issue here, it's whether to >> prefer O_DSYNC or fdatasync. > Since different platforms implement these primitives differently, and > it's not always clear from the header file definitions which options > are actually implemented, how about simply hard-coding a default value > for each platform? There's not a fixed finite list of "platforms we support". In general we prefer to avoid designing things that way at all. If we have to have specific exceptions for specific platforms, we grin and bear it, but for the most part behavioral differences ought to be driven by configure's probes for platform features. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
On Tue, Dec 7, 2010 at 03:34, Tom Lane wrote: > To my mind, O_DIRECT is not really the key issue here, it's whether to > prefer O_DSYNC or fdatasync. Since different platforms implement these primitives differently, and it's not always clear from the header file definitions which options are actually implemented, how about simply hard-coding a default value for each platform? 1. This would be quite straightforward to code and document (a table of platforms and their default wal_sync_method setting) 2. The best performing (or safest) method can be chosen on every platform. From the above discussion it seems that Windows and OSX should default to fdatasync_writethrough even if other methods are available 3. It would pre-empt similar surprises if other platforms change their header files, like what happened on Linux now. Sounds like the simple and foolproof solution. Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
On 10-12-06 09:00 PM, Josh Berkus wrote: Steve, If you tell me which options to pgbench and which .conf file settings you'd like to see I can probably arrange to run some tests on AIX. Compile and run test_fsync in PGSRC/src/tools/fsync. Attached are runs against two different disk sub-systems from a server running AIX 5.3. The first one is against the local disks Loops = 1 Simple write: 8k write 60812.454/second Compare file sync methods using one write: open_datasync 8k write 162.160/second open_sync 8k write 158.472/second 8k write, fdatasync 158.157/second 8k write, fsync 45.382/second Compare file sync methods using two writes: 2 open_datasync 8k writes79.472/second 2 open_sync 8k writes80.095/second 8k write, 8k write, fdatasync 159.268/second 8k write, 8k write, fsync44.725/second Compare open_sync with different sizes: open_sync 16k write 162.017/second 2 open_sync 8k writes79.709/second Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) 8k write, fsync, close 45.361/second 8k write, close, fsync 36.311/second The below profile is from the same machine using an IBM DS 6800 SAN for storage. Loops = 1 Simple write: 8k write 75933.027/second Compare file sync methods using one write: open_datasync 8k write 2762.801/second open_sync 8k write 2453.822/second 8k write, fdatasync2867.331/second 8k write, fsync1094.048/second Compare file sync methods using two writes: 2 open_datasync 8k writes 1287.845/second 2 open_sync 8k writes 1332.084/second 8k write, 8k write, fdatasync 1966.411/second 8k write, 8k write, fsync 1048.354/second Compare open_sync with different sizes: open_sync 16k write2281.425/second 2 open_sync 8k writes 1401.561/second Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) 8k write, fsync, close 1298.404/second 8k write, close, fsync 1188.582/second -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
All, Geirth's results from his FreeBSD 7.1 server using 8.4's test_fsync: Simple write timing: write0.007081 Compare fsync times on write() and non-write() descriptor: If the times are similar, fsync() can sync data written on a different descriptor. write, fsync, close 5.937933 write, close, fsync 8.056394 Compare one o_sync write to two: one 16k o_sync write 7.366927 two 8k o_sync writes15.299300 Compare file sync methods with one 8k write: (o_dsync unavailable) open o_sync, write 7.512682 (fdatasync unavailable) write, fsync 5.856480 Compare file sync methods with two 8k writes: (o_dsync unavailable) open o_sync, write 15.472910 (fdatasync unavailable) write, fsync 5.880319 ... again, open_sync does not look very impressive. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
On 12/6/10 6:10 PM, Tom Lane wrote: > Robert Haas writes: >> On Mon, Dec 6, 2010 at 9:04 PM, Josh Berkus wrote: >>> Actually, on OSX 10.5.8, o_dsync and fdatasync aren't even available. >>> From my run, it looks like even so regular fsync might be better than >>> open_sync. > >> But I think you need to use fsync_writethrough if you actually want >> durability. > > Yeah. Unless your laptop contains an SSD, those numbers are garbage on > their face. So that's another problem with test_fsync: it omits > fsync_writethrough. Yeah, the issue with test_fsync appears to be that it's designed to work without os-specific switches no matter what, not to accurately reflect how we access wal. I'll see if I can do better. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
Robert Haas writes: > On Mon, Dec 6, 2010 at 9:04 PM, Josh Berkus wrote: >> Actually, on OSX 10.5.8, o_dsync and fdatasync aren't even available. >> From my run, it looks like even so regular fsync might be better than >> open_sync. > But I think you need to use fsync_writethrough if you actually want > durability. Yeah. Unless your laptop contains an SSD, those numbers are garbage on their face. So that's another problem with test_fsync: it omits fsync_writethrough. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
On Mon, Dec 6, 2010 at 9:04 PM, Josh Berkus wrote: > >> Mac OS X: Like Solaris, there's a similar mechanism but it's not >> O_DIRECT; see >> http://stackoverflow.com/questions/2299402/how-does-one-do-raw-io-on-mac-os-x-ie-equivalent-to-linuxs-o-direct-flag >> for notes about the F_NOCACHE feature used. Same basic situation as >> Solaris; there's an API, but PostgreSQL doesn't use it yet. > > Actually, on OSX 10.5.8, o_dsync and fdatasync aren't even available. > From my run, it looks like even so regular fsync might be better than > open_sync. But I think you need to use fsync_writethrough if you actually want durability. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
> Mac OS X: Like Solaris, there's a similar mechanism but it's not > O_DIRECT; see > http://stackoverflow.com/questions/2299402/how-does-one-do-raw-io-on-mac-os-x-ie-equivalent-to-linuxs-o-direct-flag > for notes about the F_NOCACHE feature used. Same basic situation as > Solaris; there's an API, but PostgreSQL doesn't use it yet. Actually, on OSX 10.5.8, o_dsync and fdatasync aren't even available. >From my run, it looks like even so regular fsync might be better than open_sync. Results from a MacBook: Sidney-Stratton:fsync josh$ ./test_fsync Loops = 1 Simple write: 8k write 2121.004/second Compare file sync methods using one write: (open_datasync unavailable) open_sync 8k write 1993.833/second (fdatasync unavailable) 8k write, fsync1878.154/second Compare file sync methods using two writes: (open_datasync unavailable) 2 open_sync 8k writes 1005.009/second (fdatasync unavailable) 8k write, 8k write, fsync 1709.862/second Compare open_sync with different sizes: open_sync 16k write1728.803/second 2 open_sync 8k writes 969.416/second Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) 8k write, fsync, close 1772.572/second 8k write, close, fsync 1939.897/second -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
Steve, > If you tell me which options to pgbench and which .conf file settings > you'd like to see I can probably arrange to run some tests on AIX. Compile and run test_fsync in PGSRC/src/tools/fsync. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
Greg Smith writes: > So my guess is that some small percentage of Windows users might notice > a change here, and some testing on FreeBSD would be useful too. That's > about it for platforms that I think anybody needs to worry about. To my mind, O_DIRECT is not really the key issue here, it's whether to prefer O_DSYNC or fdatasync. I looked back in the archives, and I think that the main reason we prefer O_DSYNC when available is the results I got here: http://archives.postgresql.org/pgsql-hackers/2001-03/msg00381.php which demonstrated a performance benefit on HPUX 10.20, though with a test tool much more primitive than test_fsync. I still have that machine, although the disk that was in it at the time died awhile back. What's in there now is a Seagate ST336607LW spinning at 1 RPM (166 rev/sec) and today I get numbers like this from test_fsync: Simple write: 8k write 28331.020/second Compare file sync methods using one write: open_datasync 8k write 161.190/second open_sync 8k write 156.478/second 8k write, fdatasync 54.302/second 8k write, fsync 51.810/second Compare file sync methods using two writes: 2 open_datasync 8k writes81.702/second 2 open_sync 8k writes80.172/second 8k write, 8k write, fdatasync40.829/second 8k write, 8k write, fsync39.836/second Compare open_sync with different sizes: open_sync 16k write 80.192/second 2 open_sync 8k writes78.018/second Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) 8k write, fsync, close 52.527/second 8k write, close, fsync 54.092/second So *on that rather ancient platform* there's a measurable performance benefit to O_DSYNC, but this seems to be largely because fdatasync is stubbed to fsync in userspace rather than because fdatasync wouldn't be a better idea in the abstract. Also, a lot of the argument against fsync at the time was that it forced the kernel to iterate through all the buffers for the WAL file to see if any were dirty. I would imagine that modern kernels are a tad smarter about that; and even if they aren't, the CPU speed versus disk speed tradeoff has changed enough since 2001 that iterating through 16MB of buffers isn't as interesting as it was then. So to my mind, switching to the preference order fdatasync, fsync_writethrough, fsync seems like the thing to do. Since we assume fsync is always available, that means that O_DSYNC/O_SYNC will not be the defaults on any platform. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
On 10-12-06 06:56 PM, Greg Smith wrote: Tom Lane wrote: The various testing that's been reported so far is all for Linux and thus doesn't directly address the question of whether other kernels will have similar performance properties. Survey of some popular platforms: So my guess is that some small percentage of Windows users might notice a change here, and some testing on FreeBSD would be useful too. That's about it for platforms that I think anybody needs to worry about. If you tell me which options to pgbench and which .conf file settings you'd like to see I can probably arrange to run some tests on AIX. -- Greg Smith 2ndQuadrant usg...@2ndquadrant.comBaltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us "PostgreSQL 9.0 High Performance":http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
Tom Lane wrote: The various testing that's been reported so far is all for Linux and thus doesn't directly address the question of whether other kernels will have similar performance properties. Survey of some popular platforms: Linux: don't want O_DIRECT by default for reliability reasons, and there's no clear performance win in the default config with small wal_buffers Solaris: O_DIRECT doesn't work, there's another API support has never been added for; see http://blogs.sun.com/jkshah/entry/postgresql_wal_sync_method_and Windows: Small reported gains for O_DIRECT, i.e 10% at http://archives.postgresql.org/pgsql-hackers/2007-03/msg01615.php FreeBSD: It probably works there, but I've never seen good performance tests of it on this platform. Mac OS X: Like Solaris, there's a similar mechanism but it's not O_DIRECT; see http://stackoverflow.com/questions/2299402/how-does-one-do-raw-io-on-mac-os-x-ie-equivalent-to-linuxs-o-direct-flag for notes about the F_NOCACHE feature used. Same basic situation as Solaris; there's an API, but PostgreSQL doesn't use it yet. So my guess is that some small percentage of Windows users might notice a change here, and some testing on FreeBSD would be useful too. That's about it for platforms that I think anybody needs to worry about. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
Greg Smith writes: > Regardless, I'm now leaning heavily toward the idea of avoiding > open_datasync by default given this bug, and backpatching that change to > at least 8.4. I'll do some more database-level performance tests here > just as a final sanity check on that. My gut feel is now that we'll > eventually be taking something like Marti's patch, adding some more > documentation around it, and applying that to HEAD as well as some > number of back branches. I think we have got consensus that (1) open_datasync should not be the default on Linux, and (2) this change needs to be back-patched. What is not clear to me is whether we have consensus to change the option preference order globally, or restrict the change to just be effective on Linux. The various testing that's been reported so far is all for Linux and thus doesn't directly address the question of whether other kernels will have similar performance properties. However, it seems reasonable to me to suppose that open_datasync could only be a win in very restricted scenarios and thus shouldn't be a preferred default. Also, I dread trying to document the behavior if the preference order becomes platform-dependent. With the holidays fast approaching, our window to do something about this in a timely fashion grows short. If we don't schedule update releases to be made this week, I think we're looking at not getting the updates out till after New Year's. Do we want to wait that long? Is anyone actually planning to do performance testing that would prove anything about non-Linux platforms? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
On 03.12.2010 21:55, Josh Berkus wrote: All, So, I've been doing some reading about this issue, and I think regardless of what other changes we make we should never enable O_DIRECT automatically on Linux, and it was a mistake for us to do so in the first place. First, in the Linux docs for open(): The quote on that man page is hilarious: "The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances." -- Linus I agree we should not enable it by default. If it's faster on some circumstances, the admin is free to do the research and enable it, but defaults need to be safe above all. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
All, So, I've been doing some reading about this issue, and I think regardless of what other changes we make we should never enable O_DIRECT automatically on Linux, and it was a mistake for us to do so in the first place. First, in the Linux docs for open(): = In summary, O_DIRECT is a potentially powerful tool that should be used with caution. It is recommended that applications treat use of O_DIRECT as a performance option which is disabled by default. = Second, Linus has a quote about O_DIRECT that I think should serve as an indicator to us that directIO will never be beneficial-by-default on Linux, and might even someday be desupported: The right way to do it is to just not use O_DIRECT. The whole notion of "direct IO" is totally braindamaged. Just say no. This is your brain: O This is your brain on O_DIRECT: . Any questions? I should have fought back harder. There really is no valid reason for EVER using O_DIRECT. You need a buffer whatever IO you do, and it might as well be the page cache. There are better ways to control the page cache than play games and think that a page cache isn't necessary. So don't use O_DIRECT. Use things like madvise() and posix_fadvise() instead. Linus = -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
Andrew Dunstan wrote: > > > On 11/30/2010 11:17 PM, Tom Lane wrote: > > Andrew Dunstan writes: > >> On 11/30/2010 10:09 PM, Tom Lane wrote: > >>> We should wait for the outcome of the discussion about whether to change > >>> the default wal_sync_method before worrying about this. > >> we've just had a significant PGX customer encounter this with the latest > >> Postgres on Redhat's freshly released flagship product. Presumably the > >> default wal_sync_method will only change prospectively. > > I don't think so. The fact that Linux is changing underneath us is a > > compelling reason for back-patching a change here. Our older branches > > still have to be able to run on modern OS versions. I'm also fairly > > unclear on what you think a fix would look like if it's not effectively > > a change in the default. > > > > (Hint: this *will* be changing, one way or another, in Red Hat's version > > of 8.4, since that's what RH is shipping in RHEL6.) > > > > > > Well, my initial idea was that if PG_O_DIRECT is non-zero, we should > test at startup time if we can use it on the WAL file system and inhibit > its use if not. > > Incidentally, I notice it's not used at all in test_fsync.c - should it > not be? test_fsync certainly should be using PG_O_DIRECT in the same places the backend does. Once we decide how to handle PG_O_DIRECT, I will modify test_fsync to match. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
Tom Lane wrote: I think the best answer is to get out of the business of using O_DIRECT by default, especially seeing that available evidence suggests it might not be a performance win anyway. I was concerned that open_datasync might be doing a better job of forcing data out of drive write caches. But the tests I've done on RHEL6 so far suggest that's not true; the write guarantees seem to be the same as when using fdatasync. And there's certainly one performance regression possible going from fdatasync to open_datasync, the case where you're overflowing wal_buffers before you actually commit. Below is a test of the troublesome behavior on the same RHEL6 system I gave test_fsync performance test results from at http://archives.postgresql.org/message-id/4ce2ebf8.4040...@2ndquadrant.com This confirms that the kernel now defining O_DSYNC behavior as being available, but not actually supporting it when running the filesystem in journaled mode, is the problem here. That's clearly a kernel bug and no fault of PostgreSQL, it's just never been exposed in a default configuration before. The RedHat bugzilla report seems a bit unclear about what's going on here, may be worth updating that to note the underlying cause. Regardless, I'm now leaning heavily toward the idea of avoiding open_datasync by default given this bug, and backpatching that change to at least 8.4. I'll do some more database-level performance tests here just as a final sanity check on that. My gut feel is now that we'll eventually be taking something like Marti's patch, adding some more documentation around it, and applying that to HEAD as well as some number of back branches. $ mount | head -n 1 /dev/sda7 on / type ext4 (rw) $ cat $PGDATA/postgresql.conf | grep wal_sync_method #wal_sync_method = fdatasync# the default is the first option $ pg_ctl start server starting LOG: database system was shut down at 2010-12-01 17:20:16 EST LOG: database system is ready to accept connections LOG: autovacuum launcher started $ psql -c "show wal_sync_method" wal_sync_method - open_datasync [Edit /etc/fstab, change mount options to be "data=journal" and reboot] $ mount | grep journal /dev/sda7 on / type ext4 (rw,data=journal) $ cat postgresql.conf | grep wal_sync_method #wal_sync_method = fdatasync# the default is the first option $ pg_ctl start server starting LOG: database system was shut down at 2010-12-01 12:14:50 EST PANIC: could not open file "pg_xlog/00010001" (log file 0, segment 1): Invalid argument LOG: startup process (PID 2690) was terminated by signal 6: Aborted LOG: aborting startup due to startup process failure $ pg_ctl stop $ vi $PGDATA/postgresql.conf $ cat $PGDATA/postgresql.conf | grep wal_sync_method wal_sync_method = fdatasync# the default is the first option $ pg_ctl start server starting LOG: database system was shut down at 2010-12-01 12:14:40 EST LOG: database system is ready to accept connections LOG: autovacuum launcher started -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
On 12/01/2010 01:41 PM, Andres Freund wrote: On Wednesday 01 December 2010 19:09:05 Tom Lane wrote: Josh Berkus writes: It's a bug and it's our bug. No, it's a filesystem bug that this particular filesystem doesn't support a perfectly reasonable combination of options, and doesn't even fail gracefully as it could easily do. But assigning blame doesn't help much. I wouldnt call it a reasonable combination - promising fs-level data- journaling (data=journal) and O_DIRECT are not really compatible with each other... OK, but how is an application supposed to know that data journaling is set. Postgres doesn't even look at the FS type, let alone the mount options. From the app's POV it's perfectly reasonable. If the OS is going to provide the API, it should expect people to use it. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
> However, this doesn't really address the question of what a sensible > choice of default is. If there's little evidence about whether the > current flavor of open_datasync is really the fastest way, there's > none whatsoever that establishes open_datasync_without_o_direct > being a sane choice of default. No, I'd switch to fdatasync. That's the performance that most people are familiar with anyway, since it was all Linux supported before. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
Josh Berkus writes: > It might be nice to add new sync_method options, "osync_odirect" and > "odatasync_odirect" for DBAs who think they know enough to tune with > non-defaults. That would have the benefit that we'd not have to argue with people who liked the current behavior (assuming there are any). I'm not sure there's much technical advantage, but from a political standpoint it might be the easiest sort of change to push through. However, this doesn't really address the question of what a sensible choice of default is. If there's little evidence about whether the current flavor of open_datasync is really the fastest way, there's none whatsoever that establishes open_datasync_without_o_direct being a sane choice of default. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
On Wednesday 01 December 2010 19:09:05 Tom Lane wrote: > Josh Berkus writes: > > It's a bug and it's our bug. > > No, it's a filesystem bug that this particular filesystem doesn't > support a perfectly reasonable combination of options, and doesn't > even fail gracefully as it could easily do. But assigning blame > doesn't help much. I wouldnt call it a reasonable combination - promising fs-level data- journaling (data=journal) and O_DIRECT are not really compatible with each other... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
> I think the best answer is to get out of the business of using > O_DIRECT by default, especially seeing that available evidence > suggests it might not be a performance win anyway. Well, we don't have any performance evidence ... there's an issue with the fsync-test script which causes it not to use O_DIRECT. However, we haven't seen any evidence for benefits on any production filesystem, either. So given the lack of evidence of performance benefit, combined with the definite evidence of related failures, I agree that simply disabling O_DIRECT by default would be a good way to solve this. It might be nice to add new sync_method options, "osync_odirect" and "odatasync_odirect" for DBAs who think they know enough to tune with non-defaults. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
Josh Berkus writes: > It's a bug and it's our bug. No, it's a filesystem bug that this particular filesystem doesn't support a perfectly reasonable combination of options, and doesn't even fail gracefully as it could easily do. But assigning blame doesn't help much. > Back when we added O_DIRECT, we assumed > that support for O_DIRECT/opensync could be determined on an OS/kernel > basis, because that was the information we had. Now it turns out that > support can vary *by filesystem* and *between remounts*. We didn't have > any way of knowing different back in 2004, but that doesn't mean we > don't need to fix our mistaken assumption now. > Ideally, we would change our code to test support for O_DIRECT on > startup, rather than at compile time, and backport *that*. I'm not convinced that a startup-time test would be enough either, since as you note a remount might be enough to change the situation. I think the best answer is to get out of the business of using O_DIRECT by default, especially seeing that available evidence suggests it might not be a performance win anyway. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
Tom, > Well, no, actually it's the same (only) argument. We'd never consider > back-patching such a change if our hand weren't being forced by kernel > changes :-( I think we have to back-patch the change. The way it is now, a DBA who thinks they are doing normal sensible configuration can cause PostgreSQL to fail to restart. Imagine this scenario, for example: 1) DBA, using PostgreSQL 8.3, gets worried about possible disk issues 2) DBA changes their single Ext3/4 partition to "data=journal" 3) DBA restarts system 4) PostgreSQL won't start 5) DBA thrashes around for a few hours while the site is down 6) DBA gets fired and the new DBA migrates to some other DBMS. I simply can't think of *anywhere* we could put the information about opensync and Linux/Ext which would be prominent enough to avoid the above scenario. And per replies, a lot of people have hit this issue already. It's a bug and it's our bug. Back when we added O_DIRECT, we assumed that support for O_DIRECT/opensync could be determined on an OS/kernel basis, because that was the information we had. Now it turns out that support can vary *by filesystem* and *between remounts*. We didn't have any way of knowing different back in 2004, but that doesn't mean we don't need to fix our mistaken assumption now. Ideally, we would change our code to test support for O_DIRECT on startup, rather than at compile time, and backport *that*. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
On 11/30/2010 11:17 PM, Tom Lane wrote: Andrew Dunstan writes: On 11/30/2010 10:09 PM, Tom Lane wrote: We should wait for the outcome of the discussion about whether to change the default wal_sync_method before worrying about this. we've just had a significant PGX customer encounter this with the latest Postgres on Redhat's freshly released flagship product. Presumably the default wal_sync_method will only change prospectively. I don't think so. The fact that Linux is changing underneath us is a compelling reason for back-patching a change here. Our older branches still have to be able to run on modern OS versions. I'm also fairly unclear on what you think a fix would look like if it's not effectively a change in the default. (Hint: this *will* be changing, one way or another, in Red Hat's version of 8.4, since that's what RH is shipping in RHEL6.) Well, my initial idea was that if PG_O_DIRECT is non-zero, we should test at startup time if we can use it on the WAL file system and inhibit its use if not. Incidentally, I notice it's not used at all in test_fsync.c - should it not be? cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
On Wed, Dec 1, 2010 at 12:31 AM, Tom Lane wrote: > Josh Berkus writes: >> On 11/30/10 7:09 PM, Tom Lane wrote: >>> Josh Berkus writes: Apparently, testing for O_DIRECT at compile time isn't adequate. Ideas? >>> >>> We should wait for the outcome of the discussion about whether to change >>> the default wal_sync_method before worrying about this. > >> Are we considering backporting that change? > >> If so, this would be another argument in favor of changing the default. > > Well, no, actually it's the same (only) argument. We'd never consider > back-patching such a change if our hand weren't being forced by kernel > changes :-( > > As things stand, though, I think the only thing that's really open for > discussion is how wide to make the scope of the default-change: should > we just do it across the board, or try to limit it to some subset of the > platforms where open_datasync is currently the default. And that's a > decision that ought to be informed by some performance testing. If we could get a clear idea of what performance testing needs to be done, I suspect we could find some people willing to do it. What do you think would be useful? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
On Wed, Dec 1, 2010 at 12:35, Dimitri Fontaine wrote: > PANIC: could not open file "pg_xlog/00010001" (log file 0, > segment 1): Invalid argument +1 I got the same error when trying to get PostgreSQL working on tmpfs and gave up. > Now I understand that you want to test the other alternatives before to > choose among those which work, but my opinion is that it should be fixed > in HEAD before next alpha, or even ASAP. It's queued for this month's commitfest, so things are moving. https://commitfest.postgresql.org/action/patch_view?id=432 Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
Tom Lane writes: > As things stand, though, I think the only thing that's really open for > discussion is how wide to make the scope of the default-change: should > we just do it across the board, or try to limit it to some subset of the > platforms where open_datasync is currently the default. And that's a > decision that ought to be informed by some performance testing. Maybe I have a distorded view of the situation for having hit the problem with an ubuntu upgrade, but it really does not look like a performance item to me. PANIC: could not open file "pg_xlog/00010001" (log file 0, segment 1): Invalid argument It took me quite some time to be able to start my development cluster again and validate some new patch to send to the list. Now I understand that you want to test the other alternatives before to choose among those which work, but my opinion is that it should be fixed in HEAD before next alpha, or even ASAP. It could be that a HINT here would be enough for contributors not to lose to much time. It would be HINT: if you're running linux, please try to change wal_sync_method, open_datasync is not reliable anymore in recent kernels. An example of trustworthy setting is fdatasync. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
Josh Berkus writes: > On 11/30/10 7:09 PM, Tom Lane wrote: >> Josh Berkus writes: >>> Apparently, testing for O_DIRECT at compile time isn't adequate. Ideas? >> >> We should wait for the outcome of the discussion about whether to change >> the default wal_sync_method before worrying about this. > Are we considering backporting that change? > If so, this would be another argument in favor of changing the default. Well, no, actually it's the same (only) argument. We'd never consider back-patching such a change if our hand weren't being forced by kernel changes :-( As things stand, though, I think the only thing that's really open for discussion is how wide to make the scope of the default-change: should we just do it across the board, or try to limit it to some subset of the platforms where open_datasync is currently the default. And that's a decision that ought to be informed by some performance testing. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
Andrew Dunstan writes: > On 11/30/2010 10:09 PM, Tom Lane wrote: >> We should wait for the outcome of the discussion about whether to change >> the default wal_sync_method before worrying about this. > we've just had a significant PGX customer encounter this with the latest > Postgres on Redhat's freshly released flagship product. Presumably the > default wal_sync_method will only change prospectively. I don't think so. The fact that Linux is changing underneath us is a compelling reason for back-patching a change here. Our older branches still have to be able to run on modern OS versions. I'm also fairly unclear on what you think a fix would look like if it's not effectively a change in the default. (Hint: this *will* be changing, one way or another, in Red Hat's version of 8.4, since that's what RH is shipping in RHEL6.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
On 11/30/2010 10:09 PM, Tom Lane wrote: Josh Berkus writes: Apparently, testing for O_DIRECT at compile time isn't adequate. Ideas? We should wait for the outcome of the discussion about whether to change the default wal_sync_method before worrying about this. Tom, we've just had a significant PGX customer encounter this with the latest Postgres on Redhat's freshly released flagship product. Presumably the default wal_sync_method will only change prospectively. But this will feel to every user out there who encounters it like a bug in our code, and it needs attention. It was darn difficult to diagnose, and many people will just give up in disgust if they encounter it. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
On 11/30/10 7:09 PM, Tom Lane wrote: > Josh Berkus writes: >> Apparently, testing for O_DIRECT at compile time isn't adequate. Ideas? > > We should wait for the outcome of the discussion about whether to change > the default wal_sync_method before worrying about this. Are we considering backporting that change? If so, this would be another argument in favor of changing the default. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
Josh Berkus writes: > Apparently, testing for O_DIRECT at compile time isn't adequate. Ideas? We should wait for the outcome of the discussion about whether to change the default wal_sync_method before worrying about this. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] We really ought to do something about O_DIRECT and data=journalled on ext4
Hackers, Some of you might already be aware that this combination produces a fatal startup crash in PostgreSQL: 1. Create an Ext3 or Ext4 partition and mount it with data=journal on a server with linux kernel 2.6.30 or later. 2. Initdb a PGDATA on that partition 3. Start PostgreSQL with the default config from that PGDATA This was reported a ways back: https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=567113 To explain: calling O_DIRECT on an ext3 or ext4 partition with data=journalled causes a crash. However, recent Linux kernels now report support for O_DIRECT when we compile PostgreSQL, so we use it by default. This results in a "crash by default" situation with new Linuxes if anyone sets data=journal. We just encountered this again with another user. With RHEL6 out now, this seems likely to become a fairly common crash report. Apparently, testing for O_DIRECT at compile time isn't adequate. Ideas? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers