Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Monday 08 February 2010 05:53:23 Robert Haas wrote: On Sun, Feb 7, 2010 at 10:09 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: Andres Freund escribió: I personally think the fsync on the directory should be added to the stable branches - other opinions? If wanted I can prepare patches for that. Yeah, it seems there are two patches here -- one is the addition of fsync_fname() and the other is the fsync_prepare stuff. Andres, you want to take a crack at splitting this up? I hope I didnt duplicate Gregs work, but I didnt hear back from him, so... Everything 8.1 is hopeless because cp is used there... I didnt see it worth to replace that. The patch applies cleanly for 8.1 to 8.4 and survives the regression tests Given pg's heavy commit model I didnt see a point to split the patch for 9.0 as well... Andres diff --git a/src/port/copydir.c b/src/port/copydir.c index 72fbf36..b057ffa 100644 *** a/src/port/copydir.c --- b/src/port/copydir.c *** copydir(char *fromdir, char *todir, bool *** 50,55 --- 50,56 { DIR *xldir; struct dirent *xlde; + int dirfd; char fromfile[MAXPGPATH]; char tofile[MAXPGPATH]; *** copydir(char *fromdir, char *todir, bool *** 91,96 --- 92,116 } FreeDir(xldir); + + /* + * fsync the directory to make sure data has reached the + * disk. While needed by most filesystems, the window got bigger + * with newer ones like ext4. + */ + dirfd = BasicOpenFile(todir, + O_RDONLY | PG_BINARY, + S_IRUSR | S_IWUSR); + if(dirfd == -1) + ereport(ERROR, + (errcode_for_file_access(), + errmsg(could not open directory for fsync \%s\: %m, todir))); + + if(pg_fsync(dirfd) == -1) + ereport(ERROR, + (errcode_for_file_access(), + errmsg(could not fsync directory \%s\: %m, todir))); + close(dirfd); } /* -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Monday 08 February 2010 19:34:01 Greg Stark wrote: On Mon, Feb 8, 2010 at 4:53 AM, Robert Haas robertmh...@gmail.com wrote: On Sun, Feb 7, 2010 at 10:09 PM, Alvaro Herrera Yeah, it seems there are two patches here -- one is the addition of fsync_fname() and the other is the fsync_prepare stuff. Sorry, I'm just catching up on my mail from FOSDEM this past weekend. I had come to the same conclusion as Greg that I might as well just commit it with Tom's pg_flush_data() name and we can decide later if and when we have pg_fsync_start()/pg_fsync_finish() whether it's worth keeping two apis or not. So I was just going to commit it like that but I discovered last week that I don't have cvs write access set up yet. I'll commit it as soon as I generate a new ssh key and Dave installs it, etc. I intentionally picked a small simple patch that nobody was waiting on because I knew there was a risk of delays like this and the paperwork. I'm nearly there. Do you still want me to split the patches into two or do you want to do it yourself? One in multiple versions for the directory fsync and another one for 9.0? Andres -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Greg Smith g...@2ndquadrant.com writes: This is turning into yet another one of those situations where something simple and useful is being killed by trying to generalize it way more than it needs to be, given its current goals and its lack of external interfaces. There's no catversion bump or API breakage to hinder future refactoring if this isn't optimally designed internally from day one. I agree that it's too late in the cycle for any major redesign of the patch. But is it too much to ask to use a less confusing name for the function? regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Sunday 07 February 2010 19:23:10 Robert Haas wrote: On Sun, Feb 7, 2010 at 11:24 AM, Tom Lane t...@sss.pgh.pa.us wrote: Greg Smith g...@2ndquadrant.com writes: This is turning into yet another one of those situations where something simple and useful is being killed by trying to generalize it way more than it needs to be, given its current goals and its lack of external interfaces. There's no catversion bump or API breakage to hinder future refactoring if this isn't optimally designed internally from day one. I agree that it's too late in the cycle for any major redesign of the patch. But is it too much to ask to use a less confusing name for the function? +1. Let's just rename the thing, add some comments, and call it good. Will post a updated patch in the next hours unless somebody beats me too it. Andres -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Sunday 07 February 2010 19:27:02 Andres Freund wrote: On Sunday 07 February 2010 19:23:10 Robert Haas wrote: On Sun, Feb 7, 2010 at 11:24 AM, Tom Lane t...@sss.pgh.pa.us wrote: Greg Smith g...@2ndquadrant.com writes: This is turning into yet another one of those situations where something simple and useful is being killed by trying to generalize it way more than it needs to be, given its current goals and its lack of external interfaces. There's no catversion bump or API breakage to hinder future refactoring if this isn't optimally designed internally from day one. I agree that it's too late in the cycle for any major redesign of the patch. But is it too much to ask to use a less confusing name for the function? +1. Let's just rename the thing, add some comments, and call it good. Will post a updated patch in the next hours unless somebody beats me too it. Here we go. I left the name at my suggestion pg_fsync_prepare instead of Tom's prepare_for_fsync because it seemed more consistend with the naming in the rest of the file. Obviously feel free to adjust. I personally think the fsync on the directory should be added to the stable branches - other opinions? If wanted I can prepare patches for that. Andres diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 7ffa2eb..bc5753a 100644 *** a/src/backend/storage/file/fd.c --- b/src/backend/storage/file/fd.c *** pg_fdatasync(int fd) *** 320,325 --- 320,361 } /* + * pg_fsync_prepare --- try to make a later fsync on the same file faster + * + * A call to this function does not guarantee anything! + * + * The idea is to tell the kernel to write out its cache so that a + * fsync later on has less to write out synchronously. This allows + * that write requests get reordered more freely. + * + * In the current implementation this has the additional effect of + * dropping the cache in that region and thus can be used to avoid + * cache poisoning. This may or may not be wanted. + * + * XXX: Ideally this API would use sync_file_range (or similar on + * platforms other than linux) and a seperate one for cache + * control. We are not there yet. + * + * Look at the thread below 200912282354.51892.and...@anarazel.de in + * pgsql-hackers for a longer discussion. + */ + int + pg_fsync_prepare(int fd, off_t offset, off_t amount) + { + if (enableFsync) + { + #if defined(USE_POSIX_FADVISE) defined(POSIX_FADV_DONTNEED) + return posix_fadvise(fd, offset, amount, POSIX_FADV_DONTNEED); + #else + return 0; + #endif + } + else + return 0; + } + + + /* * InitFileAccess --- initialize this module during backend startup * * This is called during either normal or standalone backend start. diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h index 21cb024..b1a4b49 100644 *** a/src/include/storage/fd.h --- b/src/include/storage/fd.h *** extern int pg_fsync_no_writethrough(int *** 99,104 --- 99,106 extern int pg_fsync_writethrough(int fd); extern int pg_fdatasync(int fd); + extern int prepare_for_fsync(int fd, off_t offset, off_t amount); + /* Filename components for OpenTemporaryFile */ #define PG_TEMP_FILES_DIR pgsql_tmp #define PG_TEMP_FILE_PREFIX pgsql_tmp diff --git a/src/port/copydir.c b/src/port/copydir.c index 72fbf36..eef3cfb 100644 *** a/src/port/copydir.c --- b/src/port/copydir.c *** *** 37,42 --- 37,43 static void copy_file(char *fromfile, char *tofile); + static void fsync_fname(char *fname); /* *** copydir(char *fromdir, char *todir, bool *** 64,69 --- 65,73 (errcode_for_file_access(), errmsg(could not open directory \%s\: %m, fromdir))); + /* + * Copy all the files + */ while ((xlde = ReadDir(xldir, fromdir)) != NULL) { struct stat fst; *** copydir(char *fromdir, char *todir, bool *** 89,96 --- 93,127 else if (S_ISREG(fst.st_mode)) copy_file(fromfile, tofile); } + FreeDir(xldir); + + /* + * Be paranoid here and fsync all files to ensure we catch problems. + */ + xldir = AllocateDir(fromdir); + if (xldir == NULL) + ereport(ERROR, + (errcode_for_file_access(), + errmsg(could not open directory \%s\: %m, fromdir))); + + while ((xlde = ReadDir(xldir, fromdir)) != NULL) + { + if (strcmp(xlde-d_name, .) == 0 || + strcmp(xlde-d_name, ..) == 0) + continue; + snprintf(tofile, MAXPGPATH, %s/%s, todir, xlde-d_name); + fsync_fname(tofile); + } FreeDir(xldir); + + /* It's important to fsync the destination directory itself as + * individual file fsyncs don't guarantee that the directory entry + * for the file is synced. Recent versions of ext4 have made the + * window much wider but it's been true for ext3 and other + * filesyetems in the past + */ + fsync_fname(todir); } /* *** copy_file(char
Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Monday 08 February 2010 05:53:23 Robert Haas wrote: On Sun, Feb 7, 2010 at 10:09 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: Andres Freund escribió: I personally think the fsync on the directory should be added to the stable branches - other opinions? If wanted I can prepare patches for that. Yeah, it seems there are two patches here -- one is the addition of fsync_fname() and the other is the fsync_prepare stuff. Andres, you want to take a crack at splitting this up? Will do. Later today or tomorrow morning. Andres -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Saturday 06 February 2010 06:03:30 Greg Smith wrote: Andres Freund wrote: On 02/03/10 14:42, Robert Haas wrote: Well, maybe we should start with a discussion of what kernel calls you're aware of on different platforms and then we could try to put an API around it. In linux there is sync_file_range. On newer Posixish systems one can emulate that with mmap() and msync() (in batches obviously). No idea about windows. The effective_io_concurrency feature had proof of concept test programs that worked using AIO, but actually following through on that implementation would require a major restructuring of how the database interacts with the OS in terms of reads and writes of blocks. It looks to me like doing something similar to sync_file_range on Windows would be similarly difficult. Looking a bit arround it seems one could achieve something approximediately similar to pg_prepare_fsync() by using CreateFileMapping MapViewOfFile FlushViewOfFile If I understand it correctly that will flush, but not wait. Unfortunately you cant event make it wait, so its not possible to implement sync_file_range or similar fully. Andres -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Andres Freund and...@anarazel.de writes: On Tuesday 02 February 2010 18:36:12 Robert Haas wrote: I took a look at this patch today and I agree with Tom that pg_fsync_start() is a very confusing name. I don't know what the right name is, but this doesn't fsync so I don't think it shuld have fsync in the name. Maybe something like pg_advise_abandon() or pg_abandon_cache(). The current name is really wishful thinking: you're hoping that it will make the kernel start the fsync, but it might not. I think pg_start_data_flush() is similarly optimistic. What about: pg_fsync_prepare(). prepare_for_fsync()? regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Tuesday 02 February 2010 18:36:12 Robert Haas wrote: On Fri, Jan 29, 2010 at 1:56 PM, Greg Stark gsst...@mit.edu wrote: On Tue, Jan 19, 2010 at 3:25 PM, Tom Lane t...@sss.pgh.pa.us wrote: That function *seriously* needs documentation, in particular the fact that it's a no-op on machines without the right kernel call. The name you've chosen is very bad for those semantics. I'd pick something else myself. Maybe pg_start_data_flush or something like that? I would like to make one token argument in favour of the name I picked. If it doesn't convince I'll change it since we can always revisit the API down the road. I envision having two function calls, pg_fsync_start() and pg_fsync_finish(). The latter will wait until the data synced in the first call is actually synced. The fall-back if there's no implementation of this would be for fsync_start() to be a noop (or something unreliable like posix_fadvise) and fsync_finish() to just be a regular fsync. I think we can accomplish this with sync_file_range() but I need to read up on how it actually works a bit more. In this case it doesn't make a difference since when we call fsync_finish() it's going to be for the entire file and nothing else will have been writing to these files. But for wal writing and checkpointing it might have very different performance characteristics. The big objection to this is that then we don't really have an api for FADV_DONT_NEED which is more about cache policy than about syncing to disk. So for example a sequential scan might want to indicate that it isn't planning on reading the buffers it's churning through but doesn't want to force them to be written sooner than otherwise and is never going to call fsync_finish(). I took a look at this patch today and I agree with Tom that pg_fsync_start() is a very confusing name. I don't know what the right name is, but this doesn't fsync so I don't think it shuld have fsync in the name. Maybe something like pg_advise_abandon() or pg_abandon_cache(). The current name is really wishful thinking: you're hoping that it will make the kernel start the fsync, but it might not. I think pg_start_data_flush() is similarly optimistic. What about: pg_fsync_prepare(). That gives the reason why were doing that and doesnt promise that it is actually doing an fsync. I dislike really having cache in the name, because the primary aim is not to discard the cache... Andres -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Tuesday 02 February 2010 19:14:40 Robert Haas wrote: On Tue, Feb 2, 2010 at 12:50 PM, Tom Lane t...@sss.pgh.pa.us wrote: Andres Freund and...@anarazel.de writes: On Tuesday 02 February 2010 18:36:12 Robert Haas wrote: I took a look at this patch today and I agree with Tom that pg_fsync_start() is a very confusing name. I don't know what the right name is, but this doesn't fsync so I don't think it shuld have fsync in the name. Maybe something like pg_advise_abandon() or pg_abandon_cache(). The current name is really wishful thinking: you're hoping that it will make the kernel start the fsync, but it might not. I think pg_start_data_flush() is similarly optimistic. What about: pg_fsync_prepare(). prepare_for_fsync()? It still seems mis-descriptive to me. Couldn't the same routine be used simply to abandon undirtied data that we no longer care about caching? For now it could - but it very well might be converted to sync_file_range or similar, which would have different sideeffects. As the potential code duplication is rather small I would prefer to describe the prime effect not the sideeffects... Andres -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Tuesday 02 February 2010 20:06:32 Robert Haas wrote: On Tue, Feb 2, 2010 at 1:34 PM, Andres Freund and...@anarazel.de wrote: For now it could - but it very well might be converted to sync_file_range or similar, which would have different sideeffects. As the potential code duplication is rather small I would prefer to describe the prime effect not the sideeffects... Hmm, in that case, I think the problem is that this function has no comment explaining its intended charter. I agree there. Greg, do you want to update the patch with some comments or shall I? Andres -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Robert Haas robertmh...@gmail.com writes: Hmm, in that case, I think the problem is that this function has no comment explaining its intended charter. That's certainly a big problem, but a comment won't fix the fact that the name is misleading. We need both a comment and a name change. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Tuesday 19 January 2010 15:52:25 Greg Stark wrote: On Mon, Jan 18, 2010 at 4:35 PM, Greg Stark gsst...@mit.edu wrote: Looking at this patch for the commitfest I have a few questions. So I've touched this patch up a bit: 1) moved the posix_fadvise call to a new fd.c function pg_fsync_start(fd,offset,nbytes) which initiates an fsync without waiting on it. Currently it's only implemented with posix_fadvise(DONT_NEED) but I want to look into using sync_file_range in the future -- it looks like this call might be good enough for our checkpoints. 2) advised each 64k chunk as we write it which should avoid poisoning the cache if you do a large create database on an active system. 3) added the promised but afaict missing fsync of the directory -- i think we should actually backpatch this. Yes, that was a bit stupid from me - I added the fsync for directories which get recursed into (by not checking if its a file) but not for the uppermost level. So all directories should get fsynced right now but the topmost one. I will review the patch later when I finally will have some time off again... ~4h. Thanks! Andres -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Greg Stark gsst...@mit.edu writes: 1) moved the posix_fadvise call to a new fd.c function pg_fsync_start(fd,offset,nbytes) which initiates an fsync without waiting on it. Currently it's only implemented with posix_fadvise(DONT_NEED) but I want to look into using sync_file_range in the future -- it looks like this call might be good enough for our checkpoints. That function *seriously* needs documentation, in particular the fact that it's a no-op on machines without the right kernel call. The name you've chosen is very bad for those semantics. I'd pick something else myself. Maybe pg_start_data_flush or something like that? Other than that quibble it seems basically sane. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Tuesday 19 January 2010 15:57:14 Greg Stark wrote: On Tue, Jan 19, 2010 at 2:52 PM, Greg Stark gsst...@mit.edu wrote: Barring any objections shall I commit it like this? Actually before we get there could someone who demonstrated the speedup verify that this patch still gets that same speedup? At least on the three machines I tested last time the result is still in the same ballpark. Andres -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Tuesday 29 December 2009 01:27:29 Greg Stark wrote: On Mon, Dec 28, 2009 at 10:54 PM, Andres Freund and...@anarazel.de wrote: fsync everything in that pass. Including the directory - which was not done before and actually might be necessary in some cases. Er. Yes. At least on ext4 this is pretty important. I wish it weren't, but it doesn't look like we're going to convince the ext4 developers they're crazy any day soon and it would really suck for a database created from a template to have files in it go missin. Actually it was necessary on ext3 as well - the window to hit the problem just was much smaller, wasnt it? Actually that part should possibly get backported. Andres -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance