Re: [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Hi Greg, On Tuesday 19 January 2010 15:52:25 Greg Stark wrote: > On Mon, Jan 18, 2010 at 4:35 PM, Greg Stark wrote: > > Looking at this patch for the commitfest I have a few questions. > > So I've touched this patch up a bit: > > 1) moved the posix_fadvise call to a new fd.c function > pg_fsync_start(fd,offset,nbytes) which initiates an fsync without > waiting on it. Currently it's only implemented with > posix_fadvise(DONT_NEED) but I want to look into using sync_file_range > in the future -- it looks like this call might be good enough for our > checkpoints. Why exactly should that depend on fsync? Sure, thats where most of the pain comes from now but avoiding that cache poisoning wouldnt hurt otherwise as well. I would rather have it called pg_flush_cache_range or such... > 2) advised each 64k chunk as we write it which should avoid poisoning > the cache if you do a large create database on an active system. > > 3) added the promised but afaict missing fsync of the directory -- i > think we should actually backpatch this. I think as well. You need it during recursing as well though (where I had added it) and not only for the final directory. > Barring any objections shall I commit it like this? Other than the two things above it looks fine to me. Thanks, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Monday 28 December 2009 23:59:43 Andres Freund wrote: > On Monday 28 December 2009 23:54:51 Andres Freund wrote: > > On Saturday 12 December 2009 21:38:41 Andres Freund wrote: > > > On Saturday 12 December 2009 21:36:27 Michael Clemmons wrote: > > > > If ppl think its worth it I'll create a ticket > > > > > > Thanks, no need. I will post a patch tomorrow or so. > > > > Well. It was a long day... > > > > Anyway. > > In this patch I delay the fsync done in copy_file and simply do a second > > pass over the directory in copy_dir and fsync everything in that pass. > > Including the directory - which was not done before and actually might be > > necessary in some cases. > > I added a posix_fadvise(..., FADV_DONTNEED) to make it more likely that > > the copied file reaches storage before the fsync. Without the speed > > benefits were quite a bit smaller and essentially random (which seems > > sensible). > > > > This speeds up CREATE DATABASE from ~9 seconds to something around 0.8s > > on my laptop. Still slower than with fsync off (~0.25) but quite a > > worthy improvement. > > > > The benefits are obviously bigger if the template database includes > > anything added. > > Obviously the patch would be helpfull. And it should also be helpfull not to have annoying oversights in there. A FreeDir(xldir); is missing at the end of copydir(). Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Tuesday 29 December 2009 11:48:10 Greg Stark wrote: > On Tue, Dec 29, 2009 at 2:05 AM, Andres Freund wrote: > > Reads Completed:2,8KiB Writes Completed: 2362, > > 29672KiB New: > > Reads Completed:0,0KiB Writes Completed: 550, > > 5960KiB > > It looks like the new method is only doing 1/6th as much i/o. Do you > know what's going on there? While I was surprised by the amount of difference I am not surprised at all that there is a significant one - currently the fsync will write out a whole bunch of useless stuff every time its called (all metadata, directory structure and so on) This is reproducible... 6MB sounds sensible for the operation btw - the template database is around 5MB. Will try to analyze later what exactly causes the additional io. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Tue, Dec 29, 2009 at 2:05 AM, Andres Freund wrote: > Reads Completed: 2, 8KiB Writes Completed: 2362, > 29672KiB > New: > Reads Completed: 0, 0KiB Writes Completed: 550, > 5960KiB It looks like the new method is only doing 1/6th as much i/o. Do you know what's going on there? -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [PERFORM] [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Tuesday 29 December 2009 04:04:06 Michael Clemmons wrote: > Maybe not crash out but in this situation. > N=0 > while(N>=0): > CREATE DATABASE new_db_N; > Since the fsync is the part which takes the memory and time but is > happening in the background want the fsyncs pile up in the background > faster than can be run filling up the memory and stack. > This is very likely a mistake on my part about how postgres/processes The difference should not be visible outside the "CREATE DATABASE ..." at all. Currently the process simplifiedly works like: for file in source directory: copy_file(source/file, target/file); fsync(target/file); I changed it to: - for file in source directory: copy_file(source/file, target/file); /*please dear kernel, write this out, but dont block*/ posix_fadvise(target/file, FADV_DONTNEED); for file in source directory: fsync(target/file); - If at any point in time there is not enough cache available to cache anything copy_file() will just have to wait for the kernel to write out the data. fsync() does not use memory itself. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [PERFORM] [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Maybe not crash out but in this situation. N=0 while(N>=0): CREATE DATABASE new_db_N; Since the fsync is the part which takes the memory and time but is happening in the background want the fsyncs pile up in the background faster than can be run filling up the memory and stack. This is very likely a mistake on my part about how postgres/processes actually works. -Michael On Mon, Dec 28, 2009 at 9:55 PM, Andres Freund wrote: > On Tuesday 29 December 2009 03:53:12 Michael Clemmons wrote: > > Andres, > > Great job. Looking through the emails and thinking about why this works > I > > think this patch should significantly speedup 8.4 on most any file > > system(obviously some more than others) unless the system has > significantly > > reduced memory or a slow single core. On a Celeron with 256 memory I > > suspect it'll crash out or just hit the swap and be a worse bottleneck. > > Anyone have something like this to test on? > Why should it crash? The kernel should just block on writing and write out > the > dirty memory before continuing? > Pg is not caching anything here... > > Andres >
Re: [PERFORM] [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Tuesday 29 December 2009 03:53:12 Michael Clemmons wrote: > Andres, > Great job. Looking through the emails and thinking about why this works I > think this patch should significantly speedup 8.4 on most any file > system(obviously some more than others) unless the system has significantly > reduced memory or a slow single core. On a Celeron with 256 memory I > suspect it'll crash out or just hit the swap and be a worse bottleneck. > Anyone have something like this to test on? Why should it crash? The kernel should just block on writing and write out the dirty memory before continuing? Pg is not caching anything here... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [PERFORM] [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Andres, Great job. Looking through the emails and thinking about why this works I think this patch should significantly speedup 8.4 on most any file system(obviously some more than others) unless the system has significantly reduced memory or a slow single core. On a Celeron with 256 memory I suspect it'll crash out or just hit the swap and be a worse bottleneck. Anyone have something like this to test on? -Michael On Mon, Dec 28, 2009 at 9:05 PM, Andres Freund wrote: > On Tuesday 29 December 2009 01:46:21 Greg Smith wrote: > > Andres Freund wrote: > > > As I said the real benefit only occurred after adding posix_fadvise(.., > > > FADV_DONTNEED) which is somewhat plausible, because i.e. the directory > > > entries don't need to get scheduled for every file and because the > kernel > > > can reorder a whole directory nearly sequentially. Without the advice > it > > > the kernel doesn't know in time that it should write that data back and > > > it wont do it for 5 seconds by default on linux or such... > > It would be interesting to graph the "Dirty" and "Writeback" figures in > > /proc/meminfo over time with and without this patch in place. That > > should make it obvious what the kernel is doing differently in the two > > cases. > I did some analysis using blktrace (usefull tool btw) and the results show > that > the io pattern is *significantly* different. > > For one with the direct fsyncing nearly no hardware queuing is used and for > another nearly no requests are merged on software side. > > Short stats: > > OLD: > > Total (8,0): > Reads Queued: 2,8KiB Writes Queued:7854, > 29672KiB > Read Dispatches:2,8KiB Write Dispatches: 1926, > 29672KiB > Reads Requeued: 0 Writes Requeued: 0 > Reads Completed:2,8KiB Writes Completed: 2362, > 29672KiB > Read Merges:0,0KiB Write Merges: 5492, > 21968KiB > PC Reads Queued:0,0KiB PC Writes Queued:0, > 0KiB > PC Read Disp.:436,0KiB PC Write Disp.: 0, > 0KiB > PC Reads Req.: 0 PC Writes Req.: 0 > PC Reads Compl.:0 PC Writes Compl.: 2362 > IO unplugs: 2395 Timer unplugs: 557 > > > New: > > Total (8,0): > Reads Queued: 0,0KiB Writes Queued:1716, > 5960KiB > Read Dispatches:0,0KiB Write Dispatches: 324, > 5960KiB > Reads Requeued: 0 Writes Requeued: 0 > Reads Completed:0,0KiB Writes Completed: 550, > 5960KiB > Read Merges:0,0KiB Write Merges: 1166, > 4664KiB > PC Reads Queued:0,0KiB PC Writes Queued:0, > 0KiB > PC Read Disp.:226,0KiB PC Write Disp.: 0, > 0KiB > PC Reads Req.: 0 PC Writes Req.: 0 > PC Reads Compl.:0 PC Writes Compl.: 550 > IO unplugs: 503 Timer unplugs: 30 > > > Andres >
Re: [PERFORM] [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Tuesday 29 December 2009 01:46:21 Greg Smith wrote: > Andres Freund wrote: > > As I said the real benefit only occurred after adding posix_fadvise(.., > > FADV_DONTNEED) which is somewhat plausible, because i.e. the directory > > entries don't need to get scheduled for every file and because the kernel > > can reorder a whole directory nearly sequentially. Without the advice it > > the kernel doesn't know in time that it should write that data back and > > it wont do it for 5 seconds by default on linux or such... > It would be interesting to graph the "Dirty" and "Writeback" figures in > /proc/meminfo over time with and without this patch in place. That > should make it obvious what the kernel is doing differently in the two > cases. I did some analysis using blktrace (usefull tool btw) and the results show that the io pattern is *significantly* different. For one with the direct fsyncing nearly no hardware queuing is used and for another nearly no requests are merged on software side. Short stats: OLD: Total (8,0): Reads Queued: 2,8KiB Writes Queued:7854,29672KiB Read Dispatches:2,8KiB Write Dispatches: 1926,29672KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed:2,8KiB Writes Completed: 2362,29672KiB Read Merges:0,0KiB Write Merges: 5492,21968KiB PC Reads Queued:0,0KiB PC Writes Queued:0,0KiB PC Read Disp.:436,0KiB PC Write Disp.: 0,0KiB PC Reads Req.: 0 PC Writes Req.: 0 PC Reads Compl.:0 PC Writes Compl.: 2362 IO unplugs: 2395 Timer unplugs: 557 New: Total (8,0): Reads Queued: 0,0KiB Writes Queued:1716, 5960KiB Read Dispatches:0,0KiB Write Dispatches: 324, 5960KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed:0,0KiB Writes Completed: 550, 5960KiB Read Merges:0,0KiB Write Merges: 1166, 4664KiB PC Reads Queued:0,0KiB PC Writes Queued:0,0KiB PC Read Disp.:226,0KiB PC Write Disp.: 0,0KiB PC Reads Req.: 0 PC Writes Req.: 0 PC Reads Compl.:0 PC Writes Compl.: 550 IO unplugs: 503 Timer unplugs: 30 Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [PERFORM] [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Andres Freund wrote: As I said the real benefit only occurred after adding posix_fadvise(.., FADV_DONTNEED) which is somewhat plausible, because i.e. the directory entries don't need to get scheduled for every file and because the kernel can reorder a whole directory nearly sequentially. Without the advice it the kernel doesn't know in time that it should write that data back and it wont do it for 5 seconds by default on linux or such... I know they just fiddled with the logic in the last release, but for most of the Linux kernels out there now pdflush wakes up every 5 seconds by default. But typically it only worries about writing things that have been in the queue for 30 seconds or more until you've filled quite a bit of memory, so that's also an interesting number. I tried to document the main tunables here and describe how they fit together at http://www.westnet.com/~gsmith/content/linux-pdflush.htm It would be interesting to graph the "Dirty" and "Writeback" figures in /proc/meminfo over time with and without this patch in place. That should make it obvious what the kernel is doing differently in the two cases. -- Greg Smith2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Tuesday 29 December 2009 00:06:28 Tom Lane wrote: > Andres Freund writes: > > This speeds up CREATE DATABASE from ~9 seconds to something around 0.8s > > on my laptop. Still slower than with fsync off (~0.25) but quite a > > worthy improvement. > > I can't help wondering whether that's real or some kind of > platform-specific artifact. I get numbers more like 3.5s (fsync off) > vs 4.5s (fsync on) on a machine where I believe the disks aren't lying > about write-complete. It makes sense that an fsync at the end would be > a little bit faster, because it would give the kernel some additional > freedom in scheduling the required I/O, but it isn't cutting the total > I/O required at all. So I find it really hard to believe a 10x speedup. I only comfortably have access to two smaller machines without BBU from here (being in the Hacker Jeopardy at the ccc congress ;-)) and both show this behaviour. I guess its somewhat filesystem dependent. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Tuesday 29 December 2009 00:06:28 Tom Lane wrote: > Andres Freund writes: > > This speeds up CREATE DATABASE from ~9 seconds to something around 0.8s > > on my laptop. Still slower than with fsync off (~0.25) but quite a > > worthy improvement. > I can't help wondering whether that's real or some kind of > platform-specific artifact. I get numbers more like 3.5s (fsync off) > vs 4.5s (fsync on) on a machine where I believe the disks aren't lying > about write-complete. It makes sense that an fsync at the end would be > a little bit faster, because it would give the kernel some additional > freedom in scheduling the required I/O, but it isn't cutting the total > I/O required at all. So I find it really hard to believe a 10x speedup. Well, a template database is about 5.5MB big here - that shouldnt take too long when written near-sequentially? As I said the real benefit only occurred after adding posix_fadvise(.., FADV_DONTNEED) which is somewhat plausible, because i.e. the directory entries don't need to get scheduled for every file and because the kernel can reorder a whole directory nearly sequentially. Without the advice it the kernel doesn't know in time that it should write that data back and it wont do it for 5 seconds by default on linux or such... I looked at the strace output - it looks sensible timewise to me. If youre interested I can give you output of that. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Andres Freund writes: > This speeds up CREATE DATABASE from ~9 seconds to something around 0.8s on my > laptop. Still slower than with fsync off (~0.25) but quite a worthy > improvement. I can't help wondering whether that's real or some kind of platform-specific artifact. I get numbers more like 3.5s (fsync off) vs 4.5s (fsync on) on a machine where I believe the disks aren't lying about write-complete. It makes sense that an fsync at the end would be a little bit faster, because it would give the kernel some additional freedom in scheduling the required I/O, but it isn't cutting the total I/O required at all. So I find it really hard to believe a 10x speedup. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Monday 28 December 2009 23:54:51 Andres Freund wrote: > On Saturday 12 December 2009 21:38:41 Andres Freund wrote: > > On Saturday 12 December 2009 21:36:27 Michael Clemmons wrote: > > > If ppl think its worth it I'll create a ticket > > > > Thanks, no need. I will post a patch tomorrow or so. > > Well. It was a long day... > > Anyway. > In this patch I delay the fsync done in copy_file and simply do a second > pass over the directory in copy_dir and fsync everything in that pass. > Including the directory - which was not done before and actually might be > necessary in some cases. > I added a posix_fadvise(..., FADV_DONTNEED) to make it more likely that the > copied file reaches storage before the fsync. Without the speed benefits > were quite a bit smaller and essentially random (which seems sensible). > > This speeds up CREATE DATABASE from ~9 seconds to something around 0.8s on > my laptop. Still slower than with fsync off (~0.25) but quite a worthy > improvement. > > The benefits are obviously bigger if the template database includes > anything added. Obviously the patch would be helpfull. Andres From bd80748883d1328a71607a447677b0bfb1f54ab0 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 28 Dec 2009 23:43:57 +0100 Subject: [PATCH] Delay fsyncing files during copying in CREATE DATABASE - this dramatically speeds up CREATE DATABASE on non battery backed rotational storage. Additionally fsync() the directory to ensure all metadata reaches storage. --- src/port/copydir.c | 58 +-- 1 files changed, 51 insertions(+), 7 deletions(-) diff --git a/src/port/copydir.c b/src/port/copydir.c index a70477e..cde3dc7 100644 *** a/src/port/copydir.c --- b/src/port/copydir.c *** *** 37,42 --- 37,43 static void copy_file(char *fromfile, char *tofile); + static void fsync_fname(char *fname); /* *** copydir(char *fromdir, char *todir, bool *** 64,69 --- 65,73 (errcode_for_file_access(), errmsg("could not open directory \"%s\": %m", fromdir))); + /* + * Copy all the files + */ while ((xlde = ReadDir(xldir, fromdir)) != NULL) { struct stat fst; *** copydir(char *fromdir, char *todir, bool *** 89,96 else if (S_ISREG(fst.st_mode)) copy_file(fromfile, tofile); } - FreeDir(xldir); } /* --- 93,120 else if (S_ISREG(fst.st_mode)) copy_file(fromfile, tofile); } FreeDir(xldir); + + /* + * Be paranoid here and fsync all files to ensure we catch problems. + */ + xldir = AllocateDir(fromdir); + if (xldir == NULL) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not open directory \"%s\": %m", fromdir))); + + while ((xlde = ReadDir(xldir, fromdir)) != NULL) + { + struct stat fst; + + if (strcmp(xlde->d_name, ".") == 0 || + strcmp(xlde->d_name, "..") == 0) + continue; + + snprintf(tofile, MAXPGPATH, "%s/%s", todir, xlde->d_name); + fsync_fname(tofile); + } } /* *** copy_file(char *fromfile, char *tofile) *** 150,162 } /* ! * Be paranoid here to ensure we catch problems. */ ! if (pg_fsync(dstfd) != 0) ! ereport(ERROR, ! (errcode_for_file_access(), ! errmsg("could not fsync file \"%s\": %m", tofile))); ! if (close(dstfd)) ereport(ERROR, (errcode_for_file_access(), --- 174,185 } /* ! * We tell the kernel here to write the data back in order to make ! * the later fsync cheaper. */ ! #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED) ! posix_fadvise(dstfd, 0, 0, POSIX_FADV_DONTNEED); ! #endif if (close(dstfd)) ereport(ERROR, (errcode_for_file_access(), *** copy_file(char *fromfile, char *tofile) *** 166,168 --- 189,212 pfree(buffer); } + + /* + * fsync a file + */ + static void + fsync_fname(char *fname) + { + int fd = BasicOpenFile(fname, O_RDWR| PG_BINARY, + S_IRUSR | S_IWUSR); + + if (fd < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not create file \"%s\": %m", fname))); + + if (pg_fsync(fd) != 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not fsync file \"%s\": %m", fname))); + close(fd); + } -- 1.6.5.12.gd65df24 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Saturday 12 December 2009 21:38:41 Andres Freund wrote: > On Saturday 12 December 2009 21:36:27 Michael Clemmons wrote: > > If ppl think its worth it I'll create a ticket > Thanks, no need. I will post a patch tomorrow or so. Well. It was a long day... Anyway. In this patch I delay the fsync done in copy_file and simply do a second pass over the directory in copy_dir and fsync everything in that pass. Including the directory - which was not done before and actually might be necessary in some cases. I added a posix_fadvise(..., FADV_DONTNEED) to make it more likely that the copied file reaches storage before the fsync. Without the speed benefits were quite a bit smaller and essentially random (which seems sensible). This speeds up CREATE DATABASE from ~9 seconds to something around 0.8s on my laptop. Still slower than with fsync off (~0.25) but quite a worthy improvement. The benefits are obviously bigger if the template database includes anything added. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers