Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 21, 2014 at 09:20:52PM +0100, Jan Kara wrote: On Fri 17-01-14 08:57:25, Robert Haas wrote: On Fri, Jan 17, 2014 at 7:34 AM, Jeff Layton jlay...@redhat.com wrote: So this says to me that the WAL is a place where DIO should really be reconsidered. It's mostly sequential writes that need to hit the disk ASAP, and you need to know that they have hit the disk before you can proceed with other operations. Ironically enough, we actually *have* an option to use O_DIRECT here. But it doesn't work well. See below. Also, is the WAL actually ever read under normal (non-recovery) conditions or is it write-only under normal operation? If it's seldom read, then using DIO for them also avoids some double buffering since they wouldn't go through pagecache. This is the first problem: if replication is in use, then the WAL gets read shortly after it gets written. Using O_DIRECT bypasses the kernel cache for the writes, but then the reads stink. OK, yes, this is hard to fix with direct IO. Actually, it's not. Block level caching is the time-honoured answer to this problem, and it's been used very successfully on a large scale by many organisations. e.g. facebook with MySQL, O_DIRECT, XFS and flashcache sitting on an SSD in front of rotating storage. There's multiple choices for this now - bcache, dm-cache, flahscache, etc, and they all solve this same problem. And in many cases do it better than using the page cache because you can independently scale the size of the block level cache... And given the size of SSDs these days, being able to put half a TB of flash cache in front of spinning disks is a pretty inexpensive way of solving such IO problems If we're forcing the WAL out to disk because of transaction commit or because we need to write the buffer protected by a certain WAL record only after the WAL hits the platter, then it's fine. But sometimes we're writing WAL just because we've run out of internal buffer space, and we don't want to block waiting for the write to complete. Opening the file with O_SYNC deprives us of the ability to control the timing of the sync relative to the timing of the write. O_SYNC has a heavy performance penalty. For ext4 it means an extra fs transaction commit whenever there's any metadata changed on the filesystem. Since mtime ctime of files will be changed often, the will be a case very often. Therefore: O_DATASYNC. Maybe it'll be useful to have hints that say always write this file to disk as quick as you can and always postpone writing this file to disk for as long as you can for WAL and temp files respectively. But the rule for the data files, which are the really important case, is not so simple. fsync() is actually a fine API except that it tends to destroy system throughput. Maybe what we need is just for fsync() to be less aggressive, or a less aggressive version of it. We wouldn't mind waiting an almost arbitrarily long time for fsync to complete if other processes could still get their I/O requests serviced in a reasonable amount of time in the meanwhile. As I wrote in some other email in this thread, using IO priorities for data file checkpoint might be actually the right answer. They will work for IO submitted by fsync(). The downside is that currently IO priorities / IO scheduling classes work only with CFQ IO scheduler. And I don't see it being implemented anywhere else because it's the priority aware scheduling infrastructure in CFQ that causes all the problems with IO concurrency and scalability... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Sun, Jan 19, 2014 at 03:37:37AM +0200, Marti Raudsepp wrote: On Wed, Jan 15, 2014 at 5:34 AM, Jim Nasby j...@nasby.net wrote: it's very common to create temporary file data that will never, ever, ever actually NEED to hit disk. Where I work being able to tell the kernel to avoid flushing those files unless the kernel thinks it's got better things to do with that memory would be EXTREMELY valuable Windows has the FILE_ATTRIBUTE_TEMPORARY flag for this purpose. ISTR that there was discussion about implementing something analogous in Linux when ext4 got delayed allocation support, but I don't think it got anywhere and I can't find the discussion now. I think the proposed interface was to create and then unlink the file immediately, which serves as a hint that the application doesn't care about persistence. You're thinking about O_TMPFILE, which is for making temp files that can't be seen in the filesystem namespace, not for preventing them from being written to disk. I don't really like the idea of overloading a namespace directive to have special writeback connotations. What we are getting into the realm of here is generic user controlled allocation and writeback policy... Postgres is far from being the only application that wants this; many people resort to tmpfs because of this: https://lwn.net/Articles/499410/ Yes, we covered the possibility of using tmpfs much earlier in the thread, and came to the conclusion that temp files can be larger than memory so tmpfs isn't the solution here. :) Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Thu, Jan 16, 2014 at 03:58:56PM -0800, Jeff Janes wrote: On Thu, Jan 16, 2014 at 3:23 PM, Dave Chinner da...@fromorbit.com wrote: On Wed, Jan 15, 2014 at 06:14:18PM -0600, Jim Nasby wrote: On 1/15/14, 12:00 AM, Claudio Freire wrote: My completely unproven theory is that swapping is overwhelmed by near-misses. Ie: a process touches a page, and before it's actually swapped in, another process touches it too, blocking on the other process' read. But the second process doesn't account for that page when evaluating predictive models (ie: read-ahead), so the next I/O by process 2 is unexpected to the kernel. Then the same with 1. Etc... In essence, swap, by a fluke of its implementation, fails utterly to predict the I/O pattern, and results in far sub-optimal reads. Explicit I/O is free from that effect, all read calls are accountable, and that makes a difference. Maybe, if the kernel could be fixed in that respect, you could consider mmap'd files as a suitable form of temporary storage. But that would depend on the success and availability of such a fix/patch. Another option is to consider some of the more radical ideas in this thread, but only for temporary data. Our write sequencing and other needs are far less stringent for this stuff. -- Jim C. I suspect that a lot of the temporary data issues can be solved by using tmpfs for temporary files Temp files can collectively reach hundreds of gigs. So unless you have terabytes of RAM you're going to have to write them back to disk. But there's something here that I'm not getting - you're talking about a data set that you want ot keep cache resident that is at least an order of magnitude larger than the cyclic 5-15 minute WAL dataset that ongoing operations need to manage to avoid IO storms. Where do these temporary files fit into this picture, how fast do they grow and why are do they need to be so large in comparison to the ongoing modifications being made to the database? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 06:14:18PM -0600, Jim Nasby wrote: On 1/15/14, 12:00 AM, Claudio Freire wrote: My completely unproven theory is that swapping is overwhelmed by near-misses. Ie: a process touches a page, and before it's actually swapped in, another process touches it too, blocking on the other process' read. But the second process doesn't account for that page when evaluating predictive models (ie: read-ahead), so the next I/O by process 2 is unexpected to the kernel. Then the same with 1. Etc... In essence, swap, by a fluke of its implementation, fails utterly to predict the I/O pattern, and results in far sub-optimal reads. Explicit I/O is free from that effect, all read calls are accountable, and that makes a difference. Maybe, if the kernel could be fixed in that respect, you could consider mmap'd files as a suitable form of temporary storage. But that would depend on the success and availability of such a fix/patch. Another option is to consider some of the more radical ideas in this thread, but only for temporary data. Our write sequencing and other needs are far less stringent for this stuff. -- Jim C. I suspect that a lot of the temporary data issues can be solved by using tmpfs for temporary files Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Thu, Jan 16, 2014 at 08:48:24PM -0500, Robert Haas wrote: On Thu, Jan 16, 2014 at 7:31 PM, Dave Chinner da...@fromorbit.com wrote: But there's something here that I'm not getting - you're talking about a data set that you want ot keep cache resident that is at least an order of magnitude larger than the cyclic 5-15 minute WAL dataset that ongoing operations need to manage to avoid IO storms. Where do these temporary files fit into this picture, how fast do they grow and why are do they need to be so large in comparison to the ongoing modifications being made to the database? [ snip ] Temp files are something else again. If PostgreSQL needs to sort a small amount of data, like a kilobyte, it'll use quicksort. But if it needs to sort a large amount of data, like a terabyte, it'll use a merge sort.[1] IOWs the temp files contain data that requires transformation as part of a query operation. So, temp file size is bound by the dataset, growth determined by data retreival and transformation rate. IOWs, there are two very different IO and caching requirements in play here and tuning the kernel for one actively degrades the performance of the other. Right, got it now. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 07:31:15PM -0500, Tom Lane wrote: Dave Chinner da...@fromorbit.com writes: On Wed, Jan 15, 2014 at 07:08:18PM -0500, Tom Lane wrote: No, we'd be happy to re-request it during each checkpoint cycle, as long as that wasn't an unduly expensive call to make. I'm not quite sure where such requests ought to live though. One idea is to tie them to file descriptors; but the data to be written might be spread across more files than we really want to keep open at one time. It would be a property of the inode, as that is how writeback is tracked and timed. Set and queried through a file descriptor, though - it's basically the same context that fadvise works through. Ah, got it. That would be fine on our end, I think. We could probably live with serially checkpointing data in sets of however-many-files-we-can-have-open, if file descriptors are the place to keep the requests. Inodes live longer than file descriptors, but there's no guarantee that they live from one fd context to another. Hence my question about persistence ;) I plead ignorance about what an fd context is. open-to-close life time. fd = open(some/file, ); . close(fd); is a single context. If multiple fd contexts of the same file overlap in lifetime, then the inode is constantly referenced and the inode won't get reclaimed so the value won't get lost. However, is there is no open fd context, there are no external references to the inode so it can get reclaimed. Hence there's not guarantee that the inode is present and the writeback property maintained across close-to-open timeframes. We're ahead of the game as long as it usually works. *nod* Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 09:54:20PM -0600, Jim Nasby wrote: On 1/14/14, 3:41 PM, Dave Chinner wrote: On Tue, Jan 14, 2014 at 09:40:48AM -0500, Robert Haas wrote: On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman mgor...@suse.de wrote: Whether the problem is with the system call or the programmer is harder to determine. I think the problem is in part that it's not exactly clear when we should call it. So suppose we want to do a checkpoint. What we used to do a long time ago is write everything, and then fsync it all, and then call it good. But that produced horrible I/O storms. So what we do now is do the writes over a period of time, with sleeps in between, and then fsync it all at the end, hoping that the kernel will write some of it before the fsyncs arrive so that we don't get a huge I/O spike. And that sorta works, and it's definitely better than doing it all at full speed, but it's pretty imprecise. If the kernel doesn't write enough of the data out in advance, then there's still a huge I/O storm when we do the fsyncs and everything grinds to a halt. If it writes out more data than needed in advance, it increases the total number of physical writes because we get less write-combining, and that hurts performance, too. I think there's a pretty important bit that Robert didn't mention: we have a specific *time* target for when we want all the fsync's to complete. People that have problems here tend to tune checkpoints to complete every 5-15 minutes, and they want the write traffic for the checkpoint spread out over 90% of that time interval. To put it another way, fsync's should be done when 90% of the time to the next checkpoint hits, but preferably not a lot before then. I think that is pretty much understood. I don't recall anyone mentioning a typical checkpoint period, though, so knowing the typical timeframe of IO storms and how much data is typically written in a checkpoint helps us understand the scale of the problem. It sounds to me like you want the kernel to start background writeback earlier so that it doesn't build up as much dirty data before you require a flush. There are several ways to do this by tweaking writeback knobs. The simplest is probably just to set /proc/sys/vm/dirty_background_bytes to an appropriate threshold (say 50MB) and dirty_expire_centiseconds to a few seconds so that background writeback starts and walks all dirty inodes almost immediately. This will keep a steady stream of low level background IO going, and fsync should then not take very long. Except that still won't throttle writes, right? That's the big issue here: our users often can't tolerate big spikes in IO latency. They want user requests to always happen within a specific amount of time. Right, but that's a different problem and one that io scheduling tweaks can have a major effect on. e.g. the deadline scheduler should be able to provide a maximum upper bound on read IO latency even while writes are in progress, though how successful it is is dependent on the nature of the write load and the architecture of the underlying storage. However, the first problem is dealing with the IO storm problem on fsync. Then we can measure the effect of spreading those writes out in time and determine what triggers read starvations (if they are apparent). The we can look at whether IO scheduling tweaks or whether blk-io throttling solves those problems. Or whether something else needs to be done to make it work in environments where problems are manifesting. FWIW [and I know you're probably sick of hearing this by now], but the blk-io throttling works almost perfectly with applications that use direct IO. So while delaying writes potentially reduces the total amount of data you're writing, users that run into problems here ultimately care more about ensuring that their foreground IO completes in a timely fashion. Understood. Applications that crunch randomly through large data sets are almost always read IO latency bound Fundamentally, though, we need bug reports from people seeing these problems when they see them so we can diagnose them on their systems. Trying to discuss/diagnose these problems without knowing anything about the storage, the kernel version, writeback thresholds, etc really doesn't work because we can't easily determine a root cause. So is lsf...@linux-foundation.org the best way to accomplish that? No. That is just the list for organising the LFSMM summit. ;) For general pagecache and writeback issues, discussions, etc, linux-fsde...@vger.kernel.org is the list to use. LKML simple has too much noise to be useful these days, so I'd avoid it. Otherwise the filesystem specific lists are are good place to get help for specific problems (e.g. linux-e...@vger.kernel.org and x...@oss.sgi.com). We tend to cross-post to other relevant lists as triage moves into different areas of the storage stack. Also, along the lines
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 10:12:38AM -0500, Tom Lane wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: On 01/15/2014 07:50 AM, Dave Chinner wrote: FWIW [and I know you're probably sick of hearing this by now], but the blk-io throttling works almost perfectly with applications that use direct IO. For checkpoint writes, direct I/O actually would be reasonable. Bypassing the OS cache is a good thing in that case - we don't want the written pages to evict other pages from the OS cache, as we already have them in the PostgreSQL buffer cache. But in exchange for that, we'd have to deal with selecting an order to write pages that's appropriate depending on the filesystem layout, other things happening in the system, etc etc. We don't want to build an I/O scheduler, IMO, but we'd have to. I don't see that as necessary - nobody else needs to do this with direct IO. Indeed, if the application does ascending offset order writeback from within a file, then it's replicating exactly what the kernel page cache writeback does. If what the kernel does is good enough for you, then I can't see how doing the same thing with a background thread doing direct IO is going to need any special help Writing one page at a time with O_DIRECT from a single process might be quite slow, so we'd probably need to use writev() or asynchronous I/O to work around that. Yeah, and if the system has multiple spindles, we'd need to be issuing multiple O_DIRECT writes concurrently, no? What we'd really like for checkpointing is to hand the kernel a boatload (several GB) of dirty pages and say how about you push all this to disk over the next few minutes, in whatever way seems optimal given the storage hardware and system situation. Let us know when you're done. The issue there is that the kernel has other triggers for needing to clean data. We have no infrastructure to handle variable writeback deadlines at the moment, nor do we have any infrastructure to do roughly metered writeback of such files to disk. I think we could add it to the infrastructure without too much perturbation of the code, but as you've pointed out that still leaves the fact there's no obvious interface to configure such behaviour. Would it need to be persistent? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 02:29:40PM -0800, Jeff Janes wrote: On Wed, Jan 15, 2014 at 7:12 AM, Tom Lane t...@sss.pgh.pa.us wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: On 01/15/2014 07:50 AM, Dave Chinner wrote: FWIW [and I know you're probably sick of hearing this by now], but the blk-io throttling works almost perfectly with applications that use direct IO. For checkpoint writes, direct I/O actually would be reasonable. Bypassing the OS cache is a good thing in that case - we don't want the written pages to evict other pages from the OS cache, as we already have them in the PostgreSQL buffer cache. But in exchange for that, we'd have to deal with selecting an order to write pages that's appropriate depending on the filesystem layout, other things happening in the system, etc etc. We don't want to build an I/O scheduler, IMO, but we'd have to. Writing one page at a time with O_DIRECT from a single process might be quite slow, so we'd probably need to use writev() or asynchronous I/O to work around that. Yeah, and if the system has multiple spindles, we'd need to be issuing multiple O_DIRECT writes concurrently, no? writev effectively does do that, doesn't it? But they do have to be on the same file handle, so that could be a problem. I think we need something like sorted checkpoints sooner or later, anyway. No, it doesn't. writev() allows you to supply multiple user buffers for a single IO to fixed offset. If th efile is contiguous, then it will be issued as a single IO. If you want concurrent DIO, then you need to use multiple threads or AIO. What we'd really like for checkpointing is to hand the kernel a boatload (several GB) of dirty pages and say how about you push all this to disk over the next few minutes, in whatever way seems optimal given the storage hardware and system situation. Let us know when you're done. And most importantly, Also, please don't freeze up everything else in the process If you hand writeback off to the kernel, then writeback for memory reclaim needs to take precedence over metered writeback. If we are low on memory, then cleaning dirty memory quickly to avoid ongoing allocation stalls, failures and potentially OOM conditions is far more important than anything else. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 10:12:38AM -0500, Robert Haas wrote: On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara j...@suse.cz wrote: Filesystems could in theory provide facility like atomic write (at least up to a certain size say in MB range) but it's not so easy and when there are no strong usecases fs people are reluctant to make their code more complex unnecessarily. OTOH without widespread atomic write support I understand application developers have similar stance. So it's kind of chicken and egg problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place due to its data=journal mode so if someone on the PostgreSQL side wanted to research on this, knitting some experimental ext4 patches should be doable. Atomic 8kB writes would improve performance for us quite a lot. Full page writes to WAL are very expensive. I don't remember what percentage of write-ahead log traffic that accounts for, but it's not small. Essentially, the atomic writes will essentially be journalled data so initially there is not going to be any different in performance between journalling the data in userspace and journalling it in the filesystem journal. Indeed, it could be worse because the filesystem journal is typically much smaller than a database WAL file, and it will flush much more frequently and without the database having any say in when that occurs. AFAICT, we're stuck with sucky WAL until block layer and hardware support atomic writes. FWIW, I've certainly considered adding per-file data journalling capabilities to XFS in the past. If we decide that this is the way to proceed (i.e. as a stepping stone towards hardware atomic write support), then I can go back to my notes from a few years ago and see what still needs to be done to support it Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 07:13:27PM -0500, Tom Lane wrote: Dave Chinner da...@fromorbit.com writes: On Wed, Jan 15, 2014 at 02:29:40PM -0800, Jeff Janes wrote: And most importantly, Also, please don't freeze up everything else in the process If you hand writeback off to the kernel, then writeback for memory reclaim needs to take precedence over metered writeback. If we are low on memory, then cleaning dirty memory quickly to avoid ongoing allocation stalls, failures and potentially OOM conditions is far more important than anything else. I think you're in violent agreement, actually. Jeff's point is exactly that we'd rather the checkpoint deadline slid than that the system goes to hell in a handbasket for lack of I/O cycles. Here metered really means do it as a low-priority task. No, I meant the opposite - in low memory situations, the system is going to go to hell in a handbasket because we are going to cause a writeback IO storm cleaning memory regardless of these IO priorities. i.e. there is no way we'll let low priority writeback to avoid IO storms cause OOM conditions to occur. That is, in OOM conditions, cleaning dirty pages becomes one of the highest priority tasks of the system Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 07:08:18PM -0500, Tom Lane wrote: Dave Chinner da...@fromorbit.com writes: On Wed, Jan 15, 2014 at 10:12:38AM -0500, Tom Lane wrote: What we'd really like for checkpointing is to hand the kernel a boatload (several GB) of dirty pages and say how about you push all this to disk over the next few minutes, in whatever way seems optimal given the storage hardware and system situation. Let us know when you're done. The issue there is that the kernel has other triggers for needing to clean data. We have no infrastructure to handle variable writeback deadlines at the moment, nor do we have any infrastructure to do roughly metered writeback of such files to disk. I think we could add it to the infrastructure without too much perturbation of the code, but as you've pointed out that still leaves the fact there's no obvious interface to configure such behaviour. Would it need to be persistent? No, we'd be happy to re-request it during each checkpoint cycle, as long as that wasn't an unduly expensive call to make. I'm not quite sure where such requests ought to live though. One idea is to tie them to file descriptors; but the data to be written might be spread across more files than we really want to keep open at one time. It would be a property of the inode, as that is how writeback is tracked and timed. Set and queried through a file descriptor, though - it's basically the same context that fadvise works through. But the only other idea that comes to mind is some kind of global sysctl, which would probably have security and permissions issues. (One thing that hasn't been mentioned yet in this thread, but maybe is worth pointing out now, is that Postgres does not run as root, and definitely doesn't want to. So we don't want a knob that would require root permissions to twiddle.) I have assumed all along that requiring root to do stuff would be a bad thing. :) We could probably live with serially checkpointing data in sets of however-many-files-we-can-have-open, if file descriptors are the place to keep the requests. Inodes live longer than file descriptors, but there's no guarantee that they live from one fd context to another. Hence my question about persistence ;) Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
doesn't really know what a device is capable of - it can only measure what the current IO workload is achieving - and it can change based on the IO workload characteristics. Hence applications can track this as well as the kernel does if they need this information for any reason. Reimplementing i/o schedulers and all the rest of the work that the Nobody needs to reimplement IO schedulers in userspace. Direct IO still goes through the block layers where all that merging and IO scheduling occurs. kernel provides inside Postgres just seems like something outside our competency and that none of us is really excited about doing. That argument goes both ways - providing fine-grained control over the page cache contents to userspace doesn't get me excited, either. In fact, it scares the living daylights out of me. It's complex, it's fragile and it introduces constraints into everything we do in the kernel. Any one of those reasons is grounds for saying no to a proposal, but this idea hits the trifecta I'm not saying that O_DIRECT is easy or perfect, but it seems to me to be a more robust, secure, maintainable and simpler solution than trying to give applications direct control over complex internal kernel structures and algorithms. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Mon, Jan 13, 2014 at 03:24:38PM -0800, Josh Berkus wrote: On 01/13/2014 02:26 PM, Mel Gorman wrote: Really? zone_reclaim_mode is often a complete disaster unless the workload is partitioned to fit within NUMA nodes. On older kernels enabling it would sometimes cause massive stalls. I'm actually very surprised to hear it fixes anything and would be interested in hearing more about what sort of circumstnaces would convince you to enable that thing. So the problem with the default setting is that it pretty much isolates all FS cache for PostgreSQL to whichever socket the postmaster is running on, and makes the other FS cache unavailable. This means that, for example, if you have two memory banks, then only one of them is available for PostgreSQL filesystem caching ... essentially cutting your available cache in half. No matter what default NUMA allocation policy we set, there will be an application for which that behaviour is wrong. As such, we've had tools for setting application specific NUMA policies for quite a few years now. e.g: $ man 8 numactl --interleave=nodes, -i nodes Set a memory interleave policy. Memory will be allocated using round robin on nodes. When memory cannot be allocated on the current interleave target fall back to other nodes. Multiple nodes may be specified on --interleave, --membind and --cpunodebind. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 02:26:25AM +0100, Andres Freund wrote: On 2014-01-13 17:13:51 -0800, James Bottomley wrote: a file into a user provided buffer, thus obtaining a page cache entry and a copy in their userspace buffer, then insert the page of the user buffer back into the page cache as the page cache page ... that's right, isn't it postgress people? Pretty much, yes. We'd probably hint (*advise(DONTNEED)) that the page isn't needed anymore when reading. And we'd normally write if the page is dirty. So why, exactly, do you even need the kernel page cache here? You've got direct access to the copy of data read into userspace, and you want direct control of when and how the data in that buffer is written and reclaimed. Why push that data buffer back into the kernel and then have to add all sorts of kernel interfaces to control the page you already have control of? Effectively you end up with buffered read/write that's also mapped into the page cache. It's a pretty awful way to hack around mmap. Well, the problem is that you can't really use mmap() for the things we do. Postgres' durability works by guaranteeing that our journal entries (called WAL := Write Ahead Log) are written synced to disk before the corresponding entries of tables and indexes reach the disk. That also allows to group together many random-writes into a few contiguous writes fdatasync()ed at once. Only during a checkpointing phase the big bulk of the data is then (slowly, in the background) synced to disk. Which is the exact algorithm most journalling filesystems use for ensuring durability of their metadata updates. Indeed, here's an interesting piece of architecture that you might like to consider: * Neither XFS and BTRFS use the kernel page cache to back their metadata transaction engines. Why not? Because the page cache is too simplistic to adequately represent the complex object heirarchies that the filesystems have and so it's flat LRU reclaim algorithms and writeback control mechanisms are a terrible fit and cause lots of performance issues under memory pressure. IOWs, the two most complex high performance transaction engines in the Linux kernel have moved to fully customised cache and (direct) IO implementations because the requirements for scalability and performance are far more complex than the kernel page cache infrastructure can provide. Just food for thought Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 09:40:48AM -0500, Robert Haas wrote: On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman mgor...@suse.de wrote: Amen to that. Actually, I think NUMA can be (mostly?) fixed by setting zone_reclaim_mode; is there some other problem besides that? Really? zone_reclaim_mode is often a complete disaster unless the workload is partitioned to fit within NUMA nodes. On older kernels enabling it would sometimes cause massive stalls. I'm actually very surprised to hear it fixes anything and would be interested in hearing more about what sort of circumstnaces would convince you to enable that thing. By set I mean set to zero. We've seen multiple of instances of people complaining about large amounts of system memory going unused because this setting defaulted to 1. The other thing that comes to mind is the kernel's caching behavior. We've talked a lot over the years about the difficulties of getting the kernel to write data out when we want it to and to not write data out when we don't want it to. Is sync_file_range() broke? I don't know. I think a few of us have played with it and not been able to achieve a clear win. Before you go back down the sync_file_range path, keep in mind that it is not a guaranteed data integrity operation: it does not force device cache flushes like fsync/fdatasync(). Hence it does not guarantee that the metadata that points at the data written nor the volatile caches in the storage path has been flushed... IOWs, using sync_file_range() does not avoid the need to fsync() a file for data integrity purposes... Whether the problem is with the system call or the programmer is harder to determine. I think the problem is in part that it's not exactly clear when we should call it. So suppose we want to do a checkpoint. What we used to do a long time ago is write everything, and then fsync it all, and then call it good. But that produced horrible I/O storms. So what we do now is do the writes over a period of time, with sleeps in between, and then fsync it all at the end, hoping that the kernel will write some of it before the fsyncs arrive so that we don't get a huge I/O spike. And that sorta works, and it's definitely better than doing it all at full speed, but it's pretty imprecise. If the kernel doesn't write enough of the data out in advance, then there's still a huge I/O storm when we do the fsyncs and everything grinds to a halt. If it writes out more data than needed in advance, it increases the total number of physical writes because we get less write-combining, and that hurts performance, too. Yup, the kernel defaults to maximising bulk write throughput, which means it waits to the last possible moment to issue write IO. And that's exactly to maximise write combining, optimise delayed allocation, etc. There are many good reasons for doing this, and for the majority of workloads it is the right behaviour to have. It sounds to me like you want the kernel to start background writeback earlier so that it doesn't build up as much dirty data before you require a flush. There are several ways to do this by tweaking writeback knobs. The simplest is probably just to set /proc/sys/vm/dirty_background_bytes to an appropriate threshold (say 50MB) and dirty_expire_centiseconds to a few seconds so that background writeback starts and walks all dirty inodes almost immediately. This will keep a steady stream of low level background IO going, and fsync should then not take very long. Fundamentally, though, we need bug reports from people seeing these problems when they see them so we can diagnose them on their systems. Trying to discuss/diagnose these problems without knowing anything about the storage, the kernel version, writeback thresholds, etc really doesn't work because we can't easily determine a root cause. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 11:40:38AM -0800, Kevin Grittner wrote: Robert Haas robertmh...@gmail.com wrote: Jan Kara j...@suse.cz wrote: Just to get some idea about the sizes - how large are the checkpoints we are talking about that cause IO stalls? Big. To quantify that, in a production setting we were seeing pauses of up to two minutes with shared_buffers set to 8GB and default dirty ^ page settings for Linux, on a machine with 256GB RAM and 512MB ^ There's your problem. By default, background writeback doesn't start until 10% of memory is dirtied, and on your machine that's 25GB of RAM. That's way to high for your workload. It appears to me that we are seeing large memory machines much more commonly in data centers - a couple of years ago 256GB RAM was only seen in supercomputers. Hence machines of this size are moving from tweaking settings for supercomputers is OK class to tweaking settings for enterprise servers is not OK Perhaps what we need to do is deprecate dirty_ratio and dirty_background_ratio as the default values as move to the byte based values as the defaults and cap them appropriately. e.g. 10/20% of RAM for small machines down to a couple of GB for large machines non-volatile cache on the RAID controller. To eliminate stalls we had to drop shared_buffers to 2GB (to limit how many dirty pages could build up out-of-sight from the OS), spread checkpoints to 90% of allowed time (almost no gap between finishing one checkpoint and starting the next) and crank up the background writer so that no dirty page sat unwritten in PostgreSQL shared_buffers for more than 4 seconds. Less aggressive pushing to the OS resulted in the avalanche of writes I previously described, with the corresponding I/O stalls. We approached that incrementally, and that's the point where stalls stopped occurring. We did not adjust the OS thresholds for writing dirty pages, although I know of others who have had to do so. Essentially, changing dirty_background_bytes, dirty_bytes and dirty_expire_centiseconds to be much smaller should make the kernel start writeback much sooner and so you shouldn't have to limit the amount of buffers the application has to prevent major fsync triggered stalls... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 08:03:28AM +1300, Gavin Flower wrote: On 14/01/14 14:09, Dave Chinner wrote: On Mon, Jan 13, 2014 at 09:29:02PM +, Greg Stark wrote: On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund and...@2ndquadrant.com wrote: [...] The more ambitious and interesting direction is to let Postgres tell the kernel what it needs to know to manage everything. To do that we would need the ability to control when pages are flushed out. This is absolutely necessary to maintain consistency. Postgres would need to be able to mark pages as unflushable until some point in time in the future when the journal is flushed. We discussed various ways that interface could work but it would be tricky to keep it low enough overhead to be workable. IMO, the concept of allowing userspace to pin dirty page cache pages in memory is just asking for trouble. Apart from the obvious memory reclaim and OOM issues, some filesystems won't be able to move their journals forward until the data is flushed. i.e. ordered mode data writeback on ext3 will have all sorts of deadlock issues that result from pinning pages and then issuing fsync() on another file which will block waiting for the pinned pages to be flushed. Indeed, what happens if you do pin_dirty_pages(fd); fsync(fd);? If fsync() blocks because there are pinned pages, and there's no other thread to unpin them, then that code just deadlocked. If fsync() doesn't block and skips the pinned pages, then we haven't done an fsync() at all, and so violated the expectation that users have that after fsync() returns their data is safe on disk. And if we return an error to fsync(), then what the hell does the user do if it is some other application we don't know about that has pinned the pages? And if the kernel unpins them after some time, then we just violated the application's consistency guarantees [...] What if Postgres could tell the kernel how strongly that it wanted to hold on to the pages? That doesn't get rid of the problems, it just makes it harder to diagnose them when they occur. :/ Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 03:03:39PM -0800, Kevin Grittner wrote: Dave Chinner da...@fromorbit.com write: Essentially, changing dirty_background_bytes, dirty_bytes and dirty_expire_centiseconds to be much smaller should make the kernel start writeback much sooner and so you shouldn't have to limit the amount of buffers the application has to prevent major fsync triggered stalls... Is there any rule of thumb about where to start with these? There's no absolute rule here, but the threshold for background writeback needs to consider the amount of dirty data being generated, the rate at which it can be retired and the checkpoint period the application is configured with. i.e. it needs to be slow enough to not cause serious read IO perturbations, but still fast enough that it avoids peaks at synchronisation points. And most importantly, it needs to be fast enought that it can complete writeback of all the dirty data in a checkpoint before the next checkpoint is triggered. In general, I find that threshold to be somewhere around 2-5s worth of data writeback - enough to keep a good amount of write combining and the IO pipeline full as work is done, but no more. e.g. if your workload results in writeback rates of 500MB/s, then I'd be setting the dirty limit somewhere around 1-2GB as an initial guess. It's basically a simple trade off buffering space for writeback latency. Some applications perform well with increased buffering space (e.g. 10-20s of writeback) while others perform better with extremely low writeback latency (e.g. 0.5-1s). For example, should a database server maybe have dirty_background_bytes set to 75% of the non-volatile write cache present on the controller, in an attempt to make sure that there is always some slack space for writes? I don't think the hardware cache size matters as it's easy to fill them very quickly and so after a couple of seconds the controller will fall back to disk speed anyway. IMO, what matters is that the threshold is large enough to adequately buffer writes to smooth peaks and troughs in the pipeline. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 05:38:10PM -0700, Jonathan Corbet wrote: On Wed, 15 Jan 2014 09:23:52 +1100 Dave Chinner da...@fromorbit.com wrote: It appears to me that we are seeing large memory machines much more commonly in data centers - a couple of years ago 256GB RAM was only seen in supercomputers. Hence machines of this size are moving from tweaking settings for supercomputers is OK class to tweaking settings for enterprise servers is not OK Perhaps what we need to do is deprecate dirty_ratio and dirty_background_ratio as the default values as move to the byte based values as the defaults and cap them appropriately. e.g. 10/20% of RAM for small machines down to a couple of GB for large machines I had thought that was already in the works...it hits people on far smaller systems than those described here. http://lwn.net/Articles/572911/ I wonder if anybody ever finished this work out for 3.14? Not that I know of. This patch was suggested as the solution to the slow/fast drive issue that started the whole thread: http://thread.gmane.org/gmane.linux.kernel/1584789/focus=1587059 but I don't see it in a current kernel. It might be in Andrew's tree for 3.14, but I haven't checked. However, most of the discussion in that thread about dirty limits was a side show that rehashed old territory. Rate limiting and throttling in a generic, scalable manner is a complex problem. We've got some of the infrastructure we need to solve the problem, but there was no conclusion as to the correct way to connect all the dots. Perhaps it's another topic for the LSFMM conf? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers