Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-22 Thread Dave Chinner
On Tue, Jan 21, 2014 at 09:20:52PM +0100, Jan Kara wrote:
 On Fri 17-01-14 08:57:25, Robert Haas wrote:
  On Fri, Jan 17, 2014 at 7:34 AM, Jeff Layton jlay...@redhat.com wrote:
   So this says to me that the WAL is a place where DIO should really be
   reconsidered. It's mostly sequential writes that need to hit the disk
   ASAP, and you need to know that they have hit the disk before you can
   proceed with other operations.
  
  Ironically enough, we actually *have* an option to use O_DIRECT here.
  But it doesn't work well.  See below.
  
   Also, is the WAL actually ever read under normal (non-recovery)
   conditions or is it write-only under normal operation? If it's seldom
   read, then using DIO for them also avoids some double buffering since
   they wouldn't go through pagecache.
  
  This is the first problem: if replication is in use, then the WAL gets
  read shortly after it gets written.  Using O_DIRECT bypasses the
  kernel cache for the writes, but then the reads stink.
   OK, yes, this is hard to fix with direct IO.

Actually, it's not. Block level caching is the time-honoured answer
to this problem, and it's been used very successfully on a large
scale by many organisations. e.g. facebook with MySQL, O_DIRECT, XFS
and flashcache sitting on an SSD in front of rotating storage.
There's multiple choices for this now - bcache, dm-cache,
flahscache, etc, and they all solve this same problem. And in many
cases do it better than using the page cache because you can
independently scale the size of the block level cache...

And given the size of SSDs these days, being able to put half a TB
of flash cache in front of spinning disks is a pretty inexpensive
way of solving such IO problems

  If we're forcing the WAL out to disk because of transaction commit or
  because we need to write the buffer protected by a certain WAL record
  only after the WAL hits the platter, then it's fine.  But sometimes
  we're writing WAL just because we've run out of internal buffer space,
  and we don't want to block waiting for the write to complete.  Opening
  the file with O_SYNC deprives us of the ability to control the timing
  of the sync relative to the timing of the write.
   O_SYNC has a heavy performance penalty. For ext4 it means an extra fs
 transaction commit whenever there's any metadata changed on the filesystem.
 Since mtime  ctime of files will be changed often, the will be a case very
 often.

Therefore: O_DATASYNC.

  Maybe it'll be useful to have hints that say always write this file
  to disk as quick as you can and always postpone writing this file to
  disk for as long as you can for WAL and temp files respectively.  But
  the rule for the data files, which are the really important case, is
  not so simple.  fsync() is actually a fine API except that it tends to
  destroy system throughput.  Maybe what we need is just for fsync() to
  be less aggressive, or a less aggressive version of it.  We wouldn't
  mind waiting an almost arbitrarily long time for fsync to complete if
  other processes could still get their I/O requests serviced in a
  reasonable amount of time in the meanwhile.
   As I wrote in some other email in this thread, using IO priorities for
 data file checkpoint might be actually the right answer. They will work for
 IO submitted by fsync(). The downside is that currently IO priorities / IO
 scheduling classes work only with CFQ IO scheduler.

And I don't see it being implemented anywhere else because it's the
priority aware scheduling infrastructure in CFQ that causes all the
problems with IO concurrency and scalability...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-20 Thread Dave Chinner
On Sun, Jan 19, 2014 at 03:37:37AM +0200, Marti Raudsepp wrote:
 On Wed, Jan 15, 2014 at 5:34 AM, Jim Nasby j...@nasby.net wrote:
  it's very common to create temporary file data that will never, ever, ever
  actually NEED to hit disk. Where I work being able to tell the kernel to
  avoid flushing those files unless the kernel thinks it's got better things
  to do with that memory would be EXTREMELY valuable
 
 Windows has the FILE_ATTRIBUTE_TEMPORARY flag for this purpose.
 
 ISTR that there was discussion about implementing something analogous
 in Linux when ext4 got delayed allocation support, but I don't think
 it got anywhere and I can't find the discussion now. I think the
 proposed interface was to create and then unlink the file immediately,
 which serves as a hint that the application doesn't care about
 persistence.

You're thinking about O_TMPFILE, which is for making temp files that
can't be seen in the filesystem namespace, not for preventing them
from being written to disk.

I don't really like the idea of overloading a namespace directive to
have special writeback connotations. What we are getting into the
realm of here is generic user controlled allocation and writeback
policy...

 Postgres is far from being the only application that wants this; many
 people resort to tmpfs because of this:
 https://lwn.net/Articles/499410/

Yes, we covered the possibility of using tmpfs much earlier in the
thread, and came to the conclusion that temp files can be larger
than memory so tmpfs isn't the solution here. :)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-17 Thread Dave Chinner
On Thu, Jan 16, 2014 at 03:58:56PM -0800, Jeff Janes wrote:
 On Thu, Jan 16, 2014 at 3:23 PM, Dave Chinner da...@fromorbit.com wrote:
 
  On Wed, Jan 15, 2014 at 06:14:18PM -0600, Jim Nasby wrote:
   On 1/15/14, 12:00 AM, Claudio Freire wrote:
   My completely unproven theory is that swapping is overwhelmed by
   near-misses. Ie: a process touches a page, and before it's
   actually swapped in, another process touches it too, blocking on
   the other process' read. But the second process doesn't account
   for that page when evaluating predictive models (ie: read-ahead),
   so the next I/O by process 2 is unexpected to the kernel. Then
   the same with 1. Etc... In essence, swap, by a fluke of its
   implementation, fails utterly to predict the I/O pattern, and
   results in far sub-optimal reads.
   
   Explicit I/O is free from that effect, all read calls are
   accountable, and that makes a difference.
   
   Maybe, if the kernel could be fixed in that respect, you could
   consider mmap'd files as a suitable form of temporary storage.
   But that would depend on the success and availability of such a
   fix/patch.
  
   Another option is to consider some of the more radical ideas in
   this thread, but only for temporary data. Our write sequencing and
   other needs are far less stringent for this stuff.  -- Jim C.
 
  I suspect that a lot of the temporary data issues can be solved by
  using tmpfs for temporary files
 
 
 Temp files can collectively reach hundreds of gigs.

So unless you have terabytes of RAM you're going to have to write
them back to disk.

But there's something here that I'm not getting - you're talking
about a data set that you want ot keep cache resident that is at
least an order of magnitude larger than the cyclic 5-15 minute WAL
dataset that ongoing operations need to manage to avoid IO storms.
Where do these temporary files fit into this picture, how fast do
they grow and why are do they need to be so large in comparison to
the ongoing modifications being made to the database?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-17 Thread Dave Chinner
On Wed, Jan 15, 2014 at 06:14:18PM -0600, Jim Nasby wrote:
 On 1/15/14, 12:00 AM, Claudio Freire wrote:
 My completely unproven theory is that swapping is overwhelmed by
 near-misses. Ie: a process touches a page, and before it's
 actually swapped in, another process touches it too, blocking on
 the other process' read. But the second process doesn't account
 for that page when evaluating predictive models (ie: read-ahead),
 so the next I/O by process 2 is unexpected to the kernel. Then
 the same with 1. Etc... In essence, swap, by a fluke of its
 implementation, fails utterly to predict the I/O pattern, and
 results in far sub-optimal reads.
 
 Explicit I/O is free from that effect, all read calls are
 accountable, and that makes a difference.
 
 Maybe, if the kernel could be fixed in that respect, you could
 consider mmap'd files as a suitable form of temporary storage.
 But that would depend on the success and availability of such a
 fix/patch.
 
 Another option is to consider some of the more radical ideas in
 this thread, but only for temporary data. Our write sequencing and
 other needs are far less stringent for this stuff.  -- Jim C.

I suspect that a lot of the temporary data issues can be solved by
using tmpfs for temporary files

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-17 Thread Dave Chinner
On Thu, Jan 16, 2014 at 08:48:24PM -0500, Robert Haas wrote:
 On Thu, Jan 16, 2014 at 7:31 PM, Dave Chinner da...@fromorbit.com wrote:
  But there's something here that I'm not getting - you're talking
  about a data set that you want ot keep cache resident that is at
  least an order of magnitude larger than the cyclic 5-15 minute WAL
  dataset that ongoing operations need to manage to avoid IO storms.
  Where do these temporary files fit into this picture, how fast do
  they grow and why are do they need to be so large in comparison to
  the ongoing modifications being made to the database?

[ snip ]

 Temp files are something else again.  If PostgreSQL needs to sort a
 small amount of data, like a kilobyte, it'll use quicksort.  But if it
 needs to sort a large amount of data, like a terabyte, it'll use a
 merge sort.[1] 

IOWs the temp files contain data that requires transformation as
part of a query operation. So, temp file size is bound by the
dataset, growth determined by data retreival and transformation
rate.

IOWs, there are two very different IO and caching requirements in
play here and tuning the kernel for one actively degrades the
performance of the other. Right, got it now.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-16 Thread Dave Chinner
On Wed, Jan 15, 2014 at 07:31:15PM -0500, Tom Lane wrote:
 Dave Chinner da...@fromorbit.com writes:
  On Wed, Jan 15, 2014 at 07:08:18PM -0500, Tom Lane wrote:
  No, we'd be happy to re-request it during each checkpoint cycle, as
  long as that wasn't an unduly expensive call to make.  I'm not quite
  sure where such requests ought to live though.  One idea is to tie
  them to file descriptors; but the data to be written might be spread
  across more files than we really want to keep open at one time.
 
  It would be a property of the inode, as that is how writeback is
  tracked and timed. Set and queried through a file descriptor,
  though - it's basically the same context that fadvise works
  through.
 
 Ah, got it.  That would be fine on our end, I think.
 
  We could probably live with serially checkpointing data
  in sets of however-many-files-we-can-have-open, if file descriptors are
  the place to keep the requests.
 
  Inodes live longer than file descriptors, but there's no guarantee
  that they live from one fd context to another. Hence my question
  about persistence ;)
 
 I plead ignorance about what an fd context is.

open-to-close life time.

fd = open(some/file, );
.
close(fd);

is a single context. If multiple fd contexts of the same file
overlap in lifetime, then the inode is constantly referenced and the
inode won't get reclaimed so the value won't get lost. However, is
there is no open fd context, there are no external references to the
inode so it can get reclaimed. Hence there's not guarantee that the
inode is present and the writeback property maintained across
close-to-open timeframes.

 We're ahead of the game as long as it usually works.

*nod*

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Dave Chinner
On Tue, Jan 14, 2014 at 09:54:20PM -0600, Jim Nasby wrote:
 On 1/14/14, 3:41 PM, Dave Chinner wrote:
 On Tue, Jan 14, 2014 at 09:40:48AM -0500, Robert Haas wrote:
 On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman mgor...@suse.de
 wrote: Whether the problem is with the system call or the
 programmer is harder to determine.  I think the problem is in
 part that it's not exactly clear when we should call it.  So
 suppose we want to do a checkpoint.  What we used to do a long
 time ago is write everything, and then fsync it all, and then
 call it good.  But that produced horrible I/O storms.  So what
 we do now is do the writes over a period of time, with sleeps in
 between, and then fsync it all at the end, hoping that the
 kernel will write some of it before the fsyncs arrive so that we
 don't get a huge I/O spike.  And that sorta works, and it's
 definitely better than doing it all at full speed, but it's
 pretty imprecise.  If the kernel doesn't write enough of the
 data out in advance, then there's still a huge I/O storm when we
 do the fsyncs and everything grinds to a halt.  If it writes out
 more data than needed in advance, it increases the total number
 of physical writes because we get less write-combining, and that
 hurts performance, too.
 
 I think there's a pretty important bit that Robert didn't mention:
 we have a specific *time* target for when we want all the fsync's
 to complete. People that have problems here tend to tune
 checkpoints to complete every 5-15 minutes, and they want the
 write traffic for the checkpoint spread out over 90% of that time
 interval. To put it another way, fsync's should be done when 90%
 of the time to the next checkpoint hits, but preferably not a lot
 before then.

I think that is pretty much understood. I don't recall anyone
mentioning a typical checkpoint period, though, so knowing the
typical timeframe of IO storms and how much data is typically
written in a checkpoint helps us understand the scale of the
problem.

 It sounds to me like you want the kernel to start background
 writeback earlier so that it doesn't build up as much dirty data
 before you require a flush. There are several ways to do this by
 tweaking writeback knobs. The simplest is probably just to set
 /proc/sys/vm/dirty_background_bytes to an appropriate threshold
 (say 50MB) and dirty_expire_centiseconds to a few seconds so that
 background writeback starts and walks all dirty inodes almost
 immediately. This will keep a steady stream of low level
 background IO going, and fsync should then not take very long.
 
 Except that still won't throttle writes, right? That's the big
 issue here: our users often can't tolerate big spikes in IO
 latency. They want user requests to always happen within a
 specific amount of time.

Right, but that's a different problem and one that io scheduling
tweaks can have a major effect on. e.g. the deadline scheduler
should be able to provide a maximum upper bound on read IO latency
even while writes are in progress, though how successful it is is
dependent on the nature of the write load and the architecture of
the underlying storage.

However, the first problem is dealing with the IO storm problem on
fsync. Then we can measure the effect of spreading those writes out
in time and determine what triggers read starvations (if they are
apparent). The we can look at whether IO scheduling tweaks or
whether blk-io throttling solves those problems. Or whether
something else needs to be done to make it work in environments
where problems are manifesting.

FWIW [and I know you're probably sick of hearing this by now], but
the blk-io throttling works almost perfectly with applications that
use direct IO.

 So while delaying writes potentially reduces the total amount of
 data you're writing, users that run into problems here ultimately
 care more about ensuring that their foreground IO completes in a
 timely fashion.

Understood. Applications that crunch randomly through large data
sets are almost always read IO latency bound

 Fundamentally, though, we need bug reports from people seeing
 these problems when they see them so we can diagnose them on
 their systems. Trying to discuss/diagnose these problems without
 knowing anything about the storage, the kernel version, writeback
 thresholds, etc really doesn't work because we can't easily
 determine a root cause.
 
 So is lsf...@linux-foundation.org the best way to accomplish that?

No. That is just the list for organising the LFSMM summit. ;)

For general pagecache and writeback issues, discussions, etc,
linux-fsde...@vger.kernel.org is the list to use. LKML simple has
too much noise to be useful these days, so I'd avoid it. Otherwise
the filesystem specific lists are are good place to get help for
specific problems (e.g. linux-e...@vger.kernel.org and
x...@oss.sgi.com). We tend to cross-post to other relevant lists as
triage moves into different areas of the storage stack.

 Also, along the lines

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Dave Chinner
On Wed, Jan 15, 2014 at 10:12:38AM -0500, Tom Lane wrote:
 Heikki Linnakangas hlinnakan...@vmware.com writes:
  On 01/15/2014 07:50 AM, Dave Chinner wrote:
  FWIW [and I know you're probably sick of hearing this by now], but
  the blk-io throttling works almost perfectly with applications that
  use direct IO.
 
  For checkpoint writes, direct I/O actually would be reasonable. 
  Bypassing the OS cache is a good thing in that case - we don't want the 
  written pages to evict other pages from the OS cache, as we already have 
  them in the PostgreSQL buffer cache.
 
 But in exchange for that, we'd have to deal with selecting an order to
 write pages that's appropriate depending on the filesystem layout,
 other things happening in the system, etc etc.  We don't want to build
 an I/O scheduler, IMO, but we'd have to.

I don't see that as necessary - nobody else needs to do this with
direct IO. Indeed, if the application does ascending offset order
writeback from within a file, then it's replicating exactly what the
kernel page cache writeback does. If what the kernel does is good
enough for you, then I can't see how doing the same thing with
a background thread doing direct IO is going to need any special
help

  Writing one page at a time with O_DIRECT from a single process might be 
  quite slow, so we'd probably need to use writev() or asynchronous I/O to 
  work around that.
 
 Yeah, and if the system has multiple spindles, we'd need to be issuing
 multiple O_DIRECT writes concurrently, no?
 
 What we'd really like for checkpointing is to hand the kernel a boatload
 (several GB) of dirty pages and say how about you push all this to disk
 over the next few minutes, in whatever way seems optimal given the storage
 hardware and system situation.  Let us know when you're done.

The issue there is that the kernel has other triggers for needing to
clean data. We have no infrastructure to handle variable writeback
deadlines at the moment, nor do we have any infrastructure to do
roughly metered writeback of such files to disk. I think we could
add it to the infrastructure without too much perturbation of the
code, but as you've pointed out that still leaves the fact there's
no obvious interface to configure such behaviour. Would it need to
be persistent?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Dave Chinner
On Wed, Jan 15, 2014 at 02:29:40PM -0800, Jeff Janes wrote:
 On Wed, Jan 15, 2014 at 7:12 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 
  Heikki Linnakangas hlinnakan...@vmware.com writes:
   On 01/15/2014 07:50 AM, Dave Chinner wrote:
   FWIW [and I know you're probably sick of hearing this by now], but
   the blk-io throttling works almost perfectly with applications that
   use direct IO.
 
   For checkpoint writes, direct I/O actually would be reasonable.
   Bypassing the OS cache is a good thing in that case - we don't want the
   written pages to evict other pages from the OS cache, as we already have
   them in the PostgreSQL buffer cache.
 
  But in exchange for that, we'd have to deal with selecting an order to
  write pages that's appropriate depending on the filesystem layout,
  other things happening in the system, etc etc.  We don't want to build
  an I/O scheduler, IMO, but we'd have to.
 
   Writing one page at a time with O_DIRECT from a single process might be
   quite slow, so we'd probably need to use writev() or asynchronous I/O to
   work around that.
 
  Yeah, and if the system has multiple spindles, we'd need to be issuing
  multiple O_DIRECT writes concurrently, no?
 
 
 writev effectively does do that, doesn't it?  But they do have to be on the
 same file handle, so that could be a problem.  I think we need something
 like sorted checkpoints sooner or later, anyway.

No, it doesn't. writev() allows you to supply multiple user buffers
for a single IO to fixed offset. If th efile is contiguous, then it
will be issued as a single IO. If you want concurrent DIO, then you
need to use multiple threads or AIO.

  What we'd really like for checkpointing is to hand the kernel a boatload
  (several GB) of dirty pages and say how about you push all this to disk
  over the next few minutes, in whatever way seems optimal given the storage
  hardware and system situation.  Let us know when you're done.
 
 And most importantly, Also, please don't freeze up everything else in the
 process

If you hand writeback off to the kernel, then writeback for memory
reclaim needs to take precedence over metered writeback. If we are
low on memory, then cleaning dirty memory quickly to avoid ongoing
allocation stalls, failures and potentially OOM conditions is far more
important than anything else.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Dave Chinner
On Wed, Jan 15, 2014 at 10:12:38AM -0500, Robert Haas wrote:
 On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara j...@suse.cz wrote:
  Filesystems could in theory provide facility like atomic write (at least up
  to a certain size say in MB range) but it's not so easy and when there are
  no strong usecases fs people are reluctant to make their code more complex
  unnecessarily. OTOH without widespread atomic write support I understand
  application developers have similar stance. So it's kind of chicken and egg
  problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
  due to its data=journal mode so if someone on the PostgreSQL side wanted to
  research on this, knitting some experimental ext4 patches should be doable.
 
 Atomic 8kB writes would improve performance for us quite a lot.  Full
 page writes to WAL are very expensive.  I don't remember what
 percentage of write-ahead log traffic that accounts for, but it's not
 small.

Essentially, the atomic writes will essentially be journalled data
so initially there is not going to be any different in performance
between journalling the data in userspace and journalling it in the
filesystem journal. Indeed, it could be worse because the filesystem
journal is typically much smaller than a database WAL file, and it
will flush much more frequently and without the database having any
say in when that occurs.

AFAICT, we're stuck with sucky WAL until block layer and hardware
support atomic writes.

FWIW, I've certainly considered adding per-file data journalling
capabilities to XFS in the past. If we decide that this is the way
to proceed (i.e. as a stepping stone towards hardware atomic write
support), then I can go back to my notes from a few years ago and
see what still needs to be done to support it

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Dave Chinner
On Wed, Jan 15, 2014 at 07:13:27PM -0500, Tom Lane wrote:
 Dave Chinner da...@fromorbit.com writes:
  On Wed, Jan 15, 2014 at 02:29:40PM -0800, Jeff Janes wrote:
  And most importantly, Also, please don't freeze up everything else in the
  process
 
  If you hand writeback off to the kernel, then writeback for memory
  reclaim needs to take precedence over metered writeback. If we are
  low on memory, then cleaning dirty memory quickly to avoid ongoing
  allocation stalls, failures and potentially OOM conditions is far more
  important than anything else.
 
 I think you're in violent agreement, actually.  Jeff's point is exactly
 that we'd rather the checkpoint deadline slid than that the system goes
 to hell in a handbasket for lack of I/O cycles.  Here metered really
 means do it as a low-priority task.

No, I meant the opposite - in low memory situations, the system is
going to go to hell in a handbasket because we are going to cause a
writeback IO storm cleaning memory regardless of these IO
priorities. i.e. there is no way we'll let low priority writeback
to avoid IO storms cause OOM conditions to occur. That is, in OOM
conditions, cleaning dirty pages becomes one of the highest priority
tasks of the system

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Dave Chinner
On Wed, Jan 15, 2014 at 07:08:18PM -0500, Tom Lane wrote:
 Dave Chinner da...@fromorbit.com writes:
  On Wed, Jan 15, 2014 at 10:12:38AM -0500, Tom Lane wrote:
  What we'd really like for checkpointing is to hand the kernel a boatload
  (several GB) of dirty pages and say how about you push all this to disk
  over the next few minutes, in whatever way seems optimal given the storage
  hardware and system situation.  Let us know when you're done.
 
  The issue there is that the kernel has other triggers for needing to
  clean data. We have no infrastructure to handle variable writeback
  deadlines at the moment, nor do we have any infrastructure to do
  roughly metered writeback of such files to disk. I think we could
  add it to the infrastructure without too much perturbation of the
  code, but as you've pointed out that still leaves the fact there's
  no obvious interface to configure such behaviour. Would it need to
  be persistent?
 
 No, we'd be happy to re-request it during each checkpoint cycle, as
 long as that wasn't an unduly expensive call to make.  I'm not quite
 sure where such requests ought to live though.  One idea is to tie
 them to file descriptors; but the data to be written might be spread
 across more files than we really want to keep open at one time.

It would be a property of the inode, as that is how writeback is
tracked and timed. Set and queried through a file descriptor,
though - it's basically the same context that fadvise works
through.

 But the only other idea that comes to mind is some kind of global sysctl,
 which would probably have security and permissions issues.  (One thing
 that hasn't been mentioned yet in this thread, but maybe is worth pointing
 out now, is that Postgres does not run as root, and definitely doesn't
 want to.  So we don't want a knob that would require root permissions
 to twiddle.)

I have assumed all along that requiring root to do stuff would be a
bad thing. :)

 We could probably live with serially checkpointing data
 in sets of however-many-files-we-can-have-open, if file descriptors are
 the place to keep the requests.

Inodes live longer than file descriptors, but there's no guarantee
that they live from one fd context to another. Hence my question
about persistence ;)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Dave Chinner
 doesn't really know what a device is capable of - it can
only measure what the current IO workload is achieving - and it can
change based on the IO workload characteristics. Hence applications
can track this as well as the kernel does if they need this
information for any reason.

 Reimplementing i/o schedulers and all the rest of the work that the

Nobody needs to reimplement IO schedulers in userspace. Direct IO
still goes through the block layers where all that merging and
IO scheduling occurs.

 kernel provides inside Postgres just seems like something outside our
 competency and that none of us is really excited about doing.

That argument goes both ways - providing fine-grained control over
the page cache contents to userspace doesn't get me excited, either.
In fact, it scares the living daylights out of me. It's complex,
it's fragile and it introduces constraints into everything we do in
the kernel. Any one of those reasons is grounds for saying no to a
proposal, but this idea hits the trifecta

I'm not saying that O_DIRECT is easy or perfect, but it seems to me
to be a more robust, secure, maintainable and simpler solution than
trying to give applications direct control over complex internal
kernel structures and algorithms.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Dave Chinner
On Mon, Jan 13, 2014 at 03:24:38PM -0800, Josh Berkus wrote:
 On 01/13/2014 02:26 PM, Mel Gorman wrote:
  Really?
  
  zone_reclaim_mode is often a complete disaster unless the workload is
  partitioned to fit within NUMA nodes. On older kernels enabling it would
  sometimes cause massive stalls. I'm actually very surprised to hear it
  fixes anything and would be interested in hearing more about what sort
  of circumstnaces would convince you to enable that thing.
 
 So the problem with the default setting is that it pretty much isolates
 all FS cache for PostgreSQL to whichever socket the postmaster is
 running on, and makes the other FS cache unavailable.  This means that,
 for example, if you have two memory banks, then only one of them is
 available for PostgreSQL filesystem caching ... essentially cutting your
 available cache in half.

No matter what default NUMA allocation policy we set, there will be
an application for which that behaviour is wrong. As such, we've had
tools for setting application specific NUMA policies for quite a few
years now. e.g:

$ man 8 numactl

   --interleave=nodes, -i nodes
  Set a memory interleave policy. Memory will be
  allocated using round robin on nodes.  When memory
  cannot be allocated on the current interleave target
  fall back to other nodes.  Multiple nodes may be
  specified on --interleave, --membind and
  --cpunodebind.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Dave Chinner
On Tue, Jan 14, 2014 at 02:26:25AM +0100, Andres Freund wrote:
 On 2014-01-13 17:13:51 -0800, James Bottomley wrote:
  a file into a user provided buffer, thus obtaining a page cache entry
  and a copy in their userspace buffer, then insert the page of the user
  buffer back into the page cache as the page cache page ... that's right,
  isn't it postgress people?
 
 Pretty much, yes. We'd probably hint (*advise(DONTNEED)) that the page
 isn't needed anymore when reading. And we'd normally write if the page
 is dirty.

So why, exactly, do you even need the kernel page cache here? You've
got direct access to the copy of data read into userspace, and you
want direct control of when and how the data in that buffer is
written and reclaimed. Why push that data buffer back into the
kernel and then have to add all sorts of kernel interfaces to
control the page you already have control of?

  Effectively you end up with buffered read/write that's also mapped into
  the page cache.  It's a pretty awful way to hack around mmap.
 
 Well, the problem is that you can't really use mmap() for the things we
 do. Postgres' durability works by guaranteeing that our journal entries
 (called WAL := Write Ahead Log) are written  synced to disk before the
 corresponding entries of tables and indexes reach the disk. That also
 allows to group together many random-writes into a few contiguous writes
 fdatasync()ed at once. Only during a checkpointing phase the big bulk of
 the data is then (slowly, in the background) synced to disk.

Which is the exact algorithm most journalling filesystems use for
ensuring durability of their metadata updates.  Indeed, here's an
interesting piece of architecture that you might like to consider:

* Neither XFS and BTRFS use the kernel page cache to back their
  metadata transaction engines.

Why not? Because the page cache is too simplistic to adequately
represent the complex object heirarchies that the filesystems have
and so it's flat LRU reclaim algorithms and writeback control
mechanisms are a terrible fit and cause lots of performance issues
under memory pressure.

IOWs, the two most complex high performance transaction engines in
the Linux kernel have moved to fully customised cache and (direct)
IO implementations because the requirements for scalability and
performance are far more complex than the kernel page cache
infrastructure can provide.

Just food for thought

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Dave Chinner
On Tue, Jan 14, 2014 at 09:40:48AM -0500, Robert Haas wrote:
 On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman mgor...@suse.de wrote:
  Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
  setting zone_reclaim_mode; is there some other problem besides that?
 
  Really?
 
  zone_reclaim_mode is often a complete disaster unless the workload is
  partitioned to fit within NUMA nodes. On older kernels enabling it would
  sometimes cause massive stalls. I'm actually very surprised to hear it
  fixes anything and would be interested in hearing more about what sort
  of circumstnaces would convince you to enable that thing.
 
 By set I mean set to zero.  We've seen multiple of instances of
 people complaining about large amounts of system memory going unused
 because this setting defaulted to 1.
 
  The other thing that comes to mind is the kernel's caching behavior.
  We've talked a lot over the years about the difficulties of getting
  the kernel to write data out when we want it to and to not write data
  out when we don't want it to.
 
  Is sync_file_range() broke?
 
 I don't know.  I think a few of us have played with it and not been
 able to achieve a clear win.

Before you go back down the sync_file_range path, keep in mind that
it is not a guaranteed data integrity operation: it does not force
device cache flushes like fsync/fdatasync(). Hence it does not
guarantee that the metadata that points at the data written nor the
volatile caches in the storage path has been flushed...

IOWs, using sync_file_range() does not avoid the need to fsync() a
file for data integrity purposes...

 Whether the problem is with the system
 call or the programmer is harder to determine.  I think the problem is
 in part that it's not exactly clear when we should call it.  So
 suppose we want to do a checkpoint.  What we used to do a long time
 ago is write everything, and then fsync it all, and then call it good.
  But that produced horrible I/O storms.  So what we do now is do the
 writes over a period of time, with sleeps in between, and then fsync
 it all at the end, hoping that the kernel will write some of it before
 the fsyncs arrive so that we don't get a huge I/O spike.
 And that sorta works, and it's definitely better than doing it all at
 full speed, but it's pretty imprecise.  If the kernel doesn't write
 enough of the data out in advance, then there's still a huge I/O storm
 when we do the fsyncs and everything grinds to a halt.  If it writes
 out more data than needed in advance, it increases the total number of
 physical writes because we get less write-combining, and that hurts
 performance, too. 

Yup, the kernel defaults to maximising bulk write throughput, which
means it waits to the last possible moment to issue write IO. And
that's exactly to maximise write combining, optimise delayed
allocation, etc. There are many good reasons for doing this, and for
the majority of workloads it is the right behaviour to have.

It sounds to me like you want the kernel to start background
writeback earlier so that it doesn't build up as much dirty data
before you require a flush. There are several ways to do this by
tweaking writeback knobs. The simplest is probably just to set
/proc/sys/vm/dirty_background_bytes to an appropriate threshold (say
50MB) and dirty_expire_centiseconds to a few seconds so that
background writeback starts and walks all dirty inodes almost
immediately. This will keep a steady stream of low level background
IO going, and fsync should then not take very long.

Fundamentally, though, we need bug reports from people seeing these
problems when they see them so we can diagnose them on their
systems. Trying to discuss/diagnose these problems without knowing
anything about the storage, the kernel version, writeback
thresholds, etc really doesn't work because we can't easily
determine a root cause.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Dave Chinner
On Tue, Jan 14, 2014 at 11:40:38AM -0800, Kevin Grittner wrote:
 Robert Haas robertmh...@gmail.com wrote:
  Jan Kara j...@suse.cz wrote:
 
  Just to get some idea about the sizes - how large are the
  checkpoints we are talking about that cause IO stalls?
 
  Big.
 
 To quantify that, in a production setting we were seeing pauses of
 up to two minutes with shared_buffers set to 8GB and default dirty
   ^
 page settings for Linux, on a machine with 256GB RAM and 512MB
  ^
There's your problem.

By default, background writeback doesn't start until 10% of memory
is dirtied, and on your machine that's 25GB of RAM. That's way to
high for your workload.

It appears to me that we are seeing large memory machines much more
commonly in data centers - a couple of years ago 256GB RAM was only
seen in supercomputers. Hence machines of this size are moving from
tweaking settings for supercomputers is OK class to tweaking
settings for enterprise servers is not OK

Perhaps what we need to do is deprecate dirty_ratio and
dirty_background_ratio as the default values as move to the byte
based values as the defaults and cap them appropriately.  e.g.
10/20% of RAM for small machines down to a couple of GB for large
machines

 non-volatile cache on the RAID controller.  To eliminate stalls we
 had to drop shared_buffers to 2GB (to limit how many dirty pages
 could build up out-of-sight from the OS), spread checkpoints to 90%
 of allowed time (almost no gap between finishing one checkpoint and
 starting the next) and crank up the background writer so that no
 dirty page sat unwritten in PostgreSQL shared_buffers for more than
 4 seconds. Less aggressive pushing to the OS resulted in the
 avalanche of writes I previously described, with the corresponding
 I/O stalls.  We approached that incrementally, and that's the point
 where stalls stopped occurring.  We did not adjust the OS
 thresholds for writing dirty pages, although I know of others who
 have had to do so.

Essentially, changing dirty_background_bytes, dirty_bytes and
dirty_expire_centiseconds to be much smaller should make the kernel
start writeback much sooner and so you shouldn't have to limit the
amount of buffers the application has to prevent major fsync
triggered stalls...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Dave Chinner
On Wed, Jan 15, 2014 at 08:03:28AM +1300, Gavin Flower wrote:
 On 14/01/14 14:09, Dave Chinner wrote:
 On Mon, Jan 13, 2014 at 09:29:02PM +, Greg Stark wrote:
 On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
 [...]
 The more ambitious and interesting direction is to let Postgres tell
 the kernel what it needs to know to manage everything. To do that we
 would need the ability to control when pages are flushed out. This is
 absolutely necessary to maintain consistency. Postgres would need to
 be able to mark pages as unflushable until some point in time in the
 future when the journal is flushed. We discussed various ways that
 interface could work but it would be tricky to keep it low enough
 overhead to be workable.
 IMO, the concept of allowing userspace to pin dirty page cache
 pages in memory is just asking for trouble. Apart from the obvious
 memory reclaim and OOM issues, some filesystems won't be able to
 move their journals forward until the data is flushed. i.e. ordered
 mode data writeback on ext3 will have all sorts of deadlock issues
 that result from pinning pages and then issuing fsync() on another
 file which will block waiting for the pinned pages to be flushed.
 
 Indeed, what happens if you do pin_dirty_pages(fd);  fsync(fd);?
 If fsync() blocks because there are pinned pages, and there's no
 other thread to unpin them, then that code just deadlocked. If
 fsync() doesn't block and skips the pinned pages, then we haven't
 done an fsync() at all, and so violated the expectation that users
 have that after fsync() returns their data is safe on disk. And if
 we return an error to fsync(), then what the hell does the user do
 if it is some other application we don't know about that has pinned
 the pages? And if the kernel unpins them after some time, then we
 just violated the application's consistency guarantees
 
 [...]
 
 What if Postgres could tell the kernel how strongly that it wanted
 to hold on to the pages?

That doesn't get rid of the problems, it just makes it harder to
diagnose them when they occur. :/

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Dave Chinner
On Tue, Jan 14, 2014 at 03:03:39PM -0800, Kevin Grittner wrote:
 Dave Chinner da...@fromorbit.com write:
 
  Essentially, changing dirty_background_bytes, dirty_bytes and
  dirty_expire_centiseconds to be much smaller should make the
  kernel start writeback much sooner and so you shouldn't have to
  limit the amount of buffers the application has to prevent major
  fsync triggered stalls...
 
 Is there any rule of thumb about where to start with these?

There's no absolute rule here, but the threshold for background
writeback needs to consider the amount of dirty data being
generated, the rate at which it can be retired and the checkpoint
period the application is configured with. i.e. it needs to be slow
enough to not cause serious read IO perturbations, but still fast
enough that it avoids peaks at synchronisation points. And most
importantly, it needs to be fast enought that it can complete
writeback of all the dirty data in a checkpoint before the next
checkpoint is triggered.

In general, I find that threshold to be somewhere around 2-5s worth
of data writeback - enough to keep a good amount of write combining
and the IO pipeline full as work is done, but no more.

e.g. if your workload results in writeback rates of 500MB/s, then
I'd be setting the dirty limit somewhere around 1-2GB as an initial
guess. It's basically a simple trade off buffering space for
writeback latency. Some applications perform well with increased
buffering space (e.g. 10-20s of writeback) while others perform
better with extremely low writeback latency (e.g. 0.5-1s). 

   For
 example, should a database server maybe have dirty_background_bytes
 set to 75% of the non-volatile write cache present on the
 controller, in an attempt to make sure that there is always some
 slack space for writes?

I don't think the hardware cache size matters as it's easy to fill
them very quickly and so after a couple of seconds the controller
will fall back to disk speed anyway. IMO, what matters is that the
threshold is large enough to adequately buffer writes to smooth
peaks and troughs in the pipeline.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Dave Chinner
On Tue, Jan 14, 2014 at 05:38:10PM -0700, Jonathan Corbet wrote:
 On Wed, 15 Jan 2014 09:23:52 +1100
 Dave Chinner da...@fromorbit.com wrote:
 
  It appears to me that we are seeing large memory machines much more
  commonly in data centers - a couple of years ago 256GB RAM was only
  seen in supercomputers. Hence machines of this size are moving from
  tweaking settings for supercomputers is OK class to tweaking
  settings for enterprise servers is not OK
  
  Perhaps what we need to do is deprecate dirty_ratio and
  dirty_background_ratio as the default values as move to the byte
  based values as the defaults and cap them appropriately.  e.g.
  10/20% of RAM for small machines down to a couple of GB for large
  machines
 
 I had thought that was already in the works...it hits people on far
 smaller systems than those described here.
 
   http://lwn.net/Articles/572911/
 
 I wonder if anybody ever finished this work out for 3.14?

Not that I know of.  This patch was suggested as the solution to the
slow/fast drive issue that started the whole thread:

http://thread.gmane.org/gmane.linux.kernel/1584789/focus=1587059

but I don't see it in a current kernel. It might be in Andrew's tree
for 3.14, but I haven't checked.

However, most of the discussion in that thread about dirty limits
was a side show that rehashed old territory. Rate limiting and
throttling in a generic, scalable manner is a complex problem. We've
got some of the infrastructure we need to solve the problem, but
there was no conclusion as to the correct way to connect all the
dots.  Perhaps it's another topic for the LSFMM conf?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers