[HACKERS] Database/kernel community topic at Collaboration Summit 2014

2014-03-10 Thread Mel Gorman
tabase integrity.

  Completely nuts and this was not mentioned on the list, but arguably
  you could try implementing something like this as a character device
  that allows MAP_SHARED with ioctls with ioctls controlling that file
  and offset backs pages within the mapping.  A new mapping would be
  forced resident and read-only. A write would COW the page. It's a
  crazy way of doing something like this but avoids a lot of overhead.
  Even considering the stupid solution might make the general solution
  a bit more obvious.

  For reference, Tom Lane comprehensively
  described the problems with mmap at
  http://www.Postgres.org/message-id/17515.1389715...@sss.pgh.pa.us

  There were some variants of how something like this could be achieved
  but no finalised proposal at the time of writing.

   9. Hint that a page in an anonymous buffer is a copy of a page cache
   page and invalidate the page cache page on COW. This limits the
   amount of double buffering. It's in as a low priority item as it's
   unclear if it's really necessary and also I suspect the implementation
   would be very heavy because of the amount of information we'd have
   to track in the kernel.

It is important to note in general that Postgres has a problem with some
files being written back too aggressively and other files not written back
aggressively enough. Temp files for purposes such as sorting should have
writeback deferred as long as possible. Data file writes that must complete
before portions of the WAL can be discarded should begin writeback early
so the final fsync does not stall for too long.  As Dave Chinner says

IOWs, there are two very different IO and caching requirements
in play here and tuning the kernel for one actively degrades the
performance of the other.

Robert Hass categorised the IO patterns as follows

- WAL files are written (and sometimes read) sequentially and
  fsync'd very frequently and it's always good to write the data
  out to disk as soon as possible

- Temp files are written and read sequentially and never fsync'd.
  They should only be written to disk when memory pressure demands
  it (but are a good candidate when that situation comes up)

- Data files are read and written randomly.  They are fsync'd at
  checkpoint time; between checkpoints, it's best not to write
  them sooner than necessary, but when the checkpoint arrives,
  they all need to get out to the disk without bringing the system
  to a standstill

No matter it was pointed out that fsync should never be able to screw the
system. Robert Hass again summaried it as follows

IMHO, the problem is simpler than that: no single process should
be allowed to completely screw over every other process on the
system.  When the checkpointer process starts calling fsync(), the
system begins writing out the data that needs to be fsync()'d so
aggressively that service times for I/O requests from other process
go through the roof.  It's difficult for me to imagine that any
application on any I/O scheduler is ever happy with that behavior.
We shouldn't need to sprinkle of fsync() calls with special magic
juju sauce that says "hey, when you do this, could you try to avoid
causing the rest of the system to COMPLETELY GRIND TO A HALT?".
That should be the *default* behavior, if not the *only* behavior.

It is important to keep this in mind although sometimes the ordering
requirements of the filesystem may make it impossible to achieve.


At LSF/MM last year there was a discussion on whether userspace should
hint that files are "hot" or "cold" so the underlying layers could decide
to relocate some data to faster storage. I tuned out a bit during the
discussion and did not track what happened with it since but I guess that
any developments of that sort would be of interest to the Postgres community.

Some of these wish lists still need polish but could potentially be
discussed further at LSF/MM with a wider audience as well as on the
lists. Then in a of unicorns and ponies it's a case of picking some of
these hinting wishlists, seeing what it takes to implement it in kernel
and testing it with a suitably patched version of postgres running a test
case driven by something (pgbench presumably).

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-20 Thread Mel Gorman
On Fri, Jan 17, 2014 at 03:24:01PM -0500, Gregory Smith wrote:
> On 1/17/14 10:37 AM, Mel Gorman wrote:
> >There is not an easy way to tell. To be 100%, it would require an
> >instrumentation patch or a systemtap script to detect when a
> >particular page is being written back and track the context. There
> >are approximations though. Monitor nr_dirty pages over time.
> 
> I have a benchmarking wrapper for the pgbench testing program called
> pgbench-tools:  https://github.com/gregs1104/pgbench-tools  As of
> October, on Linux it now plots the "Dirty" value from /proc/meminfo
> over time.
> 

Cheers for pointing that out, I was not previously aware of its
existence. While I have some support for running pgbench via another kernel
testing framework (mmtests) the postgres-based tests are miserable. Right
now for me, pgbench is only setup to reproduce a workload that detected a
scheduler regression in the past so that it does not get reintroduced. I'd
like to have it running IO-based tests even though I typically do not
do proper regression testing for IO. I have used sysbench as a workload
generator before but it's not great for a number of reasons.

> I've been working on the problem of how we can make a benchmark test
> case that acts enough like real busy PostgreSQL servers that we can
> share it with kernel developers, and then everyone has an objective
> way to measure changes.  These rate limited tests are working much
> better for that than anything I came up with before.
> 

This would be very welcome and thanks for the other observations on IO
scheduler parameter tuning. They could potentially be used to evalate any IO
scheduler changes. For example -- deadline scheduler with these parameters
has X transactions/sec throughput with average latency of Y millieseconds
and a maximum fsync latency of Z seconds. Evaluate how well the out-of-box
behaviour compares against it with and without some set of patches.  At the
very least it would be useful for tracking historical kernel performance
over time and bisecting any regressions that got introduced. Once we have
a test I think many kernel developers (me at least) can run automated
bisections once a test case exists.

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-20 Thread Mel Gorman
On Mon, Jan 20, 2014 at 10:51:41AM +1100, Dave Chinner wrote:
> On Sun, Jan 19, 2014 at 03:37:37AM +0200, Marti Raudsepp wrote:
> > On Wed, Jan 15, 2014 at 5:34 AM, Jim Nasby  wrote:
> > > it's very common to create temporary file data that will never, ever, ever
> > > actually NEED to hit disk. Where I work being able to tell the kernel to
> > > avoid flushing those files unless the kernel thinks it's got better things
> > > to do with that memory would be EXTREMELY valuable
> > 
> > Windows has the FILE_ATTRIBUTE_TEMPORARY flag for this purpose.
> > 
> > ISTR that there was discussion about implementing something analogous
> > in Linux when ext4 got delayed allocation support, but I don't think
> > it got anywhere and I can't find the discussion now. I think the
> > proposed interface was to create and then unlink the file immediately,
> > which serves as a hint that the application doesn't care about
> > persistence.
> 
> You're thinking about O_TMPFILE, which is for making temp files that
> can't be seen in the filesystem namespace, not for preventing them
> from being written to disk.
> 
> I don't really like the idea of overloading a namespace directive to
> have special writeback connotations. What we are getting into the
> realm of here is generic user controlled allocation and writeback
> policy...
> 

Such overloading would be unwelcome. FWIW, I assumed this would be an
fadvise thing. Initially something that controlled writeback on an inode
and not an fd context that ignored the offset and length parameters.
Granded, someone will probably throw a fit about adding a Linux-specific
flag to the fadvise64 syscall. POSIX_FADV_NOREUSE is currently unimplemented
and it could be argued that it could be used to flag temporary files that
have a different writeback policy but it's not clear if that matches the
original intent of the posix flag.

> > Postgres is far from being the only application that wants this; many
> > people resort to tmpfs because of this:
> > https://lwn.net/Articles/499410/
> 
> Yes, we covered the possibility of using tmpfs much earlier in the
> thread, and came to the conclusion that temp files can be larger
> than memory so tmpfs isn't the solution here. :)
> 

And swap IO patterns blow chunks because people rarely want to touch
that area of the code with a 50 foot pole. It gets filed under "if you're
swapping, you already lost"

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Re: Linux kernel impact on PostgreSQL performance (summary v2 2014-1-17)

2014-01-20 Thread Mel Gorman
On Fri, Jan 17, 2014 at 11:01:15AM -0800, Josh Berkus wrote:
> Mel,
> 

Hi,

> So we have a few interested parties.  What do we need to do to set up
> the Collab session?
> 

This is great and thanks!

There are two summits of interest here -- LSF/MM which will have all the
filesystem, storage and memory managemnet people at it on March 24-25th
and Collaboration Summit which is on March 26-28th. We're interested in
both.

The LSF/MM committe are going through the first round of topic proposals at
the moment and we're aiming to send out the first set of invites soon. We're
hoping to invite two PostgreSQL people to LSF/MM itself for the dedicated
topic and your feedback on other topics and how they may help or hinder
PostgreSQL would be welcomed.

As LSF/MM is a relatively closed forum I'll be looking into having a
follow-up discussion at Collaboration Summit that is open to a wider and
more dedicated group. That hopefully will result in a small number of
concrete proposals that can be turned into patches over time.

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Re: Linux kernel impact on PostgreSQL performance (summary v2 2014-1-17)

2014-01-17 Thread Mel Gorman
On Fri, Jan 17, 2014 at 06:14:37PM +0100, Andres Freund wrote:
> Hi Mel,
> 
> On 2014-01-17 16:31:48 +0000, Mel Gorman wrote:
> > Direct IO, buffered IO, double buffering and wishlists
> > --
> >3. Hint that a page should be dropped immediately when IO completes.
> >   There is already something like this buried in the kernel internals
> >   and sometimes called "immediate reclaim" which comes into play when
> >   pages are bgin invalidated. It should just be a case of investigating
> >   if that is visible to userspace, if not why not and do it in a
> >   semi-sensible fashion.
> 
> "bgin invalidated"?
> 

s/bgin/being/

I admit that "invalidated" in this context is very vague and I did
not explain myself. This paragraph should remind anyone familiar with
VM internals about what happens when invalidate_mapping_pages calls
deactivate_page and how PageReclaim pages are treated by both page reclaim
and end_page_writeback handler. It's similar but not identical to what
Postgres wants and is a reasonable starting position for an implementation.

> Generally, +1 on the capability to achieve such a behaviour from
> userspace.
> 
> >7. Allow userspace process to insert data into the kernel page cache
> >   without marking the page dirty. This would allow the application
> >   to request that the OS use the application copy of data as page
> >   cache if it does not have a copy already. The difficulty here
> >   is that the application has no way of knowing if something else
> >   has altered the underlying file in the meantime via something like
> >   direct IO. Granted, such activity has probably corrupted the database
> >   already but initial reactions are that this is not a safe interface
> >   and there are coherency concerns.
> 
> I was one of the people suggesting that capability in this thread (after
> pondering about it on the back on my mind for quite some time), and I
> first though it would never be acceptable for pretty much those
> reasons.
> But on second thought I don't think that line of argument makes too much
> sense. If such an API would require write permissions on the file -
> which it surely would - it wouldn't allow an application to do anything
> it previously wasn't able to.
> And I don't see the dangers of concurrent direct IO as anything
> new. Right now the page's contents reside in userspace memory and aren't
> synced in any way with either the page cache or the actual on disk
> state. And afaik there are already several data races if a file is
> modified and read both via the page cache and direct io.
> 

All of this is true.  The objections may not hold up over time and it may
be seem much more reasonable when/if the easier stuff is addressed.

> The scheme that'd allow us is the following:
> When postgres reads a data page, it will continue to first look up the
> page in its shared buffers, if it's not there, it will perform a page
> cache backed read, but instruct that read to immediately remove from the
> page cache afterwards (new API or, posix_fadvise() or whatever).
> As long
> as it's in shared_buffers, postgres will not need to issue new reads, so
> there's no no benefit keeping it in the page cache.
> If the page is dirtied, it will be written out normally telling the
> kernel to forget about the caching the page (using 3) or possibly direct
> io).
> When a page in postgres's buffers (which wouldn't be set to very large
> values) isn't needed anymore and *not* dirty, it will seed the kernel
> page cache with the current data.
> 

Ordinarily the initial read page could be discarded with fadvise but
the later write would cause the data to be read back in again which is a
waste. The details of avoiding that re-read are tricky from a core kernel
perspective because ordinarily the kernel at that point does not know if
the write is a full complete aligned write of an underlying filesystem
structure or not.  It may need a different write path which potentially
leads into needing changes to the address_space operations on a filesystem
basis -- that would get messy and be a Linux-specific extension. I have
not researched this properly at all, I could be way off but I have a
feeling the details get messy.

> Now, such a scheme wouldn't likely be zero-copy, but it would avoid
> double buffering.

It wouldn't be zero copy because minimally the data needs to be handed
over the filesystem for writing to the disk and the interface for that is
offset,length based, not page based. Maybe sometimes it will be zero copy
but it would be a filesystem-s

[HACKERS] Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance (summary v2 2014-1-17)

2014-01-17 Thread Mel Gorman
On Wed, Jan 15, 2014 at 02:14:08PM +, Mel Gorman wrote:
> > One assumption would be that Postgres is perfectly happy with the current
> > kernel behaviour in which case our discussion here is done.
> 
> It has been demonstrated that this statement was farcical.  The thread is
> massive just from interaction with the LSF/MM program committee. I'm hoping
> that there will be Postgres representation at LSF/MM this year to bring
> the issues to a wider audience. I expect that LSF/MM can only commit to
> one person attending the whole summit due to limited seats but we could
> be more more flexible for the Postgres track itself so informal meetings
> can be arranged for the evenings and at collab summit.
> 

We still have not decided on a person that can definitely attend but we'll
get back to that shortly. I wanted to revise the summary mail so that
there is a record that can be easily digested without trawling through
archives. As before if I missed something important, prioritised poorly
or emphasised incorrectly then shout at me.

On testing of modern kernels


Josh Berkus claims that most people are using Postgres with 2.6.19 and
consequently there may be poor awareness of recent kernel developments.
This is a disturbingly large window of opportunity for problems to have
been introduced.

Minimally, Postgres has concerns about IO-related stalls which may or may
not exist in current kernels. There were indications that large writes
starve reads. There have been variants of this style of bug in the past but
it's unclear what the exact shape of this problem is and if IO-less dirty
throttling affected it. It is possible that Postgres was burned in the past
by data being written back from reclaim context in low memory situations.
That would have looked like massive stalls with drops in IO throughput
but it was fixed in relatively recent kernels. Any data on historical
tests would be helpful. Alternatively, a pgbench-based reproduction test
could potentially be used by people in the kernel community that track
performance over time and have access to a suitable testing rig.

Postgres bug reports and LKML
-

It is claimed that LKML does not welcome bug reports but it's less clear
what the basis of this claim is.  Is it because the reports are ignored? A
possible explanation is that they are simply getting lost in the LKML noise
and there would be better luck if the bug report was cc'd to a specific
subsystem list. A second possibility is the bug report is against an old
kernel and unless it is reproduced on a recent kernel the bug report will
be ignored. Finally it is possible that there is not enough data available
to debug the problem. The worst explanation is that to date the problem
has not been fixable but the details of this have been lost and are now
unknown. Is is possible that some of these bug reports can be refreshed
so at least there is a chance they get addressed?

Apparently there were changes to the reclaim algorithms that crippled
performance without any sysctls. The problem may be compounded by the
introduction of adaptive replacement cache in the shape of the thrash
detection patches currently being reviewed.  Postgres investigated the
use of ARC in the past and ultimately abandoned it. Details are in the
archives (http://www.Postgres.org/search/?m=1&q=arc&l=1&d=-1&s=r). I
have not read then, just noting they exist for future reference.

Sysctls to control VM behaviour are not popular as such tuning parameters
are often used as an excuse to not properly fix the problem. Would it be
possible to describe a test case that shows 2.6.19 performing well and a
modern kernel failing? That would give the VM people a concrete basis to
work from to either fix the problem or identify exactly what sysctls are
required to make this work.

I am confident that any bug related to VM reclaim in this area has been lost.
At least, I recall no instances of it being discussed on linux-mm and it
has not featured on LSF/MM during the last years.

IO Scheduling
-

Kevin Grittner has stated that it is known that the DEADLINE and NOOP
schedulers perform better than any alternatives for most database loads.
It would be desirable to quantify this for some test case and see can the
default scheduler cope in some way.

The deadline scheduler makes sense to a large extent though. Postgres
is sensitive to large latencies due to IO write spikes. It is at least
plausible that deadline would give more deterministic behaviour for
parallel reads in the presence of large writes assuming there were not
ordering problems between the reads/writes and the underlying filesystem.

For reference, these IO spikes can be massive. If the shared buffer is
completely dirtied in a short space of time then it could be 20-25% of
RAM being dirtied and writeback required in typical configuratio

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-17 Thread Mel Gorman
On Thu, Jan 16, 2014 at 04:30:59PM -0800, Jeff Janes wrote:
> On Wed, Jan 15, 2014 at 2:08 AM, Mel Gorman  wrote:
> 
> > On Tue, Jan 14, 2014 at 09:30:19AM -0800, Jeff Janes wrote:
> > > >
> > > > That could be something we look at. There are cases buried deep in the
> > > > VM where pages get shuffled to the end of the LRU and get tagged for
> > > > reclaim as soon as possible. Maybe you need access to something like
> > > > that via posix_fadvise to say "reclaim this page if you need memory but
> > > > leave it resident if there is no memory pressure" or something similar.
> > > > Not exactly sure what that interface would look like or offhand how it
> > > > could be reliably implemented.
> > > >
> > >
> > > I think the "reclaim this page if you need memory but leave it resident
> > if
> > > there is no memory pressure" hint would be more useful for temporary
> > > working files than for what was being discussed above (shared buffers).
> > >  When I do work that needs large temporary files, I often see physical
> > > write IO spike but physical read IO does not.  I interpret that to mean
> > > that the temporary data is being written to disk to satisfy either
> > > dirty_expire_centisecs or dirty_*bytes, but the data remains in the FS
> > > cache and so disk reads are not needed to satisfy it.  So a hint that
> > says
> > > "this file will never be fsynced so please ignore dirty_*bytes and
> > > dirty_expire_centisecs.
> >
> > It would be good to know if dirty_expire_centisecs or dirty ratio|bytes
> > were the problem here.
> 
> 
> Is there an easy way to tell?  I would guess it has to be at least
> dirty_expire_centisecs, if not both, as a very large sort operation takes a
> lot more than 30 seconds to complete.
> 

There is not an easy way to tell. To be 100%, it would require an
instrumentation patch or a systemtap script to detect when a particular page
is being written back and track the context. There are approximations though.
Monitor nr_dirty pages over time. If at the time of the stall there are fewer
dirty pages than allowed by dirty_ratio then the dirty_expire_centisecs
kicked in. That or monitor the process for stalls, when it stalls check
/proc/PID/stack and see if it's stuck in balance_dirty_pages or something
similar which would indicate the process hit dirty_ratio.

> > An interface that forces a dirty page to stay dirty
> > regardless of the global system would be a major hazard. It potentially
> > allows the creator of the temporary file to stall all other processes
> > dirtying pages for an unbounded period of time.
> 
> Are the dirty ratio/bytes limits the mechanisms by which adequate clean
> memory is maintained? 

Yes, for file-backed pages.

> I thought those were there just to but a limit on
> long it would take to execute a sync call should one be issued, and there
> were other setting which said how much clean memory to maintain.  It should
> definitely write out the pages if it needs the memory for other things,
> just not write them out due to fear of how long it would take to sync it if
> a sync was called.  (And if it needs the memory, it should be able to write
> it out quickly as the writes would be mostly sequential, not
> random--although how the kernel can believe me that that will always be the
> case could a problem)
> 

It has been suggested on more than one occasion that a more sensible
interface would be to "do not allow more dirty data than it takes N seconds
to writeback". The details of how to implement this are tricky and no one
has taken up the challenge yet.

> > I proposed in another part
> > of the thread a hint for open inodes to have the background writer thread
> > ignore dirty pages belonging to that inode. Dirty limits and fsync would
> > still be obeyed. It might also be workable for temporary files but the
> > proposal could be full of holes.
> >
> 
> If calling fsync would fail with an error, would that lower the risk of DoS?
> 

I do not understand the proposal. If there are pages that must remain
dirty and the kernel cannot touch then there will be the risk that
dirty_ratio number of pages are all untouchable and the system livelocks
until userspace takes an action.

That still leaves the possibility of flagging temp pages that should
only be written to disk if the kernel really needs to.

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Mel Gorman
On Wed, Jan 15, 2014 at 10:16:27AM -0500, Robert Haas wrote:
> On Wed, Jan 15, 2014 at 4:44 AM, Mel Gorman  wrote:
> > That applies if the dirty pages are forced to be kept dirty. You call
> > this pinned but pinned has special meaning so I would suggest calling it
> > something like dirty-sticky pages. It could be the case that such hinting
> > will have the pages excluded from dirty background writing but can still
> > be cleaned if dirty limits are hit or if fsync is called. It's a hint,
> > not a forced guarantee.
> >
> > It's still a hand grenade because if this is tracked on a per-page basis
> > because of what happens if the process crashes? Those pages stay dirty
> > potentially forever. An alternative would be to track this on a per-inode
> > instead of per-page basis. The hint would only exist where there is an
> > open fd for that inode.  Treat it as a privileged call with a sysctl
> > controlling how many dirty-sticky pages can exist in the system with the
> > information presented during OOM kills and maybe it starts becoming a bit
> > more manageable. Dirty-sticky pages are not guaranteed to stay dirty
> > until userspace action, the kernel just stays away until there are no
> > other sensible options.
> 
> I think this discussion is vividly illustrating why this whole line of
> inquiry is a pile of fail.  If all the processes that have the file
> open crash, the changes have to be *thrown away* not written to disk
> whenever the kernel likes.
> 

I realise that now and sorry for the noise.

I later read the parts of the thread that covered the strict ordering
requirements and in a summary mail I split the requirements in two. In one,
there are dirty sticky pages that the kernel should not writeback unless
it has no other option or fsync is called. This may be suitable for large
temporary files that Postgres does not necessarily want to hit the platter
but also does not have strict ordering requirements for. The second is have
pages that are strictly kept dirty until the application syncs them. An
unbounded number of these pages would blow up but maybe bounds could be
placed on it. There are no solid conclusions on that part yet.

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Linux kernel impact on PostgreSQL performance (summary v1 2014-1-15)

2014-01-15 Thread Mel Gorman
aid above the linux file system is doing fine. What we
want is a few ways to interact with it to let it do even better
when working with Postgres by telling it some stuff it otherwise
would have to second guess and by sometimes giving it back some
cache pages which were copied away for potential modifying but
ended up clean in the end.

And let the linux kernel decide if and how long to keep these pages
in its  cache using its superior knowledge of disk subsystem and
about what else is going on in the system in general.

   5. Allow copy-on-write of page-cache pages to anonymous. This would limit
  the double ram usage to some extent. It's not as simple as having a
  MAP_PRIVATE mapping of a file-backed page because presumably they want
  this data in a shared buffer shared between Postgres processes. The
  implementation details of something like this are hairy because it's
  mmap()-like but not mmap() as it does not have the same writeback
  semantics due to the write ordering requireqments Postgres has for
  database integrity.

  Completely nuts and this was not mentioned on the list, but arguably
  you could try implementing something like this as a character device
  that allows MAP_SHARED with ioctls with ioctls controlling that file
  and offset backs pages within the mapping.  A new mapping would be
  forced resident and read-only. A write would COW the page. It's a
  crazy way of doing something like this but avoids a lot of overhead.
  Even considering the stupid solution might make the general solution
  a bit more obvious.

  For reference, Tom Lane comprehensively
  described the problems with mmap at
  http://www.Postgres.org/message-id/17515.1389715...@sss.pgh.pa.us

  There were some variants of how something like this could be achieved
  but no finalised proposal at the time of writing.

Not all of these suggestions are viable but some are more viable than
others. Ultimately we would still need a test case showing the benefit
even if that depends on a Postgres patch taking advantage of a new
feature.

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Mel Gorman
On Mon, Jan 13, 2014 at 02:19:56PM -0800, James Bottomley wrote:
> On Mon, 2014-01-13 at 22:12 +0100, Andres Freund wrote:
> > On 2014-01-13 12:34:35 -0800, James Bottomley wrote:
> > > On Mon, 2014-01-13 at 14:32 -0600, Jim Nasby wrote:
> > > > Well, if we were to collaborate with the kernel community on this then
> > > > presumably we can do better than that for eviction... even to the
> > > > extent of "here's some data from this range in this file. It's (clean|
> > > > dirty). Put it in your cache. Just trust me on this."
> > > 
> > > This should be the madvise() interface (with MADV_WILLNEED and
> > > MADV_DONTNEED) is there something in that interface that is
> > > insufficient?
> > 
> > For one, postgres doesn't use mmap for files (and can't without major
> > new interfaces).
> 
> I understand, that's why you get double buffering: because we can't
> replace a page in the range you give us on read/write.  However, you
> don't have to switch entirely to mmap: you can use mmap/madvise
> exclusively for cache control and still use read/write (and still pay
> the double buffer penalty, of course).  It's only read/write with
> directio that would cause problems here (unless you're planning to
> switch to DIO?).
> 

There are hazards with using mmap/madvise that may or may not be a problem
for them. I think these are well known but just in case;

mmap/munmap intensive workloads may get hammered on taking mmap_sem for
write. The greatest costs are incurred if the application is threaded
if the parallel threads are fault-intensive. I do not think this is the
case for PostgreSQL as it is process based but it is a concern. Even it's
a single-threaded process, the cost of the mmap_sem cache line bouncing
can be a concern. Outside of that, the mmap/munmap paths are just really
costly and take a lot of work.

madvise has different hazards but lets take DONTNEED as an example because
it's the most likely candidate for use. A DONTNEED hint has three potential
downsides. The first is that mmap_sem taken for read can be very costly
for threaded applications as the cache line bounces. On NUMA machines it
can be a major problem for madvise-intensive workloads. The second is that
the page table teardown frees the pages with the associated costs but most
importantly, an IPI is required afterwards to flush the TLB. If that process
has been running on a lot of different CPUs then the IPI cost can be very
high. The third hazard is that a madvise(DONTNEED) region will incur page
faults on the next accesses again hammering into mmap_sem and all the faults
associated with faulting (allocating the same pages again, zeroing etc)

It may be the case that mmap/madvise is still required to handle a double
buffering problem but it's far from being a free lunch and it has costs
that read/write does not have to deal with. Maybe some of these problems
can be fixed or mitigated but it is a case where a test case demonstrates
the problem even if that requires patching PostgreSQL.

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Mel Gorman
On Wed, Jan 15, 2014 at 09:44:21AM +, Mel Gorman wrote:
> > 
> > H.  What happens if the process crashes after pinning the dirty
> > pages? How do we even know what process pinned the dirty pages so
> > we can clean up after it? What happens if the same page is pinned by
> > multiple processes? What happens on truncate/hole punch if the
> > partial pages in the range that need to be zeroed and written are
> > pinned? What happens if we do direct IO to a range with pinned,
> > unflushable pages in the page cache?
> > 
> 
> Proposal: A process with an open fd can hint that pages managed by this
>   inode will have dirty-sticky pages. Pages will be ignored by
>   dirty background writing unless there is an fsync call or
>   dirty page limits are hit. The hint is cleared when no process
>   has the file open.
> 

I'm still processing the rest of the thread and putting it into my head
but it's at least clear that this proposal would only cover the case where
large temporarily files are created that do not necessarily need to be
persisted. They still have cases where the ordering of writes matter and
the kernel cleaning pages behind their back would lead to corruption.

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Mel Gorman
On Tue, Jan 14, 2014 at 09:30:19AM -0800, Jeff Janes wrote:
> > > What's not so simple, is figuring out what policy to use. Remember,
> > > you cannot tell the kernel to put some page in its page cache without
> > > reading it or writing it. So, once you make the kernel forget a page,
> > > evicting it from shared buffers becomes quite expensive.
> >
> > posix_fadvise(POSIX_FADV_WILLNEED) is meant to cover this case by
> > forcing readahead.
> 
> 
> But telling the kernel to forget a page, then telling it to read it in
> again from disk because it might be needed again in the near future is
> itself very expensive.  We would need to hand the page to the kernel so it
> has it without needing to go to disk to get it.
> 

Yes, this is the unnecessary IO cost I was thinking of.

> 
> > If you evict it prematurely then you do get kinda
> > screwed because you pay the IO cost to read it back in again even if you
> > had enough memory to cache it. Maybe this is the type of kernel-postgres
> > interaction that is annoying you.
> >
> > If you don't evict, the kernel eventually steps in and evicts the wrong
> > thing. If you do evict and it was unnecessarily you pay an IO cost.
> >
> > That could be something we look at. There are cases buried deep in the
> > VM where pages get shuffled to the end of the LRU and get tagged for
> > reclaim as soon as possible. Maybe you need access to something like
> > that via posix_fadvise to say "reclaim this page if you need memory but
> > leave it resident if there is no memory pressure" or something similar.
> > Not exactly sure what that interface would look like or offhand how it
> > could be reliably implemented.
> >
> 
> I think the "reclaim this page if you need memory but leave it resident if
> there is no memory pressure" hint would be more useful for temporary
> working files than for what was being discussed above (shared buffers).
>  When I do work that needs large temporary files, I often see physical
> write IO spike but physical read IO does not.  I interpret that to mean
> that the temporary data is being written to disk to satisfy either
> dirty_expire_centisecs or dirty_*bytes, but the data remains in the FS
> cache and so disk reads are not needed to satisfy it.  So a hint that says
> "this file will never be fsynced so please ignore dirty_*bytes and
> dirty_expire_centisecs. 

It would be good to know if dirty_expire_centisecs or dirty ratio|bytes
were the problem here. An interface that forces a dirty page to stay dirty
regardless of the global system would be a major hazard. It potentially
allows the creator of the temporary file to stall all other processes
dirtying pages for an unbounded period of time. I proposed in another part
of the thread a hint for open inodes to have the background writer thread
ignore dirty pages belonging to that inode. Dirty limits and fsync would
still be obeyed. It might also be workable for temporary files but the
proposal could be full of holes.

Your alternative here is to create a private anonymous mapping as they
are not subject to dirty limits. This is only a sensible option if the
temporarily data is guaranteeed to be relatively small. If the shared
buffers, page cache and your temporary data exceed the size of RAM then
data will get discarded or your temporary data will get pushed to swap
and performance will hit the floor.

FWIW, the performance of some IO "benchmarks" used to depend on whether they
could create, write and delete files before any of the data actually hit
the disk -- pretty much exactly the type of behaviour you are looking for.

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-15 Thread Mel Gorman
d writing unless there is an fsync call or
dirty page limits are hit. The hint is cleared when no process
has the file open.

If the process crashes, the hint is cleared and the pages get cleaned as
normal

Multiple processes do not matter as such as all of them will have the file
open. There is a problem if the processes disagree on whether the pages
should be dirty sticky or not. The default would be that a sticky-dirty
hint takes priority although it does mean that a potentially unprivileged
process can cause problems. There would be security concerns here that
have to be taken into account.

fsync and truncrate both override the hint. fsync will write the pages,
truncate will discard them.

If there is direct IO on the range then force the sync, invalidate the
page cache, initiate the direct IO as normal.

At least one major downside is that the performance will depend on system
parameters and be non-deterministic, particularly in comparison to direct IO.

> These are all complex corner cases that are introduced by allowing
> applications to pin dirty pages in memory. I've only spent a few
> minutes coming up with these, and I'm sure there's more of them.
> As such, I just don't see that allowing userspace to pin dirty
> page cache pages in memory being a workable solution.
> 

>From what I've read so far, I'm not convinced they are looking for a
hard *pin* as such. They want better control over the how and the when
of writeback, not absolute control.  I somewhat sympathise with their
reluctance to use direct IO when the kernel should be able to get them most,
if not all, of the potential performance.

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread Mel Gorman
On Mon, Jan 13, 2014 at 03:24:38PM -0800, Josh Berkus wrote:
> On 01/13/2014 02:26 PM, Mel Gorman wrote:
> > Really?
> > 
> > zone_reclaim_mode is often a complete disaster unless the workload is
> > partitioned to fit within NUMA nodes. On older kernels enabling it would
> > sometimes cause massive stalls. I'm actually very surprised to hear it
> > fixes anything and would be interested in hearing more about what sort
> > of circumstnaces would convince you to enable that thing.
> 
> So the problem with the default setting is that it pretty much isolates
> all FS cache for PostgreSQL to whichever socket the postmaster is
> running on, and makes the other FS cache unavailable. 

I'm not being pedantic but the default depends on the NUMA characteristics of
the machine so I need to know if it was enabled or disabled. Some machines
will default zone_reclaim_mode to 0 and others will default it to 1. In my
experience the majority of bugs that involved zone_reclaim_mode were due
to zone_reclaim_mode enabled by default.  If I see a bug that involves
a file-based workload on a NUMA machine with stalls and/or excessive IO
when there is plenty of memory free then zone_reclaim_mode is the first
thing I check.

I'm guessing from context that in your experience it gets enabled by default
on the machines you care about. This would indeed limit FS cache usage to
the node where the process is initiating IO (postmaster I guess).

> This means that,
> for example, if you have two memory banks, then only one of them is
> available for PostgreSQL filesystem caching ... essentially cutting your
> available cache in half.
> 
> And however slow moving cached pages between memory banks is, it's an
> order of magnitude faster than moving them from disk.  But this isn't
> how the NUMA stuff is configured; it seems to assume that it's less
> expensive to get pages from disk than to move them between banks, so

Yes, this is right. The history behind this "logic" is that it was assumed
NUMA machines would only ever be used for HPC and that the workloads would
always be partitioned to run within NUMA nodes. This has not been the case
for a long time and I would argue that we should leave that thing disabled
by default in all cases. Last time I tried it was met with resistance but
maybe it's time to try again.

> whatever you've got cached on the other bank, it flushes it to disk as
> fast as possible.  I understand the goal was to make memory usage local
> to the processors stuff was running on, but that includes an implicit
> assumption that no individual process will ever want more than one
> memory bank worth of cache.
> 
> So disabling all of the NUMA optimizations is the way to go for any
> workload I personally deal with.
> 

I would hesitate to recommend "all" on the grounds that zone_reclaim_mode
is brain damage and I'd hate to lump all tuning parameters into the same box.

There is an interesting side-line here. If all IO is initiated by one
process in postgres then the memory locality will be sub-optimal.
The consumer of the data may or may not be running on the same
node as the process that read the data from disk. It is possible to
migrate this from user space but the interface is clumsy and assumes the
data is mapped.

Automatic NUMA balancing does not help you here because that thing also
depends on the data being mapped. It does nothing for data accessed via
read/write. There is nothing fundamental that prevents this, it was not
implemented because it was not deemed to be important enough. The amount
of effort spent on addressing this would depend on how important NUMA
locality is for postgres performance.

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Mel Gorman
On Mon, Jan 13, 2014 at 11:38:44PM +0100, Jan Kara wrote:
> On Mon 13-01-14 22:26:45, Mel Gorman wrote:
> > The flipside is also meant to hold true. If you know data will be needed
> > in the near future then posix_fadvise(POSIX_FADV_WILLNEED). Glancing at
> > the implementation it does a forced read-ahead on the range of pages of
> > interest. It doesn't look like it would block.
>   That's not quite true. POSIX_FADV_WILLNEED still needs to map logical
> file offsets to physical disk blocks and create IO requests. This happens
> synchronously. So if your disk is congested and relevant metadata is out of
> cache, or we simply run out of free IO requests, POSIX_FADV_WILLNEED can
> block for a significant amount of time.
> 

Umm, yes, you're right. It also potentially stalls allocating the pages
up front even though it will only try and direct reclaim pages once.
That can stall in some circumstances, particularly if there are a number
of processes trying to reclaim memory.

That kinda sucks though. One point of discussion would be to check if
this is an interface that can be used and if so, is it required to never
block and if so is there something we can do about it -- queue the IO
asynchronously if you can but if the kernel would block then do not bother.
That does mean that fadvise is not guaranteeing that the pages will be
resident in the future but it was not the intent of the interface
anyway.

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Mel Gorman
On Mon, Jan 13, 2014 at 06:27:03PM -0200, Claudio Freire wrote:
> On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby  wrote:
> > On 1/13/14, 2:19 PM, Claudio Freire wrote:
> >>
> >> On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas 
> >> wrote:
> >>>
> >>> On a related note, there's also the problem of double-buffering.  When
> >>> we read a page into shared_buffers, we leave a copy behind in the OS
> >>> buffers, and similarly on write-out.  It's very unclear what to do
> >>> about this, since the kernel and PostgreSQL don't have intimate
> >>> knowledge of what each other are doing, but it would be nice to solve
> >>> somehow.
> >>
> >>
> >>
> >> There you have a much harder algorithmic problem.
> >>
> >> You can basically control duplication with fadvise and WONTNEED. The
> >> problem here is not the kernel and whether or not it allows postgres
> >> to be smart about it. The problem is... what kind of smarts
> >> (algorithm) to use.
> >
> >
> > Isn't this a fairly simple matter of when we read a page into shared buffers
> > tell the kernel do forget that page? And a corollary to that for when we
> > dump a page out of shared_buffers (here kernel, please put this back into
> > your cache).
> 
> 
> That's my point. In terms of kernel-postgres interaction, it's fairly simple.
> 
> What's not so simple, is figuring out what policy to use. Remember,
> you cannot tell the kernel to put some page in its page cache without
> reading it or writing it. So, once you make the kernel forget a page,
> evicting it from shared buffers becomes quite expensive.

posix_fadvise(POSIX_FADV_WILLNEED) is meant to cover this case by
forcing readahead. If you evict it prematurely then you do get kinda
screwed because you pay the IO cost to read it back in again even if you
had enough memory to cache it. Maybe this is the type of kernel-postgres
interaction that is annoying you.

If you don't evict, the kernel eventually steps in and evicts the wrong
thing. If you do evict and it was unnecessarily you pay an IO cost.

That could be something we look at. There are cases buried deep in the
VM where pages get shuffled to the end of the LRU and get tagged for
reclaim as soon as possible. Maybe you need access to something like
that via posix_fadvise to say "reclaim this page if you need memory but
leave it resident if there is no memory pressure" or something similar.
Not exactly sure what that interface would look like or offhand how it
could be reliably implemented.

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Mel Gorman
kicking.

> On a related note, there's also the problem of double-buffering.  When
> we read a page into shared_buffers, we leave a copy behind in the OS
> buffers, and similarly on write-out.  It's very unclear what to do
> about this, since the kernel and PostgreSQL don't have intimate
> knowledge of what each other are doing, but it would be nice to solve
> somehow.
> 

If it's mapped, clean and you do not need any more than
madvise(MADV_DONTNEED). If you are accessing teh data via a file handle,
then I would expect posix_fadvise(POSIX_FADV_DONTNEED). Offhand, I do
not know how it behaved historically but right now it will usually sync
the data and then discard the pages. I say usually because it will not
necessarily sync if the storage is congested and there is no guarantee it
will be discarded. In older kernels, there was a bug where small calls to
posix_fadvise() would not work at all. This was fixed in 3.9.

The flipside is also meant to hold true. If you know data will be needed
in the near future then posix_fadvise(POSIX_FADV_WILLNEED). Glancing at
the implementation it does a forced read-ahead on the range of pages of
interest. It doesn't look like it would block.

The completely different approach for double buffering is direct IO but
there may be reasons why you are avoiding that and are unhappy with the
interfaces that are meant to work.

Just from the start, it looks like there are a number of problem areas.
Some may be fixed -- in which case we should identify what fixed it, what
kernel version and see can it be verified with a test case or did we
manage to break something else in the process. Other bugs may still
exist because we believe some interface works how users want when it is
in fact unfit for purpose for some reason.

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Mel Gorman
Hi,

I'm the chair for Linux Storage, Filesystem and Memory Management Summit 2014
(LSF/MM). A CFP was sent out last month (https://lwn.net/Articles/575681/)
that you may have seen already.

In recent years we have had at least one topic that was shared between
all three tracks that was lead by a person outside of the usual kernel
development community. I am checking if the PostgreSQL community
would be willing to volunteer someone to lead a topic discussing
PostgreSQL performance with recent kernels or to highlight regressions
or future developments you feel are potentially a problem. With luck
someone suitable is already travelling to the collaboration summit
(http://events.linuxfoundation.org/events/collaboration-summit) and it
would not be too inconvenient to drop in for LSF/MM as well.

There are two reasons why I'm suggesting this. First, PostgreSQL was the
basis of a test used to highlight a scheduler problem around kernel 3.6
but otherwise in my experience it is rare that PostgreSQL is part of a
bug report.  I am skeptical this particular bug report was a typical use
case for PostgreSQL (pgbench, read-only, many threads, very small in-memory
database). I wonder why reports related to PostgreSQL are not more common.
One assumption would be that PostgreSQL is perfectly happy with the current
kernel behaviour in which case our discussion here is done.

This brings me to the second reason -- there is evidence
that the PostgreSQL community is not happy with the current
direction of kernel development. The most obvious example is this thread
http://postgresql.1045698.n5.nabble.com/Why-we-are-going-to-have-to-go-DirectIO-td5781471.html
but I suspect there are others. The thread alleges that the kernel community
are in the business of pushing hackish changes into the IO stack without
much thought or testing although the linked article describes a VM and not
a storage problem. I'm not here to debate the kernels regression testing
or development methodology but LSF/MM is one place where a large number
of people involved with the IO layers will be attending.  If you have a
concrete complaint then here is a soap box.

Does the PostgreSQL community have a problem with recent kernels,
particularly with respect to the storage, filesystem or memory management
layers? If yes, do you have some data that can highlight this and can you
volunteer someone to represent your interests to the kernel community? Are
current developments in the IO layer counter to the PostgreSQL requirements?
If so, what developments, why are they a problem, do you have a suggested
alternative or some idea of what we should watch out for? The track topic
would be up to you but just as a hint, we'd need something a lot more
concrete than "you should test more".

-- 
Mel Gorman
SUSE Labs


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers