Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread James Bottomley
On Tue, 2014-01-14 at 15:09 -0500, Robert Haas wrote:
> On Tue, Jan 14, 2014 at 3:00 PM, James Bottomley
>  wrote:
> >> Doesn't sound exactly like what I had in mind.  What I was suggesting
> >> is an analogue of read() that, if it reads full pages of data to a
> >> page-aligned address, shares the data with the buffer cache until it's
> >> first written instead of actually copying the data.
> >
> > The only way to make this happen is mmap the file to the buffer and use
> > MADV_WILLNEED.
> >
> >>   The pages are
> >> write-protected so that an attempt to write the address range causes a
> >> page fault.  In response to such a fault, the pages become anonymous
> >> memory and the buffer cache no longer holds a reference to the page.
> >
> > OK, so here I thought of another madvise() call to switch the region to
> > anonymous memory.  A page fault works too, of course, it's just that one
> > per page in the mapping will be expensive.
> 
> I don't think either of these ideas works for us.  We start by
> creating a chunk of shared memory that all processes (we do not use
> threads) will have mapped at a common address, and we read() and
> write() into that chunk.

Yes, that's what I was thinking: it's a cache.  About how many files
comprise this cache?  Are you thinking it's too difficult for every
process to map the files?

> > Do you care about handling aliases ... what happens if someone else
> > reads from the file, or will that never occur?  The reason for asking is
> > that it's much easier if someone else mmapping the file gets your
> > anonymous memory than we create an alias in the page cache.
> 
> All reads and writes go through the buffer pool stored in shared
> memory, but any of the processes that have that shared memory region
> mapped could be responsible for any individual I/O request.

That seems to be possible with the abstraction.  The initial mapping
gets the file backed pages: you can do madvise to read them (using
readahead), flush them (using wontneed) and flip them to anonymous
(using something TBD).  Since it's a shared mapping API based on the
file, any of the mapping processes can do any operation.  Future mappers
of the file get the mix of real and anon memory, so it's truly shared.

Given that you want to use this as a shared cache, it seems that the API
to flip back from anon to file mapped is wontneed.  That would also
trigger writeback of any dirty pages in the previously anon region ...
which you could force with msync.  As far as I can see, this is
identical to read/write on a shared region with the exception that you
don't need to copy in and out of the page cache.

>From our point of view, the implementation is nice because the pages
effectively never leave the page cache.  We just use an extra per page
flag (which I'll get shot for suggesting) to alter the writeout path
(which is where the complexity which may kill the implementation is).

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread James Bottomley
On Tue, 2014-01-14 at 12:39 -0500, Robert Haas wrote:
> On Tue, Jan 14, 2014 at 12:20 PM, James Bottomley
>  wrote:
> > On Tue, 2014-01-14 at 15:15 -0200, Claudio Freire wrote:
> >> On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas  wrote:
> >> > In terms of avoiding double-buffering, here's my thought after reading
> >> > what's been written so far.  Suppose we read a page into our buffer
> >> > pool.  Until the page is clean, it would be ideal for the mapping to
> >> > be shared between the buffer cache and our pool, sort of like
> >> > copy-on-write.  That way, if we decide to evict the page, it will
> >> > still be in the OS cache if we end up needing it again (remember, the
> >> > OS cache is typically much larger than our buffer pool).  But if the
> >> > page is dirtied, then instead of copying it, just have the buffer pool
> >> > forget about it, because at that point we know we're going to write
> >> > the page back out anyway before evicting it.
> >> >
> >> > This would be pretty similar to copy-on-write, except without the
> >> > copying.  It would just be forget-from-the-buffer-pool-on-write.
> >>
> >> But... either copy-on-write or forget-on-write needs a page fault, and
> >> thus a page mapping.
> >>
> >> Is a page fault more expensive than copying 8k?
> >>
> >> (I really don't know).
> >
> > A page fault can be expensive, yes ... but perhaps you don't need one.
> >
> > What you want is a range of memory that's read from a file but treated
> > as anonymous for writeout (i.e. written to swap if we need to reclaim
> > it). Then at some time later, you want to designate it as written back
> > to the file instead so you control the writeout order.  I'm not sure we
> > can do this: the separation between file backed and anonymous pages is
> > pretty deeply ingrained into the OS, but if it were possible, is that
> > what you want?
> 
> Doesn't sound exactly like what I had in mind.  What I was suggesting
> is an analogue of read() that, if it reads full pages of data to a
> page-aligned address, shares the data with the buffer cache until it's
> first written instead of actually copying the data.

The only way to make this happen is mmap the file to the buffer and use
MADV_WILLNEED.

>   The pages are
> write-protected so that an attempt to write the address range causes a
> page fault.  In response to such a fault, the pages become anonymous
> memory and the buffer cache no longer holds a reference to the page.

OK, so here I thought of another madvise() call to switch the region to
anonymous memory.  A page fault works too, of course, it's just that one
per page in the mapping will be expensive.

Do you care about handling aliases ... what happens if someone else
reads from the file, or will that never occur?  The reason for asking is
that it's much easier if someone else mmapping the file gets your
anonymous memory than we create an alias in the page cache.

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread James Bottomley
On Tue, 2014-01-14 at 10:39 -0500, Tom Lane wrote:
> James Bottomley  writes:
> > The current mechanism for coherency between a userspace cache and the
> > in-kernel page cache is mmap ... that's the only way you get the same
> > page in both currently.
> 
> Right.
> 
> > glibc used to have an implementation of read/write in terms of mmap, so
> > it should be possible to insert it into your current implementation
> > without a major rewrite.  The problem I think this brings you is
> > uncontrolled writeback: you don't want dirty pages to go to disk until
> > you issue a write()
> 
> Exactly.
> 
> > I think we could fix this with another madvise():
> > something like MADV_WILLUPDATE telling the page cache we expect to alter
> > the pages again, so don't be aggressive about cleaning them.
> 
> "Don't be aggressive" isn't good enough.  The prohibition on early write
> has to be absolute, because writing a dirty page before we've done
> whatever else we need to do results in a corrupt database.  It has to
> be treated like a write barrier.
> 
> > The problem is we can't give you absolute control of when pages are
> > written back because that interface can be used to DoS the system: once
> > we get too many dirty uncleanable pages, we'll thrash looking for memory
> > and the system will livelock.
> 
> Understood, but that makes this direction a dead end.  We can't use
> it if the kernel might decide to write anyway.

No, I'm sorry, that's never going to be possible.  No user space
application has all the facts.  If we give you an interface to force
unconditional holding of dirty pages in core you'll livelock the system
eventually because you made a wrong decision to hold too many dirty
pages.   I don't understand why this has to be absolute: if you advise
us to hold the pages dirty and we do up until it becomes a choice to
hold on to the pages or to thrash the system into a livelock, why would
you ever choose the latter?  And if, as I'm assuming, you never would,
why don't you want the kernel to make that choice for you?

James



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread James Bottomley
On Tue, 2014-01-14 at 15:15 -0200, Claudio Freire wrote:
> On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas  wrote:
> >
> > In terms of avoiding double-buffering, here's my thought after reading
> > what's been written so far.  Suppose we read a page into our buffer
> > pool.  Until the page is clean, it would be ideal for the mapping to
> > be shared between the buffer cache and our pool, sort of like
> > copy-on-write.  That way, if we decide to evict the page, it will
> > still be in the OS cache if we end up needing it again (remember, the
> > OS cache is typically much larger than our buffer pool).  But if the
> > page is dirtied, then instead of copying it, just have the buffer pool
> > forget about it, because at that point we know we're going to write
> > the page back out anyway before evicting it.
> >
> > This would be pretty similar to copy-on-write, except without the
> > copying.  It would just be forget-from-the-buffer-pool-on-write.
> 
> 
> But... either copy-on-write or forget-on-write needs a page fault, and
> thus a page mapping.
> 
> Is a page fault more expensive than copying 8k?
> 
> (I really don't know).

A page fault can be expensive, yes ... but perhaps you don't need one. 

What you want is a range of memory that's read from a file but treated
as anonymous for writeout (i.e. written to swap if we need to reclaim
it).  Then at some time later, you want to designate it as written back
to the file instead so you control the writeout order.  I'm not sure we
can do this: the separation between file backed and anonymous pages is
pretty deeply ingrained into the OS, but if it were possible, is that
what you want?

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread James Bottomley
On Tue, 2014-01-14 at 11:48 -0500, Robert Haas wrote:
> On Tue, Jan 14, 2014 at 11:44 AM, James Bottomley
>  wrote:
> > No, I'm sorry, that's never going to be possible.  No user space
> > application has all the facts.  If we give you an interface to force
> > unconditional holding of dirty pages in core you'll livelock the system
> > eventually because you made a wrong decision to hold too many dirty
> > pages.   I don't understand why this has to be absolute: if you advise
> > us to hold the pages dirty and we do up until it becomes a choice to
> > hold on to the pages or to thrash the system into a livelock, why would
> > you ever choose the latter?  And if, as I'm assuming, you never would,
> > why don't you want the kernel to make that choice for you?
> 
> If you don't understand how write-ahead logging works, this
> conversation is going nowhere.  Suffice it to say that the word
> "ahead" is not optional.

No, I do ... you mean the order of write out, if we have to do it, is
important.  In the rest of the kernel, we do this with barriers which
causes ordered grouping of I/O chunks.  If we could force a similar
ordering in the writeout code, is that enough?

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread James Bottomley
On Tue, 2014-01-14 at 15:39 +0100, Hannu Krosing wrote:
> On 01/14/2014 09:39 AM, Claudio Freire wrote:
> > On Tue, Jan 14, 2014 at 5:08 AM, Hannu Krosing  
> > wrote:
> >> Again, as said above the linux file system is doing fine. What we
> >> want is a few ways to interact with it to let it do even better when
> >> working with postgresql by telling it some stuff it otherwise would
> >> have to second guess and by sometimes giving it back some cache
> >> pages which were copied away for potential modifying but ended
> >> up clean in the end.
> > You don't need new interfaces. Only a slight modification of what
> > fadvise DONTNEED does.
> >
> > This insistence in injecting pages from postgres to kernel is just a
> > bad idea. 
> Do you think it would be possible to map copy-on-write pages
> from linux cache to postgresql cache ?
> 
> this would be a step in direction of solving the double-ram-usage
> of pages which have not been read from syscache to postgresql
> cache without sacrificing linux read-ahead (which I assume does
> not happen when reads bypass system cache).

The current mechanism for coherency between a userspace cache and the
in-kernel page cache is mmap ... that's the only way you get the same
page in both currently.

glibc used to have an implementation of read/write in terms of mmap, so
it should be possible to insert it into your current implementation
without a major rewrite.  The problem I think this brings you is
uncontrolled writeback: you don't want dirty pages to go to disk until
you issue a write()  I think we could fix this with another madvise():
something like MADV_WILLUPDATE telling the page cache we expect to alter
the pages again, so don't be aggressive about cleaning them.  Plus all
the other issues with mmap() ... but if you can detail those, we might
be able to fix them.

> and we can write back the copy at the point when it is safe (from
> postgresql perspective)  to let the system write them back ?

Using MADV_WILLUPDATE, possibly ... you're still not going to have
absolute control.  The kernel will write back the pages if the dirty
limits are exceeded, for instance, but we could tune it to be useful.

> Do you think it is possible to make it work with good performance
> for a few million 8kb pages ?
> 
> > At the very least, it still needs postgres to know too much
> > of the filesystem (block layout) to properly work. Ie: pg must be
> > required to put entire filesystem-level blocks into the page cache,
> > since that's how the page cache works. 
> I was more thinking of an simple write() interface with extra
> flags/sysctls to tell kernel that "we already have this on disk"
> > At the very worst, it may
> > introduce serious security and reliability implications, when
> > applications can destroy the consistency of the page cache (even if
> > full access rights are checked, there's still the possibility this
> > inconsistency might be exploitable).
> If you allow write() which just writes clean pages, I can not see
> where the extra security concerns are beyond what normal
> write can do.

The problem is we can't give you absolute control of when pages are
written back because that interface can be used to DoS the system: once
we get too many dirty uncleanable pages, we'll thrash looking for memory
and the system will livelock.

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-14 Thread James Bottomley
On Mon, 2014-01-13 at 19:48 -0500, Trond Myklebust wrote:
> On Jan 13, 2014, at 19:03, Hannu Krosing  wrote:
> 
> > On 01/13/2014 09:53 PM, Trond Myklebust wrote:
> >> On Jan 13, 2014, at 15:40, Andres Freund  wrote:
> >> 
> >>> On 2014-01-13 15:15:16 -0500, Robert Haas wrote:
>  On Mon, Jan 13, 2014 at 1:51 PM, Kevin Grittner  
>  wrote:
> > I notice, Josh, that you didn't mention the problems many people
> > have run into with Transparent Huge Page defrag and with NUMA
> > access.
>  Amen to that.  Actually, I think NUMA can be (mostly?) fixed by
>  setting zone_reclaim_mode; is there some other problem besides that?
> >>> I think that fixes some of the worst instances, but I've seen machines
> >>> spending horrible amounts of CPU (& BUS) time in page reclaim
> >>> nonetheless. If I analyzed it correctly it's in RAM << working set
> >>> workloads where RAM is pretty large and most of it is used as page
> >>> cache. The kernel ends up spending a huge percentage of time finding and
> >>> potentially defragmenting pages when looking for victim buffers.
> >>> 
>  On a related note, there's also the problem of double-buffering.  When
>  we read a page into shared_buffers, we leave a copy behind in the OS
>  buffers, and similarly on write-out.  It's very unclear what to do
>  about this, since the kernel and PostgreSQL don't have intimate
>  knowledge of what each other are doing, but it would be nice to solve
>  somehow.
> >>> I've wondered before if there wouldn't be a chance for postgres to say
> >>> "my dear OS, that the file range 0-8192 of file x contains y, no need to
> >>> reread" and do that when we evict a page from s_b but I never dared to
> >>> actually propose that to kernel people...
> >> O_DIRECT was specifically designed to solve the problem of double 
> >> buffering 
> >> between applications and the kernel. Why are you not able to use that in 
> >> these situations?
> > What is asked is the opposite of O_DIRECT - the write from a buffer inside
> > postgresql to linux *buffercache* and telling linux that it is the same
> > as what
> > is currently on disk, so don't bother to write it back ever.
> 
> I don’t understand. Are we talking about mmap()ed files here? Why
> would the kernel be trying to write back pages that aren’t dirty?

No ... if I have it right, it's pretty awful: they want to do a read of
a file into a user provided buffer, thus obtaining a page cache entry
and a copy in their userspace buffer, then insert the page of the user
buffer back into the page cache as the page cache page ... that's right,
isn't it postgress people?

Effectively you end up with buffered read/write that's also mapped into
the page cache.  It's a pretty awful way to hack around mmap.

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread James Bottomley
On Mon, 2014-01-13 at 21:29 +, Greg Stark wrote:
> On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund  wrote:
> > For one, postgres doesn't use mmap for files (and can't without major
> > new interfaces). Frequently mmap()/madvise()/munmap()ing 8kb chunks has
> > horrible consequences for performance/scalability - very quickly you
> > contend on locks in the kernel.
> 
> 
> I may as well dump this in this thread. We've discussed this in person
> a few times, including at least once with Ted T'so when he visited
> Dublin last year.
> 
> The fundamental conflict is that the kernel understands better the
> hardware and other software using the same resources, Postgres
> understands better its own access patterns. We need to either add
> interfaces so Postgres can teach the kernel what it needs about its
> access patterns or add interfaces so Postgres can find out what it
> needs to know about the hardware context.
> 
> The more ambitious and interesting direction is to let Postgres tell
> the kernel what it needs to know to manage everything. To do that we
> would need the ability to control when pages are flushed out. This is
> absolutely necessary to maintain consistency. Postgres would need to
> be able to mark pages as unflushable until some point in time in the
> future when the journal is flushed. We discussed various ways that
> interface could work but it would be tricky to keep it low enough
> overhead to be workable.

So in this case, the question would be what additional information do we
need to exchange that's not covered by the existing interfaces.  Between
madvise and splice, we seem to have most of what you want; what's
missing?

> The less exciting, more conservative option would be to add kernel
> interfaces to teach Postgres about things like raid geometries. Then
> Postgres could use directio and decide to do prefetching based on the
> raid geometry, how much available i/o bandwidth and iops is available,
> etc.
> 
> Reimplementing i/o schedulers and all the rest of the work that the
> kernel provides inside Postgres just seems like something outside our
> competency and that none of us is really excited about doing.

This would also be a well trodden path ... I believe that some large
database company introduced Direct IO for roughly this purpose.

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread James Bottomley
On Mon, 2014-01-13 at 22:12 +0100, Andres Freund wrote:
> On 2014-01-13 12:34:35 -0800, James Bottomley wrote:
> > On Mon, 2014-01-13 at 14:32 -0600, Jim Nasby wrote:
> > > Well, if we were to collaborate with the kernel community on this then
> > > presumably we can do better than that for eviction... even to the
> > > extent of "here's some data from this range in this file. It's (clean|
> > > dirty). Put it in your cache. Just trust me on this."
> > 
> > This should be the madvise() interface (with MADV_WILLNEED and
> > MADV_DONTNEED) is there something in that interface that is
> > insufficient?
> 
> For one, postgres doesn't use mmap for files (and can't without major
> new interfaces).

I understand, that's why you get double buffering: because we can't
replace a page in the range you give us on read/write.  However, you
don't have to switch entirely to mmap: you can use mmap/madvise
exclusively for cache control and still use read/write (and still pay
the double buffer penalty, of course).  It's only read/write with
directio that would cause problems here (unless you're planning to
switch to DIO?).

>  Frequently mmap()/madvise()/munmap()ing 8kb chunks has
> horrible consequences for performance/scalability - very quickly you
> contend on locks in the kernel.

Is this because of problems in the mmap_sem?

> Also, that will mark that page dirty, which isn't what we want in this
> case.

You mean madvise (page_addr)?  It shouldn't ... the state of the dirty
bit should only be updated by actual writes.  Which MADV_ primitive is
causing the dirty marking, because we might be able to fix it (unless
there's some weird corner case I don't know about).

>  One major usecase is transplanting a page comming from postgres'
> buffers into the kernel's buffercache because the latter has a much
> better chance of properly allocating system resources across independent
> applications running.

If you want to share pages between the application and the page cache,
the only known interface is mmap ... perhaps we can discuss how better
to improve mmap for you?

We also do have a way of transplanting pages: it's called splice.  How
do the semantics of splice differ from what you need?

> Oh, and the kernel's page-cache management while far from perfect,
> actually scales much better than postgres'.

Well, then, it sounds like the best way forward would be to get
postgress to use the kernel page cache more efficiently.

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread James Bottomley
On Mon, 2014-01-13 at 14:32 -0600, Jim Nasby wrote:
> On 1/13/14, 2:27 PM, Claudio Freire wrote:
> > On Mon, Jan 13, 2014 at 5:23 PM, Jim Nasby  wrote:
> >> On 1/13/14, 2:19 PM, Claudio Freire wrote:
> >>>
> >>> On Mon, Jan 13, 2014 at 5:15 PM, Robert Haas 
> >>> wrote:
> 
>  On a related note, there's also the problem of double-buffering.  When
>  we read a page into shared_buffers, we leave a copy behind in the OS
>  buffers, and similarly on write-out.  It's very unclear what to do
>  about this, since the kernel and PostgreSQL don't have intimate
>  knowledge of what each other are doing, but it would be nice to solve
>  somehow.
> >>>
> >>>
> >>>
> >>> There you have a much harder algorithmic problem.
> >>>
> >>> You can basically control duplication with fadvise and WONTNEED. The
> >>> problem here is not the kernel and whether or not it allows postgres
> >>> to be smart about it. The problem is... what kind of smarts
> >>> (algorithm) to use.
> >>
> >>
> >> Isn't this a fairly simple matter of when we read a page into shared 
> >> buffers
> >> tell the kernel do forget that page? And a corollary to that for when we
> >> dump a page out of shared_buffers (here kernel, please put this back into
> >> your cache).
> >
> >
> > That's my point. In terms of kernel-postgres interaction, it's fairly 
> > simple.
> >
> > What's not so simple, is figuring out what policy to use. Remember,
> > you cannot tell the kernel to put some page in its page cache without
> > reading it or writing it. So, once you make the kernel forget a page,
> > evicting it from shared buffers becomes quite expensive.
> 
> Well, if we were to collaborate with the kernel community on this then
> presumably we can do better than that for eviction... even to the
> extent of "here's some data from this range in this file. It's (clean|
> dirty). Put it in your cache. Just trust me on this."

This should be the madvise() interface (with MADV_WILLNEED and
MADV_DONTNEED) is there something in that interface that is
insufficient?

James




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers