On 14/01/14 14:09, Dave Chinner wrote:
On Mon, Jan 13, 2014 at 09:29:02PM +0000, Greg Stark wrote:
On Mon, Jan 13, 2014 at 9:12 PM, Andres Freund <and...@2ndquadrant.com> wrote:
[...]
The more ambitious and interesting direction is to let Postgres tell
the kernel what it needs to know to manage everything. To do that we
would need the ability to control when pages are flushed out. This is
absolutely necessary to maintain consistency. Postgres would need to
be able to mark pages as unflushable until some point in time in the
future when the journal is flushed. We discussed various ways that
interface could work but it would be tricky to keep it low enough
overhead to be workable.
IMO, the concept of allowing userspace to pin dirty page cache
pages in memory is just asking for trouble. Apart from the obvious
memory reclaim and OOM issues, some filesystems won't be able to
move their journals forward until the data is flushed. i.e. ordered
mode data writeback on ext3 will have all sorts of deadlock issues
that result from pinning pages and then issuing fsync() on another
file which will block waiting for the pinned pages to be flushed.

Indeed, what happens if you do pin_dirty_pages(fd); .... fsync(fd);?
If fsync() blocks because there are pinned pages, and there's no
other thread to unpin them, then that code just deadlocked. If
fsync() doesn't block and skips the pinned pages, then we haven't
done an fsync() at all, and so violated the expectation that users
have that after fsync() returns their data is safe on disk. And if
we return an error to fsync(), then what the hell does the user do
if it is some other application we don't know about that has pinned
the pages? And if the kernel unpins them after some time, then we
just violated the application's consistency guarantees....

[...]

What if Postgres could tell the kernel how strongly that it wanted to hold on to the pages?

Say a byte (this is arbitrary, it could be a single hint bit which meant "please, Please, PLEASE don't flush, if that is okay with you Mr Kernel..."), so strength would be S = (unsigned byte value)/256, so 0 <= S < 1.

S = 0      flush now.
0 < S < 1  flush if the 'need' is greater than the S
S = 1      never flush (note a value of 1 cannot occur, as max S = 255/256)

Postgres could use low non-zero S values if it thinks that pages /might/ still be useful later, and very high values when it is /more certain/. I am sure Postgres must sometimes know when some pages are more important to held onto than others, hence my feeling that S should be more than one bit.

The kernel might simply flush pages starting at ones with low values of S working upwards until it has freed enough memory to resolve its memory pressure. So an explicit numerical value of 'need' (as implied above) is not required. Also any practical implementation would not use 'S' as a float/double, but use integer values for 'S' & 'need' - assuming that 'need' did have to be an actual value, which I suspect would not be reequired.

This way the kernel is free to flush all such pages, when sufficient need arises - yet usually, when there is sufficient memory, the pages will be held unflushed.


Cheers,
Gavin

Reply via email to