On Mon, Jan 16, 2012 at 8:59 PM, Greg Smith <g...@2ndquadrant.com> wrote:
> [ interesting description of problem scenario and necessary conditions for 
> reproducing it ]

This is about what I thought was happening, but I'm still not quite
sure how to recreate it in the lab.

Have you had a chance to test with Linux 3.2 does any better in this
area?  As I understand it, it doesn't do anything particularly
interesting about the willingness of the kernel to cache gigantic
amounts of dirty data, but (1) supposedly it does a better job not
yanking the disk head around by just putting foreground processes to
sleep while writes happen in the background, rather than having the
foreground processes compete with the background writer for control of
the disk head; and (2) instead of having a sharp edge where background
writing kicks in, it tries to gradually ratchet up the pressure to get
things written out.

Somehow I can't shake the feeling that this is fundamentally a Linux
problem, and that it's going to be nearly impossible to work around in
user space without some help from the kernel.  I guess in some sense
it's reasonable that calling fsync() blasts the data at the platter at
top speed, but if that leads to starving everyone else on the system
then it starts to seem a lot less reasonable: part of the kernel's job
is to guarantee all processes fair access to shared resources, and if
it doesn't do that, we're always going to be playing catch-up.

>> Just one random thought: I wonder if it would make sense to cap the
>> delay after each sync to the time spending performing that sync.  That
>> would make the tuning of the delay less sensitive to the total number
>> of files, because we won't unnecessarily wait after each sync when
>> they're not actually taking any time to complete.
>
> This is one of the attractive ideas in this area that didn't work out so
> well when tested.  The problem is that writes into a battery-backed write
> cache will show zero latency for some time until the cache is filled...and
> then you're done.  You have to pause anyway, even though it seems write
> speed is massive, to give the cache some time to drain to disk between syncs
> that push data toward it.  Even though it absorbed your previous write with
> no delay, that doesn't mean it takes no time to process that write.  With
> proper write caching, that processing is just happening asynchronously.

Hmm, OK.  Well, to borrow a page from one of your other ideas, how
about keeping track of the number of fsync requests queued for each
file, and make the delay proportional to that number?  We might have
written the same block more than once, so it could be an overestimate,
but it rubs me the wrong way to think that a checkpoint is going to
finish late because somebody ran a CREATE TABLE statement that touched
5 or 6 catalogs, and now we've got to pause for 15-18 seconds because
they've each got one dirty block.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to