On Mon, Jan 16, 2012 at 8:59 PM, Greg Smith <g...@2ndquadrant.com> wrote: > [ interesting description of problem scenario and necessary conditions for > reproducing it ]
This is about what I thought was happening, but I'm still not quite sure how to recreate it in the lab. Have you had a chance to test with Linux 3.2 does any better in this area? As I understand it, it doesn't do anything particularly interesting about the willingness of the kernel to cache gigantic amounts of dirty data, but (1) supposedly it does a better job not yanking the disk head around by just putting foreground processes to sleep while writes happen in the background, rather than having the foreground processes compete with the background writer for control of the disk head; and (2) instead of having a sharp edge where background writing kicks in, it tries to gradually ratchet up the pressure to get things written out. Somehow I can't shake the feeling that this is fundamentally a Linux problem, and that it's going to be nearly impossible to work around in user space without some help from the kernel. I guess in some sense it's reasonable that calling fsync() blasts the data at the platter at top speed, but if that leads to starving everyone else on the system then it starts to seem a lot less reasonable: part of the kernel's job is to guarantee all processes fair access to shared resources, and if it doesn't do that, we're always going to be playing catch-up. >> Just one random thought: I wonder if it would make sense to cap the >> delay after each sync to the time spending performing that sync. That >> would make the tuning of the delay less sensitive to the total number >> of files, because we won't unnecessarily wait after each sync when >> they're not actually taking any time to complete. > > This is one of the attractive ideas in this area that didn't work out so > well when tested. The problem is that writes into a battery-backed write > cache will show zero latency for some time until the cache is filled...and > then you're done. You have to pause anyway, even though it seems write > speed is massive, to give the cache some time to drain to disk between syncs > that push data toward it. Even though it absorbed your previous write with > no delay, that doesn't mean it takes no time to process that write. With > proper write caching, that processing is just happening asynchronously. Hmm, OK. Well, to borrow a page from one of your other ideas, how about keeping track of the number of fsync requests queued for each file, and make the delay proportional to that number? We might have written the same block more than once, so it could be an overestimate, but it rubs me the wrong way to think that a checkpoint is going to finish late because somebody ran a CREATE TABLE statement that touched 5 or 6 catalogs, and now we've got to pause for 15-18 seconds because they've each got one dirty block. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers