Robert Haas wrote:
What is the basis for thinking that the sync should get the same
amount of time as the writes?  That seems pretty arbitrary.  Right
now, you're allowing 3 seconds per fsync, which could be a lot more or
a lot less than 40% of the total checkpoint time...

Just that it's where I ended up at when fighting with this for a month on the system I've seen the most problems at. The 3 second number was reversed from a computation that said "aim for an internal of X minutes; we have Y relations on average involved in the checkpoint". The direction my latest patch is strugling to go is computing a reasonable time automatically in the same way--count the relations, do a time estimate, add enough delay so the sync calls should be spread linearly over the given time range.


the checkpoint activity is always going to be spikey if it does
anything at all, so spacing it out *more* isn't obviously useful.

One of the components to the write queue is some notion that writes that have been waiting longest should eventually be flushed out. Linux has this number called dirty_expire_centiseconds which suggests it enforces just that, set to a default of 30 seconds. This is why some 5-minute interval checkpoints with default parameters, effectively spreading the checkpoint over 2.5 minutes, can work under the current design. Anything you wrote at T+0 to T+2:00 *should* have been written out already when you reach T+2:30 and sync. Unfortunately, when the system gets busy, there is this "congestion control" logic that basically throws out any guarantee of writes starting shortly after the expiration time.

It turns out that the only thing that really works are the tunables that block new writes from happening once the queue is full, but they can't be set low enough to work well in earlier kernels when combined with lots of RAM. Using the terminology of http://www.mjmwired.net/kernel/Documentation/sysctl/vm.txt at some point you hit a point where "a process generating disk writes will itself start writeback." This is anologous to the PostgreSQL situation where backends do their own fsync calls. The kernel will eventually move to where those trying to write new data are instead recruited into being additional sources of write flushing. That's the part you just can't make aggressive enough on older kernels; dirty writers can always win. Ideally, the system never digs itself into a hole larger than you can afford to wait to write out. It's a transacton speed vs. latency thing though, and the older kernels just don't consider the latency side well enough.

There is new mechanism in the latest kernels to control this much better: dirty_bytes and dirty_background_bytes are the tunables. I haven't had a chance to test yet. As mentioned upthread, some of the bleding edge kernels that have this feature available in are showing such large general performance regressions in our tests, compared to the boring old RHEL5 kernel, that whether this feature works or not is irrelevant. I haven't tracked down which new kernel distributions work well performance-wise and which don't yet for PostgreSQL.

I'm hoping that when I get there, I'll see results like http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-pages , where the ideal setting for dirty_bytes to keep latency under control with BBWC was 15MB. To put that into perspective, the lowest useful setting you can set dirty_ratio to is 5% of RAM. That's 410MB on my measly 8GB desktop, and 3.3GB on the 64GB production server I've been trying to tune.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to