Re: [HACKERS] Spread checkpoint sync

Greg Smith Sat, 15 Jan 2011 23:29:28 -0800

Robert Haas wrote:

What is the basis for thinking that the sync should get the same
amount of time as the writes?  That seems pretty arbitrary.  Right
now, you're allowing 3 seconds per fsync, which could be a lot more or
a lot less than 40% of the total checkpoint time...

Just that it's where I ended up at when fighting with this for a monthon the system I've seen the most problems at. The 3 second number wasreversed from a computation that said "aim for an internal of X minutes;we have Y relations on average involved in the checkpoint". Thedirection my latest patch is strugling to go is computing a reasonabletime automatically in the same way--count the relations, do a timeestimate, add enough delay so the sync calls should be spread linearlyover the given time range.

the checkpoint activity is always going to be spikey if it does
anything at all, so spacing it out *more* isn't obviously useful.

One of the components to the write queue is some notion that writes thathave been waiting longest should eventually be flushed out. Linux hasthis number called dirty_expire_centiseconds which suggests it enforcesjust that, set to a default of 30 seconds. This is why some 5-minuteinterval checkpoints with default parameters, effectively spreading thecheckpoint over 2.5 minutes, can work under the current design.Anything you wrote at T+0 to T+2:00 *should* have been written outalready when you reach T+2:30 and sync. Unfortunately, when the systemgets busy, there is this "congestion control" logic that basicallythrows out any guarantee of writes starting shortly after the expirationtime.

It turns out that the only thing that really works are the tunables thatblock new writes from happening once the queue is full, but they can'tbe set low enough to work well in earlier kernels when combined withlots of RAM. Using the terminology ofhttp://www.mjmwired.net/kernel/Documentation/sysctl/vm.txt at some pointyou hit a point where "a process generating disk writes will itselfstart writeback." This is anologous to the PostgreSQL situation wherebackends do their own fsync calls. The kernel will eventually move towhere those trying to write new data are instead recruited into beingadditional sources of write flushing. That's the part you just can'tmake aggressive enough on older kernels; dirty writers can always win.Ideally, the system never digs itself into a hole larger than you canafford to wait to write out. It's a transacton speed vs. latency thingthough, and the older kernels just don't consider the latency side wellenough.

There is new mechanism in the latest kernels to control this muchbetter: dirty_bytes and dirty_background_bytes are the tunables. Ihaven't had a chance to test yet. As mentioned upthread, some of thebleding edge kernels that have this feature available in are showingsuch large general performance regressions in our tests, compared to theboring old RHEL5 kernel, that whether this feature works or not isirrelevant. I haven't tracked down which new kernel distributions workwell performance-wise and which don't yet for PostgreSQL.

I'm hoping that when I get there, I'll see results likehttp://serverfault.com/questions/126413/limit-linux-background-flush-dirty-pages, where the ideal setting for dirty_bytes to keep latency under controlwith BBWC was 15MB. To put that into perspective, the lowest usefulsetting you can set dirty_ratio to is 5% of RAM. That's 410MB on mymeasly 8GB desktop, and 3.3GB on the 64GB production server I've beentrying to tune.


--
Greg Smith   2ndQuadrant US    [email protected]   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

Reply via email to