Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

Greg Smith Tue, 16 Jul 2013 11:19:33 -0700

On 7/16/13 12:46 PM, Ants Aasma wrote:

Spread checkpoints sprinkles the writes out over a long
period and the general tuning advice is to heavily bound the amount of
memory the OS willing to keep dirty.

That's arguing that you can make this feature be useful if you tune in aparticular way. That's interesting, but the goal here isn't to provethe existence of some workload that a change is useful for. You canusually find a test case that validates any performance patch as helpfulif you search for one. Everyone who has submitted a sorted checkpointpatch for example has found some setup where it shows significant gains.We're trying to keep performance stable across a much wider set ofpossibilities though.

Let's talk about default parameters instead, which quickly demonstrateswhere your assumptions fail. The server I happen to be running pgbenchtests on today has 72GB of RAM running SL6 with RedHat derived kernel2.6.32-358.11.1. This is a very popular middle grade serverconfiguration nowadays. There dirty_background_ratio anddirty_background_ratio are 10 (percent). That means that roughly 7GB ofRAM can be used for write caching. Note that this is a fairly low writecache tuning compared to a survey of systems in the field--lots ofpeople have servers with earlier kernels where these numbers can be ashigh as 20 or even 40% instead.

The current feasible tuning for shared_buffers suggests a value of 8GBis near the upper limit, beyond which cache related overhead makesincreases counterproductive. Your examples are showing 53% ofshared_buffers dirty at checkpoint time; that's typical. Thecheckpointer is then writing out just over 4GB of data.


With that background what process here has more data to make decisions with?

-The operating system has 7GB of writes it's trying to optimize. Thatpotentially includes backend, background writer, checkpoint, temp table,statistics, log, and WAL data. The scheduler is also considering readoperations.

-The checkpointer process has 4GB of writes from rarely written sharedmemory it's trying to optimize.

This is why if you take the opposite approach of yours today--gosearching for workloads where sorting is counterproductive--those areequally easy to find. Any test of write speed I do starts with about 50different scale/client combinations. Why do I suggest pgbench-tools asa way to do performance tests? It's because an automated sweep ofclient setups like it does is the minimum necessary to create enoughvariation in workload for changing the database's write path. It'sreally amazing how often doing that shows a proposed change is justshuffling the good and bad cases around. That's been the case for everysorting and fsync delay change submitted so far. I'm not eveninterested in testing today's submission because I tried that particularapproach for a few months, twice so far, and it fell apart on just asmany workloads as it helped.

The checkpointer has the best long term overview of the situation here, OS
scheduling only has the short term view of outstanding read and write
requests.

True only if shared_buffers is large compared to the OS write cache,which was not the case on the example I generated with all of a minute'swork. I regularly see servers where Linux's "Dirty" area becomes amultiple of the dirty buffers written by a checkpoint. I can usuallymake that happen at will with CLUSTER and VACUUM on big tables. Theidea that the checkpointer has a long-term view while the OS has a shortone, that presumes a setup that I would say is possible but not common.

kernel settings: dirty_background_bytes = 32M,
dirty_bytes = 128M.

You disclaimed this as a best case scenario. It is a low throughput /low latency tuning. That's fine, but if Postgres optimizes itselftoward those cases it runs the risk of high throughput servers withlarge caches being detuned. I've posted examples before showing verylow write caches like this leading to VACUUM running at 1/2 its normalspeed or worse, as a simple example of where a positive change in onearea can backfire badly on another workload. That particular problemwas so common I updated pgbench-tools recently to track tablemaintenance time between tests, because that demonstrated an issue evenwhen the TPS numbers all looked fine.


--
Greg Smith   2ndQuadrant US    [email protected]   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

Reply via email to