Re: [HACKERS] Checkpoint sync pause

Greg Smith Mon, 16 Jan 2012 17:59:49 -0800

On 01/16/2012 11:00 AM, Robert Haas wrote:

Also, I am still struggling with what the right benchmarkingmethodology even is to judge whether
any patch in this area "works".  Can you provide more details about
your test setup?

The "test" setup is a production server with a few hundred users at peakworkload, reading and writing to the database. Each RAID controller(couple of them with their own tablespaces) has either 512MG or 1GB ofbattery-backed write cache. The setup that leads to the bad situationhappens like this:

-The steady stream of backend writes that happen between checkpointshave filled up most of the OS write cache. A look at /proc/meminfoshows around 2.5GB "Dirty:"

-Since we have shared_buffers set to 512MB to try and keep checkpointstorms from being too bad, there might be 300MB of dirty pages involvedin the checkpoint. The write phase dumps this all into Linux's cache.There's now closer to 3GB of dirty data there. @64GB of RAM, this isstill only 4.7% though--just below the effective lower range fordirty_background_ratio. Linux is perfectly content to let it all sit there.

-Sync phase begins. Between absorption and the new checkpoint writes,there are >300 segments to sync waiting here.

-The first few syncs force data out of Linux's cache and into the BBWC.Some of these return almost instantly. Others block for a moderatenumber of seconds. That's not necessarily a showstopper, on XFS atleast. So long as the checkpointer is not being given all of the I/O inthe system, the fact that it's stuck waiting for a sync doesn't mean theserver is unresponsive to the needs of other backends. Early data mightlook like this:


DEBUG:  Sync #1 time=21.969000 gap=0.000000 msec
DEBUG:  Sync #2 time=40.378000 gap=0.000000 msec
DEBUG:  Sync #3 time=12574.224000 gap=3007.614000 msec
DEBUG:  Sync #4 time=91.385000 gap=2433.719000 msec
DEBUG:  Sync #5 time=2119.122000 gap=2836.741000 msec
DEBUG:  Sync #6 time=67.134000 gap=2840.791000 msec
DEBUG:  Sync #7 time=62.005000 gap=3004.823000 msec
DEBUG:  Sync #8 time=0.004000 gap=2818.031000 msec
DEBUG:  Sync #9 time=0.006000 gap=3012.026000 msec
DEBUG:  Sync #10 time=302.750000 gap=3003.958000 msec

[Here 'gap' is a precise measurement of how close the sync pause featureis working, with it set to 3 seconds. This is from an earlier versionof this patch. All the timing issues I used to measure went away in thecurrent implementation because it doesn't have to worry about doingbackground writer LRU work anymore, with the checkpointer split out]

But after a few hundred of these, every downstream cache is filled up.The result is seeing more really ugly sync times, like #164 here:


DEBUG:  Sync #160 time=1147.386000 gap=2801.047000 msec
DEBUG:  Sync #161 time=0.004000 gap=4075.115000 msec
DEBUG:  Sync #162 time=0.005000 gap=2943.966000 msec
DEBUG:  Sync #163 time=962.769000 gap=3003.906000 msec
DEBUG:  Sync #164 time=45125.991000 gap=3033.228000 msec
DEBUG:  Sync #165 time=4.031000 gap=2818.013000 msec
DEBUG:  Sync #166 time=212.537000 gap=3039.979000 msec
DEBUG:  Sync #167 time=0.005000 gap=2820.023000 msec
...
DEBUG:  Sync #355 time=2.550000 gap=2806.425000 msec
LOG:  Sync 355 files longest=45125.991000 msec average=1276.177977 msec

At the same time #164 is happening, that 45 second long window, a pileof clients will get stuck where they can't do any I/O. The RAIDcontroller that used to have a useful mix of data is now completelyfilled with >=512MB of random writes. It's now failing to write as fastas new data is coming in. Eventually that leads to pressure building upin Linux's cache. Now you're in the bad place: dirty_background_ratiois crossed, Linux is now worried about spooling all cached writes todisk as fast as it can, the checkpointer is sync'ing its own importantdata to disk as fast as it can too, and all caches are inefficientbecause they're full.

To recreate a scenario like this, I've realized the benchmark needs tohave a couple of characteristics:

-It has to focus on transaction latency instead of throughput. We knowthat doing syncs more often will lower throughput due to reducedreordering etc.

-It cannot run at maximum possible speed all the time. It needs to bethe case that the system keeps up with the load during the rest of thetime, but the sync phase of checkpoints causes I/O to queue faster thanit's draining, thus saturating all caches and then blocking backends.Ideally, "Dirty:" in /proc/meminfo will reach >90% of thedirty_background_ratio trigger line around the same time the sync phasestarts.

-There should be a lot of clients doing a mix of work. The way LinuxI/O works, the scheduling for readers vs. writers is complicated, andthis is one of the few areas where things like CFQ vs. Deadline matter.

I've realized now one reason I never got anywhere with this whilerunning pgbench tests is that pgbench always runs at 100% of capacity.It fills all the caches involved completely as fast as it can, and everycheckpoint starts with them already filled to capacity. So when latencygets bad at checkpoint time, no amount of clever reordering will helpkeep those writes from interfering with other processes. There justisn't any room to work with left.

What I think is needed instead is a write-heavy benchmark with a thinktime in it, so that we can dial the workload up to, say, 90% of I/Ocapacity, but that spikes to 100% when checkpoint sync happens. Thenrearrangements in syncing that reduces caching pressure should bevisible as a latency reduction in client response times. My guess isthat dbt-2 can be configured to provide such a workload, and I don't seea way forward here except for me to fully embrace that and start overwith it.

Just one random thought: I wonder if it would make sense to cap the
delay after each sync to the time spending performing that sync.  That
would make the tuning of the delay less sensitive to the total number
of files, because we won't unnecessarily wait after each sync when
they're not actually taking any time to complete.

This is one of the attractive ideas in this area that didn't work out sowell when tested. The problem is that writes into a battery-backedwrite cache will show zero latency for some time until the cache isfilled...and then you're done. You have to pause anyway, even though itseems write speed is massive, to give the cache some time to drain todisk between syncs that push data toward it. Even though it absorbedyour previous write with no delay, that doesn't mean it takes no time toprocess that write. With proper write caching, that processing is justhappening asynchronously.

This is related to another observation, noting what went wrong when wetried deploying my fully auto-tuning sync spread patch onto production.If the sync phase of the checkpoint starts to fall behind, and you'veconfigured for a sync pause, you have to just suck that up and acceptyou'll finish late[1]. When you do get into the situation where thecache is completely filled, writes will slow dramatically. In the abovelog example, sync #164 taking 45 seconds means that #165 will surely beconsidered behind schedule now. If you use that feedback to then reducethe sync pause, feeling that you are behind schedule and cannot affordto pause anymore, now you've degenerated right back to the originaltroubled behavior: sync calls, as fast as they can be accepted by theOS, no delay between them.

[1] Where I think I'm going to end up with this eventually now is thatsetting checkpoint_sync_pause is the important tunable. The parameterthat then gets auto-tuned is checkpoint_timeout. If you have 300relations to sync and you have to wait 10 seconds between syncs to getlatency down, the server is going to inform you an hour betweencheckpoints is all you can do here.


--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checkpoint sync pause

Reply via email to