Re: [HACKERS] Checkpoint sync pause
On Sun, Feb 12, 2012 at 10:49 PM, Amit Kapila amit.kap...@huawei.com wrote: Without sorted checkpoints (or some other fancier method) you have to write out the entire pool before you can do any fsyncs. Or you have to do multiple fsyncs of the same file, with at least one occurring after the entire pool was written. With a sorted checkpoint, you can start issuing once-only fsyncs very early in the checkpoint process. I think that on large servers, that would be the main benefit, not the actually more efficient IO. (On small servers I've seen sorted checkpoints be much faster on shutdown checkpoints, but not on natural checkpoints, and presumably this improvement *is* due to better ordering). On your servers, you need big delays between fsyncs and not between writes (as they are buffered until the fsync). But in other situations, people need the delays between the writes. By using sorted checkpoints with fsyncs between each file, the delays between writes are naturally delays between fsyncs as well. So I think the benefit of using sorted checkpoints is that code to improve your situations is less likely to degrade someone else's situation, without having to introduce an extra layer of tunables. What I understood is that you are suggesting, it is better to do sorted checkpoints which essentially means flush nearby buffers together. More importantly, you can issue an fsync after all pages for any given file are written, thus naturally spreading out the fsyncs instead of reserving them to until the end, or some arbitrary fraction of the checkpoint cycle. For this purpose, the buffers only need to be sorted by physical file they are in, not by block order within the file. However if does this way, might be it will violate Oracle Patent (20050044311 - Reducing disk IO by full-cache write-merging). I am not very sure about it. But you can refer it once. Thank you. I was not aware of it, and am constantly astonished what kinds of things are patentable. I think the linked list is a bit of a red herring. Many of the concepts people discuss implementing on the linked list could just as easily be implemented with the clock sweep. And I've seen no evidence at all that the clock sweep is the problem. The LWLock that protects can obviously be a problem, but that seems to be due to the overhead of acquiring a contended lock, not the work done under the lock. Reducing the lock-strength around this might be a good idea, but that reduction could be done just as easily (and as far as I can tell, more easily) with the clock sweep than the linked list. with clock-sweep, there are many chances that backend needs to traverse more to find a suitable buffer. Maybe, but I have not seen any evidence that this is the case. My analyses, experiments, and simulations show that when the buffer allocations are high, the mere act of running the sweep that often keeps average useagecount low, so the average sweep is very short. However, if clean buffer is put in freelist, it can be directly picked from there. Not directly, you have to take a lock. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checkpoint sync pause
Without sorted checkpoints (or some other fancier method) you have to write out the entire pool before you can do any fsyncs. Or you have to do multiple fsyncs of the same file, with at least one occurring after the entire pool was written. With a sorted checkpoint, you can start issuing once-only fsyncs very early in the checkpoint process. I think that on large servers, that would be the main benefit, not the actually more efficient IO. (On small servers I've seen sorted checkpoints be much faster on shutdown checkpoints, but not on natural checkpoints, and presumably this improvement *is* due to better ordering). On your servers, you need big delays between fsyncs and not between writes (as they are buffered until the fsync). But in other situations, people need the delays between the writes. By using sorted checkpoints with fsyncs between each file, the delays between writes are naturally delays between fsyncs as well. So I think the benefit of using sorted checkpoints is that code to improve your situations is less likely to degrade someone else's situation, without having to introduce an extra layer of tunables. What I understood is that you are suggesting, it is better to do sorted checkpoints which essentially means flush nearby buffers together. However if does this way, might be it will violate Oracle Patent (20050044311 - Reducing disk IO by full-cache write-merging). I am not very sure about it. But you can refer it once. I think the linked list is a bit of a red herring. Many of the concepts people discuss implementing on the linked list could just as easily be implemented with the clock sweep. And I've seen no evidence at all that the clock sweep is the problem. The LWLock that protects can obviously be a problem, but that seems to be due to the overhead of acquiring a contended lock, not the work done under the lock. Reducing the lock-strength around this might be a good idea, but that reduction could be done just as easily (and as far as I can tell, more easily) with the clock sweep than the linked list. with clock-sweep, there are many chances that backend needs to traverse more to find a suitable buffer. However, if clean buffer is put in freelist, it can be directly picked from there. -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Jeff Janes Sent: Monday, February 13, 2012 12:14 AM To: Greg Smith Cc: Robert Haas; PostgreSQL-development Subject: Re: [HACKERS] Checkpoint sync pause On Tue, Feb 7, 2012 at 1:22 PM, Greg Smith gsm...@gregsmith.com wrote: On 02/03/2012 11:41 PM, Jeff Janes wrote: -The steady stream of backend writes that happen between checkpoints have filled up most of the OS write cache. A look at /proc/meminfo shows around 2.5GB Dirty: backend writes includes bgwriter writes, right? Right. Has using a newer kernal with dirty_background_bytes been tried, so it could be set to a lower level? If so, how did it do? Or does it just refuse to obey below the 5% level, as well? Trying to dip below 5% using dirty_background_bytes slows VACUUM down faster than it improves checkpoint latency. Does it cause VACUUM to create latency for other processes (like the checkpoint syncs do, by gumming up the IO for everyone) or does VACUUM just slow down without effecting other tasks? It seems to me that just lowering dirty_background_bytes (while not also lowering dirty_bytes) should not cause the latter to happen, but it seems like these kernel tunables never do exactly what they advertise. This may not be relevant to the current situation, but I wonder if we don't need a vacuum_cost_page_dirty_seq so that if the pages we are dirtying are consecutive (or at least closely spaced) they cost less, in anticipation that the eventual writes will be combined and thus consume less IO resources. I would think it would be common for some regions of table to be intensely dirtied, and some to be lightly dirtied (but still aggregating up to a considerable amount of random IO). But the vacuum process might also need to be made more bursty, as even if it generates sequential dirty pages the IO system might write them randomly anyway if there are too many delays interspersed Since the sort of servers that have checkpoint issues are quite often ones that have VACUUM ones, too, that whole path doesn't seem very productive. The one test I haven't tried yet is whether increasing the size of the VACUUM ring buffer might improve how well the server responds to a lower write cache. I wouldn't expect this to help. It seems like it would hurt, as it just leaves the data for even longer (however long it takes to circumnavigate the ring buffer) before there is any possibility of it getting written. I guess it does increase the chances that the dirty pages will accidentally get written by the bgwriter rather than the vacuum itself, but I doubt that that would
Re: [HACKERS] Checkpoint sync pause
On Tue, Feb 7, 2012 at 1:22 PM, Greg Smith gsm...@gregsmith.com wrote: On 02/03/2012 11:41 PM, Jeff Janes wrote: -The steady stream of backend writes that happen between checkpoints have filled up most of the OS write cache. A look at /proc/meminfo shows around 2.5GB Dirty: backend writes includes bgwriter writes, right? Right. Has using a newer kernal with dirty_background_bytes been tried, so it could be set to a lower level? If so, how did it do? Or does it just refuse to obey below the 5% level, as well? Trying to dip below 5% using dirty_background_bytes slows VACUUM down faster than it improves checkpoint latency. Does it cause VACUUM to create latency for other processes (like the checkpoint syncs do, by gumming up the IO for everyone) or does VACUUM just slow down without effecting other tasks? It seems to me that just lowering dirty_background_bytes (while not also lowering dirty_bytes) should not cause the latter to happen, but it seems like these kernel tunables never do exactly what they advertise. This may not be relevant to the current situation, but I wonder if we don't need a vacuum_cost_page_dirty_seq so that if the pages we are dirtying are consecutive (or at least closely spaced) they cost less, in anticipation that the eventual writes will be combined and thus consume less IO resources. I would think it would be common for some regions of table to be intensely dirtied, and some to be lightly dirtied (but still aggregating up to a considerable amount of random IO). But the vacuum process might also need to be made more bursty, as even if it generates sequential dirty pages the IO system might write them randomly anyway if there are too many delays interspersed Since the sort of servers that have checkpoint issues are quite often ones that have VACUUM ones, too, that whole path doesn't seem very productive. The one test I haven't tried yet is whether increasing the size of the VACUUM ring buffer might improve how well the server responds to a lower write cache. I wouldn't expect this to help. It seems like it would hurt, as it just leaves the data for even longer (however long it takes to circumnavigate the ring buffer) before there is any possibility of it getting written. I guess it does increase the chances that the dirty pages will accidentally get written by the bgwriter rather than the vacuum itself, but I doubt that that would be significant. ... Was the sorted checkpoint with an fsync after every file (real file, not VFD) one of the changes you tried? ... I haven't had very good luck with sorting checkpoints at the PostgreSQL relation level on server-size systems. There is a lot of sorting already happening at both the OS (~3GB) and BBWC (=512MB) levels on this server. My own tests on my smaller test server--with a scaled down OS (~750MB) and BBWC (256MB) cache--haven't ever validated sorting as a useful technique on top of that. It's never bubbled up to being considered a likely win on the production one as a result. Without sorted checkpoints (or some other fancier method) you have to write out the entire pool before you can do any fsyncs. Or you have to do multiple fsyncs of the same file, with at least one occurring after the entire pool was written. With a sorted checkpoint, you can start issuing once-only fsyncs very early in the checkpoint process. I think that on large servers, that would be the main benefit, not the actually more efficient IO. (On small servers I've seen sorted checkpoints be much faster on shutdown checkpoints, but not on natural checkpoints, and presumably this improvement *is* due to better ordering). On your servers, you need big delays between fsyncs and not between writes (as they are buffered until the fsync). But in other situations, people need the delays between the writes. By using sorted checkpoints with fsyncs between each file, the delays between writes are naturally delays between fsyncs as well. So I think the benefit of using sorted checkpoints is that code to improve your situations is less likely to degrade someone else's situation, without having to introduce an extra layer of tunables. What I/O are they trying to do? It seems like all your data is in RAM (if not, I'm surprised you can get queries to ran fast enough to create this much dirty data). So they probably aren't blocking on reads which are being interfered with by all the attempted writes. Reads on infrequently read data. Long tail again; even though caching is close to 100%, the occasional outlier client who wants some rarely accessed page with their personal data on it shows up. Pollute the write caches badly enough, and what happens to reads mixed into there gets very fuzzy. Depends on the exact mechanics of the I/O scheduler used in the kernel version deployed. OK, but I would still think it is a minority of transactions which need at least one of those infrequently read data and most
Re: [HACKERS] Checkpoint sync pause
On 02/03/2012 11:41 PM, Jeff Janes wrote: -The steady stream of backend writes that happen between checkpoints have filled up most of the OS write cache. A look at /proc/meminfo shows around 2.5GB Dirty: backend writes includes bgwriter writes, right? Right. Has using a newer kernal with dirty_background_bytes been tried, so it could be set to a lower level? If so, how did it do? Or does it just refuse to obey below the 5% level, as well? Trying to dip below 5% using dirty_background_bytes slows VACUUM down faster than it improves checkpoint latency. Since the sort of servers that have checkpoint issues are quite often ones that have VACUUM ones, too, that whole path doesn't seem very productive. The one test I haven't tried yet is whether increasing the size of the VACUUM ring buffer might improve how well the server responds to a lower write cache. If there is 3GB of dirty data spread over300 segments each segment is about full-sized (1GB) then on average1% of each segment is dirty? If that analysis holds, then it seem like there is simply an awful lot of data has to be written randomly, no matter how clever the re-ordering is. In other words, it is not that a harried or panicked kernel or RAID control is failing to do good re-ordering when it has opportunities to, it is just that you dirty your data too randomly for substantial reordering to be possible even under ideal conditions. Averages are deceptive here. This data follows the usual distribution for real-world data, which is that there is a hot chunk of data that receives far more writes than average (particularly index blocks), along with a long tail of segments that are only seeing one or two 8K blocks modified (catalog data, stats, application metadata). Plenty of useful reordering happens here. It's happening in Linux's cache and in the controller's cache. The constant of stream of checkpoint syncs doesn't stop that. It does seems to do two bad things though: a) makes some of these bad cache filled situations more likely, and b) doesn't leave any I/O capacity unused for clients to get some work done. One of the real possibilities I've been considering more lately is that the value we've seen of the pauses during sync aren't so much about optimizing I/O, that instead it's from allowing a brief window of client backend I/O to slip in there between the cache filling checkpoint sync. Does the BBWC, once given an fsync command and reporting success, write out those block forthwith, or does it lolly-gag around like the kernel (under non-fsync) does? If it is waiting around for write-combing opportunities that will never actually materialize in sufficient quantities to make up for the wait, how to get it to stop? Was the sorted checkpoint with an fsync after every file (real file, not VFD) one of the changes you tried? As far as I know the typical BBWC is always working. When a read or a write comes in, it starts moving immediately. When it gets behind, it starts making seek decisions more intelligently based on visibility of the whole queue. But they don't delay doing any work at all the way Linux does. I haven't had very good luck with sorting checkpoints at the PostgreSQL relation level on server-size systems. There is a lot of sorting already happening at both the OS (~3GB) and BBWC (=512MB) levels on this server. My own tests on my smaller test server--with a scaled down OS (~750MB) and BBWC (256MB) cache--haven't ever validated sorting as a useful technique on top of that. It's never bubbled up to being considered a likely win on the production one as a result. DEBUG: Sync #1 time=21.969000 gap=0.00 msec DEBUG: Sync #2 time=40.378000 gap=0.00 msec DEBUG: Sync #3 time=12574.224000 gap=3007.614000 msec DEBUG: Sync #4 time=91.385000 gap=2433.719000 msec DEBUG: Sync #5 time=2119.122000 gap=2836.741000 msec DEBUG: Sync #6 time=67.134000 gap=2840.791000 msec DEBUG: Sync #7 time=62.005000 gap=3004.823000 msec DEBUG: Sync #8 time=0.004000 gap=2818.031000 msec DEBUG: Sync #9 time=0.006000 gap=3012.026000 msec DEBUG: Sync #10 time=302.75 gap=3003.958000 msec Syncs 3 and 5 kind of surprise me. It seems like the times should be more bimodal. If the cache is already full, why doesn't the system promptly collapse, like it does later? And if it is not, why would it take 12 seconds, or even 2 seconds? Or is this just evidence that the gaps you are inserting are partially, but not completely, effective? Given a mix of completely random I/O, a 24 disk array like this system has is lucky to hit 20MB/s clearing it out. It doesn't take too much of that before even 12 seconds makes sense. The 45 second pauses are the ones demonstrating the controller's cached is completely overwhelmed. It's rare to see caching turn truly bimodal, because the model for it has both a variable ingress and egress rate involved. Even as the checkpoint sync is pushing stuff
Re: [HACKERS] Checkpoint sync pause
On Mon, Jan 16, 2012 at 5:59 PM, Greg Smith g...@2ndquadrant.com wrote: On 01/16/2012 11:00 AM, Robert Haas wrote: Also, I am still struggling with what the right benchmarking methodology even is to judge whether any patch in this area works. Can you provide more details about your test setup? The test setup is a production server with a few hundred users at peak workload, reading and writing to the database. Each RAID controller (couple of them with their own tablespaces) has either 512MG or 1GB of battery-backed write cache. The setup that leads to the bad situation happens like this: -The steady stream of backend writes that happen between checkpoints have filled up most of the OS write cache. A look at /proc/meminfo shows around 2.5GB Dirty: backend writes includes bgwriter writes, right? -Since we have shared_buffers set to 512MB to try and keep checkpoint storms from being too bad, there might be 300MB of dirty pages involved in the checkpoint. The write phase dumps this all into Linux's cache. There's now closer to 3GB of dirty data there. @64GB of RAM, this is still only 4.7% though--just below the effective lower range for dirty_background_ratio. Has using a newer kernal with dirty_background_bytes been tried, so it could be set to a lower level? If so, how did it do? Or does it just refuse to obey below the 5% level, as well? Linux is perfectly content to let it all sit there. -Sync phase begins. Between absorption and the new checkpoint writes, there are 300 segments to sync waiting here. If there is 3GB of dirty data spread over 300 segments each segment is about full-sized (1GB) then on average 1% of each segment is dirty? If that analysis holds, then it seem like there is simply an awful lot of data has to be written randomly, no matter how clever the re-ordering is. In other words, it is not that a harried or panicked kernel or RAID control is failing to do good re-ordering when it has opportunities to, it is just that you dirty your data too randomly for substantial reordering to be possible even under ideal conditions. Does the BBWC, once given an fsync command and reporting success, write out those block forthwith, or does it lolly-gag around like the kernel (under non-fsync) does? If it is waiting around for write-combing opportunities that will never actually materialize in sufficient quantities to make up for the wait, how to get it to stop? Was the sorted checkpoint with an fsync after every file (real file, not VFD) one of the changes you tried? -The first few syncs force data out of Linux's cache and into the BBWC. Some of these return almost instantly. Others block for a moderate number of seconds. That's not necessarily a showstopper, on XFS at least. So long as the checkpointer is not being given all of the I/O in the system, the fact that it's stuck waiting for a sync doesn't mean the server is unresponsive to the needs of other backends. Early data might look like this: DEBUG: Sync #1 time=21.969000 gap=0.00 msec DEBUG: Sync #2 time=40.378000 gap=0.00 msec DEBUG: Sync #3 time=12574.224000 gap=3007.614000 msec DEBUG: Sync #4 time=91.385000 gap=2433.719000 msec DEBUG: Sync #5 time=2119.122000 gap=2836.741000 msec DEBUG: Sync #6 time=67.134000 gap=2840.791000 msec DEBUG: Sync #7 time=62.005000 gap=3004.823000 msec DEBUG: Sync #8 time=0.004000 gap=2818.031000 msec DEBUG: Sync #9 time=0.006000 gap=3012.026000 msec DEBUG: Sync #10 time=302.75 gap=3003.958000 msec Syncs 3 and 5 kind of surprise me. It seems like the times should be more bimodal. If the cache is already full, why doesn't the system promptly collapse, like it does later? And if it is not, why would it take 12 seconds, or even 2 seconds? Or is this just evidence that the gaps you are inserting are partially, but not completely, effective? [Here 'gap' is a precise measurement of how close the sync pause feature is working, with it set to 3 seconds. This is from an earlier version of this patch. All the timing issues I used to measure went away in the current implementation because it doesn't have to worry about doing background writer LRU work anymore, with the checkpointer split out] But after a few hundred of these, every downstream cache is filled up. The result is seeing more really ugly sync times, like #164 here: DEBUG: Sync #160 time=1147.386000 gap=2801.047000 msec DEBUG: Sync #161 time=0.004000 gap=4075.115000 msec DEBUG: Sync #162 time=0.005000 gap=2943.966000 msec DEBUG: Sync #163 time=962.769000 gap=3003.906000 msec DEBUG: Sync #164 time=45125.991000 gap=3033.228000 msec DEBUG: Sync #165 time=4.031000 gap=2818.013000 msec DEBUG: Sync #166 time=212.537000 gap=3039.979000 msec DEBUG: Sync #167 time=0.005000 gap=2820.023000 msec ... DEBUG: Sync #355 time=2.55 gap=2806.425000 msec LOG: Sync 355 files longest=45125.991000 msec average=1276.177977 msec At the same time #164 is
Re: [HACKERS] Checkpoint sync pause
On Mon, Jan 16, 2012 at 8:59 PM, Greg Smith g...@2ndquadrant.com wrote: [ interesting description of problem scenario and necessary conditions for reproducing it ] This is about what I thought was happening, but I'm still not quite sure how to recreate it in the lab. Have you had a chance to test with Linux 3.2 does any better in this area? As I understand it, it doesn't do anything particularly interesting about the willingness of the kernel to cache gigantic amounts of dirty data, but (1) supposedly it does a better job not yanking the disk head around by just putting foreground processes to sleep while writes happen in the background, rather than having the foreground processes compete with the background writer for control of the disk head; and (2) instead of having a sharp edge where background writing kicks in, it tries to gradually ratchet up the pressure to get things written out. Somehow I can't shake the feeling that this is fundamentally a Linux problem, and that it's going to be nearly impossible to work around in user space without some help from the kernel. I guess in some sense it's reasonable that calling fsync() blasts the data at the platter at top speed, but if that leads to starving everyone else on the system then it starts to seem a lot less reasonable: part of the kernel's job is to guarantee all processes fair access to shared resources, and if it doesn't do that, we're always going to be playing catch-up. Just one random thought: I wonder if it would make sense to cap the delay after each sync to the time spending performing that sync. That would make the tuning of the delay less sensitive to the total number of files, because we won't unnecessarily wait after each sync when they're not actually taking any time to complete. This is one of the attractive ideas in this area that didn't work out so well when tested. The problem is that writes into a battery-backed write cache will show zero latency for some time until the cache is filled...and then you're done. You have to pause anyway, even though it seems write speed is massive, to give the cache some time to drain to disk between syncs that push data toward it. Even though it absorbed your previous write with no delay, that doesn't mean it takes no time to process that write. With proper write caching, that processing is just happening asynchronously. Hmm, OK. Well, to borrow a page from one of your other ideas, how about keeping track of the number of fsync requests queued for each file, and make the delay proportional to that number? We might have written the same block more than once, so it could be an overestimate, but it rubs me the wrong way to think that a checkpoint is going to finish late because somebody ran a CREATE TABLE statement that touched 5 or 6 catalogs, and now we've got to pause for 15-18 seconds because they've each got one dirty block. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checkpoint sync pause
On Mon, Jan 16, 2012 at 2:57 AM, Greg Smith g...@2ndquadrant.com wrote: ... 2012-01-16 02:39:01.184 EST [25052]: DEBUG: checkpoint sync: number=34 file=base/16385/11766 time=0.006 msec 2012-01-16 02:39:01.184 EST [25052]: DEBUG: checkpoint sync delay: seconds left=3 2012-01-16 02:39:01.284 EST [25052]: DEBUG: checkpoint sync delay: seconds left=2 2012-01-16 02:39:01.385 EST [25052]: DEBUG: checkpoint sync delay: seconds left=1 2012-01-16 02:39:01.860 EST [25052]: DEBUG: checkpoint sync: number=35 file=global/12007 time=375.710 msec 2012-01-16 02:39:01.860 EST [25052]: DEBUG: checkpoint sync delay: seconds left=3 2012-01-16 02:39:01.961 EST [25052]: DEBUG: checkpoint sync delay: seconds left=2 2012-01-16 02:39:02.061 EST [25052]: DEBUG: checkpoint sync delay: seconds left=1 2012-01-16 02:39:02.161 EST [25052]: DEBUG: checkpoint sync: number=36 file=base/16385/11754 time=0.008 msec 2012-01-16 02:39:02.555 EST [25052]: LOG: checkpoint complete: wrote 2586 buffers (63.1%); 1 transaction log file(s) added, 0 removed, 0 recycled; write=2.422 s, sync=13.282 s, total=16.123 s; sync files=36, longest=1.085 s, average=0.040 s No docs yet, really need a better guide to tuning checkpoints as they exist now before there's a place to attach a discussion of this to. Yeah, I think this is an area where a really good documentation patch might help more users than any code we could write. On the technical end, I dislike this a little bit because the parameter is clearly something some people are going to want to set, but it's not at all clear what value they should set it to and it has complex interactions with the other checkpoint settings - and the user's hardware configuration. If there's no way to make it more self-tuning, then perhaps we should just live with that, but it would be nice to come up with something more user-transparent. Also, I am still struggling with what the right benchmarking methodology even is to judge whether any patch in this area works. Can you provide more details about your test setup? Just one random thought: I wonder if it would make sense to cap the delay after each sync to the time spending performing that sync. That would make the tuning of the delay less sensitive to the total number of files, because we won't unnecessarily wait after each sync when they're not actually taking any time to complete. It's probably easier to estimate the number of segments that are likely to contain lots of dirty data than to estimate the total number of segments that you might have touched at least once since the last checkpoint, and there's no particular reason to think the latter is really what you should be tuning on anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Checkpoint sync pause
On 01/16/2012 11:00 AM, Robert Haas wrote: Also, I am still struggling with what the right benchmarking methodology even is to judge whether any patch in this area works. Can you provide more details about your test setup? The test setup is a production server with a few hundred users at peak workload, reading and writing to the database. Each RAID controller (couple of them with their own tablespaces) has either 512MG or 1GB of battery-backed write cache. The setup that leads to the bad situation happens like this: -The steady stream of backend writes that happen between checkpoints have filled up most of the OS write cache. A look at /proc/meminfo shows around 2.5GB Dirty: -Since we have shared_buffers set to 512MB to try and keep checkpoint storms from being too bad, there might be 300MB of dirty pages involved in the checkpoint. The write phase dumps this all into Linux's cache. There's now closer to 3GB of dirty data there. @64GB of RAM, this is still only 4.7% though--just below the effective lower range for dirty_background_ratio. Linux is perfectly content to let it all sit there. -Sync phase begins. Between absorption and the new checkpoint writes, there are 300 segments to sync waiting here. -The first few syncs force data out of Linux's cache and into the BBWC. Some of these return almost instantly. Others block for a moderate number of seconds. That's not necessarily a showstopper, on XFS at least. So long as the checkpointer is not being given all of the I/O in the system, the fact that it's stuck waiting for a sync doesn't mean the server is unresponsive to the needs of other backends. Early data might look like this: DEBUG: Sync #1 time=21.969000 gap=0.00 msec DEBUG: Sync #2 time=40.378000 gap=0.00 msec DEBUG: Sync #3 time=12574.224000 gap=3007.614000 msec DEBUG: Sync #4 time=91.385000 gap=2433.719000 msec DEBUG: Sync #5 time=2119.122000 gap=2836.741000 msec DEBUG: Sync #6 time=67.134000 gap=2840.791000 msec DEBUG: Sync #7 time=62.005000 gap=3004.823000 msec DEBUG: Sync #8 time=0.004000 gap=2818.031000 msec DEBUG: Sync #9 time=0.006000 gap=3012.026000 msec DEBUG: Sync #10 time=302.75 gap=3003.958000 msec [Here 'gap' is a precise measurement of how close the sync pause feature is working, with it set to 3 seconds. This is from an earlier version of this patch. All the timing issues I used to measure went away in the current implementation because it doesn't have to worry about doing background writer LRU work anymore, with the checkpointer split out] But after a few hundred of these, every downstream cache is filled up. The result is seeing more really ugly sync times, like #164 here: DEBUG: Sync #160 time=1147.386000 gap=2801.047000 msec DEBUG: Sync #161 time=0.004000 gap=4075.115000 msec DEBUG: Sync #162 time=0.005000 gap=2943.966000 msec DEBUG: Sync #163 time=962.769000 gap=3003.906000 msec DEBUG: Sync #164 time=45125.991000 gap=3033.228000 msec DEBUG: Sync #165 time=4.031000 gap=2818.013000 msec DEBUG: Sync #166 time=212.537000 gap=3039.979000 msec DEBUG: Sync #167 time=0.005000 gap=2820.023000 msec ... DEBUG: Sync #355 time=2.55 gap=2806.425000 msec LOG: Sync 355 files longest=45125.991000 msec average=1276.177977 msec At the same time #164 is happening, that 45 second long window, a pile of clients will get stuck where they can't do any I/O. The RAID controller that used to have a useful mix of data is now completely filled with =512MB of random writes. It's now failing to write as fast as new data is coming in. Eventually that leads to pressure building up in Linux's cache. Now you're in the bad place: dirty_background_ratio is crossed, Linux is now worried about spooling all cached writes to disk as fast as it can, the checkpointer is sync'ing its own important data to disk as fast as it can too, and all caches are inefficient because they're full. To recreate a scenario like this, I've realized the benchmark needs to have a couple of characteristics: -It has to focus on transaction latency instead of throughput. We know that doing syncs more often will lower throughput due to reduced reordering etc. -It cannot run at maximum possible speed all the time. It needs to be the case that the system keeps up with the load during the rest of the time, but the sync phase of checkpoints causes I/O to queue faster than it's draining, thus saturating all caches and then blocking backends. Ideally, Dirty: in /proc/meminfo will reach 90% of the dirty_background_ratio trigger line around the same time the sync phase starts. -There should be a lot of clients doing a mix of work. The way Linux I/O works, the scheduling for readers vs. writers is complicated, and this is one of the few areas where things like CFQ vs. Deadline matter. I've realized now one reason I never got anywhere with this while running pgbench tests is that pgbench always runs at 100% of
Re: [HACKERS] Checkpoint sync pause
On 1/16/12 5:59 PM, Greg Smith wrote: What I think is needed instead is a write-heavy benchmark with a think time in it, so that we can dial the workload up to, say, 90% of I/O capacity, but that spikes to 100% when checkpoint sync happens. Then rearrangements in syncing that reduces caching pressure should be visible as a latency reduction in client response times. My guess is that dbt-2 can be configured to provide such a workload, and I don't see a way forward here except for me to fully embrace that and start over with it. You can do this with custom pgbench workloads, thanks to random and sleep functions. Somebody went and make pgbench programmable, I don't remember who. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Checkpoint sync pause
Last year at this point, I submitted an increasingly complicated checkpoint sync spreading feature. I wasn't able to prove any repeatable drop in sync time latency from those patches. While that was going on, and continuing into recently, the production server that started all this with its sync time latency issues didn't stop having that problem. Data collection continued, new patches were tried. There was a really simple triage step Simon and I made before getting into the complicated ones: just delay for a few seconds between every single sync call made during a checkpoint. That approach is still hanging around that server's patched PostgreSQL package set, and it still works better than anything more complicated we've tried so far. The recent split of background writer and checkpointer makes that whole thing even easier to do without rippling out to have unexpected consequences. In order to be able to tune this usefully, you need to know information about how many files a typical checkpoint syncs. That could be available without needing log scraping using the Publish checkpoint timing and sync files summary data to pg_stat_bgwriter addition I just submitted. People who set this new checkpoint_sync_pause value too high can face checkpoints running over schedule, but you can measure how bad your exposure is with the new view information. I owe the community a lot of data to prove this is useful before I'd expect it to be taken seriously. I was planning to leave this whole area alone until 9.3. But since recent submissions may pull me back into trying various ways of rearranging the write path for 9.2, I wanted to have my own miniature horse in that race. It works simply: ... 2012-01-16 02:39:01.184 EST [25052]: DEBUG: checkpoint sync: number=34 file=base/16385/11766 time=0.006 msec 2012-01-16 02:39:01.184 EST [25052]: DEBUG: checkpoint sync delay: seconds left=3 2012-01-16 02:39:01.284 EST [25052]: DEBUG: checkpoint sync delay: seconds left=2 2012-01-16 02:39:01.385 EST [25052]: DEBUG: checkpoint sync delay: seconds left=1 2012-01-16 02:39:01.860 EST [25052]: DEBUG: checkpoint sync: number=35 file=global/12007 time=375.710 msec 2012-01-16 02:39:01.860 EST [25052]: DEBUG: checkpoint sync delay: seconds left=3 2012-01-16 02:39:01.961 EST [25052]: DEBUG: checkpoint sync delay: seconds left=2 2012-01-16 02:39:02.061 EST [25052]: DEBUG: checkpoint sync delay: seconds left=1 2012-01-16 02:39:02.161 EST [25052]: DEBUG: checkpoint sync: number=36 file=base/16385/11754 time=0.008 msec 2012-01-16 02:39:02.555 EST [25052]: LOG: checkpoint complete: wrote 2586 buffers (63.1%); 1 transaction log file(s) added, 0 removed, 0 recycled; write=2.422 s, sync=13.282 s, total=16.123 s; sync files=36, longest=1.085 s, average=0.040 s No docs yet, really need a better guide to tuning checkpoints as they exist now before there's a place to attach a discussion of this to. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c index 0b792d2..54da69a 100644 --- a/src/backend/postmaster/checkpointer.c +++ b/src/backend/postmaster/checkpointer.c @@ -142,6 +142,7 @@ static BgWriterShmemStruct *BgWriterShmem; int CheckPointTimeout = 300; int CheckPointWarning = 30; double CheckPointCompletionTarget = 0.5; +int CheckPointSyncPause = 0; /* * Flags set by interrupt handlers for later service in the main loop. @@ -157,6 +158,8 @@ static bool am_checkpointer = false; static bool ckpt_active = false; +static int checkpoint_flags = 0; + /* these values are valid when ckpt_active is true: */ static pg_time_t ckpt_start_time; static XLogRecPtr ckpt_start_recptr; @@ -643,6 +646,9 @@ CheckpointWriteDelay(int flags, double progress) if (!am_checkpointer) return; + /* Cache this value for a later sync delay */ + checkpoint_flags=flags; + /* * Perform the usual duties and take a nap, unless we're behind * schedule, in which case we just try to catch up as quickly as possible. @@ -685,6 +691,72 @@ CheckpointWriteDelay(int flags, double progress) } /* + * CheckpointSyncDelay -- control rate of checkpoint sync stage + * + * This function is called after each relation sync performed by mdsync(). + * It delays for a fixed period while still making sure to absorb + * incoming fsync requests. + * + * Due to where this is called with the md layer, it's not practical + * for it to be directly passed the checkpoint flags. It's expected + * they will have been stashed within the checkpointer's local state + * by a call to CheckpointWriteDelay. + * + */ +void +CheckpointSyncDelay() +{ + static int absorb_counter = WRITES_PER_ABSORB; + int sync_delay_secs = CheckPointSyncPause; + + /* Do nothing if checkpoint is being executed by non-checkpointer process */ + if