Re: [HACKERS] Spread checkpoint sync
On Thu, Feb 10, 2011 at 10:30 PM, Greg Smith wrote: > 3) The existing write spreading code in the background writer needs to be > overhauled, too, before spreading the syncs around is going to give the > benefits I was hoping for. I've been thinking about this problem a bit. It strikes me that the whole notion of a background writer delay is probably wrong-headed. Instead of having fixed-length cycles, we might want to make the delay dependent on whether we're actually keeping up. So during each cycle, we decide how many buffers we want to clean, and we write 'em. Then we go to sleep. When we wake up again, we figure out whether we kept up. If the number of buffers we wrote during the prior cycle was more than the required number, then we'll sleep longer the next time, up to some maximum; if we we didn't write enough, we'll reduce the sleep. Along with this, we'd want to change the minimum rate of writing checkpoint buffers from 1 per cycle to 1 for every 200 ms, or something like that. We could even possibly have a system where backends wake the background writer up early if they notice that it's not keeping up, although it's not exactly clear what a good algorithm would be. Another thing that would be really nice is if backends could somehow let the background writer know when they're using a BufferAccessStrategy, and somehow convince the background writer to write those buffers out to the OS at top speed. > I want to make this problem go away, but as you can see spreading the sync > calls around isn't enough. I think the main write loop needs to get spread > out more, too, so that the background writer is trying to work at a more > reasonable pace. I am pleased I've been able to reproduce this painful > behavior at home using test data, because that much improves my odds of > being able to isolate its cause and test solutions. But it's a tricky > problem, and I'm certainly not going to fix it in the next week. Thanks for working on this. I hope we get a better handle on it for 9.2. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Looks like it's time to close the book on this one for 9.1 development...the unfortunate results are at http://www.2ndquadrant.us/pgbench-results/index.htm Test set #12 is the one with spread sync I was hoping would turn out better than #9, the reference I was trying to improve on. TPS is about 5% slower on the scale=500 and 15% slower on the scale=1000 tests with sync spread out. Even worse, maximum latency went up a lot. I am convinced of a couple of things now: 1) Most of the benefit we were seeing from the original patch I submitted was simply from doing much better at absorbing fsync requests from backends while the checkpoint sync was running. The already committed fsync compaction patch effectively removes that problem though, to the extent it's possible to do so, making the remaining pieces here not as useful in its wake. 2) I need to start over testing here with something that isn't 100% write all of the time the way pgbench is. It's really hard to isolate out latency improvements when the test program guarantees all associated write caches will be completely filled at every moment. Also, I can't see any benefit if I make changes that improve performance only for readers with it, which is quite unrealistic relative to real-world workloads. 3) The existing write spreading code in the background writer needs to be overhauled, too, before spreading the syncs around is going to give the benefits I was hoping for. Given all that, I'm going to take my feedback and give the test server a much deserved break. I'm happy that the fsync compaction patch has made 9.1 much more tolerant of write-heavy loads than earlier versions, so it's not like no progress was made in this release. For anyone who wants more details here...the news on this spread sync implementation is not all bad. If you compare this result from HEAD, with scale=1000 and clients=256: http://www.2ndquadrant.us/pgbench-results/611/index.html Against its identically configured result with spread sync: http://www.2ndquadrant.us/pgbench-results/708/index.html There are actually significantly less times in the >2000 ms latency area. That shows up as a reduction in the 90th percentile latency figures I compute, and you can see it in the graph if you look at how much denser the points are in the 2000 - 4000 ms are on #611. But that's a pretty weak change. But the most disappointing part here relative to what I was hoping is what happens with bigger buffer caches. The main idea driving this approach was that it would enable larger values of shared_buffers without the checkpoint spikes being as bad. Test set #13 tries that out, by increasing shared_buffers from 256MB to 4GB, along with a big enough increase in checkpoint_segments to make most checkpoints time based. Not only did smaller scale TPS drop in half, all kinds of bad things happened to latency. Here's a sample of the sort of dysfunctional checkpoints that came out of that: 2011-02-10 02:41:17 EST: LOG: checkpoint starting: xlog 2011-02-10 02:53:15 EST: DEBUG: checkpoint sync: estimated segments=22 2011-02-10 02:53:15 EST: DEBUG: checkpoint sync: number=1 file=base/16384/16768 time=150.008 msec 2011-02-10 02:53:15 EST: DEBUG: checkpoint sync: number=2 file=base/16384/16749 time=0.002 msec 2011-02-10 02:53:15 EST: DEBUG: checkpoint sync: number=3 file=base/16384/16749_fsm time=0.001 msec 2011-02-10 02:53:23 EST: DEBUG: checkpoint sync: number=4 file=base/16384/16761 time=8014.102 msec 2011-02-10 02:53:23 EST: DEBUG: checkpoint sync: number=5 file=base/16384/16752_vm time=0.002 msec 2011-02-10 02:53:35 EST: DEBUG: checkpoint sync: number=6 file=base/16384/16761.5 time=11739.038 msec 2011-02-10 02:53:37 EST: DEBUG: checkpoint sync: number=7 file=base/16384/16761.6 time=2205.721 msec 2011-02-10 02:53:45 EST: DEBUG: checkpoint sync: number=8 file=base/16384/16761.2 time=8273.849 msec 2011-02-10 02:54:06 EST: DEBUG: checkpoint sync: number=9 file=base/16384/16766 time=20874.167 msec 2011-02-10 02:54:06 EST: DEBUG: checkpoint sync: number=10 file=base/16384/16762 time=0.002 msec 2011-02-10 02:54:08 EST: DEBUG: checkpoint sync: number=11 file=base/16384/16761.3 time=2440.441 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=12 file=base/16384/16766.1 time=635.839 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=13 file=base/16384/16752_fsm time=0.001 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=14 file=base/16384/16764 time=0.001 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=15 file=base/16384/16768_fsm time=0.001 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=16 file=base/16384/16761_vm time=0.001 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=17 file=base/16384/16761.4 time=150.702 msec 2011-02-10 02:54:09 EST: DEBUG: checkpoint sync: number=18 file=base/16384/16752 time=0.002 msec 2011-02-10 02:54:09 EST: DEB
Re: [HACKERS] Spread checkpoint sync
Kevin Grittner wrote: There are occasional posts from those wondering why their read-only queries are so slow after a bulk load, and why they are doing heavy writes. (I remember when I posted about that, as a relative newbie, and I know I've seen others.) Sure; I created http://wiki.postgresql.org/wiki/Hint_Bits a while back specifically to have a resource to explain that mystery to offer people. But there's a difference between having a performance issue that people don't understand, and having a real bottleneck you can't get rid of. My experience is that people who have hint bit issues run into them as a minor side-effect of a larger vacuum issue, and that if you get that under control they're only a minor detail in comparison. Makes it hard to get too excited about optimizing them. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Greg Smith wrote: > As a larger statement on this topic, I'm never very excited about > redesigning here starting from any point other than "saw a > bottleneck doing on a production system". There's a long list > of such things already around waiting to be addressed, and I've > never seen any good evidence of work related to hint bits being on > it. Please correct me if you know of some--I suspect you do from > the way you're brining this up. There are occasional posts from those wondering why their read-only queries are so slow after a bulk load, and why they are doing heavy writes. (I remember when I posted about that, as a relative newbie, and I know I've seen others.) I think worst case is probably: - Bulk load data. - Analyze (but don't vacuum) the new data. - Start a workload with a lot of small, concurrent random reads. - Watch performance tank when the write cache gluts. This pattern is why we've adopted a pretty strict rule in our shop that we run VACUUM FREEZE ANALYZE between a bulk load and putting the database back into production. It's probably a bigger issue for those who can't do that. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Cédric Villemain wrote: Is it worth a new thread with the different IO improvements done so far or on-going and how we may add new GUC(if required !!!) with intelligence between those patches ? ( For instance, hint bit IO limit needs probably a tunable to define something similar to hint_write_completion_target and/or IO_throttling strategy, ...items which are still in gestation...) Maybe, but I wouldn't bring all that up right now. Trying to wrap up the CommitFest, too distracting, etc. As a larger statement on this topic, I'm never very excited about redesigning here starting from any point other than "saw a bottleneck doing on a production system". There's a long list of such things already around waiting to be addressed, and I've never seen any good evidence of work related to hint bits being on it. Please correct me if you know of some--I suspect you do from the way you're brining this up. If we were to consider kicking off some larger work here, I would drive that by asking where the data supporting that work being necessary is at first. It's hard enough to fix a bottleneck that's staring right at you, trying to address one that's just theorized is impossible. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
2011/2/7 Greg Smith : > Robert Haas wrote: >> >> With the fsync queue compaction patch applied, I think most of this is >> now not needed. Attached please find an attempt to isolate the >> portion that looks like it might still be useful. The basic idea of >> what remains here is to make the background writer still do its normal >> stuff even when it's checkpointing. In particular, with this patch >> applied, PG will: >> >> 1. Absorb fsync requests a lot more often during the sync phase. >> 2. Still try to run the cleaning scan during the sync phase. >> 3. Pause for 3 seconds after every fsync. >> > > Yes, the bits you extracted were the remaining useful parts from the > original patch. Today was quiet here because there were sports on or > something, and I added full auto-tuning magic to the attached update. I > need to kick off benchmarks and report back tomorrow to see how well this > does, but any additional patch here would only be code cleanup on the messy > stuff I did in here (plus proper implementation of the pair of GUCs). This > has finally gotten to the exact logic I've been meaning to complete as > spread sync since the idea was first postponed in 8.3, with the benefit of > some fsync aborption improvements along the way too > > The automatic timing is modeled on the existing checkpoint_completion_target > concept, except with a new tunable (not yet added as a GUC) currently called > CheckPointSyncTarget, set to 0.8 right now. What I think I want to do is > make the existing checkpoint_completion_target now be the target for the end > of the sync phase, matching its name; people who bumped it up won't > necessarily even have to change anything. Then the new guc can be > checkpoint_write_target, representing the target that is in there right now. Is it worth a new thread with the different IO improvements done so far or on-going and how we may add new GUC(if required !!!) with intelligence between those patches ? ( For instance, hint bit IO limit needs probably a tunable to define something similar to hint_write_completion_target and/or IO_throttling strategy, ...items which are still in gestation...) > > I tossed the earlier idea of counting relations to sync based on the write > phase data as too inaccurate after testing, and with it for now goes > checkpoint sorting. Instead, I just take a first pass over pendingOpsTable > to get a total number of things to sync, which will always match the real > count barring strange circumstances (like dropping a table). > > As for the automatically determining the interval, I take the number of > syncs that have finished so far, divide by the total, and get a number > between 0.0 and 1.0 that represents progress on the sync phase. I then use > the same basic CheckpointWriteDelay logic that is there for spreading writes > out, except with the new sync target. I realized that if we assume the > checkpoint writes should have finished in CheckPointCompletionTarget worth > of time or segments, we can compute a new progress metric with the formula: > > progress = CheckPointCompletionTarget + (1.0 - CheckPointCompletionTarget) * > finished / goal; > > Where "finished" is the number of segments written out, while "goal" is the > total. To turn this into an example, let's say the default parameters are > set, we've finished the writes, and finished 1 out of 4 syncs; that much > work will be considered: > > progress = 0.5 + (1.0 - 0.5) * 1 / 4 = 0.625 > > On a scale that effectively aimes to be finished sync work by 0.8. > > I don't use quite the same logic as the CheckpointWriteDelay though. It > turns out the existing checkpoint_completion implementation doesn't always > work like I thought it did, which provide some very interesting insight into > why my attempts to work around checkpoint problems haven't worked as well as > expected the last few years. I thought that what it did was wait until an > amount of time determined by the target was reached until it did the next > write. That's not quite it; what it actually does is check progress against > the target, then sleep exactly one nap interval if it is is ahead of > schedule. That is only the same thing if you have a lot of buffers to write > relative to the amount of time involved. There's some alternative logic if > you don't have bgwriter_lru_maxpages set, but in the normal situation it > effectively it means that: > > maximum write spread time=bgwriter_delay * checkpoint dirty blocks > > No matter how far apart you try to spread the checkpoints. Now, typically, > when people run into these checkpoint spikes in production, reducing > shared_buffers improves that. But I now realize that doing so will then > reduce the average number of dirty blocks participating in the checkpoint, > and therefore potentially pull the spread down at the same time! Also, if > you try and tune bgwriter_delay down to get better background cleaning, > you're also reducing the maximum spread. Between thi
Re: [HACKERS] Spread checkpoint sync
Robert Haas wrote: With the fsync queue compaction patch applied, I think most of this is now not needed. Attached please find an attempt to isolate the portion that looks like it might still be useful. The basic idea of what remains here is to make the background writer still do its normal stuff even when it's checkpointing. In particular, with this patch applied, PG will: 1. Absorb fsync requests a lot more often during the sync phase. 2. Still try to run the cleaning scan during the sync phase. 3. Pause for 3 seconds after every fsync. Yes, the bits you extracted were the remaining useful parts from the original patch. Today was quiet here because there were sports on or something, and I added full auto-tuning magic to the attached update. I need to kick off benchmarks and report back tomorrow to see how well this does, but any additional patch here would only be code cleanup on the messy stuff I did in here (plus proper implementation of the pair of GUCs). This has finally gotten to the exact logic I've been meaning to complete as spread sync since the idea was first postponed in 8.3, with the benefit of some fsync aborption improvements along the way too The automatic timing is modeled on the existing checkpoint_completion_target concept, except with a new tunable (not yet added as a GUC) currently called CheckPointSyncTarget, set to 0.8 right now. What I think I want to do is make the existing checkpoint_completion_target now be the target for the end of the sync phase, matching its name; people who bumped it up won't necessarily even have to change anything. Then the new guc can be checkpoint_write_target, representing the target that is in there right now. I tossed the earlier idea of counting relations to sync based on the write phase data as too inaccurate after testing, and with it for now goes checkpoint sorting. Instead, I just take a first pass over pendingOpsTable to get a total number of things to sync, which will always match the real count barring strange circumstances (like dropping a table). As for the automatically determining the interval, I take the number of syncs that have finished so far, divide by the total, and get a number between 0.0 and 1.0 that represents progress on the sync phase. I then use the same basic CheckpointWriteDelay logic that is there for spreading writes out, except with the new sync target. I realized that if we assume the checkpoint writes should have finished in CheckPointCompletionTarget worth of time or segments, we can compute a new progress metric with the formula: progress = CheckPointCompletionTarget + (1.0 - CheckPointCompletionTarget) * finished / goal; Where "finished" is the number of segments written out, while "goal" is the total. To turn this into an example, let's say the default parameters are set, we've finished the writes, and finished 1 out of 4 syncs; that much work will be considered: progress = 0.5 + (1.0 - 0.5) * 1 / 4 = 0.625 On a scale that effectively aimes to be finished sync work by 0.8. I don't use quite the same logic as the CheckpointWriteDelay though. It turns out the existing checkpoint_completion implementation doesn't always work like I thought it did, which provide some very interesting insight into why my attempts to work around checkpoint problems haven't worked as well as expected the last few years. I thought that what it did was wait until an amount of time determined by the target was reached until it did the next write. That's not quite it; what it actually does is check progress against the target, then sleep exactly one nap interval if it is is ahead of schedule. That is only the same thing if you have a lot of buffers to write relative to the amount of time involved. There's some alternative logic if you don't have bgwriter_lru_maxpages set, but in the normal situation it effectively it means that: maximum write spread time=bgwriter_delay * checkpoint dirty blocks No matter how far apart you try to spread the checkpoints. Now, typically, when people run into these checkpoint spikes in production, reducing shared_buffers improves that. But I now realize that doing so will then reduce the average number of dirty blocks participating in the checkpoint, and therefore potentially pull the spread down at the same time! Also, if you try and tune bgwriter_delay down to get better background cleaning, you're also reducing the maximum spread. Between this issue and the bad behavior when the fsync queue fills, no wonder this has been so hard to tune out of production systems. At some point, the reduction in spread defeats further attempts to reduce the size of what's written at checkpoint time, by lowering the amount of data involved. What I do instead is nap until just after the planned schedule, then execute the sync. What ends up happening then is that there can be a long pause between the end of the write phase and
Re: [HACKERS] Spread checkpoint sync
On Fri, Feb 4, 2011 at 2:08 PM, Greg Smith wrote: > -The total number of buffers I'm computing based on the checkpoint writes > being sorted it not a perfect match to the number reported by the > "checkpoint complete" status line. Sometimes they are the same, sometimes > not. Not sure why yet. My first guess would be that in the cases where it's not the same, some backend evicted the buffer before the background writer got to it. That's expected under heavy contention for shared_buffers. > -The estimate for "expected to need sync" computed as a by-product of the > checkpoint sorting is not completely accurate either. This particular one > has a fairly large error in it, percentage-wise, being off by 3 with a total > of 11. Presumably these are absorbed fsync requests that were already > queued up before the checkpoint even started. So any time estimate I drive > based off of this count is only going to be approximate. As previously noted, I wonder if we ought sync queued-up requests that don't require writes before beginning the write phase. > -The order in which the sync phase processes files is unrelated to the order > in which they are written out. Note that 17216.10 here, the biggest victim > (cause?) of the I/O spike, isn't even listed among the checkpoint writes! That's awful. If more than 50% of the I/O is going to happen from one fsync() call, that seems to put a pretty pessimal bound on how much improvement we can hope to achieve here. Or am I missing something? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
As already mentioned in the broader discussion at http://archives.postgresql.org/message-id/4d4c4610.1030...@2ndquadrant.com , I'm seeing no solid performance swing in the checkpoint sorting code itself. Better sometimes, worse others, but never by a large amount. Here's what the statistics part derived from the sorted data looks like on a real checkpoint spike: 2011-02-04 07:02:51 EST: LOG: checkpoint starting: xlog 2011-02-04 07:02:51 EST: DEBUG: BufferSync 10 dirty blocks in relation.segment_fork 17216.0_2 2011-02-04 07:02:51 EST: DEBUG: BufferSync 159 dirty blocks in relation.segment_fork 17216.0_1 2011-02-04 07:02:51 EST: DEBUG: BufferSync 10 dirty blocks in relation.segment_fork 17216.3_0 2011-02-04 07:02:51 EST: DEBUG: BufferSync 548 dirty blocks in relation.segment_fork 17216.4_0 2011-02-04 07:02:51 EST: DEBUG: BufferSync 808 dirty blocks in relation.segment_fork 17216.5_0 2011-02-04 07:02:51 EST: DEBUG: BufferSync 799 dirty blocks in relation.segment_fork 17216.6_0 2011-02-04 07:02:51 EST: DEBUG: BufferSync 807 dirty blocks in relation.segment_fork 17216.7_0 2011-02-04 07:02:51 EST: DEBUG: BufferSync 716 dirty blocks in relation.segment_fork 17216.8_0 2011-02-04 07:02:51 EST: DEBUG: BufferSync 3857 buffers to write, 8 total dirty segment file(s) expected to need sync 2011-02-04 07:03:31 EST: DEBUG: checkpoint sync: number=1 file=base/16384/17216.5 time=1324.614 msec 2011-02-04 07:03:31 EST: DEBUG: checkpoint sync: number=2 file=base/16384/17216.4 time=0.002 msec 2011-02-04 07:03:31 EST: DEBUG: checkpoint sync: number=3 file=base/16384/17216_fsm time=0.001 msec 2011-02-04 07:03:47 EST: DEBUG: checkpoint sync: number=4 file=base/16384/17216.10 time=16446.753 msec 2011-02-04 07:03:53 EST: DEBUG: checkpoint sync: number=5 file=base/16384/17216.8 time=5804.252 msec 2011-02-04 07:03:53 EST: DEBUG: checkpoint sync: number=6 file=base/16384/17216.7 time=0.001 msec 2011-02-04 07:03:54 EST: DEBUG: compacted fsync request queue from 32768 entries to 2 entries 2011-02-04 07:03:54 EST: CONTEXT: writing block 1642223 of relation base/16384/17216 2011-02-04 07:04:00 EST: DEBUG: checkpoint sync: number=7 file=base/16384/17216.11 time=6350.577 msec 2011-02-04 07:04:00 EST: DEBUG: checkpoint sync: number=8 file=base/16384/17216.9 time=0.001 msec 2011-02-04 07:04:00 EST: DEBUG: checkpoint sync: number=9 file=base/16384/17216.6 time=0.001 msec 2011-02-04 07:04:00 EST: DEBUG: checkpoint sync: number=10 file=base/16384/17216.3 time=0.001 msec 2011-02-04 07:04:00 EST: DEBUG: checkpoint sync: number=11 file=base/16384/17216_vm time=0.001 msec 2011-02-04 07:04:00 EST: LOG: checkpoint complete: wrote 3813 buffers (11.6%); 0 transaction log file(s) added, 0 removed, 64 recycled; write=39.073 s, sync=29.926 s, total=69.003 s; sync files=11, longest=16.446 s, average=2.720 s You can see that it ran out of fsync absorption space in the middle of the sync phase, which is usually when compaction is needed, but the recent patch to fix that kicked in and did its thing. Couple of observations: -The total number of buffers I'm computing based on the checkpoint writes being sorted it not a perfect match to the number reported by the "checkpoint complete" status line. Sometimes they are the same, sometimes not. Not sure why yet. -The estimate for "expected to need sync" computed as a by-product of the checkpoint sorting is not completely accurate either. This particular one has a fairly large error in it, percentage-wise, being off by 3 with a total of 11. Presumably these are absorbed fsync requests that were already queued up before the checkpoint even started. So any time estimate I drive based off of this count is only going to be approximate. -The order in which the sync phase processes files is unrelated to the order in which they are written out. Note that 17216.10 here, the biggest victim (cause?) of the I/O spike, isn't even listed among the checkpoint writes! The fuzziness here is a bit disconcerting, and I'll keep digging for why it happens. But I don't see any reason not to continue forward using the rough count here to derive a nap time from, which I can then feed into the "useful leftovers" patch that Robert already refactored here. Can always sharpen up that estimate later, I need to get some solid results I can share on what the delay time does to the throughput/latency pattern next. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Michael Banck wrote: On Sat, Jan 15, 2011 at 05:47:24AM -0500, Greg Smith wrote: For example, the pre-release Squeeze numbers we're seeing are awful so far, but it's not really done yet either. Unfortunately, it does not look like Debian squeeze will change any more (or has changed much since your post) at this point, except for maybe further stable kernel updates. Which file system did you see those awful numbers on and could you maybe go into some more detail? Once the release comes out any day now I'll see if I can duplicate them on hardware I can talk about fully, and share the ZCAV graphs if it's still there. The server I've been running all of the extended pgbench tests in this thread on is running Ubuntu simply as a temporary way to get 2.6.32 before Squeeze ships. Last time I tried installing one of the Squeeze betas I didn't get anywhere; hoping the installer bug I ran into has been sorted when I try again. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Re: [HACKERS] Spread checkpoint sync
On Sat, Jan 15, 2011 at 05:47:24AM -0500, Greg Smith wrote: > For example, the pre-release Squeeze numbers we're seeing are awful so > far, but it's not really done yet either. Unfortunately, it does not look like Debian squeeze will change any more (or has changed much since your post) at this point, except for maybe further stable kernel updates. Which file system did you see those awful numbers on and could you maybe go into some more detail? Thanks, Michael -- I did send an email to propose multithreading to grub-devel on the first of april. Unfortunately everyone thought I was serious ;-) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Tom Lane wrote: > Bruce Momjian writes: > > My trivial idea was: let's assume we checkpoint every 10 minutes, and > > it takes 5 minutes for us to write the data to the kernel. If no one > > else is writing to those files, we can safely wait maybe 5 more minutes > > before issuing the fsync. If, however, hundreds of writes are coming in > > for the same files in those final 5 minutes, we should fsync right away. > > Huh? I would surely hope we could assume that nobody but Postgres is > writing the database files? Or are you considering that the bgwriter > doesn't know exactly what the backends are doing? That's true, but > I still maintain that we should design the bgwriter's behavior on the > assumption that writes from backends are negligible. Certainly the > backends aren't issuing fsyncs. Right, no one else is writing but us. When I said "no one else" I meant no other bgwrites writes are going to the files we wrote as part of the checkpoint, but have not fsync'ed yet. I assume we have two write streams --- the checkpoint writes, which we know at the start of the checkpoint, and the bgwriter writes that are happening in an unpredictable way based on database activity. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Bruce Momjian writes: > My trivial idea was: let's assume we checkpoint every 10 minutes, and > it takes 5 minutes for us to write the data to the kernel. If no one > else is writing to those files, we can safely wait maybe 5 more minutes > before issuing the fsync. If, however, hundreds of writes are coming in > for the same files in those final 5 minutes, we should fsync right away. Huh? I would surely hope we could assume that nobody but Postgres is writing the database files? Or are you considering that the bgwriter doesn't know exactly what the backends are doing? That's true, but I still maintain that we should design the bgwriter's behavior on the assumption that writes from backends are negligible. Certainly the backends aren't issuing fsyncs. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Kevin Grittner wrote: > Robert Haas wrote: > > > I also think Bruce's idea of calling fsync() on each relation just > > *before* we start writing the pages from that relation might have > > some merit. > > What bothers me about that is that you may have a lot of the same > dirty pages in the OS cache as the PostgreSQL cache, and you've just > ensured that the OS will write those *twice*. I'm pretty sure that > the reason the aggressive background writer settings we use have not > caused any noticeable increase in OS disk writes is that many > PostgreSQL writes of the same buffer keep an OS buffer page from > becoming stale enough to get flushed until PostgreSQL writes to it > taper off. Calling fsync() right before doing "one last push" of > the data could be really pessimal for some workloads. OK, maybe my idea needs to be adjusted and we should trigger an early fsync if non-fsync writes are coming in for blocks _other_ than the ones we already wrote for that checkpoint. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Tue, Feb 1, 2011 at 12:58 PM, Kevin Grittner wrote: > Robert Haas wrote: > >> I also think Bruce's idea of calling fsync() on each relation just >> *before* we start writing the pages from that relation might have >> some merit. > > What bothers me about that is that you may have a lot of the same > dirty pages in the OS cache as the PostgreSQL cache, and you've just > ensured that the OS will write those *twice*. I'm pretty sure that > the reason the aggressive background writer settings we use have not > caused any noticeable increase in OS disk writes is that many > PostgreSQL writes of the same buffer keep an OS buffer page from > becoming stale enough to get flushed until PostgreSQL writes to it > taper off. Calling fsync() right before doing "one last push" of > the data could be really pessimal for some workloads. I was thinking about what Greg reported here: http://archives.postgresql.org/pgsql-hackers/2010-11/msg01387.php If the amount of pre-checkpoint dirty data is 3GB and the checkpoint is writing 250MB, then you shouldn't have all that many extra writes... but you might have some, and that might be enough to send the whole thing down the tubes. InnoDB apparently handles this problem by advancing the redo pointer in small steps instead of in large jumps. AIUI, in addition to tracking the LSN of each page, they also track the first-dirtied LSN. That lets you checkpoint to an arbitrary LSN by flushing just the pages with an older first-dirtied LSN. So instead of doing a checkpoint every hour, you might do a mini-checkpoint every 10 minutes. Since the mini-checkpoints each need to flush less data, they should be less disruptive than a full checkpoint. But that, too, will generate some extra writes. Basically, any idea that involves calling fsync() more often is going to tend to smooth out the I/O load at the cost of some increase in the total number of writes. If we don't want any increase at all in the number of writes, spreading out the fsync() calls is pretty much the only other option. I'm worried that even with good tuning that won't be enough to tamp down the latency spikes. But maybe it will be... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Greg Smith wrote: > Greg Smith wrote: > > I think the right way to compute "relations to sync" is to finish the > > sorted writes patch I sent over a not quite right yet update to already > > Attached update now makes much more sense than the misguided patch I > submitted two weesk ago. This takes the original sorted write code, > first adjusting it so it only allocates the memory its tag structure is > stored in once (in a kind of lazy way I can improve on right now). It > then computes a bunch of derived statistics from a single walk of the > sorted data on each pass through. Here's an example of what comes out: In that patch, I would like to see a meta-comment explaining why the sorting is happening and what we hope to gain. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Robert Haas wrote: > Back to your idea: One problem with trying to bound the unflushed data > is that it's not clear what the bound should be. I've had this mental > model where we want the OS to write out pages to disk, but that's not > always true, per Greg Smith's recent posts about Linux kernel tuning > slowing down VACUUM. A possible advantage of the Momjian algorithm > (as it's known in the literature) is that we don't actually start > forcing anything out to disk until we have a reason to do so - namely, > an impending checkpoint. My trivial idea was: let's assume we checkpoint every 10 minutes, and it takes 5 minutes for us to write the data to the kernel. If no one else is writing to those files, we can safely wait maybe 5 more minutes before issuing the fsync. If, however, hundreds of writes are coming in for the same files in those final 5 minutes, we should fsync right away. My idea is that our delay between writes and fsync should somehow be controlled by how many writes to the same files are coming to the kernel while we are considering waiting because the only downside to delay is the accumulation of non-critical writes coming into the kernel for the same files we are going to fsync later. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Robert Haas wrote: > I also think Bruce's idea of calling fsync() on each relation just > *before* we start writing the pages from that relation might have > some merit. What bothers me about that is that you may have a lot of the same dirty pages in the OS cache as the PostgreSQL cache, and you've just ensured that the OS will write those *twice*. I'm pretty sure that the reason the aggressive background writer settings we use have not caused any noticeable increase in OS disk writes is that many PostgreSQL writes of the same buffer keep an OS buffer page from becoming stale enough to get flushed until PostgreSQL writes to it taper off. Calling fsync() right before doing "one last push" of the data could be really pessimal for some workloads. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Mon, Jan 31, 2011 at 4:28 PM, Tom Lane wrote: > Robert Haas writes: >> Back to the idea at hand - I proposed something a bit along these >> lines upthread, but my idea was to proactively perform the fsyncs on >> the relations that had gone the longest without a write, rather than >> the ones with the most dirty data. > > Yeah. What I meant to suggest, but evidently didn't explain well, was > to use that or something much like it as the rule for deciding *what* to > fsync next, but to use amount-of-unsynced-data-versus-threshold as the > method for deciding *when* to do the next fsync. Oh, I see. Yeah, that could be a good algorithm. I also think Bruce's idea of calling fsync() on each relation just *before* we start writing the pages from that relation might have some merit. (I'm assuming here that we are sorting the writes.) That should tend to result in the end-of-checkpoint fsyncs being quite fast, because we'll only have as much dirty data floating around as we actually wrote during the checkpoint, which according to Greg Smith is usually a small fraction of the total data in need of flushing. Also, if one of the pre-write fsyncs takes a long time, then that'll get factored into our calculations of how fast we need to write the remaining data to finish the checkpoint on schedule. Of course there's still the possibility that the I/O system literally can't finish a checkpoint in X minutes, but even in that case, the I/O saturation will hopefully be more spread out across the entire checkpoint instead of falling like a hammer at the very end. Back to your idea: One problem with trying to bound the unflushed data is that it's not clear what the bound should be. I've had this mental model where we want the OS to write out pages to disk, but that's not always true, per Greg Smith's recent posts about Linux kernel tuning slowing down VACUUM. A possible advantage of the Momjian algorithm (as it's known in the literature) is that we don't actually start forcing anything out to disk until we have a reason to do so - namely, an impending checkpoint. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Greg Smith wrote: I think the right way to compute "relations to sync" is to finish the sorted writes patch I sent over a not quite right yet update to already Attached update now makes much more sense than the misguided patch I submitted two weesk ago. This takes the original sorted write code, first adjusting it so it only allocates the memory its tag structure is stored in once (in a kind of lazy way I can improve on right now). It then computes a bunch of derived statistics from a single walk of the sorted data on each pass through. Here's an example of what comes out: DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11809.0_0 DEBUG: BufferSync 2 dirty blocks in relation.segment_fork 11811.0_0 DEBUG: BufferSync 3 dirty blocks in relation.segment_fork 11812.0_0 DEBUG: BufferSync 3 dirty blocks in relation.segment_fork 16496.0_0 DEBUG: BufferSync 28 dirty blocks in relation.segment_fork 16499.0_0 DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11638.0_0 DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11640.0_0 DEBUG: BufferSync 2 dirty blocks in relation.segment_fork 11641.0_0 DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11642.0_0 DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11644.0_0 DEBUG: BufferSync 2048 dirty blocks in relation.segment_fork 16508.0_0 DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11645.0_0 DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11661.0_0 DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11663.0_0 DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11664.0_0 DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11672.0_0 DEBUG: BufferSync 1 dirty blocks in relation.segment_fork 11685.0_0 DEBUG: BufferSync 2097 buffers to write, 17 total dirty segment file(s) expected to need sync This is the first checkpoint after starting to populate a new pgbench database. The next four show it extending into new segments: DEBUG: BufferSync 2048 dirty blocks in relation.segment_fork 16508.1_0 DEBUG: BufferSync 2048 buffers to write, 1 total dirty segment file(s) expected to need sync DEBUG: BufferSync 2048 dirty blocks in relation.segment_fork 16508.2_0 DEBUG: BufferSync 2048 buffers to write, 1 total dirty segment file(s) expected to need sync DEBUG: BufferSync 2048 dirty blocks in relation.segment_fork 16508.3_0 DEBUG: BufferSync 2048 buffers to write, 1 total dirty segment file(s) expected to need sync DEBUG: BufferSync 2048 dirty blocks in relation.segment_fork 16508.4_0 DEBUG: BufferSync 2048 buffers to write, 1 total dirty segment file(s) expected to need sync The fact that it's always showing 2048 dirty blocks on these makes me think I'm computing something wrong still, but the general idea here is working now. I had to use some magic from the md layer to let bufmgr.c know how its writes were going to get mapped into file segments and correspondingly fsync calls later. Not happy about breaking the API encapsulation there, but don't see an easy way to compute that data at the per-segment level--and it's not like that's going to change in the near future anyway. I like this approach for a providing a map of how to spread syncs out for a couple of reasons: -It computes data that could be used to drive sync spread timing in a relatively short amount of simple code. -You get write sorting at the database level helping out the OS. Everything I've been seeing recently on benchmarks says Linux at least needs all the help it can get in that regard, even if block order doesn't necessarily align perfectly with disk order. -It's obvious how to take this same data and build a future model where the time allocated for fsyncs was proportional to how much that particular relation was touched. Benchmarks of just the impact of the sorting step and continued bug swatting to follow. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 1f89e52..ef9df7d 100644 *** a/src/backend/storage/buffer/bufmgr.c --- b/src/backend/storage/buffer/bufmgr.c *** *** 48,53 --- 48,63 #include "utils/rel.h" #include "utils/resowner.h" + /* + * Checkpoint time mapping between the buffer id values and the associated + * buffer tags of dirty buffers to write + */ + typedef struct BufAndTag + { + int buf_id; + BufferTag tag; + BlockNumber segNum; + } BufAndTag; /* Note: these two macros only work on shared buffers, not local ones! */ #define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ)) *** int target_prefetch_pages = 0; *** 78,83 --- 88,96 static volatile BufferDesc *InProgressBuf =
Re: [HACKERS] Spread checkpoint sync
Tom Lane wrote: Robert Haas writes: 3. Pause for 3 seconds after every fsync. I think something along the lines of #3 is probably a good idea, Really? Any particular delay is guaranteed wrong. '3 seconds' is just a placeholder for whatever comes out of a "total time scheduled to sync / relations to sync" computation. (Still doing all my thinking in terms of time, altough I recognize a showdown with segment-based checkpoints is coming too) I think the right way to compute "relations to sync" is to finish the sorted writes patch I sent over a not quite right yet update to already, which is my next thing to work on here. I remain pessimistic that any attempt to issue fsync calls without the maximum possible delay after asking kernel to write things out first will work out well. My recent tests with low values of dirty_bytes on Linux just reinforces how bad that can turn out. In addition to computing the relation count while sorting them, placing writes in-order by relation and then doing all writes followed by all syncs should place the database right in the middle of the throughput/latency trade-off here. It will have had the maximum amount of time we can give it to sort and flush writes for any given relation before it is asked to sync it. I don't want to try and be any smarter than that without trying to be a *lot* smarter--timing individual sync calls, feedback loops on time estimation, etc. At this point I have to agree with Robert's observation that splitting checkpoints into checkpoint_write_target and checkpoint_sync_target is the only reasonable thing left that might be possible complete in a short period. So that's how this can compute the total time numerator here. The main thing I will warn about in relations to discussion today is the danger of true dead-line oriented scheduling in this area. The checkpoint process may discover the sync phase is falling behind expectations because the individual sync calls are taking longer than expected. If that happens, aiming for the "finish on target anyway" goal puts you right back to a guaranteed nasty write spike again. I think many people would prefer logging the overrun as tuning feedback for the DBA rather than to accelerate, which is likely to make the problem even worse if the checkpoint is falling behind. But since ultimately the feedback for this will be "make the checkpoints longer or increase checkpoint_sync_target", sync acceleration to meet the deadline isn't unacceptable; DBA can try both of those themselves if seeing spikes. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Re: [HACKERS] Spread checkpoint sync
Robert Haas writes: > Back to the idea at hand - I proposed something a bit along these > lines upthread, but my idea was to proactively perform the fsyncs on > the relations that had gone the longest without a write, rather than > the ones with the most dirty data. Yeah. What I meant to suggest, but evidently didn't explain well, was to use that or something much like it as the rule for deciding *what* to fsync next, but to use amount-of-unsynced-data-versus-threshold as the method for deciding *when* to do the next fsync. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Tom Lane wrote: I wonder whether it'd be useful to keep track of the total amount of data written-and-not-yet-synced, and to issue fsyncs often enough to keep that below some parameter; the idea being that the parameter would limit how much dirty kernel disk cache there is. Of course, ideally the kernel would have a similar tunable and this would be a waste of effort on our part... I wanted to run the tests again before reporting in detail here, because the results are so bad, but I threw out an initial report about trying to push this toward this down to be the kernel's job at http://blog.2ndquadrant.com/en/2011/01/tuning-linux-for-low-postgresq.html So far it looks like the newish Linux dirty_bytes parameter works well at reducing latency by limiting how much dirty data can pile up before it gets nudged heavily toward disk. But the throughput drop you pay on VACUUM in particular is brutal, I'm seeing over a 50% slowdown in some cases. I suspect we need to let the regular cleaner and backend writes queue up in the largest possible cache for VACUUM, so it benefits as much as possible from elevator sorting of writes. I suspect this being the worst case now for a tightly controlled write cache is an unintended side-effect of the ring buffer implementation it uses now. Right now I'm running the same tests on XFS instead of ext3, and those are just way more sensible all around; I'll revisit this on that filesystem and ext4. The scale=500 tests I've running lots of lately are a full 3X TPS faster on XFS relative to ext3, with about 1/8 as much worst-case latency. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Robert Haas wrote: > Back to the idea at hand - I proposed something a bit along these > lines upthread, but my idea was to proactively perform the fsyncs on > the relations that had gone the longest without a write, rather than > the ones with the most dirty data. I'm not sure which is better. > Obviously, doing the ones that have "gone idle" gives the OS more time > to write out the data, but OTOH it might not succeed in purging much > dirty data. Doing the ones with the most dirty data will definitely > reduce the size of the final checkpoint, but might also cause a > latency spike if it's triggered immediately after heavy write activity > on that file. Crazy idea #2 --- it would be interesting if you issued an fsync _before_ you wrote out data to a file that needed an fsync. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Mon, Jan 31, 2011 at 12:11 PM, Tom Lane wrote: > Robert Haas writes: >> On Mon, Jan 31, 2011 at 11:51 AM, Tom Lane wrote: >>> I wonder whether it'd be useful to keep track of the total amount of >>> data written-and-not-yet-synced, and to issue fsyncs often enough to >>> keep that below some parameter; the idea being that the parameter would >>> limit how much dirty kernel disk cache there is. Of course, ideally the >>> kernel would have a similar tunable and this would be a waste of effort >>> on our part... > >> It's not clear to me how you'd maintain that information without it >> turning into a contention bottleneck. > > What contention bottleneck? I was just visualizing the bgwriter process > locally tracking how many writes it'd issued. Backend-issued writes > should happen seldom enough to be ignorable for this purpose. Ah. Well, if you ignore backend writes, then yes, there's no contention bottleneck. However, I seem to recall Greg Smith showing a system at PGCon last year with a pretty respectable volume of backend writes (30%?) and saying "OK, so here's a healthy system". Perhaps I'm misremembering. But at any rate any backend that is using a BufferAccessStrategy figures to do a lot of its own writes. This is probably an area for improvement in future releases, if we an figure out how to do it: if we're doing a bulk load into a system with 4GB of shared_buffers using a 16MB ring buffer, we'd ideally like the background writer - or somebody other than the foreground process - to go nuts on those buffers, writing them out as fast as it possibly can - rather than letting the backend do it when the ring wraps around. Back to the idea at hand - I proposed something a bit along these lines upthread, but my idea was to proactively perform the fsyncs on the relations that had gone the longest without a write, rather than the ones with the most dirty data. I'm not sure which is better. Obviously, doing the ones that have "gone idle" gives the OS more time to write out the data, but OTOH it might not succeed in purging much dirty data. Doing the ones with the most dirty data will definitely reduce the size of the final checkpoint, but might also cause a latency spike if it's triggered immediately after heavy write activity on that file. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Robert Haas writes: > On Mon, Jan 31, 2011 at 11:51 AM, Tom Lane wrote: >> I wonder whether it'd be useful to keep track of the total amount of >> data written-and-not-yet-synced, and to issue fsyncs often enough to >> keep that below some parameter; the idea being that the parameter would >> limit how much dirty kernel disk cache there is. Of course, ideally the >> kernel would have a similar tunable and this would be a waste of effort >> on our part... > It's not clear to me how you'd maintain that information without it > turning into a contention bottleneck. What contention bottleneck? I was just visualizing the bgwriter process locally tracking how many writes it'd issued. Backend-issued writes should happen seldom enough to be ignorable for this purpose. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Mon, Jan 31, 2011 at 12:01 PM, Tom Lane wrote: > Robert Haas writes: >> 3. Pause for 3 seconds after every fsync. > >> I think something along the lines of #3 is probably a good idea, > > Really? Any particular delay is guaranteed wrong. What I was getting at was - I think it's probably a good idea not to do the fsyncs at top speed, but I'm not too sure how they should be spaced out. I agree a fixed delay isn't necessarily right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Robert Haas writes: > 3. Pause for 3 seconds after every fsync. > I think something along the lines of #3 is probably a good idea, Really? Any particular delay is guaranteed wrong. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Mon, Jan 31, 2011 at 11:51 AM, Tom Lane wrote: > Robert Haas writes: >> On Mon, Jan 31, 2011 at 11:29 AM, Tom Lane wrote: >>> That sounds like you have an entirely wrong mental model of where the >>> cost comes from. Those times are not independent. > >> Yeah, Greg Smith made the same point a week or three ago. But it >> seems to me that there is potential value in overlaying the write and >> sync phases to some degree. For example, if the write phase is spread >> over 15 minutes and you have 30 files, then by, say, minute 7, it's a >> probably OK to flush the file you wrote first. > > Yeah, probably, but we can't do anything as stupid as file-by-file. Eh? > I wonder whether it'd be useful to keep track of the total amount of > data written-and-not-yet-synced, and to issue fsyncs often enough to > keep that below some parameter; the idea being that the parameter would > limit how much dirty kernel disk cache there is. Of course, ideally the > kernel would have a similar tunable and this would be a waste of effort > on our part... It's not clear to me how you'd maintain that information without it turning into a contention bottleneck. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Robert Haas writes: > On Mon, Jan 31, 2011 at 11:29 AM, Tom Lane wrote: >> That sounds like you have an entirely wrong mental model of where the >> cost comes from. Those times are not independent. > Yeah, Greg Smith made the same point a week or three ago. But it > seems to me that there is potential value in overlaying the write and > sync phases to some degree. For example, if the write phase is spread > over 15 minutes and you have 30 files, then by, say, minute 7, it's a > probably OK to flush the file you wrote first. Yeah, probably, but we can't do anything as stupid as file-by-file. I wonder whether it'd be useful to keep track of the total amount of data written-and-not-yet-synced, and to issue fsyncs often enough to keep that below some parameter; the idea being that the parameter would limit how much dirty kernel disk cache there is. Of course, ideally the kernel would have a similar tunable and this would be a waste of effort on our part... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Mon, Jan 31, 2011 at 11:29 AM, Tom Lane wrote: > Heikki Linnakangas writes: >> IMHO we should re-consider the patch to sort the writes. Not so much >> because of the performance gain that gives, but because we can then >> re-arrange the fsyncs so that you write one file, then fsync it, then >> write the next file and so on. > > Isn't that going to make performance worse not better? Generally you > want to give the kernel as much scheduling flexibility as possible, > which you do by issuing the write as far before the fsync as you can. > An arrangement like the above removes all cross-file scheduling freedom. > For example, if two files are on different spindles, you've just > guaranteed that no I/O overlap is possible. > >> That way we the time taken by the fsyncs >> is distributed between the writes, > > That sounds like you have an entirely wrong mental model of where the > cost comes from. Those times are not independent. Yeah, Greg Smith made the same point a week or three ago. But it seems to me that there is potential value in overlaying the write and sync phases to some degree. For example, if the write phase is spread over 15 minutes and you have 30 files, then by, say, minute 7, it's a probably OK to flush the file you wrote first. Waiting longer isn't necessarily going to help - the kernel has probably written what it is going to write without prodding. In fact, it might be that on a busy system, you could lose by waiting *too long* to perform the fsync. The cleaning scan and/or backends may kick out additional dirty buffers that will now have to get forced down to disk, even though you don't really care about them (because they were dirtied after the checkpoint write had already been done). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Heikki Linnakangas writes: > IMHO we should re-consider the patch to sort the writes. Not so much > because of the performance gain that gives, but because we can then > re-arrange the fsyncs so that you write one file, then fsync it, then > write the next file and so on. Isn't that going to make performance worse not better? Generally you want to give the kernel as much scheduling flexibility as possible, which you do by issuing the write as far before the fsync as you can. An arrangement like the above removes all cross-file scheduling freedom. For example, if two files are on different spindles, you've just guaranteed that no I/O overlap is possible. > That way we the time taken by the fsyncs > is distributed between the writes, That sounds like you have an entirely wrong mental model of where the cost comes from. Those times are not independent. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On 31.01.2011 16:44, Robert Haas wrote: On Mon, Jan 31, 2011 at 3:04 AM, Itagaki Takahiro wrote: On Mon, Jan 31, 2011 at 13:41, Robert Haas wrote: 1. Absorb fsync requests a lot more often during the sync phase. 2. Still try to run the cleaning scan during the sync phase. 3. Pause for 3 seconds after every fsync. So if we want the checkpoint to finish in, say, 20 minutes, we can't know whether the write phase needs to be finished by minute 10 or 15 or 16 or 19 or only by 19:59. We probably need deadline-based scheduling, that is being used in write() phase. If we want to sync 100 files in 20 minutes, each file should be sync'ed in 12 seconds if we think each fsync takes the same time. If we would have better estimation algorithm (file size? dirty ratio?), each fsync chould have some weight factor. But deadline-based scheduling is still needed then. Right. I think the problem is balancing the write and sync phases. For example, if your operating system is very aggressively writing out dirty pages to disk, then you want the write phase to be as long as possible and the sync phase can be very short because there won't be much work to do. But if your operating system is caching lots of stuff in memory and writing dirty pages out to disk only when absolutely necessary, then the write phase could be relatively quick without much hurting anything, but the sync phase will need to be long to keep from crushing the I/O system. The trouble is, we don't really have a priori way to know which it's doing. Maybe we could try to tune based on the behavior of previous checkpoints, ... IMHO we should re-consider the patch to sort the writes. Not so much because of the performance gain that gives, but because we can then re-arrange the fsyncs so that you write one file, then fsync it, then write the next file and so on. That way we the time taken by the fsyncs is distributed between the writes, so we don't need to accurately estimate how long each will take. If one fsync takes a long time, the writes that follow will just be done a bit faster to catch up. ... but I'm wondering if we oughtn't to take the cheesy path first and split checkpoint_completion_target into checkpoint_write_target and checkpoint_sync_target. That's another parameter to set, but I'd rather add a parameter that people have to play with to find the right value than impose an arbitrary rule that creates unavoidable bad performance in certain environments. That is of course simpler.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Mon, Jan 31, 2011 at 3:04 AM, Itagaki Takahiro wrote: > On Mon, Jan 31, 2011 at 13:41, Robert Haas wrote: >> 1. Absorb fsync requests a lot more often during the sync phase. >> 2. Still try to run the cleaning scan during the sync phase. >> 3. Pause for 3 seconds after every fsync. >> >> So if we want the checkpoint >> to finish in, say, 20 minutes, we can't know whether the write phase >> needs to be finished by minute 10 or 15 or 16 or 19 or only by 19:59. > > We probably need deadline-based scheduling, that is being used in write() > phase. If we want to sync 100 files in 20 minutes, each file should be > sync'ed in 12 seconds if we think each fsync takes the same time. > If we would have better estimation algorithm (file size? dirty ratio?), > each fsync chould have some weight factor. But deadline-based scheduling > is still needed then. Right. I think the problem is balancing the write and sync phases. For example, if your operating system is very aggressively writing out dirty pages to disk, then you want the write phase to be as long as possible and the sync phase can be very short because there won't be much work to do. But if your operating system is caching lots of stuff in memory and writing dirty pages out to disk only when absolutely necessary, then the write phase could be relatively quick without much hurting anything, but the sync phase will need to be long to keep from crushing the I/O system. The trouble is, we don't really have a priori way to know which it's doing. Maybe we could try to tune based on the behavior of previous checkpoints, but I'm wondering if we oughtn't to take the cheesy path first and split checkpoint_completion_target into checkpoint_write_target and checkpoint_sync_target. That's another parameter to set, but I'd rather add a parameter that people have to play with to find the right value than impose an arbitrary rule that creates unavoidable bad performance in certain environments. > BTW, we should not sleep in full-speed checkpoint. CHECKPOINT command, > shutdown, pg_start_backup(), and some of checkpoints during recovery > might don't want to sleep. Yeah, I think that's understood. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Mon, Jan 31, 2011 at 13:41, Robert Haas wrote: > 1. Absorb fsync requests a lot more often during the sync phase. > 2. Still try to run the cleaning scan during the sync phase. > 3. Pause for 3 seconds after every fsync. > > So if we want the checkpoint > to finish in, say, 20 minutes, we can't know whether the write phase > needs to be finished by minute 10 or 15 or 16 or 19 or only by 19:59. We probably need deadline-based scheduling, that is being used in write() phase. If we want to sync 100 files in 20 minutes, each file should be sync'ed in 12 seconds if we think each fsync takes the same time. If we would have better estimation algorithm (file size? dirty ratio?), each fsync chould have some weight factor. But deadline-based scheduling is still needed then. BTW, we should not sleep in full-speed checkpoint. CHECKPOINT command, shutdown, pg_start_backup(), and some of checkpoints during recovery might don't want to sleep. -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith wrote: > I've attached an updated version of the initial sync spreading patch here, > one that applies cleanly on top of HEAD and over top of the sync > instrumentation patch too. The conflict that made that hard before is gone > now. With the fsync queue compaction patch applied, I think most of this is now not needed. Attached please find an attempt to isolate the portion that looks like it might still be useful. The basic idea of what remains here is to make the background writer still do its normal stuff even when it's checkpointing. In particular, with this patch applied, PG will: 1. Absorb fsync requests a lot more often during the sync phase. 2. Still try to run the cleaning scan during the sync phase. 3. Pause for 3 seconds after every fsync. I suspect that #1 is probably a good idea. It seems pretty clear based on your previous testing that the fsync compaction patch should be sufficient to prevent us from hitting the wall, but if we're going to any kind of nontrivial work here then cleaning the queue is a sensible thing to do along the way, and there's little downside. I also suspect #2 is a good idea. The fact that we're checkpointing doesn't mean that the system suddenly doesn't require clean buffers, and the experimentation I've done recently (see: limiting hint bit I/O) convinces me that it's pretty expensive from a performance standpoint when backends have to start writing out their own buffers, so continuing to do that work during the sync phase of a checkpoint, just as we do during the write phase, seems pretty sensible. I think something along the lines of #3 is probably a good idea, but the current coding doesn't take checkpoint_completion_target into account. The underlying problem here is that it's at least somewhat reasonable to assume that if we write() a whole bunch of blocks, each write() will take approximately the same amount of time. But this is not true at all with respect to fsync() - they neither take the same amount of time as each other, nor is there any fixed ratio between write() time and fsync() time to go by. So if we want the checkpoint to finish in, say, 20 minutes, we can't know whether the write phase needs to be finished by minute 10 or 15 or 16 or 19 or only by 19:59. One idea I have is to try to get some of the fsyncs out of the queue at times other than end-of-checkpoint. Even if this resulted in some modest increase in the total number of fsync() calls, it might improve performance by causing data to be flushed to disk in smaller chunks. For example, suppose we kept an LRU list of pending fsync requests - every time we remember an fsync request for a particular relation, we move it to the head (hot end) of the LRU. And periodically we pull the tail entry off the list and fsync it - say, after checkpoint_timeout / (# of items in the list). That way, when we arrive at the end of the checkpoint and starting syncing everything, the syncs hopefully complete more quickly because we've already forced a bunch of the data down to disk. That algorithm may well be too stupid or just not work in real life, but perhaps there's some variation that would be sensible. The point is: instead of or in addition to trying to spread out the sync phase, we might want to investigate whether it's possible to reduce its size. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index 4df69c2..36da084 100644 --- a/src/backend/postmaster/bgwriter.c +++ b/src/backend/postmaster/bgwriter.c @@ -726,6 +726,53 @@ CheckpointWriteDelay(int flags, double progress) } /* + * CheckpointSyncDelay -- yield control to bgwriter during a checkpoint + * + * This function is called after each file sync performed by mdsync(). + * It is responsible for keeping the bgwriter's normal activities in + * progress during a long checkpoint. + */ +void +CheckpointSyncDelay(void) +{ + pg_time_t now; + pg_time_t sync_start_time; + int sync_delay_secs; + + /* + * Delay after each sync, in seconds. This could be a parameter. But + * since ideally this will be auto-tuning in the near future, not + * assigning it a GUC setting yet. + */ +#define EXTRA_SYNC_DELAY 3 + + /* Do nothing if checkpoint is being executed by non-bgwriter process */ + if (!am_bg_writer) + return; + + sync_start_time = (pg_time_t) time(NULL); + + /* + * Perform the usual bgwriter duties. + */ + for (;;) + { + AbsorbFsyncRequests(); + BgBufferSync(); + CheckArchiveTimeout(); + BgWriterNap(); + + /* + * Are we there yet? + */ + now = (pg_time_t) time(NULL); + sync_delay_secs = now - sync_start_time; + if (sync_delay_secs >= EXTRA_SYNC_DELAY) + break; + } +} + +/* * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint * in time? * diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/
Re: [HACKERS] Spread checkpoint sync
On Fri, Jan 28, 2011 at 12:53 AM, Greg Smith wrote: > Where there are still very ugly maximum latency figures here in every case, > these periods just aren't as wide with the patch in place. OK, committed the patch, with some additional commenting, and after fixing the compiler warning Chris Browne noticed. > P.S. Yes, I know I have other review work to do as well. Starting on the > rest of that tomorrow. *cracks whip* Man, this thing doesn't work at all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Robert Haas wrote: During each cluster, the system probably slows way down, and then recovers when the queue is emptied. So the TPS improvement isn't at all a uniform speedup, but simply relief from the stall that would otherwise result from a full queue. That does seem to be the case here. http://www.2ndquadrant.us/pgbench-results/index.htm now has results from my a long test series, at two database scales that caused many backend fsyncs during earlier tests. Set #5 is the existing server code, #6 is with the patch applied. There are zero backend fsync calls with the patch applied, which isn't surprising given how simple the schema is on this test case. An average of a 14% TPS gain appears at a scale of 500 and a 8% one at 1000; the attached CSV file summarizes the average figures for the archives. The gains do appear to be from smoothing out the dead period that normally occur during the sync phase of the checkpoint. For example, here are the fastest runs at scale=1000/clients=256 with and without the patch: http://www.2ndquadrant.us/pgbench-results/436/index.html (tps=361) http://www.2ndquadrant.us/pgbench-results/486/index.html (tps=380) Here the difference in how much less of a slowdown there is around the checkpoint end points is really obvious, and obviously an improvement. You can see the same thing to a lesser extent at the other end of the scale; here's the fastest runs at scale=500/clients=16: http://www.2ndquadrant.us/pgbench-results/402/index.html (tps=590) http://www.2ndquadrant.us/pgbench-results/462/index.html (tps=643) Where there are still very ugly maximum latency figures here in every case, these periods just aren't as wide with the patch in place. I'm moving onto some brief testing some of the newer kernel behavior here, then returning to testing the other checkpoint spreading ideas on top of this compation patch, presuming something like it will end up being committed first. I think it's safe to say I can throw away the changes to try and alter the fsync absorption code present in what I submitted before, as this scheme does a much better job of avoiding that problem than those earlier queue alteration ideas. I'm glad Robert grabbed the right one from the pile of ideas I threw out for what else might help here. P.S. Yes, I know I have other review work to do as well. Starting on the rest of that tomorrow. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books ,,"Unmodified",,"Compacted Fsync",,, "scale","clients","tps","max_latency","tps","max_latency","TPS Gain","% Gain" 500,16,557,17963.41,631,17116.31,74,13.3% 500,32,587,25838.8,655,24311.54,68,11.6% 500,64,628,35198.39,727,38040.39,99,15.8% 500,128,621,41001.91,687,48195.77,66,10.6% 500,256,632,49610.39,747,46799.48,115,18.2% ,,, 1000,16,306,39298.95,321,40826.58,15,4.9% 1000,32,314,40120.35,345,27910.51,31,9.9% 1000,64,334,46244.86,358,45138.1,24,7.2% 1000,128,343,72501.57,372,47125.46,29,8.5% 1000,256,321,80588.63,350,83232.14,29,9.0% -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Robert Haas wrote: Based on what I saw looking at this, I'm thinking that the backend fsyncs probably happen in clusters - IOW, it's not 2504 backend fsyncs spread uniformly throughout the test, but clusters of 100 or more that happen in very quick succession, followed by relief when the background writer gets around to emptying the queue. That's exactly the case. You'll be running along fine, the queue will fill, and then hundreds of them can pile up in seconds. Since the worst of that seemed to be during the sync phase of the checkpoint, adding additional queue management logic to there is where we started at. I thought this compaction idea would be more difficult to implement than your patch proved to be though, so doing this first is working out quite well instead. This is what all the log messages from the patch look like here, at scale=500 and shared_buffers=256MB: DEBUG: compacted fsync request queue from 32768 entries to 11 entries That's an 8GB database, and from looking at the relative sizes I'm guessing 7 entries refer to the 1GB segments of the accounts table, 2 to its main index, and the other 2 are likely branches/tellers data. Since I know the production system I ran into this on has about 400 file segments on it regularly dirtied a higher shared_buffers than that, I expect this will demolish this class of problem on it, too. I'll have all the TPS over time graphs available to publish by the end of my day here, including tests at a scale of 1000 as well. Those should give a little more insight into how the patch is actually impacting high-level performance. I don't dare disturb the ongoing tests by copying all that data out of there until they're finished, will be a few hours yet. My only potential concern over committing this is that I haven't done a sanity check over whether it impacts the fsync mechanics in a way that might cause an issue. Your assumptions there are documented and look reasonable on quick review; I just haven't had much time yet to look for flaws in them. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Thu, Jan 27, 2011 at 12:18 PM, Greg Smith wrote: > Greg Smith wrote: >> >> I think a helpful next step here would be to put Robert's fsync compaction >> patch into here and see if that helps. There are enough backend syncs >> showing up in the difficult workloads (scale>=1000, clients >=32) that its >> impact should be obvious. > > Initial tests show everything expected from this change and more. This took > me a while to isolate because of issues where the filesystem involved > degraded over time, giving a heavy bias toward a faster first test run, > before anything was fragmented. I just had to do a whole new mkfs on the > database/xlog disks when switching between test sets in order to eliminate > that. > > At a scale of 500, I see the following average behavior: > > Clients TPS backend-fsync > 16 557 155 > 32 587 572 > 64 628 843 > 128 621 1442 > 256 632 2504 > > On one run through with the fsync compaction patch applied this turned into: > > Clients TPS backend-fsync > 16 637 0 > 32 621 0 > 64 721 0 > 128 716 0 > 256 841 0 > > So not only are all the backend fsyncs gone, there is a very clear TPS > improvement too. The change in results at >=64 clients are well above the > usual noise threshold in these tests. > The problem where individual fsync calls during checkpoints can take a long > time is not appreciably better. But I think this will greatly reduce the > odds of running into the truly dysfunctional breakdown, where checkpoint and > backend fsync calls compete with one another, that caused the worst-case > situation kicking off this whole line of research here. Dude! That's pretty cool. Thanks for doing that measurement work - that's really awesome. Barring objections, I'll go ahead and commit my patch. Based on what I saw looking at this, I'm thinking that the backend fsyncs probably happen in clusters - IOW, it's not 2504 backend fsyncs spread uniformly throughout the test, but clusters of 100 or more that happen in very quick succession, followed by relief when the background writer gets around to emptying the queue. During each cluster, the system probably slows way down, and then recovers when the queue is emptied. So the TPS improvement isn't at all a uniform speedup, but simply relief from the stall that would otherwise result from a full queue. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Greg Smith wrote: I think a helpful next step here would be to put Robert's fsync compaction patch into here and see if that helps. There are enough backend syncs showing up in the difficult workloads (scale>=1000, clients >=32) that its impact should be obvious. Initial tests show everything expected from this change and more. This took me a while to isolate because of issues where the filesystem involved degraded over time, giving a heavy bias toward a faster first test run, before anything was fragmented. I just had to do a whole new mkfs on the database/xlog disks when switching between test sets in order to eliminate that. At a scale of 500, I see the following average behavior: Clients TPS backend-fsync 16 557 155 32 587 572 64 628 843 128 621 1442 256 632 2504 On one run through with the fsync compaction patch applied this turned into: Clients TPS backend-fsync 16 637 0 32 621 0 64 721 0 128 716 0 256 841 0 So not only are all the backend fsyncs gone, there is a very clear TPS improvement too. The change in results at >=64 clients are well above the usual noise threshold in these tests. The problem where individual fsync calls during checkpoints can take a long time is not appreciably better. But I think this will greatly reduce the odds of running into the truly dysfunctional breakdown, where checkpoint and backend fsync calls compete with one another, that caused the worst-case situation kicking off this whole line of research here. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
> To be frank, I really don't care about fixing this behavior on ext3, > especially in the context of that sort of hack. That filesystem is not > the future, it's not possible to ever really make it work right, and > every minute spent on pandering to its limitations would be better spent > elsewhere IMHO. I'm starting with the ext3 benchmarks just to provide > some proper context for the worst-case behavior people can see right > now, and to make sure refactoring here doesn't make things worse on it. > My target is same or slightly better on ext3, much better on XFS and ext4. Please don't forget that we need to avoid performance regressions on NTFS and ZFS as well. They don't need to improve, but we can't let them regress. I think we can ignore BSD/UFS and Solaris/UFS, as well as HFS+, though. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Robert Haas wrote: Idea #4: For ext3 filesystems that like to dump the entire buffer cache instead of only the requested file, write a little daemon that runs alongside of (and completely indepdently of) PostgreSQL. Every 30 s, it opens a 1-byte file, changes the byte, fsyncs the file, and closes the file, thus dumping the cache and preventing a ridiculous growth in the amount of data to be sync'd at checkpoint time. Today's data suggests this problem has been resolved in the latest kernels. I saw the "giant flush/series of small flushes" pattern quite easily on the CentOS5 system I last did heavy pgbench testing on. The one I'm testing now has kernel 2.6.23 (Ubuntu 10.04), and it doesn't show it at all. Here's what a bad checkpoint looks like on this system: LOG: checkpoint starting: xlog DEBUG: checkpoint sync: number=1 file=base/24746/36596.8 time=7651.601 msec DEBUG: checkpoint sync: number=2 file=base/24746/36506 time=0.001 msec DEBUG: checkpoint sync: number=3 file=base/24746/36596.2 time=1891.695 msec DEBUG: checkpoint sync: number=4 file=base/24746/36596.4 time=7431.441 msec DEBUG: checkpoint sync: number=5 file=base/24746/36515 time=0.216 msec DEBUG: checkpoint sync: number=6 file=base/24746/36596.9 time=4422.892 msec DEBUG: checkpoint sync: number=7 file=base/24746/36596.12 time=954.242 msec DEBUG: checkpoint sync: number=8 file=base/24746/36237_fsm time=0.002 msec DEBUG: checkpoint sync: number=9 file=base/24746/36503 time=0.001 msec DEBUG: checkpoint sync: number=10 file=base/24746/36584 time=41.401 msec DEBUG: checkpoint sync: number=11 file=base/24746/36596.7 time=885.921 msec DEBUG: checkpoint sync: number=12 file=base/24813/30774 time=0.002 msec DEBUG: checkpoint sync: number=13 file=base/24813/24822 time=0.005 msec DEBUG: checkpoint sync: number=14 file=base/24746/36801 time=49.801 msec DEBUG: checkpoint sync: number=15 file=base/24746/36601.2 time=610.996 msec DEBUG: checkpoint sync: number=16 file=base/24746/36596 time=16154.361 msec DEBUG: checkpoint sync: number=17 file=base/24746/36503_vm time=0.001 msec DEBUG: checkpoint sync: number=18 file=base/24746/36508 time=0.000 msec DEBUG: checkpoint sync: number=19 file=base/24746/36596.10 time=9759.898 msec DEBUG: checkpoint sync: number=20 file=base/24746/36596.3 time=3392.727 msec DEBUG: checkpoint sync: number=21 file=base/24746/36237 time=0.150 msec DEBUG: checkpoint sync: number=22 file=base/24746/36596.11 time=9153.437 msec DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 1057833 of relation base/24746/36596 [>800 more of these] DEBUG: checkpoint sync: number=23 file=base/24746/36601.1 time=48697.179 msec DEBUG: could not forward fsync request because request queue is full DEBUG: checkpoint sync: number=24 file=base/24746/36597 time=0.080 msec DEBUG: checkpoint sync: number=25 file=base/24746/36237_vm time=0.001 msec DEBUG: checkpoint sync: number=26 file=base/24813/24822_fsm time=0.001 msec DEBUG: checkpoint sync: number=27 file=base/24746/36503_fsm time=0.000 msec DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 20619 of relation base/24746/36601 DEBUG: checkpoint sync: number=28 file=base/24746/36506_fsm time=0.000 msec DEBUG: checkpoint sync: number=29 file=base/24746/36596_vm time=0.040 msec DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 278967 of relation base/24746/36596 DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 1582400 of relation base/24746/36596 DEBUG: checkpoint sync: number=30 file=base/24746/36596.6 time=0.002 msec DEBUG: checkpoint sync: number=31 file=base/24813/11647 time=0.004 msec DEBUG: checkpoint sync: number=32 file=base/24746/36601 time=201.632 msec DEBUG: checkpoint sync: number=33 file=base/24746/36801_fsm time=0.001 msec DEBUG: checkpoint sync: number=34 file=base/24746/36596.5 time=0.001 msec DEBUG: checkpoint sync: number=35 file=base/24746/36599 time=0.000 msec DEBUG: checkpoint sync: number=36 file=base/24746/36587 time=0.005 msec DEBUG: checkpoint sync: number=37 file=base/24746/36596_fsm time=0.001 msec DEBUG: checkpoint sync: number=38 file=base/24746/36596.1 time=0.001 msec DEBUG: checkpoint sync: number=39 file=base/24746/36801_vm time=0.001 msec LOG: checkpoint complete: wrote 9515 buffers (29.0%); 0 transaction log file(s) added, 0 removed, 64 recycled; write=32.409 s, sync=111.615 s, total=144.052 s; sync files=39, longest=48.697 s, average=2.853 s Here the file that's been brutally delayed via backend contention is #23, after already seeing quite long delays on the earlier ones. That I've never seen with earlier kernels running ext3. This is good in that it makes it more likely a spread sync approach that works on XFS will also work on these newer kernels with ext4. Then the only group we wouldn't be able to help if that works the ex
Re: [HACKERS] Spread checkpoint sync
2011/1/18 Greg Smith : > Bruce Momjian wrote: >> >> Should we be writing until 2:30 then sleep 30 seconds and fsync at 3:00? >> > > The idea of having a dead period doing no work at all between write phase > and sync phase may have some merit. I don't have enough test data yet on > some more fundamental issues in this area to comment on whether that smaller > optimization would be valuable. It may be a worthwhile concept to throw > into the sequencing. Are we able to have some pause without strict rules like 'stop for 30 sec' ? (case : my hardware is very good and I can write 400MB/sec with no interrupt, XXX IOPS) I wonder if we are not going to have issue with "RAID firmware + BBU + linux scheduler" because we are adding 'unexpected' behavior in the middle. -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Bruce Momjian wrote: Should we be writing until 2:30 then sleep 30 seconds and fsync at 3:00? The idea of having a dead period doing no work at all between write phase and sync phase may have some merit. I don't have enough test data yet on some more fundamental issues in this area to comment on whether that smaller optimization would be valuable. It may be a worthwhile concept to throw into the sequencing. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Jim Nasby wrote: Wow, that's the kind of thing that would be incredibly difficult to figure out, especially while your production system is in flames... Can we change ereport that happens in that case from DEBUG1 to WARNING? Or provide some other means to track it That's why we already added pg_stat_bgwriter.buffers_backend_fsync to track the problem before trying to improve it. It was driving me crazy on a production server not having any visibility into when it happened. I haven't seen that we need anything beyond that so far. In the context of this new patch for example, if you get to where a backend does its own sync, you'll know it did a compaction as part of that. The existing statistic would tell you enough. There's now enough data in test set 3 at http://www.2ndquadrant.us/pgbench-results/index.htm to start to see how this breaks down on a moderately big system (well, by most people's standards, but not Jim for whom this is still a toy). Note the backend_sync column on the right, very end of the page; that's the relevant counter I'm commenting on: scale=175: Some backend fsync with 64 clients, 2/3 runs. scale=250: Significant backend fsync with 32 and 64 clients, every run. scale=500: Moderate to large backend fsync at any client count >=16. This seems to be worst spot of those mapped. Above here, I would guess the TPS numbers start slowing enough that the fsync request queue activity drops, too. scale=1000: Backend fsync starting at 8 clients scale=2000: Backend fsync starting at 16 clients. By here I think the TPS volumes are getting low enough that clients are stuck significantly more often waiting for seeks rather than fsync. Looks like the most effective spot for me to focus testing on with this server is scales of 500 and 1000, with 16 to 64 clients. Now that I've got the scale fine tuned better, I may crank up the client counts too and see what that does. I'm glad these are appearing in reasonable volume here though, was starting to get nervous about only having NDA restricted results to work against. Some days you just have to cough up for your own hardware. I just tagged pgbench-tools-0.6.0 and pushed to GitHub/git.postgresql.org with the changes that track and report on buffers_backend_fsync if anyone else wants to try this out. It includes those numbers if you have a 9.1 with them, otherwise just reports 0 for it all the time; detection of the feature wasn't hard to add. The end portion of a config file for the program (the first part specifies host/username info and the like) that would replicate the third test set here is: MAX_WORKERS="4" SCRIPT="tpc-b.sql" SCALES="1 10 100 175 250 500 1000 2000" SETCLIENTS="4 8 16 32 64" SETTIMES=3 RUNTIME=600 TOTTRANS="" -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Mon, Jan 17, 2011 at 6:07 PM, Jim Nasby wrote: > On Jan 15, 2011, at 8:15 AM, Robert Haas wrote: >> Well, the point of this is not to save time in the bgwriter - I'm not >> surprised to hear that wasn't noticeable. The point is that when the >> fsync request queue fills up, backends start performing an fsync *for >> every block they write*, and that's about as bad for performance as >> it's possible to be. So it's worth going to a little bit of trouble >> to try to make sure it doesn't happen. It didn't happen *terribly* >> frequently before, but it does seem to be common enough to worry about >> - e.g. on one occasion, I was able to reproduce it just by running >> pgbench -i -s 25 or something like that on a laptop. > > Wow, that's the kind of thing that would be incredibly difficult to figure > out, especially while your production system is in flames... Can we change > ereport that happens in that case from DEBUG1 to WARNING? Or provide some > other means to track it? Something like this? http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=3134d8863e8473e3ed791e27d484f9e548220411 -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Jan 15, 2011, at 8:15 AM, Robert Haas wrote: > Well, the point of this is not to save time in the bgwriter - I'm not > surprised to hear that wasn't noticeable. The point is that when the > fsync request queue fills up, backends start performing an fsync *for > every block they write*, and that's about as bad for performance as > it's possible to be. So it's worth going to a little bit of trouble > to try to make sure it doesn't happen. It didn't happen *terribly* > frequently before, but it does seem to be common enough to worry about > - e.g. on one occasion, I was able to reproduce it just by running > pgbench -i -s 25 or something like that on a laptop. Wow, that's the kind of thing that would be incredibly difficult to figure out, especially while your production system is in flames... Can we change ereport that happens in that case from DEBUG1 to WARNING? Or provide some other means to track it? -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Jeff Janes wrote: Have you ever tested Robert's other idea of having a metronome process do a periodic fsync on a dummy file which is located on the same ext3fs as the table files? I think that that would be interesting to see. To be frank, I really don't care about fixing this behavior on ext3, especially in the context of that sort of hack. That filesystem is not the future, it's not possible to ever really make it work right, and every minute spent on pandering to its limitations would be better spent elsewhere IMHO. I'm starting with the ext3 benchmarks just to provide some proper context for the worst-case behavior people can see right now, and to make sure refactoring here doesn't make things worse on it. My target is same or slightly better on ext3, much better on XFS and ext4. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Sun, Jan 16, 2011 at 7:13 PM, Greg Smith wrote: > I have finished a first run of benchmarking the current 9.1 code at various > sizes. See http://www.2ndquadrant.us/pgbench-results/index.htm for many > details. The interesting stuff is in Test Set 3, near the bottom. That's > the first one that includes buffer_backend_fsync data. This iall on ext3 so > far, but is using a newer 2.6.32 kernel, the one from Ubuntu 10.04. > > The results are classic Linux in 2010: latency pauses from checkpoint sync > will easily leave the system at a dead halt for a minute, with the worst one > observed this time dropping still for 108 seconds. That one is weird, but > these two are completely averge cases: > > http://www.2ndquadrant.us/pgbench-results/210/index.html > http://www.2ndquadrant.us/pgbench-results/215/index.html > > I think a helpful next step here would be to put Robert's fsync compaction > patch into here and see if that helps. There are enough backend syncs > showing up in the difficult workloads (scale>=1000, clients >=32) that its > impact should be obvious. Have you ever tested Robert's other idea of having a metronome process do a periodic fsync on a dummy file which is located on the same ext3fs as the table files? I think that that would be interesting to see. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Greg Smith wrote: > One of the components to the write queue is some notion that writes that > have been waiting longest should eventually be flushed out. Linux has > this number called dirty_expire_centiseconds which suggests it enforces > just that, set to a default of 30 seconds. This is why some 5-minute > interval checkpoints with default parameters, effectively spreading the > checkpoint over 2.5 minutes, can work under the current design. > Anything you wrote at T+0 to T+2:00 *should* have been written out > already when you reach T+2:30 and sync. Unfortunately, when the system > gets busy, there is this "congestion control" logic that basically > throws out any guarantee of writes starting shortly after the expiration > time. Should we be writing until 2:30 then sleep 30 seconds and fsync at 3:00? -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Sun, Jan 16, 2011 at 10:13 PM, Greg Smith wrote: > I have finished a first run of benchmarking the current 9.1 code at various > sizes. See http://www.2ndquadrant.us/pgbench-results/index.htm for many > details. The interesting stuff is in Test Set 3, near the bottom. That's > the first one that includes buffer_backend_fsync data. This iall on ext3 so > far, but is using a newer 2.6.32 kernel, the one from Ubuntu 10.04. > > The results are classic Linux in 2010: latency pauses from checkpoint sync > will easily leave the system at a dead halt for a minute, with the worst one > observed this time dropping still for 108 seconds. I wish I understood better what makes Linux systems "freeze up" under heavy I/O load. Linux - like other UNIX-like systems - generally has reasonably effective mechanisms for preventing a single task from monopolizing the (or a) CPU in the presence of other processes that also wish to be time-sliced, but the same thing doesn't appear to be true of I/O. > I think a helpful next step here would be to put Robert's fsync compaction > patch into here and see if that helps. There are enough backend syncs > showing up in the difficult workloads (scale>=1000, clients >=32) that its > impact should be obvious. Thanks for doing this work. I look forward to the results. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
I have finished a first run of benchmarking the current 9.1 code at various sizes. See http://www.2ndquadrant.us/pgbench-results/index.htm for many details. The interesting stuff is in Test Set 3, near the bottom. That's the first one that includes buffer_backend_fsync data. This iall on ext3 so far, but is using a newer 2.6.32 kernel, the one from Ubuntu 10.04. The results are classic Linux in 2010: latency pauses from checkpoint sync will easily leave the system at a dead halt for a minute, with the worst one observed this time dropping still for 108 seconds. That one is weird, but these two are completely averge cases: http://www.2ndquadrant.us/pgbench-results/210/index.html http://www.2ndquadrant.us/pgbench-results/215/index.html I think a helpful next step here would be to put Robert's fsync compaction patch into here and see if that helps. There are enough backend syncs showing up in the difficult workloads (scale>=1000, clients >=32) that its impact should be obvious. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Sun, Jan 16, 2011 at 7:32 PM, Jeff Janes wrote: > But since you already wrote a patch to do the whole thing, I figured > I'd time it. Thanks! > I arranged to test an instrumented version of your patch under large > shared_buffers of 4GB, conditions that would maximize the opportunity > for it to take a long time. Running your compaction to go from 524288 > to a handful (14 to 29, depending on run) took between 36 and 39 > milliseconds. > > For comparison, doing just the memcpy part of AbsorbFsyncRequest on > a full queue took from 24 to 27 milliseconds. > > They are close enough to each other that I am no longer interested in > partial deduplication. But both are long enough that I wonder if > having a hash table in shared memory that is kept unique automatically > at each update might not be worthwhile. There are basically three operations that we care about here: (1) time to add an fsync request to the queue, (2) time to absorb requests from the queue, and (3) time to compact the queue. The first is by far the most common, and at least in any situation that anyone's analyzed so far, the second will be far more common than the third. Therefore, it seems unwise to accept any slowdown in #1 to speed up either #2 or #3, and a hash table probe is definitely going to be slower than what's required to add an element under the status quo. We could perhaps mitigate this by partitioning the hash table. Alternatively, we could split the queue in half and maintain a global variable - protected by the same lock - indicating which half is currently open for insertions. The background writer would grab the lock, flip the global, release the lock, and then drain the half not currently open to insertions; the next iteration would flush the other half. However, it's unclear to me that either of these things has any value. I can't remember any reports of contention on the BgWriterCommLock, so it seems like changing the logic as minimally as possible as the way to go. (In contrast, note that the WAL insert lock, proc array lock, and lock manager/buffer manager partition locks are all known to be heavily contended in certain workloads.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Tue, Jan 11, 2011 at 5:27 PM, Robert Haas wrote: > On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith wrote: >> One of the ideas Simon and I had been considering at one point was adding >> some better de-duplication logic to the fsync absorb code, which I'm >> reminded by the pattern here might be helpful independently of other >> improvements. > > Hopefully I'm not stepping on any toes here, but I thought this was an > awfully good idea and had a chance to take a look at how hard it would > be today while en route from point A to point B. The answer turned > out to be "not very", so PFA a patch that seems to work. I tested it > by attaching gdb to the background writer while running pgbench, and > it eliminate the backend fsyncs without even breaking a sweat. I had been concerned about how long the lock would be held, and I was pondering ways to do only partial deduplication to reduce the time. But since you already wrote a patch to do the whole thing, I figured I'd time it. I arranged to test an instrumented version of your patch under large shared_buffers of 4GB, conditions that would maximize the opportunity for it to take a long time. Running your compaction to go from 524288 to a handful (14 to 29, depending on run) took between 36 and 39 milliseconds. For comparison, doing just the memcpy part of AbsorbFsyncRequest on a full queue took from 24 to 27 milliseconds. They are close enough to each other that I am no longer interested in partial deduplication. But both are long enough that I wonder if having a hash table in shared memory that is kept unique automatically at each update might not be worthwhile. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Robert Haas wrote: What is the basis for thinking that the sync should get the same amount of time as the writes? That seems pretty arbitrary. Right now, you're allowing 3 seconds per fsync, which could be a lot more or a lot less than 40% of the total checkpoint time... Just that it's where I ended up at when fighting with this for a month on the system I've seen the most problems at. The 3 second number was reversed from a computation that said "aim for an internal of X minutes; we have Y relations on average involved in the checkpoint". The direction my latest patch is strugling to go is computing a reasonable time automatically in the same way--count the relations, do a time estimate, add enough delay so the sync calls should be spread linearly over the given time range. the checkpoint activity is always going to be spikey if it does anything at all, so spacing it out *more* isn't obviously useful. One of the components to the write queue is some notion that writes that have been waiting longest should eventually be flushed out. Linux has this number called dirty_expire_centiseconds which suggests it enforces just that, set to a default of 30 seconds. This is why some 5-minute interval checkpoints with default parameters, effectively spreading the checkpoint over 2.5 minutes, can work under the current design. Anything you wrote at T+0 to T+2:00 *should* have been written out already when you reach T+2:30 and sync. Unfortunately, when the system gets busy, there is this "congestion control" logic that basically throws out any guarantee of writes starting shortly after the expiration time. It turns out that the only thing that really works are the tunables that block new writes from happening once the queue is full, but they can't be set low enough to work well in earlier kernels when combined with lots of RAM. Using the terminology of http://www.mjmwired.net/kernel/Documentation/sysctl/vm.txt at some point you hit a point where "a process generating disk writes will itself start writeback." This is anologous to the PostgreSQL situation where backends do their own fsync calls. The kernel will eventually move to where those trying to write new data are instead recruited into being additional sources of write flushing. That's the part you just can't make aggressive enough on older kernels; dirty writers can always win. Ideally, the system never digs itself into a hole larger than you can afford to wait to write out. It's a transacton speed vs. latency thing though, and the older kernels just don't consider the latency side well enough. There is new mechanism in the latest kernels to control this much better: dirty_bytes and dirty_background_bytes are the tunables. I haven't had a chance to test yet. As mentioned upthread, some of the bleding edge kernels that have this feature available in are showing such large general performance regressions in our tests, compared to the boring old RHEL5 kernel, that whether this feature works or not is irrelevant. I haven't tracked down which new kernel distributions work well performance-wise and which don't yet for PostgreSQL. I'm hoping that when I get there, I'll see results like http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-pages , where the ideal setting for dirty_bytes to keep latency under control with BBWC was 15MB. To put that into perspective, the lowest useful setting you can set dirty_ratio to is 5% of RAM. That's 410MB on my measly 8GB desktop, and 3.3GB on the 64GB production server I've been trying to tune. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Sat, Jan 15, 2011 at 5:57 PM, Greg Smith wrote: > I was just giving an example of how I might do an initial split. There's a > checkpoint happening now at time T; we have a rough idea that it needs to be > finished before some upcoming time T+D. Currently with default parameters > this becomes: > > Write: 0.5 * D; Sync: 0 > > Even though Sync obviously doesn't take zero. The slop here is enough that > it usually works anyway. > > I was suggesting that a quick reshuffling to: > > Write: 0.4 * D; Sync: 0.4 * D > > Might be a good first candidate for how to split the time up better. What is the basis for thinking that the sync should get the same amount of time as the writes? That seems pretty arbitrary. Right now, you're allowing 3 seconds per fsync, which could be a lot more or a lot less than 40% of the total checkpoint time, but I have a pretty clear sense of why that's a sensible thing to try: you give the rest of the system a moment or two to get some I/O done for something other than the checkpoint before flushing the next batch of buffers. But the checkpoint activity is always going to be spikey if it does anything at all, so spacing it out *more* isn't obviously useful. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Sat, Jan 15, 2011 at 14:05, Robert Haas wrote: > Idea #4: For ext3 filesystems that like to dump the entire buffer > cache instead of only the requested file, write a little daemon that > runs alongside of (and completely indepdently of) PostgreSQL. Every > 30 s, it opens a 1-byte file, changes the byte, fsyncs the file, and > closes the file, thus dumping the cache and preventing a ridiculous > growth in the amount of data to be sync'd at checkpoint time. Wouldn't it be easier to just mount in data=writeback mode? This provides a similar level of journaling as most other file systems. Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Robert Haas wrote: That seems like a bad idea - don't we routinely recommend that people crank this up to 0.9? You'd be effectively bounding the upper range of this setting to a value to the less than the lowest value we recommend anyone use today. I was just giving an example of how I might do an initial split. There's a checkpoint happening now at time T; we have a rough idea that it needs to be finished before some upcoming time T+D. Currently with default parameters this becomes: Write: 0.5 * D; Sync: 0 Even though Sync obviously doesn't take zero. The slop here is enough that it usually works anyway. I was suggesting that a quick reshuffling to: Write: 0.4 * D; Sync: 0.4 * D Might be a good first candidate for how to split the time up better. The fact that this gives less writing time than the current biggest spread possible: Write: 0.9 * D; Sync: 0 Is true. It's also true that in the case where sync time really is zero, this new default would spread writes less than the current default. I think that's optimistic, but it could happen if checkpoints are small and you have a good write cache. Step back from that a second though. Ultimately, the person who is getting checkpoints at a 5 minute interval, and is being nailed by spikes, should have the option of just increasing the parameters to make that interval bigger. First you increase the measly default segments to a reasonable range, then checkpoint_completion_target is the second one you can try. But from there, you quickly move onto making checkpoint_timeout longer. At some point, there is no option but to give up checkpoints every 5 minutes as being practical, and make the average interval longer. Whether or not a refactoring here makes things slightly worse for cases closer to the default doesn't bother me too much. What bothers me is the way trying to stretch checkpoints out further fails to deliver as well as it should. I'd be OK with saying "to get the exact same spread situation as in older versions, you may need to retarget for checkpoints every 6 minutes" *if* in the process I get a much better sync latency distribution in most cases. Here's an interesting data point from the customer site this all started at, one I don't think they'll mind sharing since it helps make the situation more clear to the community. After applying this code to spread sync out, in order to get their server back to functional we had to move all the parameters involved up to where checkpoints were spaced 35 minutes apart. It just wasn't possible to write any faster than that without disrupting foreground activity. The whole current model where people think of this stuff in terms of segments and completion targets is a UI disaster. The direction I want to go in is where users can say "make sure checkpoints happen every N minutes", and something reasonable happens without additional parameter fiddling. And if the resulting checkpoint I/O spike is too big, they just increase the timeout to N+1 or N*2 to spread the checkpoint further. Getting too bogged down thinking in terms of the current, really terrible interface is something I'm trying to break myself of. Long-term, I want there to be checkpoint_timeout, and all the other parameters are gone, replaced by an internal implementation of the best practices proven to work even on busy systems. I don't have as much clarity on exactly what that best practice is the way that, say, I just suggested exactly how to eliminate wal_buffers as an important thing to manually set. But that's the dream UI: you shoot for a checkpoint interval, and something reasonable happens; if that's too intense, you just increase the interval to spread further. There probably will be small performance regression possible vs. the current code with parameter combination that happen to work well on it. Preserving every one of those is something that's not as important to me as making the tuning interface simple and clear. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Sat, Jan 15, 2011 at 10:31 AM, Greg Smith wrote: > That's going to give worse performance than the current code in some cases. OK. >> How does the checkpoint target give you any time to sync them? Unless >> you squeeze the writes together more tightly, but that seems sketchy. > > Obviously the checkpoint target idea needs to be shuffled around some too. > I was thinking of making the new default 0.8, and having it split the time > in half for write and sync. That will make the write phase close to the > speed people are seeing now, at the default of 0.5, while giving some window > for spread sync too. The exact way to redistribute that around I'm not so > concerned about yet. When I get to where that's the most uncertain thing > left I'll benchmark the TPS vs. latency trade-off and see what happens. If > the rest of the code is good enough but this just needs to be tweaked, > that's a perfect thing to get beta feedback to finalize. That seems like a bad idea - don't we routinely recommend that people crank this up to 0.9? You'd be effectively bounding the upper range of this setting to a value to the less than the lowest value we recommend anyone use today. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Sat, 2011-01-15 at 09:15 -0500, Robert Haas wrote: > On Sat, Jan 15, 2011 at 8:55 AM, Simon Riggs wrote: > > On Sat, 2011-01-15 at 05:47 -0500, Greg Smith wrote: > >> Robert Haas wrote: > >> > On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith wrote: > >> > > >> > > One of the ideas Simon and I had been considering at one point was > >> > > adding > >> > > some better de-duplication logic to the fsync absorb code, which I'm > >> > > reminded by the pattern here might be helpful independently of other > >> > > improvements. > >> > > > >> > > >> > Hopefully I'm not stepping on any toes here, but I thought this was an > >> > awfully good idea and had a chance to take a look at how hard it would > >> > be today while en route from point A to point B. The answer turned > >> > out to be "not very", so PFA a patch that seems to work. I tested it > >> > by attaching gdb to the background writer while running pgbench, and > >> > it eliminate the backend fsyncs without even breaking a sweat. > >> > > >> > >> No toe damage, this is great, I hadn't gotten to coding for this angle > >> yet at all. Suffering from an overload of ideas and (mostly wasted) > >> test data, so thanks for exploring this concept and proving it works. > > > > No toe damage either, but are we sure we want the de-duplication logic > > and in this place? > > > > I was originally of the opinion that de-duplicating the list would save > > time in the bgwriter, but that guess was wrong by about two orders of > > magnitude, IIRC. The extra time in the bgwriter wasn't even noticeable. > > Well, the point of this is not to save time in the bgwriter - I'm not > surprised to hear that wasn't noticeable. The point is that when the > fsync request queue fills up, backends start performing an fsync *for > every block they write*, and that's about as bad for performance as > it's possible to be. So it's worth going to a little bit of trouble > to try to make sure it doesn't happen. It didn't happen *terribly* > frequently before, but it does seem to be common enough to worry about > - e.g. on one occasion, I was able to reproduce it just by running > pgbench -i -s 25 or something like that on a laptop. > > With this patch applied, there's no performance impact vs. current > code in the very, very common case where space remains in the queue - > 999 times out of 1000, writing to the fsync queue will be just as fast > as ever. But in the unusual case where the queue has been filled up, > compacting the queue is much much faster than performing an fsync, and > the best part is that the compaction is generally massive. I was > seeing things like "4096 entries compressed to 14". So clearly even > if the compaction took as long as the fsync itself it would be worth > it, because the next 4000+ guys who come along again go through the > fast path. But in fact I think it's much faster than an fsync. > > In order to get pathological behavior even with this patch applied, > you'd need to have NBuffers pending fsync requests and they'd all have > to be different. I don't think that's theoretically impossible, but > Greg's research seems to indicate that even on busy systems we don't > come even a little bit close to the circumstances that would cause it > to occur in practice. Every other change we might make in this area > will further improve this case, too: for example, doing an absorb > after each fsync would presumably help, as would the more drastic step > of splitting the bgwriter into two background processes (one to do > background page cleaning, and the other to do checkpoints, for > example). But even without those sorts of changes, I think this is > enough to effectively eliminate the full fsync queue problem in > practice, which seems worth doing independently of anything else. You've persuaded me. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Robert Haas wrote: I'll believe it when I see it. How about this: a 1 a 2 sync a b 1 b 2 sync b c 1 c 2 sync c Or maybe some variant, where we become willing to fsync a file a certain number of seconds after writing the last block, or when all the writes are done, whichever comes first. That's going to give worse performance than the current code in some cases. The goal of what's in there now is that you get a sequence like this: a1 b1 a2 [Filesystem writes a1] b2 [Filesystem writes b1] sync a [Only has to write a2] sync b [Only has to write b2] This idea works until you to get where the filesystem write cache is so large that it becomes lazier about writing things. The fundamental idea--push writes out some time before the sync, in hopes the filesystem will get to them before that said--it not unsound. On some systems, doing the sync more aggressively than that will be a regression. This approach just breaks down in some cases, and those cases are happening more now because their likelihood scales with total RAM. I don't want to screw the people with smaller systems, who may be getting considerable benefit from the existing sequence. Today's little systems--which are very similar to the high-end ones the spread checkpoint stuff was developed on during 8.3--do get some benefit from it as far as I know. Anyway, now that the ability to get logging on all this stuff went in during the last CF, it's way easier to just setup a random system to run tests in this area than it used to be. Whatever testing does happen should include, say, a 2GB laptop with a single hard drive in it. I think that's the bottom of what is reasonable to consider a reasonable target for tweaking write performance on, given hardware 9.1 is likely to be deployed on. How does the checkpoint target give you any time to sync them? Unless you squeeze the writes together more tightly, but that seems sketchy. Obviously the checkpoint target idea needs to be shuffled around some too. I was thinking of making the new default 0.8, and having it split the time in half for write and sync. That will make the write phase close to the speed people are seeing now, at the default of 0.5, while giving some window for spread sync too. The exact way to redistribute that around I'm not so concerned about yet. When I get to where that's the most uncertain thing left I'll benchmark the TPS vs. latency trade-off and see what happens. If the rest of the code is good enough but this just needs to be tweaked, that's a perfect thing to get beta feedback to finalize. Well you don't have to put it in shared memory on account of any of that. You can just hang it on a global variable. Hmm. Because it's so similar to other things being allocated in shared memory, I just automatically pushed it over to there. But you're right; it doesn't need to be that complicated. Nobody is touching it but the background writer. If we can find something that's a modest improvement on the status quo and we can be confident in quickly, good, but I'd rather have 9.1 go out the door on time without fully fixing this than delay the release. I'm not somebody who needs to be convinced of that. There are two near commit quality pieces of this out there now: 1) Keep some BGW cleaning and fsync absorption going while sync is happening, rather than starting it and ignoring everything else until it's done. 2) Compact fsync requests when the queue fills If that's all we can get for 9.1, it will still be a major improvement. I realize I only have a very short period of time to complete a major integration breakthrough on the pieces floating around before the goal here has to drop to something less ambitious. I head to the West Coast for a week on the 23rd; I'll be forced to throw in the towel at that point if I can't get the better ideas we have in pieces here all assembled well by then. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Sat, Jan 15, 2011 at 9:25 AM, Greg Smith wrote: > Once upon a time we got a patch from Itagaki Takahiro whose purpose was to > sort writes before sending them out: > > http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php Ah, a fine idea! > Which has very low odds of the sync on "a" finishing quickly, we'd get this > one: > > table block > a 1 > a 2 > b 1 > b 2 > c 1 > c 2 > sync a > sync b > sync c > > Which sure seems like a reasonable way to improve the odds data has been > written before the associated sync comes along. I'll believe it when I see it. How about this: a 1 a 2 sync a b 1 b 2 sync b c 1 c 2 sync c Or maybe some variant, where we become willing to fsync a file a certain number of seconds after writing the last block, or when all the writes are done, whichever comes first. It seems to me that it's going to be a bear to figure out what fraction of the checkpoint you've completed if you put all of the syncs at the end, and this whole problem appears to be predicated the assumption that the OS *isn't* writing out in a timely fashion. Are we sure that postponing the fsync relative to the writes is anything more than wishful thinking? > Also, I could just traverse the sorted list with some simple logic to count > the number of unique files, and then set the delay between fsync writes > based on it. In the above, once the list was sorted, easy to just see how > many times the table name changes on a linear scan of the sorted data. 3 > files, so if the checkpoint target gives me, say, a minute of time to sync > them, I can delay 20 seconds between. Simple math, and exactly the sort I How does the checkpoint target give you any time to sync them? Unless you squeeze the writes together more tightly, but that seems sketchy. > So I fixed the bitrot on the old sorted patch, which was fun as it came from > before the 8.3 changes. It seemed to work. I then moved the structure it > uses to hold the list of buffers to write, the thing that's sorted, into > shared memory. It's got a predictable maximum size, relying on palloc in > the middle of the checkpoint code seems bad, and there's some potential gain > from not reallocating it every time through. Well you don't have to put it in shared memory on account of any of that. You can just hang it on a global variable. > There's good bits in the patch I submitted for the last CF and in the patch > you wrote earlier this week. This unfinished patch may be a valuable idea > to fit in there too once I fix it, or maybe it's fundamentally flawed and > one of the other ideas you suggested (or I have sitting on the potential > design list) will work better. There's a patch integration problem that > needs to be solved here, but I think almost all the individual pieces are > available. I'd hate to see this fail to get integrated now just for lack of > time, considering the problem is so serious when you run into it. Likewise, but committing something half-baked is no good either. I think we're in a position to crush the full-fsync-queue problem flat (my patch should do that, and there are several other obvious things we can do for extra certainty) but the problem of spreading out the fsyncs looks to me like something we don't completely know how to solve. If we can find something that's a modest improvement on the status quo and we can be confident in quickly, good, but I'd rather have 9.1 go out the door on time without fully fixing this than delay the release. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Robert Haas wrote: Idea #2: At the beginning of a checkpoint when we scan all the buffers, count the number of buffers that need to be synced for each relation. Use the same hashtable that we use for tracking pending fsync requests. Then, interleave the writes and the fsyncs... Idea #3: Stick with the idea of a fixed delay between fsyncs, but compute how many fsyncs you think you're ultimately going to need at the start of the checkpoint, and back up the target completion time by 3 s per fsync from the get-go, so that the checkpoint still finishes on schedule. What I've been working on is something halfway between these two ideas. I have a patch, and it doesn't work right yet because I just broke it, but since I have some faint hope this will all come together any minute now I'm going to share it before someone announces a deadline has passed or something. (whistling). I'm going to add this messy thing and the patch you submitted upthread to the CF list; I'll review yours, I'll either fix the remaining problem in this one myself or rewrite to one of your ideas, and then it's onto a round of benchmarking. Once upon a time we got a patch from Itagaki Takahiro whose purpose was to sort writes before sending them out: http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php This didn't really work repeatedly for everyone because of the now well understood ext3 issues--I never replicated that speedup at the time for example. And this was before the spread checkpoint code was in 8.3. The hope was that it wasn't really going to be necessary after that anyway. Back to today...instead of something complicated, it struck me that if I just had a count of exactly how many files were involved in each checkpoint, that would be helpful. I could keep the idea of a fixed delay between fsyncs, but just auto-tune that delay amount based on the count. And how do you count the number of unique things in a list? Well, you can always sort them. I thought that if the sorted writes patch got back to functional again, it could serve two purposes. It would group all of the writes for a file together, and if you did the syncs in the same sorted order they would have the maximum odds of discovering the data was already written. So rather than this possible order: table block a 1 b 1 c 1 c 2 b 2 a 2 sync a sync b sync c Which has very low odds of the sync on "a" finishing quickly, we'd get this one: table block a 1 a 2 b 1 b 2 c 1 c 2 sync a sync b sync c Which sure seems like a reasonable way to improve the odds data has been written before the associated sync comes along. Also, I could just traverse the sorted list with some simple logic to count the number of unique files, and then set the delay between fsync writes based on it. In the above, once the list was sorted, easy to just see how many times the table name changes on a linear scan of the sorted data. 3 files, so if the checkpoint target gives me, say, a minute of time to sync them, I can delay 20 seconds between. Simple math, and exactly the sort I used to get reasonable behavior on the busy production system this all started on. There's some unresolved trickiness in the segment-driven checkpoint case, but one thing at a time. So I fixed the bitrot on the old sorted patch, which was fun as it came from before the 8.3 changes. It seemed to work. I then moved the structure it uses to hold the list of buffers to write, the thing that's sorted, into shared memory. It's got a predictable maximum size, relying on palloc in the middle of the checkpoint code seems bad, and there's some potential gain from not reallocating it every time through. Somewhere along the way, it started doing this instead of what I wanted: BadArgument("!(((header->context) != ((void *)0) && (Node*)((header->context)))->type) == T_AllocSetContext", File: "mcxt.c", Line: 589) (that's from initdb, not a good sign) And it's left me wondering whether this whole idea is a dead end I used up my window of time wandering down. There's good bits in the patch I submitted for the last CF and in the patch you wrote earlier this week. This unfinished patch may be a valuable idea to fit in there too once I fix it, or maybe it's fundamentally flawed and one of the other ideas you suggested (or I have sitting on the potential design list) will work better. There's a patch integration problem that needs to be solved here, but I think almost all the individual pieces are available. I'd hate to see this fail to get integrated now just for lack of time, considering the problem is so serious when you run into it. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c i
Re: [HACKERS] Spread checkpoint sync
On Sat, Jan 15, 2011 at 8:55 AM, Simon Riggs wrote: > On Sat, 2011-01-15 at 05:47 -0500, Greg Smith wrote: >> Robert Haas wrote: >> > On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith wrote: >> > >> > > One of the ideas Simon and I had been considering at one point was adding >> > > some better de-duplication logic to the fsync absorb code, which I'm >> > > reminded by the pattern here might be helpful independently of other >> > > improvements. >> > > >> > >> > Hopefully I'm not stepping on any toes here, but I thought this was an >> > awfully good idea and had a chance to take a look at how hard it would >> > be today while en route from point A to point B. The answer turned >> > out to be "not very", so PFA a patch that seems to work. I tested it >> > by attaching gdb to the background writer while running pgbench, and >> > it eliminate the backend fsyncs without even breaking a sweat. >> > >> >> No toe damage, this is great, I hadn't gotten to coding for this angle >> yet at all. Suffering from an overload of ideas and (mostly wasted) >> test data, so thanks for exploring this concept and proving it works. > > No toe damage either, but are we sure we want the de-duplication logic > and in this place? > > I was originally of the opinion that de-duplicating the list would save > time in the bgwriter, but that guess was wrong by about two orders of > magnitude, IIRC. The extra time in the bgwriter wasn't even noticeable. Well, the point of this is not to save time in the bgwriter - I'm not surprised to hear that wasn't noticeable. The point is that when the fsync request queue fills up, backends start performing an fsync *for every block they write*, and that's about as bad for performance as it's possible to be. So it's worth going to a little bit of trouble to try to make sure it doesn't happen. It didn't happen *terribly* frequently before, but it does seem to be common enough to worry about - e.g. on one occasion, I was able to reproduce it just by running pgbench -i -s 25 or something like that on a laptop. With this patch applied, there's no performance impact vs. current code in the very, very common case where space remains in the queue - 999 times out of 1000, writing to the fsync queue will be just as fast as ever. But in the unusual case where the queue has been filled up, compacting the queue is much much faster than performing an fsync, and the best part is that the compaction is generally massive. I was seeing things like "4096 entries compressed to 14". So clearly even if the compaction took as long as the fsync itself it would be worth it, because the next 4000+ guys who come along again go through the fast path. But in fact I think it's much faster than an fsync. In order to get pathological behavior even with this patch applied, you'd need to have NBuffers pending fsync requests and they'd all have to be different. I don't think that's theoretically impossible, but Greg's research seems to indicate that even on busy systems we don't come even a little bit close to the circumstances that would cause it to occur in practice. Every other change we might make in this area will further improve this case, too: for example, doing an absorb after each fsync would presumably help, as would the more drastic step of splitting the bgwriter into two background processes (one to do background page cleaning, and the other to do checkpoints, for example). But even without those sorts of changes, I think this is enough to effectively eliminate the full fsync queue problem in practice, which seems worth doing independently of anything else. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Sat, 2011-01-15 at 05:47 -0500, Greg Smith wrote: > Robert Haas wrote: > > On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith wrote: > > > > > One of the ideas Simon and I had been considering at one point was adding > > > some better de-duplication logic to the fsync absorb code, which I'm > > > reminded by the pattern here might be helpful independently of other > > > improvements. > > > > > > > Hopefully I'm not stepping on any toes here, but I thought this was an > > awfully good idea and had a chance to take a look at how hard it would > > be today while en route from point A to point B. The answer turned > > out to be "not very", so PFA a patch that seems to work. I tested it > > by attaching gdb to the background writer while running pgbench, and > > it eliminate the backend fsyncs without even breaking a sweat. > > > > No toe damage, this is great, I hadn't gotten to coding for this angle > yet at all. Suffering from an overload of ideas and (mostly wasted) > test data, so thanks for exploring this concept and proving it works. No toe damage either, but are we sure we want the de-duplication logic and in this place? I was originally of the opinion that de-duplicating the list would save time in the bgwriter, but that guess was wrong by about two orders of magnitude, IIRC. The extra time in the bgwriter wasn't even noticeable. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Sat, Jan 15, 2011 at 5:47 AM, Greg Smith wrote: > No toe damage, this is great, I hadn't gotten to coding for this angle yet > at all. Suffering from an overload of ideas and (mostly wasted) test data, > so thanks for exploring this concept and proving it works. Yeah - obviously I want to make sure that someone reviews the logic carefully, since a loss of fsyncs or a corruption of the request queue could affect system stability, but only very rarely, since you'd need full fsync queue + crash. But the code is pretty simple, so it should be possible to convince ourselves as to its correctness (or otherwise). Obviously, major credit to you and Simon for identifying the problem and coming up with a proposed fix. > I'm not sure what to do with the rest of the work I've been doing in this > area here, so I'm tempted to just combine this new bit from you with the > older patch I submitted, streamline, and see if that makes sense. Expected > to be there already, then "how about spending 5 minutes first checking out > that autovacuum lock patch again" turned out to be a wild underestimate. I'd rather not combine the patches, because this one is pretty simple and just does one thing, but feel free to write something that applies over top of it. Looking through your old patch (sync-spread-v3), there seem to be a couple of components there: - Compact the fsync queue based on percentage fill rather than number of writes per absorb. If we apply my queue-compacting logic, do we still need this? The queue compaction may hold the BgWriterCommLock for slightly longer than AbsorbFsyncRequests() would, but I'm not inclined to jump to the conclusion that this is worth getting excited about. The whole idea of accessing BgWriterShmem->num_requests without the lock gives me the willies anyway - sure, it'll probably work OK most of the time, especially on x86, but it seems hard to predict whether there will be occasional bad behavior on platforms with weak memory ordering. - Call pgstat_send_bgwriter() at the end of AbsorbFsyncRequests(). Not sure what the motivation for this is. - CheckpointSyncDelay(), to make sure that we absorb fsync requests and free up buffers during a long checkpoint. I think this part is clearly valuable, although I'm not sure we've satisfactorily solved the problem of how to spread out the fsyncs and still complete the checkpoint on schedule. As to that, I have a couple of half-baked ideas I'll throw out so you can laugh at them. Some of these may be recycled versions of ideas you've already had/mentioned, so, again, credit to you for getting the ball rolling. Idea #1: When we absorb fsync requests, don't just remember that there was an fsync request; also remember the time of said fsync request. If a new fsync request arrives for a segment for which we're already remembering an fsync request, update the timestamp on the request. Periodically scan the fsync request queue for requests older than, say, 30 s, and perform one such request. The idea is - if we wrote a bunch of data to a relation and then haven't touched it for a while, force it out to disk before the checkpoint actually starts so that the volume of work required by the checkpoint is lessened. Idea #2: At the beginning of a checkpoint when we scan all the buffers, count the number of buffers that need to be synced for each relation. Use the same hashtable that we use for tracking pending fsync requests. Then, interleave the writes and the fsyncs. Start by performing any fsyncs that need to happen but have no buffers to sync (i.e. everything that must be written to that relation has already been written). Then, start performing the writes, decrementing the pending-write counters as you go. If the pending-write count for a relation hits zero, you can add it to the list of fsyncs that can be performed before the writes are finished. If it doesn't hit zero (perhaps because a non-bgwriter process wrote a buffer that we were going to write anyway), then we'll do it at the end. One problem with this - aside from complexity - is that most likely most fsyncs would either happen at the beginning or very near the end, because there's no reason to assume that buffers for the same relation would be clustered together in shared_buffers. But I'm inclined to think that at least the idea of performing fsyncs for which no dirty buffers remain in shared_buffers at the beginning of the checkpoint rather than at the end might have some value. Idea #3: Stick with the idea of a fixed delay between fsyncs, but compute how many fsyncs you think you're ultimately going to need at the start of the checkpoint, and back up the target completion time by 3 s per fsync from the get-go, so that the checkpoint still finishes on schedule. Idea #4: For ext3 filesystems that like to dump the entire buffer cache instead of only the requested file, write a little daemon that runs alongside of (and completely indepdently of) PostgreSQL. Every 30 s,
Re: [HACKERS] Spread checkpoint sync
Robert Haas wrote: On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith wrote: One of the ideas Simon and I had been considering at one point was adding some better de-duplication logic to the fsync absorb code, which I'm reminded by the pattern here might be helpful independently of other improvements. Hopefully I'm not stepping on any toes here, but I thought this was an awfully good idea and had a chance to take a look at how hard it would be today while en route from point A to point B. The answer turned out to be "not very", so PFA a patch that seems to work. I tested it by attaching gdb to the background writer while running pgbench, and it eliminate the backend fsyncs without even breaking a sweat. No toe damage, this is great, I hadn't gotten to coding for this angle yet at all. Suffering from an overload of ideas and (mostly wasted) test data, so thanks for exploring this concept and proving it works. I'm not sure what to do with the rest of the work I've been doing in this area here, so I'm tempted to just combine this new bit from you with the older patch I submitted, streamline, and see if that makes sense. Expected to be there already, then "how about spending 5 minutes first checking out that autovacuum lock patch again" turned out to be a wild underestimate. Part of the problem is that it's become obvious to me the last month that right now is a terrible time to be doing Linux benchmarks that impact filesystem sync behavior. The recent kernel changes that are showing in the next rev of the enterprise distributions--like RHEL6 and Debian Squeeze both working to get a stable 2.6.32--have made testing a nightmare. I don't want to dump a lot of time into optimizing for <2.6.32 if this problem changes its form in newer kernels, but the distributions built around newer kernels are just not fully baked enough yet to tell. For example, the pre-release Squeeze numbers we're seeing are awful so far, but it's not really done yet either. I expect 3-6 months from today, that all will have settled down enough that I can make some sense of it. Lately my work with the latest distributions has just been burning time installing stuff that doesn't work quite right yet. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Re: [HACKERS] Spread checkpoint sync
On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith wrote: > Having the pg_stat_bgwriter.buffers_backend_fsync patch available all the > time now has made me reconsider how important one potential bit of > refactoring here would be. I managed to catch one of the situations where > really popular relations were being heavily updated in a way that was > competing with the checkpoint on my test system (which I can happily share > the logs of), with the instrumentation patch applied but not the spread sync > one: > > LOG: checkpoint starting: xlog > DEBUG: could not forward fsync request because request queue is full > CONTEXT: writing block 7747 of relation base/16424/16442 > DEBUG: could not forward fsync request because request queue is full > CONTEXT: writing block 42688 of relation base/16424/16437 > DEBUG: could not forward fsync request because request queue is full > CONTEXT: writing block 9723 of relation base/16424/16442 > DEBUG: could not forward fsync request because request queue is full > CONTEXT: writing block 58117 of relation base/16424/16437 > DEBUG: could not forward fsync request because request queue is full > CONTEXT: writing block 165128 of relation base/16424/16437 > [330 of these total, all referring to the same two relations] > > DEBUG: checkpoint sync: number=1 file=base/16424/16448_fsm > time=10132.83 msec > DEBUG: checkpoint sync: number=2 file=base/16424/11645 time=0.001000 msec > DEBUG: checkpoint sync: number=3 file=base/16424/16437 time=7.796000 msec > DEBUG: checkpoint sync: number=4 file=base/16424/16448 time=4.679000 msec > DEBUG: checkpoint sync: number=5 file=base/16424/11607 time=0.001000 msec > DEBUG: checkpoint sync: number=6 file=base/16424/16437.1 time=3.101000 msec > DEBUG: checkpoint sync: number=7 file=base/16424/16442 time=4.172000 msec > DEBUG: checkpoint sync: number=8 file=base/16424/16428_vm time=0.001000 > msec > DEBUG: checkpoint sync: number=9 file=base/16424/16437_fsm time=0.001000 > msec > DEBUG: checkpoint sync: number=10 file=base/16424/16428 time=0.001000 msec > DEBUG: checkpoint sync: number=11 file=base/16424/16425 time=0.00 msec > DEBUG: checkpoint sync: number=12 file=base/16424/16437_vm time=0.001000 > msec > DEBUG: checkpoint sync: number=13 file=base/16424/16425_vm time=0.001000 > msec > LOG: checkpoint complete: wrote 3032 buffers (74.0%); 0 transaction log > file(s) added, 0 removed, 0 recycled; write=1.742 s, sync=10.153 s, > total=37.654 s; sync files=13, longest=10.132 s, average=0.779 s > > Note here how the checkpoint was hung on trying to get 16448_fsm written > out, but the backends were issuing constant competing fsync calls to these > other relations. This is very similar to the production case this patch was > written to address, which I hadn't been able to share a good example of yet. > That's essentially what it looks like, except with the contention going on > for minutes instead of seconds. > > One of the ideas Simon and I had been considering at one point was adding > some better de-duplication logic to the fsync absorb code, which I'm > reminded by the pattern here might be helpful independently of other > improvements. Hopefully I'm not stepping on any toes here, but I thought this was an awfully good idea and had a chance to take a look at how hard it would be today while en route from point A to point B. The answer turned out to be "not very", so PFA a patch that seems to work. I tested it by attaching gdb to the background writer while running pgbench, and it eliminate the backend fsyncs without even breaking a sweat. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index 4457df6..f6cd8dc 100644 --- a/src/backend/postmaster/bgwriter.c +++ b/src/backend/postmaster/bgwriter.c @@ -182,6 +182,7 @@ static void CheckArchiveTimeout(void); static void BgWriterNap(void); static bool IsCheckpointOnSchedule(double progress); static bool ImmediateCheckpointRequested(void); +static bool CompactBgwriterRequestQueue(void); /* Signal handlers */ @@ -1090,10 +1091,20 @@ ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum, /* Count all backend writes regardless of if they fit in the queue */ BgWriterShmem->num_backend_writes++; + /* + * If the background writer isn't running or the request queue is full, + * the backend will have to perform its own fsync request. But before + * forcing that to happen, we can try to compact the background writer + * request queue. + */ if (BgWriterShmem->bgwriter_pid == 0 || - BgWriterShmem->num_requests >= BgWriterShmem->max_requests) + (BgWriterShmem->num_requests >= BgWriterShmem->max_requests + && !CompactBgwriterRequestQueue())) { - /* Also count the subset where backends have to do their own fsync */ + /* + * Count the subset of writes where backends have to do their own + * fsync + */ BgWriterShm
Re: [HACKERS] Spread checkpoint sync
On Mon, 2010-12-06 at 23:26 -0300, Alvaro Herrera wrote: > Why would multiple bgwriter processes worry you? Because it complicates the tracking of files requiring fsync. As Greg says, the last attempt to do that was a lot of code. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Alvaro Herrera wrote: Why would multiple bgwriter processes worry you? Of course, it wouldn't work to have multiple processes trying to execute a checkpoint simultaneously, but what if we separated the tasks so that one process is in charge of checkpoints, and another one is in charge of the LRU scan? I was commenting more in the context of development resource allocation. Moving toward that design would be helpful, but it alone isn't enough to improve the checkpoint sync issues. My concern is that putting work into that area will be a distraction from making progress on those. If individual syncs take so long that the background writer gets lost for a while executing them, and therefore doesn't do LRU cleanup, you've got a problem that LRU-related improvements probably aren't enough to solve. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Excerpts from Greg Smith's message of dom dic 05 20:02:48 -0300 2010: > When ends up happening if you push toward fully sync I/O is the design > you see in some other databases, where you need multiple writer > processes. Then requests for new pages can continue to allocate as > needed, while keeping any one write from blocking things. That's one > sort of a way to simulate asynchronous I/O, and you can substitute true > async I/O instead in many of those implementations. We didn't have much > luck with portability on async I/O when that was last experimented with, > and having multiple background writer processes seems like overkill; > that whole direction worries me. Why would multiple bgwriter processes worry you? Of course, it wouldn't work to have multiple processes trying to execute a checkpoint simultaneously, but what if we separated the tasks so that one process is in charge of checkpoints, and another oneZis in charge of the LRU scan? -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Rob Wultsch wrote: Forgive me, but is all of this a step on the slippery slope to direct io? And is this a bad thing I don't really think so. There's an important difference in my head between direct I/O, where the kernel is told "write this immediately!", and what I'm trying to achive. I want to give the kernel an opportunity to write blocks out in an efficient way, so that it can take advantage of elevator sorting, write combining, and similar tricks. But, eventually, those writes have to make it out to disk. Linux claims to have concepts like a "deadline" for I/O to happen, but they turn out to not be so effective once the system gets backed up with enough writes. Since fsync time is the only effective deadline, I'm progressing from the standpoint that adjusting when it happens relative to the write will help, while still allowing the kernel an opportunity to get the writes out on its own schedule. When ends up happening if you push toward fully sync I/O is the design you see in some other databases, where you need multiple writer processes. Then requests for new pages can continue to allocate as needed, while keeping any one write from blocking things. That's one sort of a way to simulate asynchronous I/O, and you can substitute true async I/O instead in many of those implementations. We didn't have much luck with portability on async I/O when that was last experimented with, and having multiple background writer processes seems like overkill; that whole direction worries me. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Sun, Dec 5, 2010 at 2:53 PM, Greg Smith wrote: > Heikki Linnakangas wrote: >> >> If you fsync() a file with one dirty page in it, it's going to return very >> quickly, but a 1GB file will take a while. That could be problematic if you >> have a thousand small files and a couple of big ones, as you would want to >> reserve more time for the big ones. I'm not sure what to do about it, maybe >> it's not a problem in practice. > > It's a problem in practice allright, with the bulk-loading situation being > the main one you'll hit it. If somebody is running a giant COPY to populate > a table at the time the checkpoint starts, there's probably a 1GB file of > dirty data that's unsynced around there somewhere. I think doing anything > about that situation requires an additional leap in thinking about buffer > cache evicition and fsync absorption though. Ultimately I think we'll end > up doing sync calls for relations that have gone "cold" for a while all the > time as part of BGW activity, not just at checkpoint time, to try and avoid > this whole area better. That's a lot more than I'm trying to do in my first > pass of improvements though. > > In the interest of cutting the number of messy items left in the official > CommitFest, I'm going to mark my patch here "Returned with Feedback" and > continue working in the general direction I was already going. Concept > shared, underlying patches continue to advance, good discussion around it; > those were my goals for this CF and I think we're there. > > I have a good idea how to autotune the sync spread that's hardcoded in the > current patch. I'll work on finishing that up and organizing some more > extensive performance tests. Right now I'm more concerned about finishing > the tests around the wal_sync_method issues, which are related to this and > need to get sorted out a bit more urgently. > > -- > Greg Smith 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD > PostgreSQL Training, Services and Support www.2ndQuadrant.us > Forgive me, but is all of this a step on the slippery slope to direction io? And is this a bad thing? -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Heikki Linnakangas wrote: If you fsync() a file with one dirty page in it, it's going to return very quickly, but a 1GB file will take a while. That could be problematic if you have a thousand small files and a couple of big ones, as you would want to reserve more time for the big ones. I'm not sure what to do about it, maybe it's not a problem in practice. It's a problem in practice allright, with the bulk-loading situation being the main one you'll hit it. If somebody is running a giant COPY to populate a table at the time the checkpoint starts, there's probably a 1GB file of dirty data that's unsynced around there somewhere. I think doing anything about that situation requires an additional leap in thinking about buffer cache evicition and fsync absorption though. Ultimately I think we'll end up doing sync calls for relations that have gone "cold" for a while all the time as part of BGW activity, not just at checkpoint time, to try and avoid this whole area better. That's a lot more than I'm trying to do in my first pass of improvements though. In the interest of cutting the number of messy items left in the official CommitFest, I'm going to mark my patch here "Returned with Feedback" and continue working in the general direction I was already going. Concept shared, underlying patches continue to advance, good discussion around it; those were my goals for this CF and I think we're there. I have a good idea how to autotune the sync spread that's hardcoded in the current patch. I'll work on finishing that up and organizing some more extensive performance tests. Right now I'm more concerned about finishing the tests around the wal_sync_method issues, which are related to this and need to get sorted out a bit more urgently. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Greg Stark wrote: Using sync_file_range you can specify the set of blocks to sync and then block on them only after some time has passed. But there's no documentation on how this relates to the I/O scheduler so it's not clear it would have any effect on the problem. I believe this is the exact spot we're stalled at in regards to getting this improved on the Linux side, as I understand it at least. *The* answer for this class of problem on Linux is to use sync_file_range, and I don't think we'll ever get any sympathy from those kernel developers until we do. But that's a Linux specific call, so doing that is going to add a write path fork with platform-specific code into the database. If I thought sync_file_range was a silver bullet guaranteed to make this better, maybe I'd go for that. I think there's some relatively low-hanging fruit on the database side that would do better before going to that extreme though, thus the patch. We might still have to delay the begining of the sync to allow the dirty blocks to be synced naturally and then when we issue it still end up catching a lot of other i/o as well. Whether it's "lots" or not is really workload dependent. I work from the assumption that the blocks being written out by the checkpoint are the most popular ones in the database, the ones that accumulate a high usage count and stay there. If that's true, my guess is that the writes being done while the checkpoint is executing are a bit less likely to be touching the same files. You raise a valid concern, I just haven't seen that actually happen in practice yet. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Thu, Dec 2, 2010 at 2:24 PM, Greg Stark wrote: > On Wed, Dec 1, 2010 at 4:25 AM, Greg Smith wrote: >>> I ask because I don't have a mental model of how the pause can help. >>> Given that this dirty data has been hanging around for many minutes >>> already, what is a 3 second pause going to heal? >>> >> >> The difference is that once an fsync call is made, dirty data is much more >> likely to be forced out. It's the one thing that bypasses all other ways >> the kernel might try to avoid writing the data > > I had always assumed the problem was that other I/O had been done to > the files in the meantime. I.e. the fsync is not just syncing the > checkpoint but any other blocks that had been flushed since the > checkpoint started. It strikes me that we might start the syncs of the files that the checkpoint isn't going to dirty further at the start of the checkpoint, and do the rest at the end. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
> Using sync_file_range you can specify the set of blocks to sync and > then block on them only after some time has passed. But there's no > documentation on how this relates to the I/O scheduler so it's not > clear it would have any effect on the problem. We might still have to > delay the begining of the sync to allow the dirty blocks to be synced > naturally and then when we issue it still end up catching a lot of > other i/o as well. This *really* sounds like we should be working with the FS geeks on making the OS do this work for us. Greg, you wanna go to LinuxCon next year? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Wed, Dec 1, 2010 at 4:25 AM, Greg Smith wrote: >> I ask because I don't have a mental model of how the pause can help. >> Given that this dirty data has been hanging around for many minutes >> already, what is a 3 second pause going to heal? >> > > The difference is that once an fsync call is made, dirty data is much more > likely to be forced out. It's the one thing that bypasses all other ways > the kernel might try to avoid writing the data I had always assumed the problem was that other I/O had been done to the files in the meantime. I.e. the fsync is not just syncing the checkpoint but any other blocks that had been flushed since the checkpoint started. The longer the checkpoint is spread over the more other I/O is included as well. Using sync_file_range you can specify the set of blocks to sync and then block on them only after some time has passed. But there's no documentation on how this relates to the I/O scheduler so it's not clear it would have any effect on the problem. We might still have to delay the begining of the sync to allow the dirty blocks to be synced naturally and then when we issue it still end up catching a lot of other i/o as well. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On 01.12.2010 23:30, Greg Smith wrote: Heikki Linnakangas wrote: Do you have any idea how to autotune the delay between fsyncs? I'm thinking to start by counting the number of relations that need them at the beginning of the checkpoint. Then use the same basic math that drives the spread writes, where you assess whether you're on schedule or not based on segment/time progress relative to how many have been sync'd out of that total. At a high level I think that idea translates over almost directly into the existing write spread code. Was hoping for a sanity check from you in particular about whether that seems reasonable or not before diving into the coding. Sounds reasonable to me. fsync()s are a lot less uniform than write()s, though. If you fsync() a file with one dirty page in it, it's going to return very quickly, but a 1GB file will take a while. That could be problematic if you have a thousand small files and a couple of big ones, as you would want to reserve more time for the big ones. I'm not sure what to do about it, maybe it's not a problem in practice. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Heikki Linnakangas wrote: Do you have any idea how to autotune the delay between fsyncs? I'm thinking to start by counting the number of relations that need them at the beginning of the checkpoint. Then use the same basic math that drives the spread writes, where you assess whether you're on schedule or not based on segment/time progress relative to how many have been sync'd out of that total. At a high level I think that idea translates over almost directly into the existing write spread code. Was hoping for a sanity check from you in particular about whether that seems reasonable or not before diving into the coding. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On 01.12.2010 06:25, Greg Smith wrote: Jeff Janes wrote: I ask because I don't have a mental model of how the pause can help. Given that this dirty data has been hanging around for many minutes already, what is a 3 second pause going to heal? The difference is that once an fsync call is made, dirty data is much more likely to be forced out. It's the one thing that bypasses all other ways the kernel might try to avoid writing the data--both the dirty ratio guidelines and the congestion control logic--and forces those writes to happen as soon as they can be scheduled. If you graph the amount of data shown "Dirty:" by /proc/meminfo over time, once the sync calls start happening it's like a descending staircase pattern, dropping a little bit as each sync fires. Do you have any idea how to autotune the delay between fsyncs? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Jeff Janes wrote: Have you tested out this "absorb during syncing phase" code without the sleep between the syncs? I.e. so that it still a tight loop, but the loop alternates between sync and absorb, with no intentional pause? Yes; that's how it was developed. It helped to have just the extra absorb work without the pauses, but that alone wasn't enough to really improve things on the server we ran into this problem badly on. I ask because I don't have a mental model of how the pause can help. Given that this dirty data has been hanging around for many minutes already, what is a 3 second pause going to heal? The difference is that once an fsync call is made, dirty data is much more likely to be forced out. It's the one thing that bypasses all other ways the kernel might try to avoid writing the data--both the dirty ratio guidelines and the congestion control logic--and forces those writes to happen as soon as they can be scheduled. If you graph the amount of data shown "Dirty:" by /proc/meminfo over time, once the sync calls start happening it's like a descending staircase pattern, dropping a little bit as each sync fires. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Sun, Nov 14, 2010 at 3:48 PM, Greg Smith wrote: ... > One change that turned out be necessary rather than optional--to get good > performance from the system under tuning--was to make regular background > writer activity, including fsync absorb checks, happen during these sync > pauses. The existing code ran the checkpoint sync work in a pretty tight > loop, which as I alluded to in an earlier patch today can lead to the > backends competing with the background writer to get their sync calls > executed. This squashes that problem if the background writer is setup > properly. Have you tested out this "absorb during syncing phase" code without the sleep between the syncs? I.e. so that it still a tight loop, but the loop alternates between sync and absorb, with no intentional pause? I wonder if all the improvement you see might not be due entirely to the absorb between syncs, and none or very little from the sleep itself. I ask because I don't have a mental model of how the pause can help. Given that this dirty data has been hanging around for many minutes already, what is a 3 second pause going to heal? The healing power of clearing out the absorb queue seems much more obvious. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
> Maybe, but it's hard to argue that the current implementation--just > doing all of the sync calls as fast as possible, one after the other--is > going to produce worst-case behavior in a lot of situations. Given that > it's not a huge amount of code to do better, I'd rather do some work in > that direction, instead of presuming the kernel authors will ever make > this go away. Spreading the writes out as part of the checkpoint rework > in 8.3 worked better than any kernel changes I've tested since then, and > I'm not real optimisic about this getting resolved at the system level. > So long as the database changes aren't antagonistic toward kernel > improvements, I'd prefer to have some options here that become effective > as soon as the database code is done. Besides, even if kernel/FS authors did improve things, the improvements would not be available on production platforms for years. And, for that matter, while Linux and BSD are pretty responsive to our feedback, Apple, Microsoft and Oracle are most definitely not. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Ron Mayer wrote: Might smoother checkpoints be better solved by talking to the OS vendors & virtual-memory-tunning-knob-authors to work with them on exposing the ideal knobs; rather than saying that our only tool is a hammer(fsync) so the problem must be handled as a nail. Maybe, but it's hard to argue that the current implementation--just doing all of the sync calls as fast as possible, one after the other--is going to produce worst-case behavior in a lot of situations. Given that it's not a huge amount of code to do better, I'd rather do some work in that direction, instead of presuming the kernel authors will ever make this go away. Spreading the writes out as part of the checkpoint rework in 8.3 worked better than any kernel changes I've tested since then, and I'm not real optimisic about this getting resolved at the system level. So long as the database changes aren't antagonistic toward kernel improvements, I'd prefer to have some options here that become effective as soon as the database code is done. I've attached an updated version of the initial sync spreading patch here, one that applies cleanly on top of HEAD and over top of the sync instrumentation patch too. The conflict that made that hard before is gone now. Having the pg_stat_bgwriter.buffers_backend_fsync patch available all the time now has made me reconsider how important one potential bit of refactoring here would be. I managed to catch one of the situations where really popular relations were being heavily updated in a way that was competing with the checkpoint on my test system (which I can happily share the logs of), with the instrumentation patch applied but not the spread sync one: LOG: checkpoint starting: xlog DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 7747 of relation base/16424/16442 DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 42688 of relation base/16424/16437 DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 9723 of relation base/16424/16442 DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 58117 of relation base/16424/16437 DEBUG: could not forward fsync request because request queue is full CONTEXT: writing block 165128 of relation base/16424/16437 [330 of these total, all referring to the same two relations] DEBUG: checkpoint sync: number=1 file=base/16424/16448_fsm time=10132.83 msec DEBUG: checkpoint sync: number=2 file=base/16424/11645 time=0.001000 msec DEBUG: checkpoint sync: number=3 file=base/16424/16437 time=7.796000 msec DEBUG: checkpoint sync: number=4 file=base/16424/16448 time=4.679000 msec DEBUG: checkpoint sync: number=5 file=base/16424/11607 time=0.001000 msec DEBUG: checkpoint sync: number=6 file=base/16424/16437.1 time=3.101000 msec DEBUG: checkpoint sync: number=7 file=base/16424/16442 time=4.172000 msec DEBUG: checkpoint sync: number=8 file=base/16424/16428_vm time=0.001000 msec DEBUG: checkpoint sync: number=9 file=base/16424/16437_fsm time=0.001000 msec DEBUG: checkpoint sync: number=10 file=base/16424/16428 time=0.001000 msec DEBUG: checkpoint sync: number=11 file=base/16424/16425 time=0.00 msec DEBUG: checkpoint sync: number=12 file=base/16424/16437_vm time=0.001000 msec DEBUG: checkpoint sync: number=13 file=base/16424/16425_vm time=0.001000 msec LOG: checkpoint complete: wrote 3032 buffers (74.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=1.742 s, sync=10.153 s, total=37.654 s; sync files=13, longest=10.132 s, average=0.779 s Note here how the checkpoint was hung on trying to get 16448_fsm written out, but the backends were issuing constant competing fsync calls to these other relations. This is very similar to the production case this patch was written to address, which I hadn't been able to share a good example of yet. That's essentially what it looks like, except with the contention going on for minutes instead of seconds. One of the ideas Simon and I had been considering at one point was adding some better de-duplication logic to the fsync absorb code, which I'm reminded by the pattern here might be helpful independently of other improvements. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index 620b197..501cab8 100644 --- a/src/backend/postmaster/bgwriter.c +++ b/src/backend/postmaster/bgwriter.c @@ -143,8 +143,8 @@ typedef struct static BgWriterShmemStruct *BgWriterShmem; -/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */ -#define WRITES_PER_ABSORB 1000 +/* Fraction of fsync absorb queue that needs to be filled before acting */ +#def
Re: [HACKERS] Spread checkpoint sync
Josh Berkus wrote: > On 11/20/10 6:11 PM, Jeff Janes wrote: >> True, but I think that changing these from their defaults is not >> considered to be a dark art reserved for kernel hackers, i.e they are >> something that sysadmins are expected to tweak to suite their work >> load, just like the shmmax and such. > > I disagree. Linux kernel hackers know about these kinds of parameters, > and I suppose that Linux performance experts do. But very few > sysadmins, in my experience, have any idea. To me, a lot of this conversation feels parallel to the arguments the occasionally come up debating writing directly to raw disks bypassing the filesystems altogether. Might smoother checkpoints be better solved by talking to the OS vendors & virtual-memory-tunning-knob-authors to work with them on exposing the ideal knobs; rather than saying that our only tool is a hammer(fsync) so the problem must be handled as a nail. Hypothetically - what would the ideal knobs be? Something like madvise WONTNEED but that leaves pages in the OS's cache after writing them? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
2010/11/21 Andres Freund : > On Sunday 21 November 2010 23:19:30 Martijn van Oosterhout wrote: >> For a similar problem we had (kernel buffering too much) we had success >> using the fadvise and madvise WONTNEED syscalls to force the data to >> exit the cache much sooner than it would otherwise. This was on Linux >> and it had the side-effect that the data was deleted from the kernel >> cache, which we wanted, but probably isn't appropriate here. > Yep, works fine. Although it has the issue that the data will get read again > if > archiving/SR is enabled. mmhh . the current code does call DONTNEED or WILLNEED for WAL depending of the archiving off or on. This matters *only* once the data is writen (fsync, fdatasync), before that it should not have an effect. > >> There is also sync_file_range, but that's linux specific, although >> close to what you want I think. It would allow you to work with blocks >> smaller than 1GB. > Unfortunately that puts the data under quite high write-out pressure inside > the kernel - which is not what you actually want because it limits reordering > and such significantly. > > It would be nicer if you could get a mix of both semantics (looking at it, > depending on the approach that seems to be about a 10 line patch to the > kernel). I.e. indicate that you want to write the pages soonish, but don't put > it on the head of the writeout queue. > > Andres > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Sun, Nov 21, 2010 at 4:54 PM, Greg Smith wrote: > Let me throw some numbers out [...] Interesting. > Ultimately what I want to do here is some sort of smarter write-behind sync > operation, perhaps with a LRU on relations with pending fsync requests. The > idea would be to sync relations that haven't been touched in a while in > advance of the checkpoint even. I think that's similar to the general idea > Robert is suggesting here, to get some sync calls flowing before all of the > checkpoint writes have happened. I think that the final sync calls will > need to get spread out regardless, and since doing that requires a fairly > small amount of code too that's why we started with that. Doing some kind of background fsyinc-ing might indeed be sensible, but I agree that's secondary to trying to spread out the fsyncs during the checkpoint itself. I guess the question is what we can do there sensibly without an unreasonable amount of new code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On 11/20/10 6:11 PM, Jeff Janes wrote: > True, but I think that changing these from their defaults is not > considered to be a dark art reserved for kernel hackers, i.e they are > something that sysadmins are expected to tweak to suite their work > load, just like the shmmax and such. I disagree. Linux kernel hackers know about these kinds of parameters, and I suppose that Linux performance experts do. But very few sysadmins, in my experience, have any idea. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Sunday 21 November 2010 23:19:30 Martijn van Oosterhout wrote: > For a similar problem we had (kernel buffering too much) we had success > using the fadvise and madvise WONTNEED syscalls to force the data to > exit the cache much sooner than it would otherwise. This was on Linux > and it had the side-effect that the data was deleted from the kernel > cache, which we wanted, but probably isn't appropriate here. Yep, works fine. Although it has the issue that the data will get read again if archiving/SR is enabled. > There is also sync_file_range, but that's linux specific, although > close to what you want I think. It would allow you to work with blocks > smaller than 1GB. Unfortunately that puts the data under quite high write-out pressure inside the kernel - which is not what you actually want because it limits reordering and such significantly. It would be nicer if you could get a mix of both semantics (looking at it, depending on the approach that seems to be about a 10 line patch to the kernel). I.e. indicate that you want to write the pages soonish, but don't put it on the head of the writeout queue. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Sun, Nov 21, 2010 at 04:54:00PM -0500, Greg Smith wrote: > Ultimately what I want to do here is some sort of smarter write-behind > sync operation, perhaps with a LRU on relations with pending fsync > requests. The idea would be to sync relations that haven't been touched > in a while in advance of the checkpoint even. I think that's similar to > the general idea Robert is suggesting here, to get some sync calls > flowing before all of the checkpoint writes have happened. I think that > the final sync calls will need to get spread out regardless, and since > doing that requires a fairly small amount of code too that's why we > started with that. For a similar problem we had (kernel buffering too much) we had success using the fadvise and madvise WONTNEED syscalls to force the data to exit the cache much sooner than it would otherwise. This was on Linux and it had the side-effect that the data was deleted from the kernel cache, which we wanted, but probably isn't appropriate here. There is also sync_file_range, but that's linux specific, although close to what you want I think. It would allow you to work with blocks smaller than 1GB. Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > Patriotism is when love of your own people comes first; nationalism, > when hate for people other than your own comes first. > - Charles de Gaulle signature.asc Description: Digital signature
Re: [HACKERS] Spread checkpoint sync
Robert Haas wrote: Doing all the writes and then all the fsyncs meets this requirement trivially, but I'm not so sure that's a good idea. For example, given files F1 ... Fn with dirty pages needing checkpoint writes, we could do the following: first, do any pending fsyncs for files not among F1 .. Fn; then, write all pages for F1 and fsync, write all pages for F2 and fsync, write all pages for F3 and fsync, etc. This might seem dumb because we're not really giving the OS a chance to write anything out before we fsync, but think about the ext3 case where the whole filesystem cache gets flushed anyway. I'm not horribly interested in optimizing for the ext3 case per se, as I consider that filesystem fundamentally broken from the perspective of its ability to deliver low-latency here. I wouldn't want a patch that improved behavior on filesystem with granular fsync to make the ext3 situation worst. That's as much as I'd want design to lean toward considering its quirks. Jeff Janes made a case downthread for "why not make it the admin/OS's job to worry about this?" In cases where there is a reasonable solution available, in the form of "switch to XFS or ext4", I'm happy to take that approach. Let me throw some numbers out to give a better idea of the shape and magnitude of the problem case I've been working on here. In the situation that leads that the near hour-long sync phase I've seen, checkpoints will start with about a 3GB backlog of data in the kernel write cache to deal with. That's about 4% of RAM, just under the 5% threshold set by dirty_background_ratio. Whether or not the 256MB write cache on the controller is also filled is a relatively minor detail I can't monitor easily. The checkpoint itself? <250MB each time. This proportion is why I didn't think to follow the alternate path of worrying about spacing the write and fsync calls out differently. I shrunk shared_buffers down to make the actual checkpoints smaller, which helped to some degree; that's what got them down to smaller than the RAID cache size. But the amount of data cached by the operating system is the real driver of total sync time here. Whether or not you include all of the writes from the checkpoint itself before you start calling fsync didn't actually matter very much; in the case I've been chasing, those are getting cached anyway. The write storm from the fsync calls themselves forcing things out seems to be the driver on I/O spikes, which is why I started with spacing those out. Writes go out at a rate of around 5MB/s, so clearing the 3GB backlog takes a minimum of 10 minutes of real time. There are about 300 1GB relation files involved in the case I've been chasing. This is where the 3 second delay number came from; 300 files, 3 seconds each, 900 seconds = 15 minutes of sync spread. You can turn that math around to figure out how much delay per relation you can afford while still keeping checkpoints to a planned end time, which isn't done in the patch I submitted yet. Ultimately what I want to do here is some sort of smarter write-behind sync operation, perhaps with a LRU on relations with pending fsync requests. The idea would be to sync relations that haven't been touched in a while in advance of the checkpoint even. I think that's similar to the general idea Robert is suggesting here, to get some sync calls flowing before all of the checkpoint writes have happened. I think that the final sync calls will need to get spread out regardless, and since doing that requires a fairly small amount of code too that's why we started with that. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
Jeff Janes wrote: And for very large memory systems, even 1% may be too much to cache (dirty*_ratio can only be set in integer percent points), so recent kernels introduced dirty*_bytes parameters. I like these better because they do what they say. With the dirty*_ratio, I could never figure out what it was a ratio of, and the results were unpredictable without extensive experimentation. Right, you can't set dirty_background_ratio low enough to make this problem go away. Even attempts to set it to 1%, back when that that was the right size for it, seem to be defeated by other mechanisms within the kernel. Last time I looked at the related source code, it seemed the "congestion control" logic that kicks in to throttle writes was a likely suspect. This is why I'm not real optimistic about newer mechanism like the dirty_background_bytes added 2.6.29 to help here, as that just gives a mapping to setting lower values; the same basic logic is under the hood. Like Jeff, I've never seen dirty_expire_centisecs help at all, possibly due to the same congestion mechanism. Yes, but how much work do we want to put into redoing the checkpoint logic so that the sysadmin on a particular OS and configuration and FS can avoid having to change the kernel parameters away from their defaults? (Assuming of course I am correctly understanding the problem, always a dangerous assumption.) I've been trying to make this problem go away using just the kernel tunables available since 2006. I adjusted them carefully on the server that ran into this problem so badly that it motivated the submitted patch, months before this issue got bad. It didn't help. Maybe if they were running a later kernel that supported dirty_background_bytes that would have worked better. During the last few years, the only thing that has consistently helped in every case is the checkpoint spreading logic that went into 8.3. I no longer expect that the kernel developers will ever make this problem go away the way checkpoints are written out right now, whereas the last good PostgreSQL work in this area definitely helped. The basic premise of the current checkpoint code is that if you write all of the buffers out early enough, by the time syncs execute enough of the data should have gone out that those don't take very long to process. That was usually true for the last few years, on systems with a battery-backed cache; the amount of memory cached by the OS was relatively small relative to the RAID cache size. That's not the case anymore, and that divergence is growing bigger. The idea that the checkpoint sync code can run in a relatively tight loop, without stopping to do the normal background writer cleanup work, is also busted by that observation. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Sat, Nov 20, 2010 at 5:17 PM, Robert Haas wrote: > On Sat, Nov 20, 2010 at 6:21 PM, Jeff Janes wrote: >>> Doing all the writes and then all the fsyncs meets this requirement >>> trivially, but I'm not so sure that's a good idea. For example, given >>> files F1 ... Fn with dirty pages needing checkpoint writes, we could >>> do the following: first, do any pending fsyncs for files not among F1 >>> .. Fn; then, write all pages for F1 and fsync, write all pages for F2 >>> and fsync, write all pages for F3 and fsync, etc. This might seem >>> dumb because we're not really giving the OS a chance to write anything >>> out before we fsync, but think about the ext3 case where the whole >>> filesystem cache gets flushed anyway. It's much better to dump the >>> cache at the beginning of the checkpoint and then again after every >>> file than it is to spew many GB of dirty stuff into the cache and then >>> drop the hammer. >> >> But the kernel has knobs to prevent that from happening. >> dirty_background_ratio, dirty_ratio, dirty_background_bytes (on newer >> kernels), dirty_expire_centisecs. Don't these knobs work? Also, ext3 >> is supposed to do a journal commit every 5 seconds under default mount >> conditions. > > I don't know in detail. dirty_expire_centisecs sounds useful; I think > the problem with dirty_background_ratio and dirty_ratio is that the > default ratios are large enough that on systems with a huge pile of > memory, they allow more dirty data to accumulate than can be flushed > without causing an I/O storm. True, but I think that changing these from their defaults is not considered to be a dark art reserved for kernel hackers, i.e they are something that sysadmins are expected to tweak to suite their work load, just like the shmmax and such. And for very large memory systems, even 1% may be too much to cache (dirty*_ratio can only be set in integer percent points), so recent kernels introduced dirty*_bytes parameters. I like these better because they do what they say. With the dirty*_ratio, I could never figure out what it was a ratio of, and the results were unpredictable without extensive experimentation. > I believe Greg Smith made a comment > along the lines of - memory sizes are grow faster than I/O speeds; > therefore a ratio that is OK for a low-end system with a modest amount > of memory causes problems on a high-end system that has faster I/O but > MUCH more memory. Yes, but how much work do we want to put into redoing the checkpoint logic so that the sysadmin on a particular OS and configuration and FS can avoid having to change the kernel parameters away from their defaults? (Assuming of course I am correctly understanding the problem, always a dangerous assumption.) Some experiments I have just done show that dirty_expire_centisecs does not seem reliable on ext3, but the dirty*_ratio and dirty*_bytes seem reliable on ext2, ext3, and ext4. But that may not apply to RAID, I don't have one I can test. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Sat, Nov 20, 2010 at 6:21 PM, Jeff Janes wrote: >>> The thing to realize >>> that complicates the design is that the actual sync execution may take a >>> considerable period of time. It's much more likely for that to happen than >>> in the case of an individual write, as the current spread checkpoint does, >>> because those are usually cached. In the spread sync case, it's easy for >>> one slow sync to make the rest turn into ones that fire in quick succession, >>> to make up for lost time. >> >> I think the behavior of file systems and operating systems is highly >> relevant here. We seem to have a theory that allowing a delay between >> the write and the fsync should give the OS a chance to start writing >> the data out, > > I thought that the theory was that doing too many fsync in short order > can lead to some kind of starvation of other IO. > > If the theory is that we want to wait between writes and fsyncs, then > the current behavior is probably the best, Spreading out the writes > and then doing all the syncs at the end gives the best delay time > between an average write and the sync of that written to file. Or, > spread the writes out over 150 seconds, sleep for 140 seconds, then do > the fsyncs. But I don't think that that is the theory. Well, I've heard Bruce and, I think, possibly also Greg talk about wanting to wait after doing the writes in the hopes that the kernel will start to flush the dirty pages, but I'm wondering whether it wouldn't be better to just give up on that and do: small batch of writes - fsync those writes - another small batch of writes - fsync that batch - etc. >> but do we have any evidence indicating whether and under >> what circumstances that actually occurs? For example, if we knew that >> it's important to wait at least 30 s but waiting 60 s is no better, >> that would be useful information. >> >> Another question I have is about how we're actually going to know when >> any given fsync can be performed. For any given segment, there are a >> certain number of pages A that are already dirty at the start of the >> checkpoint. > > Dirty in the shared pool, or dirty in the OS cache? OS cache, sorry. >> Then there are a certain number of additional pages B >> that are going to be written out during the checkpoint. If it so >> happens that B = 0, we can call fsync() at the beginning of the >> checkpoint without losing anything (in fact, we gain something: any >> pages dirtied by cleaning scans or backend writes during the >> checkpoint won't need to hit the disk; > > Aren't those pages written out by cleaning scans and backend writes > while the checkpoint is occurring exactly what you defined to be page > set B, and then to be zero? No, sorry, I'm referring to cases where all the dirty pages in a segment have been written out the OS but we have not yet issued the necessary fsync. >> and if the filesystem dumps >> more of its cache than necessary on fsync, we may as well take that >> hit before dirtying a bunch more stuff). But if B > 0, then we should >> attempt the fsync() until we've written them all; otherwise we'll end >> up having to fsync() that segment twice. >> >> Doing all the writes and then all the fsyncs meets this requirement >> trivially, but I'm not so sure that's a good idea. For example, given >> files F1 ... Fn with dirty pages needing checkpoint writes, we could >> do the following: first, do any pending fsyncs for files not among F1 >> .. Fn; then, write all pages for F1 and fsync, write all pages for F2 >> and fsync, write all pages for F3 and fsync, etc. This might seem >> dumb because we're not really giving the OS a chance to write anything >> out before we fsync, but think about the ext3 case where the whole >> filesystem cache gets flushed anyway. It's much better to dump the >> cache at the beginning of the checkpoint and then again after every >> file than it is to spew many GB of dirty stuff into the cache and then >> drop the hammer. > > But the kernel has knobs to prevent that from happening. > dirty_background_ratio, dirty_ratio, dirty_background_bytes (on newer > kernels), dirty_expire_centisecs. Don't these knobs work? Also, ext3 > is supposed to do a journal commit every 5 seconds under default mount > conditions. I don't know in detail. dirty_expire_centisecs sounds useful; I think the problem with dirty_background_ratio and dirty_ratio is that the default ratios are large enough that on systems with a huge pile of memory, they allow more dirty data to accumulate than can be flushed without causing an I/O storm. I believe Greg Smith made a comment along the lines of - memory sizes are grow faster than I/O speeds; therefore a ratio that is OK for a low-end system with a modest amount of memory causes problems on a high-end system that has faster I/O but MUCH more memory. As a kernel developer, I suspect the tendency is to try to set the ratio so that you keep enough free memory around to service future allocat
Re: [HACKERS] Spread checkpoint sync
On Mon, Nov 15, 2010 at 6:15 PM, Robert Haas wrote: > On Sun, Nov 14, 2010 at 6:48 PM, Greg Smith wrote: >> The second issue is that the delay between sync calls is currently >> hard-coded, at 3 seconds. I believe the right path here is to consider the >> current checkpoint_completion_target to still be valid, then work back from >> there. That raises the question of what percentage of the time writes >> should now be compressed into relative to that, to leave some time to spread >> the sync calls. If we're willing to say "writes finish in first 1/2 of >> target, syncs execute in second 1/2", that I could implement that here. >> Maybe that ratio needs to be another tunable. Still thinking about that >> part, and it's certainly open to community debate. I would speculate that the answer is likely to be nearly binary. The best option would either be to do the writes as fast as possible and spread out the fsyncs, or spread out the writes and do the fsyncs as fast as possible. Depending on the system set up. >> The thing to realize >> that complicates the design is that the actual sync execution may take a >> considerable period of time. It's much more likely for that to happen than >> in the case of an individual write, as the current spread checkpoint does, >> because those are usually cached. In the spread sync case, it's easy for >> one slow sync to make the rest turn into ones that fire in quick succession, >> to make up for lost time. > > I think the behavior of file systems and operating systems is highly > relevant here. We seem to have a theory that allowing a delay between > the write and the fsync should give the OS a chance to start writing > the data out, I thought that the theory was that doing too many fsync in short order can lead to some kind of starvation of other IO. If the theory is that we want to wait between writes and fsyncs, then the current behavior is probably the best, Spreading out the writes and then doing all the syncs at the end gives the best delay time between an average write and the sync of that written to file. Or, spread the writes out over 150 seconds, sleep for 140 seconds, then do the fsyncs. But I don't think that that is the theory. > but do we have any evidence indicating whether and under > what circumstances that actually occurs? For example, if we knew that > it's important to wait at least 30 s but waiting 60 s is no better, > that would be useful information. > > Another question I have is about how we're actually going to know when > any given fsync can be performed. For any given segment, there are a > certain number of pages A that are already dirty at the start of the > checkpoint. Dirty in the shared pool, or dirty in the OS cache? > Then there are a certain number of additional pages B > that are going to be written out during the checkpoint. If it so > happens that B = 0, we can call fsync() at the beginning of the > checkpoint without losing anything (in fact, we gain something: any > pages dirtied by cleaning scans or backend writes during the > checkpoint won't need to hit the disk; Aren't those pages written out by cleaning scans and backend writes while the checkpoint is occurring exactly what you defined to be page set B, and then to be zero? > and if the filesystem dumps > more of its cache than necessary on fsync, we may as well take that > hit before dirtying a bunch more stuff). But if B > 0, then we should > attempt the fsync() until we've written them all; otherwise we'll end > up having to fsync() that segment twice. > > Doing all the writes and then all the fsyncs meets this requirement > trivially, but I'm not so sure that's a good idea. For example, given > files F1 ... Fn with dirty pages needing checkpoint writes, we could > do the following: first, do any pending fsyncs for files not among F1 > .. Fn; then, write all pages for F1 and fsync, write all pages for F2 > and fsync, write all pages for F3 and fsync, etc. This might seem > dumb because we're not really giving the OS a chance to write anything > out before we fsync, but think about the ext3 case where the whole > filesystem cache gets flushed anyway. It's much better to dump the > cache at the beginning of the checkpoint and then again after every > file than it is to spew many GB of dirty stuff into the cache and then > drop the hammer. But the kernel has knobs to prevent that from happening. dirty_background_ratio, dirty_ratio, dirty_background_bytes (on newer kernels), dirty_expire_centisecs. Don't these knobs work? Also, ext3 is supposed to do a journal commit every 5 seconds under default mount conditions. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Spread checkpoint sync
On Sun, Nov 14, 2010 at 6:48 PM, Greg Smith wrote: > The second issue is that the delay between sync calls is currently > hard-coded, at 3 seconds. I believe the right path here is to consider the > current checkpoint_completion_target to still be valid, then work back from > there. That raises the question of what percentage of the time writes > should now be compressed into relative to that, to leave some time to spread > the sync calls. If we're willing to say "writes finish in first 1/2 of > target, syncs execute in second 1/2", that I could implement that here. > Maybe that ratio needs to be another tunable. Still thinking about that > part, and it's certainly open to community debate. The thing to realize > that complicates the design is that the actual sync execution may take a > considerable period of time. It's much more likely for that to happen than > in the case of an individual write, as the current spread checkpoint does, > because those are usually cached. In the spread sync case, it's easy for > one slow sync to make the rest turn into ones that fire in quick succession, > to make up for lost time. I think the behavior of file systems and operating systems is highly relevant here. We seem to have a theory that allowing a delay between the write and the fsync should give the OS a chance to start writing the data out, but do we have any evidence indicating whether and under what circumstances that actually occurs? For example, if we knew that it's important to wait at least 30 s but waiting 60 s is no better, that would be useful information. Another question I have is about how we're actually going to know when any given fsync can be performed. For any given segment, there are a certain number of pages A that are already dirty at the start of the checkpoint. Then there are a certain number of additional pages B that are going to be written out during the checkpoint. If it so happens that B = 0, we can call fsync() at the beginning of the checkpoint without losing anything (in fact, we gain something: any pages dirtied by cleaning scans or backend writes during the checkpoint won't need to hit the disk; and if the filesystem dumps more of its cache than necessary on fsync, we may as well take that hit before dirtying a bunch more stuff). But if B > 0, then we should attempt the fsync() until we've written them all; otherwise we'll end up having to fsync() that segment twice. Doing all the writes and then all the fsyncs meets this requirement trivially, but I'm not so sure that's a good idea. For example, given files F1 ... Fn with dirty pages needing checkpoint writes, we could do the following: first, do any pending fsyncs for files not among F1 .. Fn; then, write all pages for F1 and fsync, write all pages for F2 and fsync, write all pages for F3 and fsync, etc. This might seem dumb because we're not really giving the OS a chance to write anything out before we fsync, but think about the ext3 case where the whole filesystem cache gets flushed anyway. It's much better to dump the cache at the beginning of the checkpoint and then again after every file than it is to spew many GB of dirty stuff into the cache and then drop the hammer. I'm just brainstorming here; feel free to tell me I'm all wet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers