Re: [PATCHES] Load Distributed Checkpoints, take 3
Gregory Stark wrote: > I can imagine a scenario where you have a system that's very busy for 60s and > then idle for 60s repeatedly. And for some reason you configure a > checkpoint_timeout on the order of 20m or so (assuming you're travelling > precisely 60mph). Is that Scottish m? -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] Load Distributed Checkpoints, take 3
"Greg Smith" <[EMAIL PROTECTED]> writes: > If you write them twice, so what? You didn't even get to that point as an > option until all the important stuff was taken care of and the system was > near idle. Well even if it's near idle you were still occupying the i/o system for a few milliseconds. If someone else came in with a block request at that time you extended their response time by that much. > The elimination of the all-scan background writer means that true hot and > dirty > spots in the buffer cache, like popular index blocks on a heavily updated > table > that never get a zero usage_count, are never going to be written out other > than > as part of the checkpoint process. If they're really popular blocks on a heavily updated table then they really don't buy us anything to write them out unnecessarily. The case where they help us is when they weren't really popular but we're not doing enough to get around to writing them out and then when we do need to write it out the system's busy. In that case we wasted a chance to write them out when the system was more idle. But don't forget that we're still going through the OS's buffers. That will smooth out a lot of the i/o we generate anyways. Just because Postgres is idle doesn't mean the OS isn't busy flushing the buffers we wrote out when it was busy. > That's OK for now, but I'd like it to be the case that one day the > database's I/O scheduling would eventually get to those, in order to > optimize performance in the kind of bursty scenarios I've been mentioning > lately. I think the feeling is that the bursty scenarios are really corner cases. Heikki described a case where you're dirtying tons of blocks without triggering any WAL. That can happen but it's pretty peculiar. I can imagine a scenario where you have a system that's very busy for 60s and then idle for 60s repeatedly. And for some reason you configure a checkpoint_timeout on the order of 20m or so (assuming you're travelling precisely 60mph). In that scenario bgwriter's lru scan has to fight to keep up for 60s while it's mostly writing out dirty pages that it could have flushed out during the idle time. Effectively you're only making use of half the i/o bandwidth since bgwriter doesn't do any work for half the duty cycle. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Tue, 26 Jun 2007, Tom Lane wrote: I'm not impressed with the idea of writing buffers because we might need them someday; that just costs extra I/O due to re-dirtying in too many scenarios. This is kind of an interesting statement to me because it really highlights the difference in how I thinking about this problem from how you see it. As far as I'm concerned, there's a hierarchy of I/O the database needs to finish that goes like this: 1) Client back-end writes (blocked until a buffer appears) 2) LRU writes so (1) doesn't happen 3) Checkpoint writes 4) Dirty pages with a non-zero usage count In my view of the world, there should be one parameter for a target rate of how much I/O you can stand under normal use, and the background writer should work its way as far down this chain as it can until it meets that. If there's plenty of clean buffers for the expected new allocations and there's no checkpoint going on, by all means write out some buffers we might re-dirty if there's I/O to spare. If you write them twice, so what? You didn't even get to that point as an option until all the important stuff was taken care of and the system was near idle. The elimination of the all-scan background writer means that true hot and dirty spots in the buffer cache, like popular index blocks on a heavily updated table that never get a zero usage_count, are never going to be written out other than as part of the checkpoint process. That's OK for now, but I'd like it to be the case that one day the database's I/O scheduling would eventually get to those, in order to optimize performance in the kind of bursty scenarios I've been mentioning lately. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Tue, 26 Jun 2007, Tom Lane wrote: I have no doubt that there are scenarios such as you are thinking about, but it definitely seems like a corner case that doesn't justify keeping the all-buffers scan. That scan is costing us extra I/O in ordinary non-corner cases, so it's not free to keep it. And scenarios I'm concerned about but can't diagram as easily fall into this category as well. I agree that a LDC enabled config would ship with the all-buffers scan turned off as redundant and wasteful, in which the only cost to keep it is code baggage. But the fact that there are corner cases floating around this area is what makes me feel that removing it altogether is still a bit premature. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
Heikki Linnakangas <[EMAIL PROTECTED]> writes: > To recap, the sequence is: > 1. COPY FROM > 2. checkpoint > 3. VACUUM > Now you have buffer cache full of dirty buffers with usage_count=1, Well, it won't be very full, because VACUUM works in a limited number of buffers (and did even before the BufferAccessStrategy patch). I have no doubt that there are scenarios such as you are thinking about, but it definitely seems like a corner case that doesn't justify keeping the all-buffers scan. That scan is costing us extra I/O in ordinary non-corner cases, so it's not free to keep it. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Mon, 25 Jun 2007, Tom Lane wrote: right now, BgBufferSync starts over from the current clock-sweep point on each call --- that is, each bgwriter cycle. So it can't really be made to write very many buffers without excessive CPU work. Maybe we should redefine it to have some static state carried across bgwriter cycles The LRU portion restarts like that, and it's certainly not optimal. But the auto-tuning LRU patch that's already near application makes this much less of an issue because it only does real work when buffers have been allocated, so the sweep point will have moved along. I'll add this idea to my list of things that would be nice to have as part of a larger rewriter, I think it's more trouble than it's worth to chase right now. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: Heikki Linnakangas <[EMAIL PROTECTED]> writes: Tom Lane wrote: Who's "we"? AFAICS, CVS HEAD will treat a large copy the same as any other large heapscan. Umm, I'm talking about populating a table with COPY *FROM*. That's not a heap scan at all. No wonder we're failing to communicate. I assumed you were talking about copy-to-file. Copy-from-file is going to be creating WAL entries hence the no-checkpoint case doesn't apply anyway, no? We are indeed having trouble to communicate :(. No, I'm not talking about the new non-WAL-logged COPY optimization. COPY FROM *would* create WAL entries, and the next checkpoint would clean them. So far so good. But then you run VACUUM, as you often do after loading a table, to set all hint bits. That will *not* generate WAL, and next checkpoint is skipped. To recap, the sequence is: 1. COPY FROM 2. checkpoint 3. VACUUM Now you have buffer cache full of dirty buffers with usage_count=1, and no WAL activity since last checkpoint. They will not be flushed until: a) WAL activity happens and next checkpoint comes b) database is shut down, or manual CHECKPOINT is issued c) clock hand advances and decrements the usage_counts It's a corner case. Probably not a problem in practice; you usually run CREATE INDEX after loading a table, for example. But it exists. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] Load Distributed Checkpoints, take 3
Heikki Linnakangas <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> Who's "we"? AFAICS, CVS HEAD will treat a large copy the same as any >> other large heapscan. > Umm, I'm talking about populating a table with COPY *FROM*. That's not a > heap scan at all. No wonder we're failing to communicate. I assumed you were talking about copy-to-file. Copy-from-file is going to be creating WAL entries hence the no-checkpoint case doesn't apply anyway, no? [ thinks ... ] Oh, you must be positing the case where the recently added skip-WAL-if-table-is-new-in-this-transaction optimization applies. Well, that thing could probably do with some more work anyway (I wonder why it's using shared buffers at all anymore). I don't really think that case should be allowed to drive our thinking about how the bgwriter should work. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: Heikki Linnakangas <[EMAIL PROTECTED]> writes: Tom Lane wrote: (Note that COPY per se will not trigger this behavior anyway, since it will act in a limited number of buffers because of the recent buffer access strategy patch.) Actually we dropped it from COPY, because it didn't seem to improve performance in the tests we ran. Who's "we"? AFAICS, CVS HEAD will treat a large copy the same as any other large heapscan. Umm, I'm talking about populating a table with COPY *FROM*. That's not a heap scan at all. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Load Distributed Checkpoints, take 3
Heikki Linnakangas <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> (Note that COPY per se will not trigger this behavior anyway, since it >> will act in a limited number of buffers because of the recent buffer >> access strategy patch.) > Actually we dropped it from COPY, because it didn't seem to improve > performance in the tests we ran. Who's "we"? AFAICS, CVS HEAD will treat a large copy the same as any other large heapscan. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: (Note that COPY per se will not trigger this behavior anyway, since it will act in a limited number of buffers because of the recent buffer access strategy patch.) Actually we dropped it from COPY, because it didn't seem to improve performance in the tests we ran. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
Heikki Linnakangas <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> This argument supposes that the bgwriter will do nothing while the COPY >> is proceeding. > It will clean buffers ahead of the COPY, but it won't write the buffers > COPY leaves behind since they have usage_count=1. Yeah, and they don't *need* to be written until the clock sweep has passed over them once. I'm not impressed with the idea of writing buffers because we might need them someday; that just costs extra I/O due to re-dirtying in too many scenarios. (Note that COPY per se will not trigger this behavior anyway, since it will act in a limited number of buffers because of the recent buffer access strategy patch.) regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: Heikki Linnakangas <[EMAIL PROTECTED]> writes: One pathological case is a COPY of a table slightly smaller than shared_buffers. That will fill the buffer cache. If you then have a checkpoint, and after that a SELECT COUNT(*), or a VACUUM, the buffer cache will be full of pages with just hint-bit-updates, but no WAL activity since last checkpoint. This argument supposes that the bgwriter will do nothing while the COPY is proceeding. It will clean buffers ahead of the COPY, but it won't write the buffers COPY leaves behind since they have usage_count=1. Let me demonstrate this with an imaginary example with shared_buffers=4: buf_id usage_count dirty 1 0 f 2 0 f 3 0 f 4 0 f After COPY buf_id usage_count dirty 1 1 t 2 1 t 3 1 t 4 1 t CHECKPOINT: buf_id usage_count dirty 1 1 f 2 1 f 3 1 f 4 1 f VACUUM: buf_id usage_count dirty 1 1 t 2 1 t 3 1 t 4 1 t As soon as a backend asks for a buffer, the situation is defused as the backend will do a full clock sweep and decrement the usage_count of each buffer to 0, letting the bgwriter lru-scan to clean them. Having the buffer cache full of dirty buffers is not a problem on its own, so this only becomes a performance issue if you then issue another large COPY etc. that needs those buffers, and you now have to write them at the busy time. This is a corner case that might not be worth worrying about. It's also mitigated by the fact that the OS cache is most likely clean after a period of idle time, and should be able to absorb the write burst. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Load Distributed Checkpoints, take 3
Heikki Linnakangas <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> ... that's what the LRU scan is for. > Yeah, except the LRU scan is not doing a very good job at that. It will > ignore buffers with usage_count > 0, and it only scans > bgwriter_lru_percent buffers ahead of the clock hand. Which is exactly in sync with which buffers are actually candidates for replacement. It could be looking further ahead of the clock hand, as per my previous suggestion, but there is no need to push out buffers with usage_count > 0 until after the sweep has decremented that to 0. Doing so would tend to create excess I/O --- it makes it more likely that multiple page-dirtying events on a page will result in separate writes instead of a single write. > One pathological case is a COPY of a table slightly smaller than > shared_buffers. That will fill the buffer cache. If you then have a > checkpoint, and after that a SELECT COUNT(*), or a VACUUM, the buffer > cache will be full of pages with just hint-bit-updates, but no WAL > activity since last checkpoint. This argument supposes that the bgwriter will do nothing while the COPY is proceeding. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: Anyway, if there are no XLOG records since the last checkpoint, there's probably nothing in shared buffers that needs flushing. There might be some dirty hint-bits, but the only reason to push those out is to make some free buffers available, and doing that is not checkpoint's job (nor the all-buffers scan's job); that's what the LRU scan is for. Yeah, except the LRU scan is not doing a very good job at that. It will ignore buffers with usage_count > 0, and it only scans bgwriter_lru_percent buffers ahead of the clock hand. One pathological case is a COPY of a table slightly smaller than shared_buffers. That will fill the buffer cache. If you then have a checkpoint, and after that a SELECT COUNT(*), or a VACUUM, the buffer cache will be full of pages with just hint-bit-updates, but no WAL activity since last checkpoint. But let's fix the LRU scan, rather work around it's deficiencies. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] Load Distributed Checkpoints, take 3
Heikki Linnakangas <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> Hmm. But if we're going to do that, we might as well have a checkpoint >> for our troubles, no? The reason for the current design is the >> assumption that a bgwriter_all scan is less burdensome than a >> checkpoint, but that is no longer true given this rewrite. > Per comments in CreateCheckPoint, another also skip the extra > checkpoints to avoid writing two checkpoints to the same page, risking > losing both on a crash: That's not the reason for the current design --- that logic existed long before bgwriter did. Anyway, if there are no XLOG records since the last checkpoint, there's probably nothing in shared buffers that needs flushing. There might be some dirty hint-bits, but the only reason to push those out is to make some free buffers available, and doing that is not checkpoint's job (nor the all-buffers scan's job); that's what the LRU scan is for. The only reason that the all-buffers scan was put in was to try to make the next checkpoint cheaper. That reason falls to the ground with LDC. What we have left is merely excess I/O. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: Greg Smith <[EMAIL PROTECTED]> writes: The way transitions between completely idle and all-out bursts happen were one problematic area I struggled with. Since the LRU point doesn't move during the idle parts, and the lingering buffers have a usage_count>0, the LRU scan won't touch them; the only way to clear out a bunch of dirty buffers leftover from the last burst is with the all-scan. One thing that might be worth changing is that right now, BgBufferSync starts over from the current clock-sweep point on each call --- that is, each bgwriter cycle. So it can't really be made to write very many buffers without excessive CPU work. Maybe we should redefine it to have some static state carried across bgwriter cycles, such that it would write at most N dirty buffers per call, but scan through X percent of the buffers, possibly across several calls, before returning to the (by now probably advanced) clock-sweep point. This would allow a larger value of X to be used than is currently practical. You might wish to recheck the clock sweep point on each iteration just to make sure the scan hasn't fallen behind it, but otherwise I don't see any downside. The scenario where somebody re-dirties a buffer that was cleaned by the bgwriter scan isn't a problem, because that buffer will also have had its usage_count increased and thereby not be a candidate for replacement. Something along those lines could be useful. I've thought of that before, but it never occured to me that if a page in front of the clock hand is re-dirtied, it's no longer a candidate for replacement anyway... I'm going to leave the all- and lru- bgwriter scans alone for now to get this LDC patch finished. We still have the bgwriter autotuning patch in the queue. Let's think about this in the context of that patch. As a general comment on this subject, a lot of the work in LDC presumes you have an accurate notion of how close the next checkpoint is. Yeah; this is one reason I was interested in carrying some write-speed state across checkpoints instead of having the calculation start from scratch each time. That wouldn't help systems that sit idle a long time and suddenly go nuts, but it seems to me that smoothing the write rate across more than one checkpoint could help if the fluctuations occur over a timescale of a few checkpoints. Hmm. This problem only applies to checkpoints triggered by checkpoint_segments; time tends to move forward at a constant rate. I'm not worried about small fluctuations or bursts. As I argued earlier, the OS still buffers the writes and should give some extra smoothing of the physical writes. I believe bursts of say 50-100 MB would easily fit in OS cache, as long as there's enough gap between them. I haven't tested that, though. Here's a proposal for an algorithm to smooth bigger bursts: The basic design is the same as before. We keep track of elapsed time and elapsed xlogs, and based on them we estimate how much progress in flushing the buffers we should've made by now, and then we catch up until we reach that. The estimate for time is the same. The estimate for xlogs gets more complicated: Let's have a few definitions first: Ro = elapsed segments / elapsed time, from previous checkpoint cycle. For example, 1.25 means that the checkpoint was triggered by checkpoint_segments, and we had spent 1/1.25 = 80% of checkpoint_timeout when we reached checkpoint_segments. 0.25 would mean that checkpoint was triggered by checkpoint_timeout, and we had spent 25% of checkpoint_segments by then. Rn = elapsed segments / elapsed time this far from current in-progress checkpoint. t = elapsed time, as a fraction of checkpoint_timeout (0.0 - 1.0, though could be > 1 if next checkpoint is already due) s = elapsed xlog segments, as a fraction of checkpoint_segments (0.0 - 1.0, though could again be > 1 if next checkpoint is already due) R = estimate for WAL segment consumption rate, as checkpoint_segments / checkpoint_timeout R = Ro * t + Rn * (1 - t) In other words, at the beginning of the checkpoint, we give more weight to the state carried over from previous checkpoint. As we go forward, more weight is given to the rate calculated from current cycle. From R, we extrapolate how much progress we should've done by now: Target progress = R * t This would require saving just one number from previous cycle (Rn), and there is no requirement to call the estimation function at steady time intervals, for example. It gives pretty steady I/O rate even if there's big bursts in WAL activity, but still reacts to changes in the rate. I'm not convinced this is worth the effort, though. First of all, this is only a problem if you use checkpoint_segments to control your checkpoints, so you can lower checkpoint_timeout to do more work during the idle periods. Secondly, with the optimization of not flushing buffers during checkpoint that were dirtied after the start of
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: Hmm. But if we're going to do that, we might as well have a checkpoint for our troubles, no? The reason for the current design is the assumption that a bgwriter_all scan is less burdensome than a checkpoint, but that is no longer true given this rewrite. Per comments in CreateCheckPoint, another also skip the extra checkpoints to avoid writing two checkpoints to the same page, risking losing both on a crash: * If this isn't a shutdown or forced checkpoint, and we have not inserted * any XLOG records since the start of the last checkpoint, skip the * checkpoint. The idea here is to avoid inserting duplicate checkpoints * when the system is idle. That wastes log space, and more importantly it * exposes us to possible loss of both current and previous checkpoint * records if the machine crashes just as we're writing the update. * (Perhaps it'd make even more sense to checkpoint only when the previous * checkpoint record is in a different xlog page?) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Load Distributed Checkpoints, take 3
Greg Smith <[EMAIL PROTECTED]> writes: > The way transitions between completely idle and all-out bursts happen were > one problematic area I struggled with. Since the LRU point doesn't move > during the idle parts, and the lingering buffers have a usage_count>0, the > LRU scan won't touch them; the only way to clear out a bunch of dirty > buffers leftover from the last burst is with the all-scan. One thing that might be worth changing is that right now, BgBufferSync starts over from the current clock-sweep point on each call --- that is, each bgwriter cycle. So it can't really be made to write very many buffers without excessive CPU work. Maybe we should redefine it to have some static state carried across bgwriter cycles, such that it would write at most N dirty buffers per call, but scan through X percent of the buffers, possibly across several calls, before returning to the (by now probably advanced) clock-sweep point. This would allow a larger value of X to be used than is currently practical. You might wish to recheck the clock sweep point on each iteration just to make sure the scan hasn't fallen behind it, but otherwise I don't see any downside. The scenario where somebody re-dirties a buffer that was cleaned by the bgwriter scan isn't a problem, because that buffer will also have had its usage_count increased and thereby not be a candidate for replacement. > As a general comment on this subject, a lot of the work in LDC presumes > you have an accurate notion of how close the next checkpoint is. Yeah; this is one reason I was interested in carrying some write-speed state across checkpoints instead of having the calculation start from scratch each time. That wouldn't help systems that sit idle a long time and suddenly go nuts, but it seems to me that smoothing the write rate across more than one checkpoint could help if the fluctuations occur over a timescale of a few checkpoints. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Mon, 25 Jun 2007, Heikki Linnakangas wrote: Please describe the class of transactions and the service guarantees so that we can reproduce that, and figure out what's the best solution. I'm confident you're already moving in that direction by noticing how the 90th percentile numbers were kind of weird with your 150 warehouse DBT2 tests, and I already mentioned how that could be usefully fleshed out by more tests during beta. That number is the kind of service guarantee I'm talking about--if before 90% of transactions were <4.5ms, but now that number is closer to 6ms, that could be considered worse performance by some service metrics even if the average and worst-case performance were improved. The only thing I can think of if you wanted to make the problem more like what I was seeing would be switching the transaction mix on that around to do more UPDATEs relative to the other types of transactions; having more of those seemed to aggrevate my LDC-related issues because they leave a different pattern of dirty+used buffers around than other operations. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Mon, 25 Jun 2007, Heikki Linnakangas wrote: It only scans bgwriter_lru_percent buffers ahead of the clock hand. If the hand isn't moving, it keeps scanning the same buffers over and over again. You can crank it all the way up to 100%, though, in which case it would work, but that starts to get expensive CPU-wise. In addition to being a CPU pig, that still won't necessarily work because the way the LRU writer ignores things with a non-zero usage_count. If it's dirty, it's probably been used recently as well. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Mon, 25 Jun 2007, Heikki Linnakangas wrote: Greg, is this the kind of workload you're having, or is there some other scenario you're worried about? The way transitions between completely idle and all-out bursts happen were one problematic area I struggled with. Since the LRU point doesn't move during the idle parts, and the lingering buffers have a usage_count>0, the LRU scan won't touch them; the only way to clear out a bunch of dirty buffers leftover from the last burst is with the all-scan. Ideally, you want those to write during idle periods so you're completely clean when the next burst comes. My plan for the code I wanted to put into 8.4 one day was to have something like the current all-scan that defers to the LRU and checkpoint, such that if neither of them are doing anything it would go searching for buffers it might blow out. Because the all-scan mainly gets in the way under heavy load right now I've only found mild settings helpful, but if it had a bit more information about what else was going on it could run much harder during slow spots. That's sort of the next stage to the auto-tuning LRU writer code in the grand design floating through my head. As a general comment on this subject, a lot of the work in LDC presumes you have an accurate notion of how close the next checkpoint is. On systems that can dirty buffers and write WAL really fast, I've found hyper bursty workloads are a challenge for it to cope with. You can go from thinking you have all sorts of time to stream the data out to discovering the next checkpoint is coming up fast in only seconds. In that situation, you'd have been better off had you been writing faster during the period preceeding the burst when the code thought it should be "smooth"[1]. That falls into the category of things I haven't found a good way for other people to test (I happened to have an internal bursty app that aggrevated this area to use). [1] This is actually a reference to "Yacht Rock", one of my favorite web sites: http://www.channel101.com/shows/show.php?show_id=152 -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Load Distributed Checkpoints, take 3
Heikki Linnakangas <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> Heikki Linnakangas <[EMAIL PROTECTED]> writes: >>> If you have a system with a very bursty transaction rate, it's possible >>> that when it's time for a checkpoint, there hasn't been any WAL logged >>> activity since last checkpoint, so we skip it. When that happens, the >>> buffer cache might still be full of dirty pages, because of hint bit >>> updates. That still isn't a problem on it's own, but if you then do a >>> huge batch update, you have to flush those dirty pages at that point. It >>> would be better to trickle flush those dirty pages during the idle period. >> >> But wouldn't the LRU-based scan accomplish that? > It only scans bgwriter_lru_percent buffers ahead of the clock hand. If > the hand isn't moving, it keeps scanning the same buffers over and over > again. You can crank it all the way up to 100%, though, in which case it > would work, but that starts to get expensive CPU-wise. Hmm. But if we're going to do that, we might as well have a checkpoint for our troubles, no? The reason for the current design is the assumption that a bgwriter_all scan is less burdensome than a checkpoint, but that is no longer true given this rewrite. AFAICS all the bgwriter_all scan will accomplish is induce extra I/O in most scenarios. And it's certainly extra code in an area that's already been rendered pretty darn baroque by this patch. Also, the design assumption here is that the bgwriter_lru scan is supposed to keep enough buffers clean to satisfy the system's need for new buffers. If it's not able to do so, it's not apparent to me that the bgwriter_all scan helps --- the latter is most likely flushing the wrong buffers, ie, those not close in front of the clock sweep. You'd be well advised to increase lru_percent anyway to keep more buffers clean, if this scenario is worth optimizing rather than others for your usage. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: Heikki Linnakangas <[EMAIL PROTECTED]> writes: On further thought, there is one workload where removing the non-LRU part would be counterproductive: If you have a system with a very bursty transaction rate, it's possible that when it's time for a checkpoint, there hasn't been any WAL logged activity since last checkpoint, so we skip it. When that happens, the buffer cache might still be full of dirty pages, because of hint bit updates. That still isn't a problem on it's own, but if you then do a huge batch update, you have to flush those dirty pages at that point. It would be better to trickle flush those dirty pages during the idle period. But wouldn't the LRU-based scan accomplish that? It only scans bgwriter_lru_percent buffers ahead of the clock hand. If the hand isn't moving, it keeps scanning the same buffers over and over again. You can crank it all the way up to 100%, though, in which case it would work, but that starts to get expensive CPU-wise. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] Load Distributed Checkpoints, take 3
Heikki Linnakangas <[EMAIL PROTECTED]> writes: > On further thought, there is one workload where removing the non-LRU > part would be counterproductive: > If you have a system with a very bursty transaction rate, it's possible > that when it's time for a checkpoint, there hasn't been any WAL logged > activity since last checkpoint, so we skip it. When that happens, the > buffer cache might still be full of dirty pages, because of hint bit > updates. That still isn't a problem on it's own, but if you then do a > huge batch update, you have to flush those dirty pages at that point. It > would be better to trickle flush those dirty pages during the idle period. But wouldn't the LRU-based scan accomplish that? regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: I agree with removing the non-LRU part of the bgwriter's write logic though; that should simplify matters a bit and cut down the overall I/O volume. On further thought, there is one workload where removing the non-LRU part would be counterproductive: If you have a system with a very bursty transaction rate, it's possible that when it's time for a checkpoint, there hasn't been any WAL logged activity since last checkpoint, so we skip it. When that happens, the buffer cache might still be full of dirty pages, because of hint bit updates. That still isn't a problem on it's own, but if you then do a huge batch update, you have to flush those dirty pages at that point. It would be better to trickle flush those dirty pages during the idle period. So we might still want to keep the non-LRU scan. Or alternatively, when the checkpoint is a no-op, we call BufferSync nevertheless. That's effectively the same thing, except that BufferSync would be controlled by the logic to estimate the deadline until next checkpoint, instead of the current bgwriter_all_*-settings. Greg, is this the kind of workload you're having, or is there some other scenario you're worried about? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Mon, 2007-06-25 at 12:56 +0200, Magnus Hagander wrote: > Didn't we already add other featuers that makes recovery much *faster* than > before? In that case, are they faster enugh to neutralise this increased > time (a guestimate, of course) > > Or did I mess that up with stuff we added for 8.2? :-) Only with full_page_writes on, but yes you're right, the predicted recovery times are changing for 8.3 anyway. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] Load Distributed Checkpoints, take 3
Magnus Hagander wrote: On Mon, Jun 25, 2007 at 10:15:07AM +0100, Simon Riggs wrote: As you say, we can put comments in the release notes to advise people of 50% increase in recovery time if the parameters stay the same. That would be balanced by the comment that checkpoints are now considerably smoother than before and more frequent checkpoints are unlikely to be a problem, so it is OK to reduce the parameters from the settings you used in previous releases. Didn't we already add other featuers that makes recovery much *faster* than before? Yes, we no longer read in a page just to overwrite it with a full page image. In that case, are they faster enugh to neutralise this increased time (a guestimate, of course) It depends a lot on your workload. Under the right circumstances, it will more than offset any increase caused by LDC, but sometimes it won't make any difference at all (for example with full_page_writes=off). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Mon, Jun 25, 2007 at 10:15:07AM +0100, Simon Riggs wrote: > On Mon, 2007-06-25 at 01:33 -0400, Greg Smith wrote: > > On Sun, 24 Jun 2007, Simon Riggs wrote: > > > > > Greg can't choose to use checkpoint_segments as the limit and then > > > complain about unbounded recovery time, because that was clearly a > > > conscious choice. > > > > I'm complaining > > I apologise for using that emotive phrase. > > > only because everyone seems content to wander in a > > direction where the multiplier on checkpoint_segments for how many > > segments are actually active at once will go up considerably, which can > > make a known problem (recovery time) worse. > > +50% more. Recovery time is a consideration that can be adjusted for. We > have done nothing to make recovery rate worse; the additional WAL leads > to an increased recovery time *only* if you keep the same parameter > settings. There is no reason to keep them the same, nor do we promise > that parameters will keep the exact meaning they had previous releases. > > As you say, we can put comments in the release notes to advise people of > 50% increase in recovery time if the parameters stay the same. That > would be balanced by the comment that checkpoints are now considerably > smoother than before and more frequent checkpoints are unlikely to be a > problem, so it is OK to reduce the parameters from the settings you used > in previous releases. Didn't we already add other featuers that makes recovery much *faster* than before? In that case, are they faster enugh to neutralise this increased time (a guestimate, of course) Or did I mess that up with stuff we added for 8.2? :-) //Magnus ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Mon, 2007-06-25 at 01:33 -0400, Greg Smith wrote: > On Sun, 24 Jun 2007, Simon Riggs wrote: > > > Greg can't choose to use checkpoint_segments as the limit and then > > complain about unbounded recovery time, because that was clearly a > > conscious choice. > > I'm complaining I apologise for using that emotive phrase. > only because everyone seems content to wander in a > direction where the multiplier on checkpoint_segments for how many > segments are actually active at once will go up considerably, which can > make a known problem (recovery time) worse. +50% more. Recovery time is a consideration that can be adjusted for. We have done nothing to make recovery rate worse; the additional WAL leads to an increased recovery time *only* if you keep the same parameter settings. There is no reason to keep them the same, nor do we promise that parameters will keep the exact meaning they had previous releases. As you say, we can put comments in the release notes to advise people of 50% increase in recovery time if the parameters stay the same. That would be balanced by the comment that checkpoints are now considerably smoother than before and more frequent checkpoints are unlikely to be a problem, so it is OK to reduce the parameters from the settings you used in previous releases. I don't see any reason why we would want unsmooth checkpoints; similarly we don't want unsmooth bg writing, unsmooth archiving etc.. Performance will increase, response times will drop and people will be happy. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Load Distributed Checkpoints, take 3
Greg Smith wrote: LDC certainly makes things better in almost every case. My "allegiance" comes from having seen a class of transactions where LDC made things worse on a fast/overloaded system, in that it made some types of service guarantees harder to meet, and I just don't know who else might run into problems in that area. Please describe the class of transactions and the service guarantees so that we can reproduce that, and figure out what's the best solution. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Mon, 25 Jun 2007, Tom Lane wrote: I'm not sure why you hold such strong allegiance to the status quo. We know that the status quo isn't working very well. Don't get me wrong here; I am a big fan of this patch, think it's an important step forward, and it's exactly the fact that I'm so shell shocked from abuse by the status quo that I'm still mixed up in this mess (I really should be ignoring the lot of you and writing new code instead). LDC certainly makes things better in almost every case. My "allegiance" comes from having seen a class of transactions where LDC made things worse on a fast/overloaded system, in that it made some types of service guarantees harder to meet, and I just don't know who else might run into problems in that area. I'm worried that if it's not adjustable, you're introducing a risk that you'll take a step backward for some of this code's users, and that will be hard to undo given the way releases are structured here. I spent some time trading stocks for a living. There are sometimes situations you can get into there where there is a tiny chance that something very bad can happen with a trade, and many people get wiped out by such things. If it's possible in that situation to remove that risk with something inexpensive, you do it, even though the net expected value of the change might be slightly negative. This seems like such a situation to me. If it's possible to take away the risk of other people running into an unforseen problem with the LDC patch just by keeping a knob that's already there, unless that's an expensive operation my opinion is that you should pick a good default but not remove it yet. And if you think that the current code had enormous amounts of testing before it went in, I've got to disillusion you :-( It's having been on the painful receiving end of that fact that makes me so paranoid now :) -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Sun, 24 Jun 2007, Simon Riggs wrote: Greg can't choose to use checkpoint_segments as the limit and then complain about unbounded recovery time, because that was clearly a conscious choice. I'm complaining only because everyone seems content to wander in a direction where the multiplier on checkpoint_segments for how many segments are actually active at once will go up considerably, which can make a known problem (recovery time) worse. I'm thinking about things like how the release notes for 8.3 are going to address this. It the plan to say something like "whatever you set checkpoint_segments to before so you were happy with the crash recovery time, make sure to re-run all those tests again because we made a big change there, it may take a lot longer now with the same value, and too bad if you don't like it because there's no way to get the old behavior back. Suck it up, decrease checkpoint_segments, and get used to having more checkpoints if you have to keep the same recovery time". -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
Greg Smith <[EMAIL PROTECTED]> writes: > I am not a fan of introducing a replacement feature based on what I > consider too limited testing, and I don't feel this one has been beat on > long yet enough to start pruning features that would allow better backward > compatibility/transitioning. I think that's introducing an unnecessary > risk to the design. While it's certainly true that we could do with wider testing of this patch, I'm not sure why you hold such strong allegiance to the status quo. We know that the status quo isn't working very well. And if you think that the current code had enormous amounts of testing before it went in, I've got to disillusion you :-( regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Sun, 24 Jun 2007, Simon Riggs wrote: I can't see why anyone would want to turn off smoothing: If they are doing many writes, then they will be effected by the sharp dive at checkpoint, which happens *every* checkpoint. There are service-level agreement situations where a short and sharp disruption is more acceptable than a more prolonged one. As some of the overloaded I/O tests are starting to show, the LDC may be a backward step for someone in that sort of environment. I am not a fan of introducing a replacement feature based on what I consider too limited testing, and I don't feel this one has been beat on long yet enough to start pruning features that would allow better backward compatibility/transitioning. I think that's introducing an unnecessary risk to the design. We won't need to set checkpoint_segments so high, since performance is smoothed across checkpoints by LDC and its OK to allow them more frequently. So this concern need not apply with LDC. Performance *should* be smoothed across by checkpoints by LDC and my concern *may* not apply. I think assuming it will always help based on the limited number of test results presented so far is extrapolation. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Fri, 2007-06-22 at 16:57 -0400, Greg Smith wrote: > If you're not, I think you should be. Keeping that replay interval > time down was one of the reasons why the people I was working with > were displeased with the implications of the very spread out style of > some LDC tunings. They were already unhappy with the implied recovery > time of how high they had to set checkpoint_settings for good > performance, and making it that much bigger aggrevates the issue. > Given a knob where the LDC can be spread out a bit but not across the > entire interval, that makes it easier to control how much expansion > there is relative to the current behavior. We won't need to set checkpoint_settings so high, since performance is smoothed across checkpoints by LDC and its OK to allow them more frequently. So this concern need not apply with LDC. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Fri, 2007-06-22 at 16:21 -0400, Tom Lane wrote: > Heikki Linnakangas <[EMAIL PROTECTED]> writes: > > 3. Recovery will take longer, because the distance last committed redo > > ptr will lag behind more. > > True, you'd have to replay 1.5 checkpoint intervals on average instead > of 0.5 (more or less, assuming checkpoints had been short). I don't > think we're in the business of optimizing crash recovery time though. Agreed, though only for crash recovery. Most archive recoveries will not be change noticeably - half a checkpoint isn't very long in PITR terms. > It might be that getting rid of checkpoint_smoothing is an overreaction, > and that we should keep that parameter. IIRC Greg had worried about > being able to turn this behavior off, so we'd still need at least a > bool, and we might as well expose the fraction instead. (Still don't > like the name "smoothing" though.) ISTM we don't need the checkpoint_smoothing parameter. I can't see why anyone would want to turn off smoothing: If they are doing many writes, then they will be effected by the sharp dive at checkpoint, which happens *every* checkpoint. The increase in crash recovery time is an easily acceptable trade-off for what is gained; if not, they can always make checkpoint_timeout smaller. If they aren't doing many writes they don't care what the setting of checkpoint_smoothing is and recovery will be very fast anyway. >I agree with removing the non-LRU > part of the bgwriter's write logic though; that should simplify matters > a bit and cut down the overall I/O volume. Yes, agreed. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] Load Distributed Checkpoints, take 3
Simon Riggs wrote: On Fri, 2007-06-22 at 22:19 +0100, Heikki Linnakangas wrote: However, I think shortening the checkpoint interval is a perfectly valid solution to that. Agreed. That's what checkpoint_timeout is for. Greg can't choose to use checkpoint_segments as the limit and then complain about unbounded recovery time, because that was clearly a conscious choice. No, by checkpoint interval, I meant both checkpoint_timeout and checkpoint_segments. You can decrease checkpoint_segments as well to shorten recovery time. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Fri, 2007-06-22 at 22:19 +0100, Heikki Linnakangas wrote: > However, I think shortening the checkpoint interval is a perfectly valid > solution to that. Agreed. That's what checkpoint_timeout is for. Greg can't choose to use checkpoint_segments as the limit and then complain about unbounded recovery time, because that was clearly a conscious choice. > In any case, while people > sometimes complain that we have a large WAL footprint, it's not usually > a problem. IMHO its a huge problem. Turning on full_page_writes means that the amount of WAL generated varies fairly linearly with number of blocks touched, which means large databases become a problem. Suzuki-san's team had results that showed this was a problem also. > This is off-topic, but at PGCon in May, Itagaki-san and his colleagues > whose names I can't remember, pointed out to me very clearly that our > recovery is *slow*. So slow, that in the benchmarks they were running, > their warm stand-by slave couldn't keep up with the master generating > the WAL, even though both are running on the same kind of hardware. > The reason is simple: There can be tens of backends doing I/O and > generating WAL, but in recovery we serialize them. If you have decent > I/O hardware that could handle for example 10 concurrent random I/Os, at > recovery we'll be issuing them one at a time. That's a scalability > issue, and doesn't show up on a laptop or a small server with a single disk. The results showed that the current recovery isn't scalable beyond a certain point, not that it was slow per se. The effect isn't noticeable on systems generating fewer writes (of any size) or ones where the cache hit ratio is high. It isn't accurate to say laptops and small servers only, but it would be accurate to say hi volume I/O bound OLTP is the place where the scalability of recovery does not match the performance scalability of the master server. Yes, we need to make recovery more scalable by de-serializing I/O. Slony also serializes changes onto the Slave nodes, so the general problem of scalability of recovery solutions needs to be tackled. > That's one of the first things I'm planning to tackle when the 8.4 dev > cycle opens. And I'm planning to look at recovery times in general; I've > never even measured it before so who knows what comes up. I'm planning on working on recovery also, as is Florian. Let's make sure we coordinate what we do to avoid patch conflicts. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
This message is going to come off as kind of angry, and I hope you don't take that personally. I'm very frustrated with this whole area right now but am unable to do anything to improve that situation. On Fri, 22 Jun 2007, Tom Lane wrote: If you've got specific evidence why any of these things need to be parameterized, let's see it. All I'm trying to suggest here is that you might want to pause and consider whether you want to make a change that might break existing, happily working installations based just on the small number of tests that have been done on this patch so far. A nice stack of DBT2 results is very informative, but the DBT2 workload is not everybody's workload. Did you see anybody else predicting issues with the LDC patch on overloaded systems as are starting to be seen in the 150 warehouse/90% latency figures in Heikki's most recent results? The way I remember that, it was just me pushing to expose that problem, because I knew it was there from my unfortunately private tests, but it was difficult to encounter the issue on other types of benchmarks (thanks again to Greg Stark and Heikki for helping with that). But that's fine, if you want to blow off the rest of my suggestions now just because the other things I'm worried about are also very hard problem to expose and I can't hand you over a smoking gun, that's your decision. Personally I think that we have a bad track record of exposing GUC variables as a substitute for understanding performance issues at the start, and this approach isn't doing any favors for DBAs. I think this project has an awful track record of introducing new GUC variables and never having a plan to follow through with a process to figure out how they should be set. The almost complete lack of standardization and useful tools for collecting performance information about this database boggles my mind, and you're never going to get the performance related sections of the GUC streamlined without it. We were just talking about the mess that is effective_cache_size recently. As a more topical example here, the background writer was officially released in early 2005, with a bizarre collection of tunables. I had to help hack on that code myself, over two years later, to even start exposing the internal statistics data needed to optimize it correctly. The main reason I can't prove some of my concerns is that I got so side-tracked adding the infrastructure needed to even show they exist that I wasn't able to nail down exactly what was going on well enough to generate a public test case before the project that exposed the issues wrapped up. Right at the moment the best thing to do seems to be to enable LDC with a low minimum write rate and a high target duration, and remove the thereby-obsoleted "all buffers" scan of the existing bgwriter logic. I have reason to believe there's a set of use cases where a more accelerated LDC approach than everyone seems to be learning toward is appropriate, which would then reinvigorate the need for the all-scan BGW component. I have a whole new design for the non-LRU background writer that fixes most of what's wrong with it I'm waiting for 8.4 to pass out and get feedback on, but if everybody is hell bent on just yanking the whole thing out in preference to these really lazy checkpoints go ahead and do what you want. My life would be easier if I just tossed all that out and forgot about the whole thing, and I'm real close to doing just that right now. Did anyone else ever notice that when a new xlog segment is created, the write to clear it out doesn't happen via direct I/O like the rest of the xlog writes do? It's not supposed to matter, because that path isn't supposed to be taken often. Yes, but during the situation it does happen in--when checkpoints take so much longer than expected that more segments have to be created, or in an archive logger faiure--it badly impacts an already unpleasant situation. there's a whole class of issues involving recycling xlog segments this would introduce I would be really unhappy with the implications of. Really? Name one. You already mentioned expansion of the log segments used which is a primary issue. Acting like all the additional segments used for some of the more extreme checkpoint spreading approaches are without cost is completely unrealistic IMHO. In the situation I just described above, I also noticed the way O_DIRECT sync writes get mixed with buffered WAL writes seems to cause some weird I/O scheduling issues in Linux that can make worst-case latency degrade. But since I can't prove that, I guess I might as well not even mention that either. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Load Distributed Checkpoints, take 3
Greg Smith wrote: On Fri, 22 Jun 2007, Tom Lane wrote: Greg had worried about being able to turn this behavior off, so we'd still need at least a bool, and we might as well expose the fraction instead. I agree with removing the non-LRU part of the bgwriter's write logic though If you accept that being able to turn LDC off is important, people who aren't turning it on may still want the existing bgwriter_all_* settings so they can keep running things the way they are now. It's certainly reasonable to skip that code path when doing things the LDC way. Well, if you don't want anything to change, don't upgrade. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Load Distributed Checkpoints, take 3
Greg Smith wrote: True, you'd have to replay 1.5 checkpoint intervals on average instead of 0.5 (more or less, assuming checkpoints had been short). I don't think we're in the business of optimizing crash recovery time though. If you're not, I think you should be. Keeping that replay interval time down was one of the reasons why the people I was working with were displeased with the implications of the very spread out style of some LDC tunings. They were already unhappy with the implied recovery time of how high they had to set checkpoint_settings for good performance, and making it that much bigger aggrevates the issue. Given a knob where the LDC can be spread out a bit but not across the entire interval, that makes it easier to control how much expansion there is relative to the current behavior. I agree on that one: we *should* optimize crash recovery time. It may not be the most important thing on earth, but it's a significant consideration for some systems. However, I think shortening the checkpoint interval is a perfectly valid solution to that. It does lead to more full page writes, but in 8.3 more full page writes can actually make the recovery go faster, not slower, because with we no longer read in the previous contents of the page when we restore it from a full page image. In any case, while people sometimes complain that we have a large WAL footprint, it's not usually a problem. This is off-topic, but at PGCon in May, Itagaki-san and his colleagues whose names I can't remember, pointed out to me very clearly that our recovery is *slow*. So slow, that in the benchmarks they were running, their warm stand-by slave couldn't keep up with the master generating the WAL, even though both are running on the same kind of hardware. The reason is simple: There can be tens of backends doing I/O and generating WAL, but in recovery we serialize them. If you have decent I/O hardware that could handle for example 10 concurrent random I/Os, at recovery we'll be issuing them one at a time. That's a scalability issue, and doesn't show up on a laptop or a small server with a single disk. That's one of the first things I'm planning to tackle when the 8.4 dev cycle opens. And I'm planning to look at recovery times in general; I've never even measured it before so who knows what comes up. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Fri, 22 Jun 2007, Tom Lane wrote: Greg had worried about being able to turn this behavior off, so we'd still need at least a bool, and we might as well expose the fraction instead. I agree with removing the non-LRU part of the bgwriter's write logic though If you accept that being able to turn LDC off is important, people who aren't turning it on may still want the existing bgwriter_all_* settings so they can keep running things the way they are now. It's certainly reasonable to skip that code path when doing things the LDC way. True, you'd have to replay 1.5 checkpoint intervals on average instead of 0.5 (more or less, assuming checkpoints had been short). I don't think we're in the business of optimizing crash recovery time though. If you're not, I think you should be. Keeping that replay interval time down was one of the reasons why the people I was working with were displeased with the implications of the very spread out style of some LDC tunings. They were already unhappy with the implied recovery time of how high they had to set checkpoint_settings for good performance, and making it that much bigger aggrevates the issue. Given a knob where the LDC can be spread out a bit but not across the entire interval, that makes it easier to control how much expansion there is relative to the current behavior. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
Greg Smith <[EMAIL PROTECTED]> writes: > As the person who was complaining about corner cases I'm not in a position > to talk more explicitly about, I can at least summarize my opinion of how > I feel everyone should be thinking about this patch and you can take what > you want from that. Sorry, but we're not going to make decisions on the basis of evidence you won't show us. Right at the moment the best thing to do seems to be to enable LDC with a low minimum write rate and a high target duration, and remove the thereby-obsoleted "all buffers" scan of the existing bgwriter logic. If you've got specific evidence why any of these things need to be parameterized, let's see it. Personally I think that we have a bad track record of exposing GUC variables as a substitute for understanding performance issues at the start, and this approach isn't doing any favors for DBAs. > Whether or not I think this is an awesome idea, the very idea of a change > that big at this point gives me the willies. Just off the top of my head, > there's a whole class of issues involving recycling xlog segments this > would introduce I would be really unhappy with the implications of. Really? Name one. AFAICS this should not affect xlog recycling in the slightest (other than that we have to bump up the target number of live segments from 2x to 3x the checkpoint distance). > Did anyone else ever notice that when a new xlog segment is created, > the write to clear it out doesn't happen via direct I/O like the rest > of the xlog writes do? It's not supposed to matter, because that path isn't supposed to be taken often. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
Heikki Linnakangas <[EMAIL PROTECTED]> writes: > Ok, if we approach this from the idea that there will be *no* GUC > variables at all to control this, and we remove the bgwriter_all_* > settings as well, does anyone see a reason why that would be bad? Here's > the ones mentioned this far: > 1. we need to keep 2x as much WAL segments around as before. No, only 50% more. We've always kept WAL back to the checkpoint-before-last, so the peak usage has been about 2 checkpoint distances (plus a bit), and now it'd tend to be about 3. > 2. pg_start_backup will need to wait for a long time. > 3. Recovery will take longer, because the distance last committed redo > ptr will lag behind more. True, you'd have to replay 1.5 checkpoint intervals on average instead of 0.5 (more or less, assuming checkpoints had been short). I don't think we're in the business of optimizing crash recovery time though. It might be that getting rid of checkpoint_smoothing is an overreaction, and that we should keep that parameter. IIRC Greg had worried about being able to turn this behavior off, so we'd still need at least a bool, and we might as well expose the fraction instead. (Still don't like the name "smoothing" though.) I agree with removing the non-LRU part of the bgwriter's write logic though; that should simplify matters a bit and cut down the overall I/O volume. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Fri, 22 Jun 2007, Tom Lane wrote: Yeah, I'm not sure that we've thought through the interactions with the existing bgwriter behavior. The entire background writer mess needs a rewrite, and the best way to handle that is going to shift considerably with LDC applied. As the person who was complaining about corner cases I'm not in a position to talk more explicitly about, I can at least summarize my opinion of how I feel everyone should be thinking about this patch and you can take what you want from that. In the context of getting an 8.3 release finalized, I think you should be building a LDC implementation that accomplishes the main goal of spreading the checkpoints out, which is clearly working, and erring on the side of more knobs in cases where it's unsure if they're needed or not. It's clear what my position is which non-obvious knobs I think are important for some people. Which is worse: putting in a tuning setting that it's discovered everybody just uses the same value for, or discovering after release that there's a use-case for that setting and by hard-coding it you've made the DBAs for a class of applications you didn't think about miserable? In the cases where there's good evidence so far of the right setting, just make that the default, and the only harm is GUC bloat. Nothing should be done that changes the existing behavior if the LDC feature is turned off, so anything more obtrusive to the background writer is right out. Make reducing the knobs, optimizing the default behavior, and rewriting the background writer to better fit into its new context a major goal of 8.4. I know I've got a whole notebook full of stuff on that topic I've been ignoring as not to distract you guys from getting 8.3 done. That's the low risk plan, and the design/beta/release period here is short enough that I think going too experimental beyond that is a bad idea. To pick an example, when I read this idea from Heikki: You would have a checkpoint running practically all the time, and you would use checkpoint_timeout/checkpoint_segments to control how long it takes... If we do that, we should remove bgwriter_all_* settings Whether or not I think this is an awesome idea, the very idea of a change that big at this point gives me the willies. Just off the top of my head, there's a whole class of issues involving recycling xlog segments this would introduce I would be really unhappy with the implications of. Did anyone else ever notice that when a new xlog segment is created, the write to clear it out doesn't happen via direct I/O like the rest of the xlog writes do? That write goes through the regular buffered path instead. The checkpoint logging patch I submitted logs when this happens specifically because that particular issue messed with some operations and I found it important to be aware of. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: Maybe I misread the patch, but I thought that if someone requested an immediate checkpoint, the checkpoint-in-progress would effectively flip to immediate mode. So that could be handled by offering an immediate vs extended checkpoint option in pg_start_backup. I'm not sure it's a problem though, since as previously noted you probably want pg_start_backup to be noninvasive. Also, one could do a manual CHECKPOINT command then immediately pg_start_backup if one wanted as-fast-as-possible (CHECKPOINT requests immediate checkpoint, right?) Yeah, that's possible. and recovery would need to process on average 1.5 as much WAL as before. Though with LDC, you should get away with shorter checkpoint intervals than before, because the checkpoints aren't as invasive. No, you still want a pretty long checkpoint interval, because of the increase in WAL traffic due to more page images being dumped when the interval is short. If we do that, we should remove bgwriter_all_* settings. They wouldn't do much because we would have checkpoint running all the time, writing out dirty pages. Yeah, I'm not sure that we've thought through the interactions with the existing bgwriter behavior. I searched the archives a bit for the discussions when the current bgwriter settings were born, and found this thread: http://archives.postgresql.org/pgsql-hackers/2004-12/msg00784.php The idea of Load Distributed Checkpoints certainly isn't new :). Ok, if we approach this from the idea that there will be *no* GUC variables at all to control this, and we remove the bgwriter_all_* settings as well, does anyone see a reason why that would be bad? Here's the ones mentioned this far: 1. we need to keep 2x as much WAL segments around as before. 2. pg_start_backup will need to wait for a long time. 3. Recovery will take longer, because the distance last committed redo ptr will lag behind more. 1. and 3. can be alleviated by using a smaller checkpoint_timeout/segments though as you pointed out that leads to higher WAL traffic. 2. is not a big deal, and we can add an 'immediate' parameter to pg_start_backup if necessary. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
Heikki Linnakangas <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> I still think you've not demonstrated a need to expose this parameter. > Greg Smith wanted to explicitly control the I/O rate, and let the > checkpoint duration vary. I personally think that fixing the checkpoint > duration is better because it's easier to tune. > But if we only do that, you might end up with ridiculously long > checkpoints when there's not many dirty pages. If we want to avoid that, > we need some way of telling what's a safe minimum rate to write at, > because that can vary greatly depending on your hardware. As long as the minimum rate is at least 1/bgdelay, I don't think this is a big problem. > But maybe we don't care about prolonging checkpoints, and don't really > need any GUCs at all. We could then just hardcode writes_per_nap to some > low value, and target duration close to 1.0. You would have a checkpoint > running practically all the time, and you would use > checkpoint_timeout/checkpoint_segments to control how long it takes. I'm > a bit worried about jumping to such a completely different regime, > though. For example, pg_start_backup needs to create a new checkpoint, > so it would need to wait on average 1.5 * checkpoint_timeout/segments, Maybe I misread the patch, but I thought that if someone requested an immediate checkpoint, the checkpoint-in-progress would effectively flip to immediate mode. So that could be handled by offering an immediate vs extended checkpoint option in pg_start_backup. I'm not sure it's a problem though, since as previously noted you probably want pg_start_backup to be noninvasive. Also, one could do a manual CHECKPOINT command then immediately pg_start_backup if one wanted as-fast-as-possible (CHECKPOINT requests immediate checkpoint, right?) > and recovery would need to process on average 1.5 as much WAL as before. > Though with LDC, you should get away with shorter checkpoint intervals > than before, because the checkpoints aren't as invasive. No, you still want a pretty long checkpoint interval, because of the increase in WAL traffic due to more page images being dumped when the interval is short. > If we do that, we should remove bgwriter_all_* settings. They wouldn't > do much because we would have checkpoint running all the time, writing > out dirty pages. Yeah, I'm not sure that we've thought through the interactions with the existing bgwriter behavior. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: Heikki Linnakangas <[EMAIL PROTECTED]> writes: Tom Lane wrote: (BTW, the patch seems a bit schizoid about whether checkpoint_rate is int or float.) Yeah, I've gone back and forth on the data type. I wanted it to be a float, but guc code doesn't let you specify a float in KB, so I switched it to int. I seriously question trying to claim that it's blocks at all, seeing that the *actual* units are pages per unit time. Pretending that it's a memory unit does more to obscure the truth than document it. Hmm. What I wanted to avoid is that the I/O rate you get then depends on your bgwriter_delay, so you if you change that you need to change checkpoint_min_rate as well. Now we already have that issue with bgwriter_all/lru_maxpages, and I don't like it there either. If you think it's better to let the user define it directly as pages/bgwriter_delay, fine. And checkpoint_rate really needs to be named checkpoint_min_rate, if it's going to be a minimum. However, I question whether we need it at all, Hmm. With bgwriter_delay of 200 ms, and checkpoint_min_rate of 512 KB/s, using the non-broken formula above, we get: (512*1024/8192) * 200 / 1000 = 12.8, truncated to 12. So I think that's fine. "Fine?" That's 12x the value you have actually tested. That's enough of a change to invalidate all your performance testing IMHO. I'll reschedule the tests to be sure, after we settle on how we want to control this feature. I still think you've not demonstrated a need to expose this parameter. Greg Smith wanted to explicitly control the I/O rate, and let the checkpoint duration vary. I personally think that fixing the checkpoint duration is better because it's easier to tune. But if we only do that, you might end up with ridiculously long checkpoints when there's not many dirty pages. If we want to avoid that, we need some way of telling what's a safe minimum rate to write at, because that can vary greatly depending on your hardware. But maybe we don't care about prolonging checkpoints, and don't really need any GUCs at all. We could then just hardcode writes_per_nap to some low value, and target duration close to 1.0. You would have a checkpoint running practically all the time, and you would use checkpoint_timeout/checkpoint_segments to control how long it takes. I'm a bit worried about jumping to such a completely different regime, though. For example, pg_start_backup needs to create a new checkpoint, so it would need to wait on average 1.5 * checkpoint_timeout/segments, and recovery would need to process on average 1.5 as much WAL as before. Though with LDC, you should get away with shorter checkpoint intervals than before, because the checkpoints aren't as invasive. If we do that, we should remove bgwriter_all_* settings. They wouldn't do much because we would have checkpoint running all the time, writing out dirty pages. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Load Distributed Checkpoints, take 3
Heikki Linnakangas <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> (BTW, the patch seems >> a bit schizoid about whether checkpoint_rate is int or float.) > Yeah, I've gone back and forth on the data type. I wanted it to be a > float, but guc code doesn't let you specify a float in KB, so I switched > it to int. I seriously question trying to claim that it's blocks at all, seeing that the *actual* units are pages per unit time. Pretending that it's a memory unit does more to obscure the truth than document it. >> What's bugging me about this is that we are either going to be writing >> at checkpoint_rate if ahead of schedule, or max possible rate if behind; >> that's not "smoothing" to me, that's introducing some pretty bursty >> behavior. > That sounds a lot more complex. The burstiness at that level shouldn't > matter much. The OS is still going to cache the writes, and should even > out the bursts. With the changes you're proposing here, the burstiness would be quite severe. OTOH, if writes_per_nap is always 1, then bufmgr is going to recheck the delay situation after every page, so what you have actually tested is as granular as it could get. >> And checkpoint_rate really needs to be named checkpoint_min_rate, if >> it's going to be a minimum. However, I question whether we need it at >> all, > Hmm. With bgwriter_delay of 200 ms, and checkpoint_min_rate of 512 KB/s, > using the non-broken formula above, we get: > (512*1024/8192) * 200 / 1000 = 12.8, truncated to 12. > So I think that's fine. "Fine?" That's 12x the value you have actually tested. That's enough of a change to invalidate all your performance testing IMHO. I still think you've not demonstrated a need to expose this parameter. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: > And checkpoint_rate really needs to be named checkpoint_min_rate, if > it's going to be a minimum. However, I question whether we need it at > all, because as the code stands, with the default BgWriterDelay you > would have to increase checkpoint_rate to 4x its proposed default before > writes_per_nap moves off its minimum of 1. This says to me that the > system's tested behavior has been so insensitive to checkpoint_rate > that we probably need not expose such a parameter at all; just hardwire > the minimum writes_per_nap at 1. I agree on starting with as simple a GUC as possible and expand it as there is need in the field. 99% of people are never going to test this value as much as you are. -- Bruce Momjian <[EMAIL PROTECTED]> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: 1. checkpoint_rate is used thusly: writes_per_nap = Min(1, checkpoint_rate / BgWriterDelay); where writes_per_nap is the max number of dirty blocks to write before taking a bgwriter nap. Now surely this is completely backward: if BgWriterDelay is increased, the number of writes to allow per nap decreases? If you think checkpoint_rate is expressed in some kind of physical bytes/sec unit, that cannot be right; the number of blocks per nap has to increase if the naps get longer. Uh, you're right, that's just plain wrong. checkpoint_rate is in pages/s, so that line should be writes_per_nap = Min(1, checkpoint_rate * BgWriterDelay / 1000) (BTW, the patch seems a bit schizoid about whether checkpoint_rate is int or float.) Yeah, I've gone back and forth on the data type. I wanted it to be a float, but guc code doesn't let you specify a float in KB, so I switched it to int. 2. checkpoint_smoothing is used thusly: /* scale progress according to CheckPointSmoothing */ progress *= CheckPointSmoothing; where the progress value being scaled is the fraction so far completed of the total number of dirty pages we expect to have to write. This is then compared against estimates of the total fraction of the time- between-checkpoints that has elapsed; if less, we are behind schedule and should not nap, if more, we are ahead of schedule and may nap. (This is a bit odd, but I guess it's all right because it's equivalent to dividing the elapsed-time estimate by CheckPointSmoothing, which seems a more natural way of thinking about what needs to happen.) Yeah, it's a bit unnatural. I did it that way so we don't have to divide all three of the estimates: cached_elapsed, progress_in_time and progress_in_xlog. Maybe it's not worth micro-optimizing that. What's bugging me about this is that we are either going to be writing at checkpoint_rate if ahead of schedule, or max possible rate if behind; that's not "smoothing" to me, that's introducing some pretty bursty behavior. ISTM that actual "smoothing" would involve adjusting writes_per_nap up or down according to whether we are ahead or behind schedule, so as to have a finer degree of control over the I/O rate. (I'd also consider saving the last writes_per_nap value across checkpoints so as to have a more nearly accurate starting value next time.) That sounds a lot more complex. The burstiness at that level shouldn't matter much. The OS is still going to cache the writes, and should even out the bursts. Assuming time/xlogs elapse at a steady rate, we will write some multiple of writes_per_nap pages between each sleep. With a small writes_per_nap, writing just writes_per_nap isn't enough to catch up after we fall behind, so we'll write more than that between each sleep. That means that on each iteration, we'll write either N*writes_per_nap, or (N+1)*writes_per_nap. At worst, that means either writes_per_nap or 2*writes_per_nap pages on each iteration. That's not too bad, I think. In any case I still concur with Takahiro-san that "smoothing" doesn't seem the most appropriate name for the parameter. Something along the lines of "checkpoint_completion_target" would convey more about what it does, I think. Ok, I'm not wedded to smoothing. And checkpoint_rate really needs to be named checkpoint_min_rate, if it's going to be a minimum. However, I question whether we need it at all, because as the code stands, with the default BgWriterDelay you would have to increase checkpoint_rate to 4x its proposed default before writes_per_nap moves off its minimum of 1. Hmm. With bgwriter_delay of 200 ms, and checkpoint_min_rate of 512 KB/s, using the non-broken formula above, we get: (512*1024/8192) * 200 / 1000 = 12.8, truncated to 12. So I think that's fine. (looking at the patch, I see that the default in guc.c is actually 100 pages / s, contrary to documentation; needs to be fixed) This says to me that the system's tested behavior has been so insensitive to checkpoint_rate that we probably need not expose such a parameter at all; just hardwire the minimum writes_per_nap at 1. I've set checkpoint_rate to a small value in my tests on purpose to control the feature with the other parameter. That must be why I haven't noticed the bogus calculation of it before. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Load Distributed Checkpoints, take 3
Heikki Linnakangas <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> So the question is, why in the heck would anyone want the behavior that >> "checkpoints take exactly X time"?? > Because it's easier to tune. You don't need to know how much checkpoint > I/O you can tolerate. The system will use just enough I/O bandwidth to > meet the deadline, but not more than that. Uh, not really. After consuming more caffeine and reading the patch more carefully, I think there are several problems here: 1. checkpoint_rate is used thusly: writes_per_nap = Min(1, checkpoint_rate / BgWriterDelay); where writes_per_nap is the max number of dirty blocks to write before taking a bgwriter nap. Now surely this is completely backward: if BgWriterDelay is increased, the number of writes to allow per nap decreases? If you think checkpoint_rate is expressed in some kind of physical bytes/sec unit, that cannot be right; the number of blocks per nap has to increase if the naps get longer. (BTW, the patch seems a bit schizoid about whether checkpoint_rate is int or float.) 2. checkpoint_smoothing is used thusly: /* scale progress according to CheckPointSmoothing */ progress *= CheckPointSmoothing; where the progress value being scaled is the fraction so far completed of the total number of dirty pages we expect to have to write. This is then compared against estimates of the total fraction of the time- between-checkpoints that has elapsed; if less, we are behind schedule and should not nap, if more, we are ahead of schedule and may nap. (This is a bit odd, but I guess it's all right because it's equivalent to dividing the elapsed-time estimate by CheckPointSmoothing, which seems a more natural way of thinking about what needs to happen.) What's bugging me about this is that we are either going to be writing at checkpoint_rate if ahead of schedule, or max possible rate if behind; that's not "smoothing" to me, that's introducing some pretty bursty behavior. ISTM that actual "smoothing" would involve adjusting writes_per_nap up or down according to whether we are ahead or behind schedule, so as to have a finer degree of control over the I/O rate. (I'd also consider saving the last writes_per_nap value across checkpoints so as to have a more nearly accurate starting value next time.) In any case I still concur with Takahiro-san that "smoothing" doesn't seem the most appropriate name for the parameter. Something along the lines of "checkpoint_completion_target" would convey more about what it does, I think. And checkpoint_rate really needs to be named checkpoint_min_rate, if it's going to be a minimum. However, I question whether we need it at all, because as the code stands, with the default BgWriterDelay you would have to increase checkpoint_rate to 4x its proposed default before writes_per_nap moves off its minimum of 1. This says to me that the system's tested behavior has been so insensitive to checkpoint_rate that we probably need not expose such a parameter at all; just hardwire the minimum writes_per_nap at 1. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: Heikki Linnakangas <[EMAIL PROTECTED]> writes: The main tuning knob is checkpoint_smoothing, which is defined as a fraction of the checkpoint interval (both checkpoint_timeout and checkpoint_segments are taken into account). Normally, the write phase of a checkpoint takes exactly that much time So the question is, why in the heck would anyone want the behavior that "checkpoints take exactly X time"?? The useful part of this whole patch is to cap the write rate at something that doesn't interfere too much with foreground queries. I don't see why people wouldn't prefer "checkpoints can take any amount of time up to the checkpoint interval, but we do our best not to exceed Y writes/second". Because it's easier to tune. You don't need to know how much checkpoint I/O you can tolerate. The system will use just enough I/O bandwidth to meet the deadline, but not more than that. Basically I don't see what useful values checkpoint_smoothing would have other than 0 and 1. You might as well make it a bool. Well that's one option. It feels like a good thing to be able to control how much headroom you have until the next checkpoint, but maybe we can just hardcode it close to 1. It's also good to avoid spreading the checkpoints unnecessarily, to keep recovery times lower, but you can control that with the min rate setting as well. There's another possible strategy: keep the I/O rate constant, but vary the length of the checkpoint. checkpoint_rate allows you to do that. But only from the lower side. Now how would you replace checkpoint_smoothing with a max I/O rate? I don't see why you think that's hard. It looks to me like the components of the decision are the same numbers in any case: you have to estimate your progress towards checkpoint completion, your available time till next checkpoint, and your write rate. Then you either delay or not. Let me put it this way: If you define a min and a max I/O rate, when would the max I/O rate limit take effect? If there's few dirty buffers in the pool, so that you'll finish the checkpoint in time before the next one is due writing at the min rate, that's what you'll use. If there's more, you'll need to write fast enough that you'll finish the checkpoint in time, regardless of the max rate. Or would you let the next checkpoint slip and keep writing at the max rate? That seems like a footgun if someone sets it to a too low value. Or are you thinking that we have just one setting: checkpoint_rate? You describe it as a maximum, but I've been thinking of it as a minimum because you *will* exceed it if the next checkpoint is due soon. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Load Distributed Checkpoints, take 3
Heikki Linnakangas <[EMAIL PROTECTED]> writes: > I don't think you understand how the settings work. Did you read the > documentation? If you did, it's apparently not adequate. I did read the documentation, and I'm not complaining that I don't understand it. I'm complaining that I don't like the presented API because it's self-inconsistent. You've got two parameters that are in effect upper and lower bounds for the checkpoint write rate, but they are named inconsistently and not even measured in the same kind of unit. Nor do I agree that the inconsistency buys any ease of use. > The main tuning knob is checkpoint_smoothing, which is defined as a > fraction of the checkpoint interval (both checkpoint_timeout and > checkpoint_segments are taken into account). Normally, the write phase > of a checkpoint takes exactly that much time. So the question is, why in the heck would anyone want the behavior that "checkpoints take exactly X time"?? The useful part of this whole patch is to cap the write rate at something that doesn't interfere too much with foreground queries. I don't see why people wouldn't prefer "checkpoints can take any amount of time up to the checkpoint interval, but we do our best not to exceed Y writes/second". Basically I don't see what useful values checkpoint_smoothing would have other than 0 and 1. You might as well make it a bool. > There's another possible strategy: keep the I/O rate constant, but vary > the length of the checkpoint. checkpoint_rate allows you to do that. But only from the lower side. > Now how would you replace checkpoint_smoothing with a max I/O rate? I don't see why you think that's hard. It looks to me like the components of the decision are the same numbers in any case: you have to estimate your progress towards checkpoint completion, your available time till next checkpoint, and your write rate. Then you either delay or not. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: Heikki Linnakangas <[EMAIL PROTECTED]> writes: Tom Lane wrote: I tend to agree with whoever said upthread that the combination of GUC variables proposed here is confusing and ugly. It'd make more sense to have min and max checkpoint rates in KB/s, with the max checkpoint rate only honored when we are predicting we'll finish before the next checkpoint time. Really? I thought everyone is happy with the current combination, and that it was just the old trio of parameters controlling the write, nap and sync phases that was ugly. One particularly nice thing about defining the duration as a fraction of checkpoint interval is that we can come up with a reasonable default value that doesn't depend on your hardware. That argument would hold some water if you weren't introducing a hardware-dependent min rate in the same patch. Do we need the min rate at all? If so, why can't it be in the same units as the max (ie, a fraction of checkpoint)? How would a min and max rate work? Pretty much the same as the code does now, no? You either delay, or not. I don't think you understand how the settings work. Did you read the documentation? If you did, it's apparently not adequate. The main tuning knob is checkpoint_smoothing, which is defined as a fraction of the checkpoint interval (both checkpoint_timeout and checkpoint_segments are taken into account). Normally, the write phase of a checkpoint takes exactly that much time. So the length of a checkpoint stays the same regardless of how many dirty buffers there is (assuming you don't exceed the bandwidth of your hardware), but the I/O rate varies. There's another possible strategy: keep the I/O rate constant, but vary the length of the checkpoint. checkpoint_rate allows you to do that. I'm envisioning we set the defaults so that checkpoint_smoothing is the effective control in a relatively busy system, and checkpoint_rate ensures that we don't unnecessarily prolong checkpoints on an idle system. Now how would you replace checkpoint_smoothing with a max I/O rate? The only real alternative way of defining it is directly as a time and/or segments, similar to checkpoint_timeout and checkpoint_segments, but then we'd really need two knobs instead of one. Though maybe we could just hard-code it to 0.8, for example, and tell people to tune checkpoint_rate if they want more aggressive checkpoints. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Load Distributed Checkpoints, take 3
Heikki Linnakangas <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> I tend to agree with whoever said upthread that the combination of GUC >> variables proposed here is confusing and ugly. It'd make more sense to >> have min and max checkpoint rates in KB/s, with the max checkpoint rate >> only honored when we are predicting we'll finish before the next >> checkpoint time. > Really? I thought everyone is happy with the current combination, and > that it was just the old trio of parameters controlling the write, nap > and sync phases that was ugly. One particularly nice thing about > defining the duration as a fraction of checkpoint interval is that we > can come up with a reasonable default value that doesn't depend on your > hardware. That argument would hold some water if you weren't introducing a hardware-dependent min rate in the same patch. Do we need the min rate at all? If so, why can't it be in the same units as the max (ie, a fraction of checkpoint)? > How would a min and max rate work? Pretty much the same as the code does now, no? You either delay, or not. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: Heikki Linnakangas <[EMAIL PROTECTED]> writes: In fact, I think there's a small race condition in CVS HEAD: Yeah, probably --- the original no-locking design didn't have any side flags. The reason you need the lock is for a backend to be sure that a newly-started checkpoint is using its requested flags. But the detection of checkpoint termination is still the same. Actually, the race condition I outlined isn't related to the flags. It's possible because RequestCheckpoint doesn't guarantee that a checkpoint is performed when there's been no WAL activity since last one. I did use a new force-flag to fix it, but I'm sure there is other ways. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: Heikki Linnakangas <[EMAIL PROTECTED]> writes: I added a spinlock to protect the signaling fields between bgwriter and backends. The current non-locking approach gets really difficult as the patch adds two new flags, and both are more important than the existing ckpt_time_warn flag. That may be, but you could minimize the cost and notational ugliness by not using the spinlock where you don't have to. Put the sig_atomic_t fields back the way they were, and many of the uses of the spinlock go away. All you really need it for is the places where the additional flags are set or read. I find it easier to understand if it's used whenever any of the fields are accessed. You don't need to read and write ckpt_failed and ckpt_started/ckpt_done in specific order anymore, for example. Some other comments: I tend to agree with whoever said upthread that the combination of GUC variables proposed here is confusing and ugly. It'd make more sense to have min and max checkpoint rates in KB/s, with the max checkpoint rate only honored when we are predicting we'll finish before the next checkpoint time. Really? I thought everyone is happy with the current combination, and that it was just the old trio of parameters controlling the write, nap and sync phases that was ugly. One particularly nice thing about defining the duration as a fraction of checkpoint interval is that we can come up with a reasonable default value that doesn't depend on your hardware. How would a min and max rate work? Anyone else have an opinion on the parameters? The flow of control and responsibility between xlog.c, bgwriter.c and bufmgr.c seems to have reached the breaking point of unintelligibility. Can you think of a refactoring that would simplify it? We might have to resign ourselves to this much messiness, but I'd rather not... The problem we're trying to solve is doing a checkpoint while running the normal bgwriter activity at the same time. The normal way to do two things simultaneously is to have two different processes (or threads). I thought about having a separate checkpoint process, but I gave up on that thought because the communication needed between backends, bgwriter and the checkpointer seems like a mess. The checkpointer would need the pendingOpsTable so that it can do the fsyncs, and it would also need to receive the forget-messages to that table. We could move that table entirely to the checkpointer, but since bgwriter is presumably doing most of the writes, there would be a lot of chatter between bgwriter and the checkpointer. The current approach is like co-operative multitasking. BufferSyncs yields control to bgwriter every now and then. The division of labor between xlog.c and other modules is not that bad, IMO. CreateCheckPoint is the main entry point to create a checkpoint, and it calls other modules to do their stuff, including bufmgr.c. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Load Distributed Checkpoints, take 3
ITAGAKI Takahiro wrote: The only thing I don't understand is the naming of 'checkpoint_smoothing'. Can users imagine the unit of 'smoothing' is a fraction? You explain the paremeter with the word 'fraction'. Why don't you simply name it 'checkpoint_fraction' ? | Specifies the target length of checkpoints, as a fraction of | the checkpoint interval. The default is 0.3. I chose checkpoint_smoothing because it tells you what the parameter is for. If you want more smoothing, tune it up, and if you want less, tune it down. checkpoint_fraction makes you wonder what you can do with it and why you would change it. Sorry if I'm missing discussions abount the naming. No, I chose _smoothing on my own. People didn't like checkpoint_write_percent either (including). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] Load Distributed Checkpoints, take 3
Heikki Linnakangas <[EMAIL PROTECTED]> wrote: > Here's an updated WIP patch for load distributed checkpoints. > Since last patch, I did some clean up and refactoring, and added a bunch > of comments, and user documentation. The only thing I don't understand is the naming of 'checkpoint_smoothing'. Can users imagine the unit of 'smoothing' is a fraction? You explain the paremeter with the word 'fraction'. Why don't you simply name it 'checkpoint_fraction' ? | Specifies the target length of checkpoints, as a fraction of | the checkpoint interval. The default is 0.3. Sorry if I'm missing discussions abount the naming. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
Heikki Linnakangas <[EMAIL PROTECTED]> writes: > I added a spinlock to protect the signaling fields between bgwriter and > backends. The current non-locking approach gets really difficult as the > patch adds two new flags, and both are more important than the existing > ckpt_time_warn flag. That may be, but you could minimize the cost and notational ugliness by not using the spinlock where you don't have to. Put the sig_atomic_t fields back the way they were, and many of the uses of the spinlock go away. All you really need it for is the places where the additional flags are set or read. > In fact, I think there's a small race condition in CVS HEAD: Yeah, probably --- the original no-locking design didn't have any side flags. The reason you need the lock is for a backend to be sure that a newly-started checkpoint is using its requested flags. But the detection of checkpoint termination is still the same. Some other comments: I tend to agree with whoever said upthread that the combination of GUC variables proposed here is confusing and ugly. It'd make more sense to have min and max checkpoint rates in KB/s, with the max checkpoint rate only honored when we are predicting we'll finish before the next checkpoint time. The flow of control and responsibility between xlog.c, bgwriter.c and bufmgr.c seems to have reached the breaking point of unintelligibility. Can you think of a refactoring that would simplify it? We might have to resign ourselves to this much messiness, but I'd rather not... regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[PATCHES] Load Distributed Checkpoints, take 3
Here's an updated WIP patch for load distributed checkpoints. I added a spinlock to protect the signaling fields between bgwriter and backends. The current non-locking approach gets really difficult as the patch adds two new flags, and both are more important than the existing ckpt_time_warn flag. In fact, I think there's a small race condition in CVS HEAD: 1. pg_start_backup() is called, which calls RequestCheckpoint 2. RequestCheckpoint takes note of the old value of ckpt_started 3. bgwriter wakes up from pg_usleep, and sees that we've exceeded checkpoint_timeout. 4. bgwriter increases ckpt_started to note that a new checkpoint has started 5. RequestCheckpoint signals bgwriter to start a new checkpoint 6. bgwriter calls CreateCheckpoint, with the force-flag set to false because this checkpoint was triggered by timeout 7. RequestCheckpoint sees that ckpt_started has increased, and starts to wait for ckpt_done to reach the new value. 8. CreateCheckpoint finishes immediately, because there was no XLOG activity since last checkpoint. 9. RequestCheckpoint sees that ckpt_done matches ckpt_started, and returns. 10. pg_start_backup() continues, with potentially the same redo location and thus history filename as previous backup. Now I admit that the chances for that to happen are extremely small, people don't usually do two pg_start_backup calls without *any* WAL logged activity in between them, for example. But as we add the new flags, avoiding scenarios like that becomes harder. Since last patch, I did some clean up and refactoring, and added a bunch of comments, and user documentation. I haven't yet changed GetInsertRecPtr to use the almost up-to-date value protected by the info_lck per Simon's suggestion, and I need to do some correctness testing. After that, I'm done with the patch. Ps. In case you wonder what took me so long since last revision, I've spent a lot of time reviewing HOT. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: doc/src/sgml/config.sgml === RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/config.sgml,v retrieving revision 1.126 diff -c -r1.126 config.sgml *** doc/src/sgml/config.sgml 7 Jun 2007 19:19:56 - 1.126 --- doc/src/sgml/config.sgml 19 Jun 2007 14:24:31 - *** *** 1565,1570 --- 1565,1608 + + checkpoint_smoothing (floating point) + +checkpoint_smoothing configuration parameter + + + + Specifies the target length of checkpoints, as a fraction of + the checkpoint interval. The default is 0.3. + + This parameter can only be set in the postgresql.conf + file or on the server command line. + + + + + checkpoint_rate (floating point) + +checkpoint_rate configuration parameter + + + + Specifies the minimum I/O rate used to flush dirty buffers during a + checkpoint, when there's not many dirty buffers in the buffer cache. + The default is 512 KB/s. + + Note: the accuracy of this setting depends on + bgwriter_delaypostgresql.conf + file or on the server command line. + + + + checkpoint_warning (integer) Index: doc/src/sgml/wal.sgml === RCS file: /home/hlinnaka/pgcvsrepository/pgsql/doc/src/sgml/wal.sgml,v retrieving revision 1.43 diff -c -r1.43 wal.sgml *** doc/src/sgml/wal.sgml 31 Jan 2007 20:56:19 - 1.43 --- doc/src/sgml/wal.sgml 19 Jun 2007 14:26:45 - *** *** 217,225 There will be at least one WAL segment file, and will normally not be more than 2 * checkpoint_segments + 1 !files. Each segment file is normally 16 MB (though this size can be altered when building the server). You can use this to estimate space requirements for WAL. Ordinarily, when old log segment files are no longer needed, they --- 217,245 +If there is a lot of dirty buffers in the buffer cache, flushing them +all at checkpoint will cause a heavy burst of I/O that can disrupt other +activity in the system. To avoid that, the checkpoint I/O can be distributed +over a longer period of time, defined with +checkpoint_smoothing. It's given as a fraction of the +checkpoint interval, as defined by checkpoint_timeout +and checkpoint_segments. The WAL segment consumption +and elapsed time is monitored and the I/O rate is adjusted during +checkpoint so that it's finished when the given fraction of elapsed time +or WAL segments has passed, whichever is sooner. However, that could lead +to unnecessarily prolonged checkpoints when there's not many dirty buffers +in the cache. To avoid that, checkpoint_rate can be used