Re: [HACKERS] Bgwriter strategies
"Pavan Deolasee" <[EMAIL PROTECTED]> writes: > I think you are assuming that the next write of the same block won't > use another OS cache block. I doubt if thats the way writes are handled > by the kernel. Each write would typically end up being queued up in the > kernel > where each write will have its own copy of the block to the written. Isn't > it ? A kernel that worked like that would have a problem doing read(), ie, it'd have to search to find the latest version of the block. So I'd expect that most systems would prefer to keep only one in-memory copy of any given block and overwrite it at write() time. No sane kernel designer will optimize write() at the expense of read() performance, especially when you consider that a design as above really pessimizes write() too --- it does more I/O than is necessary when the same block is modified repeatedly in a short time. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bgwriter strategies
Pavan Deolasee escribió: > On 7/11/07, Heikki Linnakangas <[EMAIL PROTECTED]> wrote: > > > >I was able > >to reproduce the phenomenon with a simple C program that writes 8k > >blocks in random order to a fixed size file. I've attached it along with > >output of running it on my test server. The output shows how the writes > >start to periodically block after a while. I was able to reproduce the > >problem on my laptop as well. Can anyone explain what's going on? > > > I think you are assuming that the next write of the same block won't > use another OS cache block. I doubt if thats the way writes are handled > by the kernel. Each write would typically end up being queued up in the > kernel > where each write will have its own copy of the block to the written. Isn't > it ? I don't think so -- at least not on Linux. See https://ols2006.108.redhat.com/2007/Reprints/zijlstra-Reprint.pdf where he talks about a patch to the page cache. He describes the current page cache there; each page is kept on a tree, so a second write to the same page would "overwrite" the page of the original write. -- Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4 "Las mujeres son como hondas: mientras más resistencia tienen, más lejos puedes llegar con ellas" (Jonas Nightingale, Leap of Faith) ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bgwriter strategies
On 7/11/07, Heikki Linnakangas <[EMAIL PROTECTED]> wrote: I was able to reproduce the phenomenon with a simple C program that writes 8k blocks in random order to a fixed size file. I've attached it along with output of running it on my test server. The output shows how the writes start to periodically block after a while. I was able to reproduce the problem on my laptop as well. Can anyone explain what's going on? I think you are assuming that the next write of the same block won't use another OS cache block. I doubt if thats the way writes are handled by the kernel. Each write would typically end up being queued up in the kernel where each write will have its own copy of the block to the written. Isn't it ? Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Bgwriter strategies
In the last couple of days, I've been running a lot of DBT-2 tests and smaller microbenchmarks with different bgwriter settings and experimental patches, but I have not been able to produce a repeatable test case where any of the bgwriter configurations perform better than not having bgwriter at all. I encountered a strange phenomenon that I don't understand. I ran a small test case with DELETEs in random order, using an index, on a table ~300MB table, with shared_buffers smaller than that. I expected that to be dominated by the speed postgres can swap pages in and out of the shared buffer cache, but surprisingly the test starts to block on the write I/O, even though the table fits completely in OS cache. I was able to reproduce the phenomenon with a simple C program that writes 8k blocks in random order to a fixed size file. I've attached it along with output of running it on my test server. The output shows how the writes start to periodically block after a while. I was able to reproduce the problem on my laptop as well. Can anyone explain what's going on? Anyone out there have a repeatable test case where bgwriter helps? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com #include #include #include #include #include #include int main(int argc, char **argv) { int fd; off_t len; char buf[8192]; int i; int size; struct timeval begin_t; if (argc != 3) { printf("Usage: writetest \n"); exit(1); } fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, S_IWUSR | S_IRUSR); if (fd == -1) { perror(NULL); exit(1); } size = atoi(argv[2]) * 1024 * 1024; for(i=0; i < size;) i += write(fd, buf, sizeof(buf)); len = i; fsync(fd); gettimeofday(&begin_t, NULL); for(i = 0; i < 1000; i++) { lseek(fd, ((random() % (len / sizeof(buf * sizeof(buf), SEEK_SET); write(fd, buf, sizeof(buf)); if(i % 4 == 0) { struct timeval t; long msecs; gettimeofday(&t, NULL); msecs = (t.tv_sec - begin_t.tv_sec) * 1000 +(t.tv_usec - begin_t.tv_usec) / 1000; printf("%d blocks written, time=%ld ms\n", i, msecs); begin_t = t; } } } ./writetest /mnt/data/writetest-data 80 0 blocks written, time=0 ms 4 blocks written, time=251 ms 8 blocks written, time=241 ms 12 blocks written, time=241 ms 16 blocks written, time=241 ms 20 blocks written, time=242 ms 24 blocks written, time=242 ms 28 blocks written, time=241 ms 32 blocks written, time=241 ms 36 blocks written, time=242 ms 40 blocks written, time=241 ms 44 blocks written, time=241 ms 48 blocks written, time=241 ms 52 blocks written, time=242 ms 56 blocks written, time=241 ms 60 blocks written, time=241 ms 64 blocks written, time=242 ms 68 blocks written, time=242 ms 72 blocks written, time=242 ms 76 blocks written, time=241 ms 80 blocks written, time=242 ms 84 blocks written, time=4579 ms 88 blocks written, time=244 ms 92 blocks written, time=242 ms 96 blocks written, time=4752 ms 100 blocks written, time=241 ms 104 blocks written, time=4618 ms 108 blocks written, time=242 ms 112 blocks written, time=4614 ms 116 blocks written, time=246 ms 120 blocks written, time=243 ms 124 blocks written, time=4619 ms 128 blocks written, time=242 ms 132 blocks written, time=242 ms 136 blocks written, time=4605 ms 140 blocks written, time=242 ms ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bgwriter strategies
On Fri, 2007-07-06 at 10:55 +0100, Heikki Linnakangas wrote: > We need to get the requirements straight. > > One goal of bgwriter is clearly to keep just enough buffers clean in > front of the clock hand so that backends don't need to do writes > themselves until the next bgwriter iteration. But not any more than > that, otherwise we might end up doing more writes than necessary if some > of the buffers are redirtied. The purpose of the WAL/shared buffer cache is to avoid having to write all of the data blocks touched by a transaction to disk before end of transaction, thus increasing request response time. That purpose is only fulfilled iff using the shared buffer cache does not require us to write out someone else's dirty buffers, while avoiding our own. The bgwriter exists specifically to clean the dirty buffers, so that users do not have to clean theirs or anybody else's dirty buffers. > To deal with bursty workloads, for example a batch of 2 GB worth of > inserts coming in every 10 minutes, it seems we want to keep doing a > little bit of cleaning even when the system is idle, to prepare for the > next burst. The idea is to smoothen the physical I/O bursts; if we don't > clean the dirty buffers left over from the previous burst during the > idle period, the I/O system will be bottlenecked during the bursts, and > sit idle otherwise. In short, bursty workloads are the normal situation. When capacity is not saturated the bgwriter can utilise the additional capacity to reduce statement response times. It is standard industry practice to avoid running a system at peak throughout for long periods of time, so DBT-2 does not represent a normal situation. This is because the response times are only predictable on a non-saturated system and most apps have some implicit or explicit service level objective. However, the server needs to cope with periods of saturation, so must be able to perform efficiently during those times. So I see there are two modes of operation: i) dirty block write offload when capacity is available ii) efficient operation when server is saturated. DBT-2 represents only the second mode of operation; the two modes are equally important, yet mode i) is the ideal situation. > To strike a balance between cleaning buffers ahead of possible bursts in > the future and not doing unnecessary I/O when no such bursts come, I > think a reasonable strategy is to write buffers with usage_count=0 at a > slow pace when there's no buffer allocations happening. Agreed. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bgwriter strategies
On Thu, 2007-07-05 at 21:50 +0100, Heikki Linnakangas wrote: > All test runs were also patched to count the # of buffer allocations, > and # of buffer flushes performed by bgwriter and backends. Here's those > results (I hope the intendation gets through properly): > > imola-336 imola-337 imola-340 > writes by checkpoint38302 30410 39529 > writes by bgwriter 350113 2205782 1418672 > writes by backends1834333 265755 787633 > writes total 748 2501947 2245834 > allocations 2683170 2657896 2699974 These results may show that the minimum bgwriter_delay of 10ms may be too large for the workloads: whatever the strategy used the bgwriter spends too much time sleeping when it should be working. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bgwriter strategies
On Fri, 6 Jul 2007, Heikki Linnakangas wrote: To strike a balance between cleaning buffers ahead of possible bursts in the future and not doing unnecessary I/O when no such bursts come, I think a reasonable strategy is to write buffers with usage_count=0 at a slow pace when there's no buffer allocations happening. One idea I had there was to always scan max_pages buffers each time even if there were less allocations than needed for that. That number is usually relatively small compared to the size of the buffer cache, so it would creep through the buffer cache at a bounded pace during idle periods. It's actually nice to watch the LRU cleaner get so far ahead during idle spots that it catches the strategy point, so that when the next burst comes, it doesn't have to do anything until there's a full lap by the clock sweep. Anyway, completely with you on the rest of this post, everything you said matches the direction I've been trudging toward. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bgwriter strategies
"Tom Lane" <[EMAIL PROTECTED]> writes: >> That would be overly aggressive on a workload that's steady on average, >> but consists of small bursts. Like this: 0 0 0 0 100 0 0 0 0 100 0 0 0 0 >> 100. You'd end up writing ~100 pages on every bgwriter round, but you >> only need an average of 20 pages per round. > > No, you wouldn't be *writing* that many, you'd only be keeping that many > *clean*; which only costs more work if any of them get re-dirtied > between writing and use. Which is a fairly small probability if we're > talking about a small difference in the number of buffers to keep clean. > So I think the average number of writes is hardly different, it's just > that the backends are far less likely to have to do any of them. Well Postgres's hint bits tends to redirty pages precisely once at just about the time when they're ready to be paged out. But I think there are things we can do to tackle that head-on. Bgwriter could try to set hint bits before cleaning these pages for example. Or we could elect in selected circumstances not to write out a page that is hint-bit-dirty-only. Or some combination of those options depending on the circumstances. Figuring out the circumstances is the hard part. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bgwriter strategies
On Fri, 6 Jul 2007, Heikki Linnakangas wrote: I've been running these test with bgwriter_delay of 10 ms, which is probably too aggressive. Even on relatively high-end hardware, I've found it hard to get good results out of the BGW with the delay under 50ms--particularly when trying to do some more complicated smoothing. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bgwriter strategies
Heikki Linnakangas wrote: I scheduled a test with the moving average method as well, we'll see how that fares. No too well :(. Strange. The total # of writes is on par with having bgwriter disabled, but the physical I/O graphs show more I/O (on par with the aggressive bgwriter), and the response times are higher. I just noticed that on the tests with the moving average, or the simple "just enough" method, there's a small bump in the CPU usage during the ramp up period. I believe that's because bgwriter scans through the whole buffer cache without finding enough buffers to clean. I ran some tests earlier with unpatched bgwriter tuned to the maximum, and it used ~10% of CPU, which is the same level that the bump rises to. Unfortunately I haven't been taking pg_buffercache snapshots until after the ramp up; it should've shown up there. I've been running these test with bgwriter_delay of 10 ms, which is probably too aggressive. I used that to test the idea of starting the scan from where it left off, instead of always starting from clock hand. If someone wants to have a look, the # of writes are collected to a separate log file in /server/buf_alloc_stats.log. There's no link to it from the html files. There's also summary snapshots of pg_buffercache every 30 seconds in /server/bufcache.log. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bgwriter strategies
On Fri, 6 Jul 2007, Tom Lane wrote: The problem is that it'd be very hard to track how far ahead of the recycling sweep hand we are, because that number has to be measured in usage-count-zero pages. I see no good way to know how many of the pages we scanned before have been touched (and given nonzero usage counts) unless we rescan them. I've actually been working on how to address that specific problem without expressly tracking the contents of the buffer cache. When the background writer is called, it finds out how many buffers were allocated and how far the sweep point moved since the last call. From that, you can calculate how many buffers on average need to be scanned per allocation, which tells you something about the recently encountered density of 0-usage count buffers. My thought was to use that as an input to the computation for how far ahead to stay. I've been doing moving averages for years and years, and I find that the multiplication approach works at least as well as explicitly storing the last K observations. It takes a lot less storage and arithmetic too. I was simplifying the description just to comment on the range for K; I was using a multiplication approach for the computation. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bgwriter strategies
Tom Lane wrote: Heikki Linnakangas <[EMAIL PROTECTED]> writes: imola-336 imola-337 imola-340 writes by checkpoint 38302 30410 39529 writes by bgwriter 350113 2205782 1418672 writes by backends 1834333 265755 787633 writes total748 2501947 2245834 allocations 2683170 2657896 2699974 It looks like Tom's idea is not a winner; it leads to more writes than necessary. The incremental number of writes is not that large; only about 10% more. The interesting thing is that those "extra" writes must represent buffers that were re-touched after their usage_count went to zero, but before they could be recycled by the clock sweep. While you'd certainly expect some of that, I'm surprised it is as much as 10%. Maybe we need to play with the buffer allocation strategy some more. The very small difference in NOTPM among the three runs says that either this whole area is unimportant, or DBT2 isn't a good test case for it; or maybe that there's something wrong with the patches? The small difference in NOTPM is because the I/O still wasn't saturated even with 10% extra writes. I ran more tests with a higher number of warehouses, and the extra writes start to show in the response times. See tests 341-344: http://community.enterprisedb.com/bgwriter/. I scheduled a test with the moving average method as well, we'll see how that fares. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bgwriter strategies
Greg Smith <[EMAIL PROTECTED]> writes: > On Thu, 5 Jul 2007, Tom Lane wrote: >> This would give us a safety margin such that buffers_to_clean is not >> less than the largest demand observed in the last 100 iterations...and >> it takes quite a while for the memory of a demand spike to be forgotten >> completely. > If you tested this strategy even on a steady load, I'd expect you'll find > there are large spikes in allocations during the occasional period where > everything is just right to pull a bunch of buffers in, and if you let > that max linger around for 100 iterations you'll write a large number of > buffers more than you need. You seem to have the same misunderstanding as Heikki. What I was proposing was not a target for how many to *write* on each cycle, but a target for how far ahead of the clock sweep hand to look. If say the target is 100, we'll scan forward from the sweep until we have seen 100 clean zero-usage-count buffers; but we only have to write whichever of them weren't already clean. This is actually not so different from my previous proposal, in that the idea is to keep ahead of the sweep by a particular distance. The previous idea was that that distance was "all the buffers", whereas this idea is "a moving average of the actual demand rate". The excess writes created by the previous proposal were because of the probability of re-dirtying buffers between cleaning and recycling. We reduce that probability by not trying to keep so many of 'em clean. But I think that we can meet the goal of having backends do hardly any of the writes with a relatively small increase in the target distance, and thus a relatively small differential in the number of wasted writes. Heikki's test showed that Itagaki-san's patch wasn't doing that well in eliminating writes by backends, so we need a more aggressive target for how many buffers to keep clean than it has; but I think not a huge amount more, and thus my proposal. BTW, somewhere upthread you suggested combining the target-distance idea with the idea that the cleaning work uses a separate sweep hand and thus doesn't re-examine the same buffers on every bgwriter iteration. The problem is that it'd be very hard to track how far ahead of the recycling sweep hand we are, because that number has to be measured in usage-count-zero pages. I see no good way to know how many of the pages we scanned before have been touched (and given nonzero usage counts) unless we rescan them. We could approximate it maybe: try to keep the cleaning hand N total buffers ahead of the recycling hand, where N is the target number of clean usage-count-zero buffers scaled by the average fraction of count-zero buffers (which we can track a moving average of as we advance the recycling hand). However I'm not sure the complexity and uncertainty is worth it. What I took away from Heikki's experiment is that trying to stay a large distance in front of the recycle sweep isn't actually so useful because you get too many wasted writes due to re-dirtying. So restructuring the algorithm to make it cheap CPU-wise to stay well ahead is not so useful either. > I ended up settling on max(moving average of the last 16,most recent > allocation), and that seemed to work pretty well without being too > wasteful from excessive writes. I've been doing moving averages for years and years, and I find that the multiplication approach works at least as well as explicitly storing the last K observations. It takes a lot less storage and arithmetic too. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bgwriter strategies
Tom Lane wrote: Heikki Linnakangas <[EMAIL PROTECTED]> writes: Tom Lane wrote: buffers_to_clean = Max(buffers_used * 1.1, buffers_to_clean * 0.999); That would be overly aggressive on a workload that's steady on average, but consists of small bursts. Like this: 0 0 0 0 100 0 0 0 0 100 0 0 0 0 100. You'd end up writing ~100 pages on every bgwriter round, but you only need an average of 20 pages per round. No, you wouldn't be *writing* that many, you'd only be keeping that many *clean*; which only costs more work if any of them get re-dirtied between writing and use. Which is a fairly small probability if we're talking about a small difference in the number of buffers to keep clean. So I think the average number of writes is hardly different, it's just that the backends are far less likely to have to do any of them. Ah, ok, I misunderstood what you were proposing. Yes, that seems like a good algorithm then. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bgwriter strategies
Heikki Linnakangas <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> buffers_to_clean = Max(buffers_used * 1.1, >> buffers_to_clean * 0.999); > That would be overly aggressive on a workload that's steady on average, > but consists of small bursts. Like this: 0 0 0 0 100 0 0 0 0 100 0 0 0 0 > 100. You'd end up writing ~100 pages on every bgwriter round, but you > only need an average of 20 pages per round. No, you wouldn't be *writing* that many, you'd only be keeping that many *clean*; which only costs more work if any of them get re-dirtied between writing and use. Which is a fairly small probability if we're talking about a small difference in the number of buffers to keep clean. So I think the average number of writes is hardly different, it's just that the backends are far less likely to have to do any of them. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bgwriter strategies
Greg Smith wrote: On Fri, 6 Jul 2007, Heikki Linnakangas wrote: There's something wrong with that. The number of buffer allocations shouldn't depend on the bgwriter strategy at all. I was seeing a smaller (closer to 5%) increase in buffer allocations switching from no background writer to using the stock one before I did any code tinkering, so it didn't strike me as odd. I believe it's related to the TPS numbers. When there are more transactions being executed per unit time, it's more likely the useful blocks will stay in memory because their usage_count is getting tickled faster, and therefore there's less of the most useful blocks being swapped out only to be re-allocated again later. Did you run the test for a constant number of transactions? If you did, the access pattern and the number of allocations should be *exactly* the same with 1 client, assuming the initial state and the seed used for the random number generator is the same. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bgwriter strategies
On Fri, 6 Jul 2007, Heikki Linnakangas wrote: There's something wrong with that. The number of buffer allocations shouldn't depend on the bgwriter strategy at all. I was seeing a smaller (closer to 5%) increase in buffer allocations switching from no background writer to using the stock one before I did any code tinkering, so it didn't strike me as odd. I believe it's related to the TPS numbers. When there are more transactions being executed per unit time, it's more likely the useful blocks will stay in memory because their usage_count is getting tickled faster, and therefore there's less of the most useful blocks being swapped out only to be re-allocated again later. Since the bad bgwriter tunings reduce TPS, I believe that's the mechanism by which there are more allocations needed. I'll try to keep an eye on this now that you've brought it up. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bgwriter strategies
Greg Smith wrote: As you can see, I achieved the goal of almost never having a backend write its own buffer, so yeah for that. That's the only good thing I can say about it though. The TPS results take a moderate dive, and there's about 10% more buffer allocations. The big and obvious issues is that I'm writing almost 75% more buffers this way--way worse even than the 10% extra overhead Heikki was seeing. But since I've going out of my way to find a worse-case for this code, I consider mission accomplished there. There's something wrong with that. The number of buffer allocations shouldn't depend on the bgwriter strategy at all. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bgwriter strategies
I just got my own first set of useful tests of using the new "remember where you last scanned to" BGW implementation suggested by Tom. What I did was keep the exiting % to scan, but cut back the number to scan when so close to a complete lap ahead of the strategy point that I'd cross it if I scanned that much. So when the system was idle, it would very quickly catch up with the strategy point, but if the %/max numbers were low it's possible for it to fall behind. My workload was just the UPDATE statement out of pgbench with a database of scale 25 (~400MB, picked so most operations were in memory), which pushes lots of things in and out of the buffer cache as fast as possible. Here's some data with no background writer at all: clients tps buf_clean buf_backend buf_alloc 1 13400 72554 96846 2 14210 73969 88879 3 14180 71452 86339 4 13440 75184 90187 8 13610 73063 88099 15 13480 71861 86923 And here's what I got with the new approach, using 10% for the scan percentage and a maximum of 200 buffers written out. I picked those numbers after some experimentation because they were the first I found where the background writer was almost always riding right behind the strategy point; with lower numbers, when the background writer woke up it often found it had already fallen behind the stategy point and had to start cleaning forward the old way instead, which wasn't what I wanted to test. clients tps buf_clean buf_backend buf_alloc 1 1261122917 150 105655 2 1186126663 26 97586 3 1154127780 21 98077 4 1181127685 19 98068 8 1076128597 2 98229 15 1065128399 5 98143 As you can see, I achieved the goal of almost never having a backend write its own buffer, so yeah for that. That's the only good thing I can say about it though. The TPS results take a moderate dive, and there's about 10% more buffer allocations. The big and obvious issues is that I'm writing almost 75% more buffers this way--way worse even than the 10% extra overhead Heikki was seeing. But since I've going out of my way to find a worse-case for this code, I consider mission accomplished there. Anyway, will have more detailed reports to post after I collect some more data; for now I just wanted to join Heikki in confirming that the strategy of trying to get the LRU cleaner to ride right behind the strategy point can really waste a whole lot of writes. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bgwriter strategies
Tom Lane wrote: Heikki Linnakangas <[EMAIL PROTECTED]> writes: imola-336 imola-337 imola-340 writes by checkpoint 38302 30410 39529 writes by bgwriter 350113 2205782 1418672 writes by backends 1834333 265755 787633 writes total748 2501947 2245834 allocations 2683170 2657896 2699974 It looks like Tom's idea is not a winner; it leads to more writes than necessary. The incremental number of writes is not that large; only about 10% more. The interesting thing is that those "extra" writes must represent buffers that were re-touched after their usage_count went to zero, but before they could be recycled by the clock sweep. While you'd certainly expect some of that, I'm surprised it is as much as 10%. Maybe we need to play with the buffer allocation strategy some more. The very small difference in NOTPM among the three runs says that either this whole area is unimportant, or DBT2 isn't a good test case for it; or maybe that there's something wrong with the patches? On imola-340, there's still a significant amount of backend writes. I'm still not sure what we should be aiming at. Is 0 backend writes our goal? Well, the lower the better, but not at the cost of a very large increase in total writes. Imola-340 was with a patch along the lines of Itagaki's original patch, ensuring that there's as many clean pages in front of the clock head as were consumed by backends since last bgwriter iteration. This seems intuitively wrong, since in the presence of bursty request behavior it'll constantly be getting caught short of buffers. I think you need a safety margin and a moving-average decay factor. Possibly something like buffers_to_clean = Max(buffers_used * 1.1, buffers_to_clean * 0.999); where buffers_used is the current observation of demand. This would give us a safety margin such that buffers_to_clean is not less than the largest demand observed in the last 100 iterations (0.999 ^ 100 is about 0.90, cancelling out the initial 10% safety margin), and it takes quite a while for the memory of a demand spike to be forgotten completely. That would be overly aggressive on a workload that's steady on average, but consists of small bursts. Like this: 0 0 0 0 100 0 0 0 0 100 0 0 0 0 100. You'd end up writing ~100 pages on every bgwriter round, but you only need an average of 20 pages per round. That'd be effectively the same as keeping all buffers with usage_count=0 clean. BTW, I believe that kind of workload is actually very common. That's what you get if one transaction causes say 10-100 buffer allocations, and you execute one such transaction every few seconds. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bgwriter strategies
Greg Smith wrote: On Thu, 5 Jul 2007, Heikki Linnakangas wrote: It looks like Tom's idea is not a winner; it leads to more writes than necessary. What I came away with as the core of Tom's idea is that the cleaning/LRU writer shouldn't ever scan the same section of the buffer cache twice, because anything that resulted in a new dirty buffer will be unwritable by it until the clock sweep passes over it. I never took that to mean that idea necessarily had to be implemented as "trying to aggressively keep all pages with usage_count=0 clean". I've been making slow progress on this myself, and the question I've been trying to answer is whether this fundamental idea really matters or not. One clear benefit of that alternate implementation should allow is setting a lower value for the interval without being as concerned that you're wasting resources by doing so, which I've found to a problem with the current implementation--it will consume a lot of CPU scanning the same section right now if you lower that too much. Yes, in fact ignoring the CPU overhead of scanning the same section over and over again, Tom's proposal is the same as setting both bgwriter_lru_* settings all the way up to the max. In fact I ran a DBT-2 test like that as well, and the # of writes was indeed the same, just with a max higher CPU usage. It's clear that scanning the same section over and over again has been a waste of time in previous releases. As a further data point, I constructed a smaller test case that performs random DELETEs on a table using an index. I varied the # of shared_buffers, and ran the test with bgwriter disabled, or tuned all the way up to the maximum. Here's the results from that: shared_buffers | writes | writes | writes_ratio +++--- 2560 | 86936 | 88023 | 1.01250345081439 5120 | 81207 | 84551 | 1.04117871612053 7680 | 75367 | 80603 | 1.06947337694216 10240 | 69772 | 74533 | 1.06823654187926 12800 | 64281 | 69237 | 1.07709898725907 15360 | 58515 | 64735 | 1.10629753054772 17920 | 53231 | 58635 | 1.10151979109917 20480 | 48128 | 54403 | 1.13038148271277 23040 | 43087 | 49949 | 1.15925917330053 25600 | 39062 | 46477 | 1.1898264297783 28160 | 35391 | 43739 | 1.23587917832217 30720 | 32713 | 37480 | 1.14572188426619 33280 | 31634 | 31677 | 1.00135929695897 35840 | 31668 | 31717 | 1.00154730327144 38400 | 31696 | 31693 | 0.05350832913 40960 | 31685 | 31730 | 1.00142023039293 43520 | 31694 | 31650 | 0.998611724616647 46080 | 31661 | 31650 | 0.999652569407157 The first writes-column is the # of writes with bgwriter disabled, 2nd column is with the aggressive bgwriter. The table size is 4 pages, so after that the table fits in cache and the bgwriter strategy makes no difference. As far as your results, first off I'm really glad to see someone else comparing checkpoint/backend/bgwriter writes the same I've been doing so I finally have someone else's results to compare against. I expect that the optimal approach here is a hybrid one that structures scanning the buffer cache the new way Tom suggests, but limits the number of writes to "just enough". I happen to be fond of the "just enough" computation based on a weighted moving average I wrote before, but there's certainly room for multiple implementations of that part of the code to evolve. We need to get the requirements straight. One goal of bgwriter is clearly to keep just enough buffers clean in front of the clock hand so that backends don't need to do writes themselves until the next bgwriter iteration. But not any more than that, otherwise we might end up doing more writes than necessary if some of the buffers are redirtied. To deal with bursty workloads, for example a batch of 2 GB worth of inserts coming in every 10 minutes, it seems we want to keep doing a little bit of cleaning even when the system is idle, to prepare for the next burst. The idea is to smoothen the physical I/O bursts; if we don't clean the dirty buffers left over from the previous burst during the idle period, the I/O system will be bottlenecked during the bursts, and sit idle otherwise. To strike a balance between cleaning buffers ahead of possible bursts in the future and not doing unnecessary I/O when no such bursts come, I think a reasonable strategy is to write buffers with usage_count=0 at a slow pace when there's no buffer allocations happening. To smoothen the small variations on a relatively steady workload, the weighted average sounds good. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bgwriter strategies
On Thu, 5 Jul 2007, Tom Lane wrote: This would give us a safety margin such that buffers_to_clean is not less than the largest demand observed in the last 100 iterations...and it takes quite a while for the memory of a demand spike to be forgotten completely. If you tested this strategy even on a steady load, I'd expect you'll find there are large spikes in allocations during the occasional period where everything is just right to pull a bunch of buffers in, and if you let that max linger around for 100 iterations you'll write a large number of buffers more than you need. That's what I saw when I tried to remember too much information about allocation history in the version of the auto LRU tuner I worked on. For example, with 32000 buffers, with pgbench trying to UPDATE as fast as possible I sometimes hit 1500 allocations in an interval, but the steady-state allocation level was closer to 500. I ended up settling on max(moving average of the last 16,most recent allocation), and that seemed to work pretty well without being too wasteful from excessive writes. Playing with multiples of 2, 8 was definately not enough memory to smooth usefully, while 32 seemed a little sluggish on the entry and wasteful on the exit ends. At the default interval, 16 iterations is looking back at the previous 3.2 seconds. I have a feeling the proper tuning for this should be time-based, where you would decide how long ago to consider looking back for and compute the iterations based on that. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bgwriter strategies
On Thu, 5 Jul 2007, Heikki Linnakangas wrote: It looks like Tom's idea is not a winner; it leads to more writes than necessary. What I came away with as the core of Tom's idea is that the cleaning/LRU writer shouldn't ever scan the same section of the buffer cache twice, because anything that resulted in a new dirty buffer will be unwritable by it until the clock sweep passes over it. I never took that to mean that idea necessarily had to be implemented as "trying to aggressively keep all pages with usage_count=0 clean". I've been making slow progress on this myself, and the question I've been trying to answer is whether this fundamental idea really matters or not. One clear benefit of that alternate implementation should allow is setting a lower value for the interval without being as concerned that you're wasting resources by doing so, which I've found to a problem with the current implementation--it will consume a lot of CPU scanning the same section right now if you lower that too much. As far as your results, first off I'm really glad to see someone else comparing checkpoint/backend/bgwriter writes the same I've been doing so I finally have someone else's results to compare against. I expect that the optimal approach here is a hybrid one that structures scanning the buffer cache the new way Tom suggests, but limits the number of writes to "just enough". I happen to be fond of the "just enough" computation based on a weighted moving average I wrote before, but there's certainly room for multiple implementations of that part of the code to evolve. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bgwriter strategies
Heikki Linnakangas <[EMAIL PROTECTED]> writes: > imola-336 imola-337 imola-340 > writes by checkpoint38302 30410 39529 > writes by bgwriter 350113 2205782 1418672 > writes by backends1834333 265755 787633 > writes total 748 2501947 2245834 > allocations 2683170 2657896 2699974 > It looks like Tom's idea is not a winner; it leads to more writes than > necessary. The incremental number of writes is not that large; only about 10% more. The interesting thing is that those "extra" writes must represent buffers that were re-touched after their usage_count went to zero, but before they could be recycled by the clock sweep. While you'd certainly expect some of that, I'm surprised it is as much as 10%. Maybe we need to play with the buffer allocation strategy some more. The very small difference in NOTPM among the three runs says that either this whole area is unimportant, or DBT2 isn't a good test case for it; or maybe that there's something wrong with the patches? > On imola-340, there's still a significant amount of backend writes. I'm > still not sure what we should be aiming at. Is 0 backend writes our goal? Well, the lower the better, but not at the cost of a very large increase in total writes. > Imola-340 was with a patch along the lines of > Itagaki's original patch, ensuring that there's as many clean pages in > front of the clock head as were consumed by backends since last bgwriter > iteration. This seems intuitively wrong, since in the presence of bursty request behavior it'll constantly be getting caught short of buffers. I think you need a safety margin and a moving-average decay factor. Possibly something like buffers_to_clean = Max(buffers_used * 1.1, buffers_to_clean * 0.999); where buffers_used is the current observation of demand. This would give us a safety margin such that buffers_to_clean is not less than the largest demand observed in the last 100 iterations (0.999 ^ 100 is about 0.90, cancelling out the initial 10% safety margin), and it takes quite a while for the memory of a demand spike to be forgotten completely. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] Bgwriter strategies
I ran some DBT-2 tests to compare different bgwriter strategies: http://community.enterprisedb.com/bgwriter/ imola-336 was run with minimal bgwriter settings, so that most writes are done by backends. imola-337 was patched with an implementation of Tom's bgwriter idea, trying to aggressively keep all pages with usage_count=0 clean. Imola-340 was with a patch along the lines of Itagaki's original patch, ensuring that there's as many clean pages in front of the clock head as were consumed by backends since last bgwriter iteration. All test runs were also patched to count the # of buffer allocations, and # of buffer flushes performed by bgwriter and backends. Here's those results (I hope the intendation gets through properly): imola-336 imola-337 imola-340 writes by checkpoint 38302 30410 39529 writes by bgwriter 350113 2205782 1418672 writes by backends 1834333 265755 787633 writes total748 2501947 2245834 allocations 2683170 2657896 2699974 It looks like Tom's idea is not a winner; it leads to more writes than necessary. But the OS caches the writes, so let's look at the actual I/O performed to be sure, from iostat: http://community.enterprisedb.com/bgwriter/writes-336-337-340.jpg The graph shows that on imola-337, there was indeed more write traffic than on the other two test runs. On imola-340, there's still a significant amount of backend writes. I'm still not sure what we should be aiming at. Is 0 backend writes our goal? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org