[HACKERS] Checkpoints and buffers that are hint-bit-dirty
When we checkpoint we write out all dirty buffers. But ISTM we don't really need to write out buffers which are dirty but which have an LSN older than the previous checkpoint. Those represent buffers which were dirtied by a non-wal-logged modification, ie, hint bit setting. The other non-wal-logged operations will sync the buffer themselves when they're done. I guess it doesn't really buy much, probably just a slight delay in writing out the page until bgwriter gets around to it. Conceivably you could have a hot buffer with many missing hint bits which will get written out on several checkpoints but how many of those can you have? And extending the checkpoint doesn't seem like much of a concern. On the other hand it wouldn't be hard to check would it? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] usleep feature for pgbench
Heikki Linnakangas <[EMAIL PROTECTED]> writes: > Jan Wieck wrote: >> You mean as a second, optional argument? Good idea. >> >> us = microseconds >> ms = milliseconds >> s = seconds (default) >> >> \sleep {value|:variable} [us|ms|s] >> >> Is that okay with everyone? > I won't object, but is it really worth the trouble? If we don't get it right now, we'll be stuck with the API later. +1 for the definition suggested above. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bgwriter strategies
"Tom Lane" <[EMAIL PROTECTED]> writes: >> That would be overly aggressive on a workload that's steady on average, >> but consists of small bursts. Like this: 0 0 0 0 100 0 0 0 0 100 0 0 0 0 >> 100. You'd end up writing ~100 pages on every bgwriter round, but you >> only need an average of 20 pages per round. > > No, you wouldn't be *writing* that many, you'd only be keeping that many > *clean*; which only costs more work if any of them get re-dirtied > between writing and use. Which is a fairly small probability if we're > talking about a small difference in the number of buffers to keep clean. > So I think the average number of writes is hardly different, it's just > that the backends are far less likely to have to do any of them. Well Postgres's hint bits tends to redirty pages precisely once at just about the time when they're ready to be paged out. But I think there are things we can do to tackle that head-on. Bgwriter could try to set hint bits before cleaning these pages for example. Or we could elect in selected circumstances not to write out a page that is hint-bit-dirty-only. Or some combination of those options depending on the circumstances. Figuring out the circumstances is the hard part. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Updated tsearch documentation
FYI, I have massively reorganized the text search documentation and it is getting closer to something I am happy with: http://momjian.us/expire/fulltext/HTML/textsearch.html -- Bruce Momjian <[EMAIL PROTECTED]> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bgwriter strategies
On Fri, 6 Jul 2007, Heikki Linnakangas wrote: I've been running these test with bgwriter_delay of 10 ms, which is probably too aggressive. Even on relatively high-end hardware, I've found it hard to get good results out of the BGW with the delay under 50ms--particularly when trying to do some more complicated smoothing. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Still recommending daily vacuum...
Kevin Grittner wrote: > This all started with the question about whether the documentation should > say anything about vacuum schedules other than "enable autovacuum." > My point was that I have a use case where I think that a scheduled vacuum > will be better than leaving everything to autovacuum. I may not be the only > one, so I'm thinking the documentation should discuss where an explicit > schedule might be useful. Fair enough, you are correct that this scenario may be a useful one to describe in the docs. There are plans to add some sort of scheduling to autovacuum so that it choses different settings based on time/date. It may be useful for you as well (though of course they are 8.4 material at best ...), and we may consider shrinking the docs (or at least reshaping them) a bit at that time. > The changes sound good, but I don't see the point of having any vacuum > activity during the work week on the big tables in the database I was > describing. Of course. -- Alvaro Herrera Valdivia, Chile ICBM: S 39º 49' 18.1", W 73º 13' 56.4" "A wizard is never late, Frodo Baggins, nor is he early. He arrives precisely when he means to." (Gandalf, en LoTR FoTR) ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Still recommending daily vacuum...
>>> On Fri, Jul 6, 2007 at 2:19 PM, in message <[EMAIL PROTECTED]>, Alvaro Herrera <[EMAIL PROTECTED]> wrote: > Kevin Grittner wrote: > 2. The point of autovacuum is to get rid of maintenance burden, not add > to it. If you know which tables are small and frequently updated, then > configure those to specific settings that you've found to be optimal, > and then you don't need to worry about vacuuming them any longer. We have 72 counties using the same schema, which has over 300 tables. (Each has their own server, located in their county, with their data.) Rather than trying to fine-tune autovacuum for each table in all locations, we find it more convenient to use general settings which are aggressive enough for the small, high-update tables, but lax enough to let the big ones go until a nightly database vacuum. (That time is pretty slack anyway, so why not off-load the overhead of the vacuum to those hours?) > If you didn't tune it to match specific tables, > most likely your biggest tables never met the formula's condition, which > is why you were seeing it affecting only the small tables (which met the > condition under the values you configured server-wide). > > The extra I/O I was talking about would come from vacuuming one of your > biggest tables, which could cause the amount of I/O to swamp everything > else the server was doing at the time. Is there something better about having the autovacuum compete with load during the week, rather than doing a database vacuum during otherwise idle weekend hours, immediately after the weekly delete of almost 2% of the rows? At the time we run the database vacuum analyze, there is nothing else running to be swamped. >> Our tables tend to fall into one of four categories, small tables with high >> update rates, medium tables (millions or tens of millions of rows) with >> thousands or tens of thousands of updates per day, static tables of various >> sizes that are only modified as part of a software release, and big honking >> tables (100s of GB) which are either insert-only or are insert with >> periodic purge of old rows. Only the first group has a chance of being >> autovacuumed in normal operations. Event he purges don't cause it to kick >> in. > > It could certainly vacuum all your tables. Well, sure, if we weren't doing a nightly database vacuum. (Weekly for the database containing the largest tables, mentioned above.) This all started with the question about whether the documentation should say anything about vacuum schedules other than "enable autovacuum." My point was that I have a use case where I think that a scheduled vacuum will be better than leaving everything to autovacuum. I may not be the only one, so I'm thinking the documentation should discuss where an explicit schedule might be useful. The changes sound good, but I don't see the point of having any vacuum activity during the work week on the big tables in the database I was describing. It seems to me that it would result in at least some performance degradation for the interactive users, and bloat the table, since we might start inserting before the post-delete vacuum. -Kevin ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Still recommending daily vacuum...
Alvaro Herrera wrote: Matthew T. O'Connor wrote: Well, if a table has 10 rows, and we keep the current threshold of 1000 rows, then this table must have 1002 dead tuples (99% dead tuples, 1002 dead + 10 live) before being vacuumed. This seems wasteful because there are 500 dead tuples on it and only 10 live tuples. So each scan must wade through all the dead tuples. Another small table with 100 tuples will be vacuumed on every iteration as well, even if there are just two dead tuples. So you are right -- maybe dropping it all the way to 0 is too much. But a small value of 10 is reasonable? That will make the 10 tuple table be vacuumed when there are 10 dead tuples (50% of dead tuples), and the 100 tuple table when there are 11 (11% of dead tuples). It decreases quickly to the scale factor (2%, or do we want to decrease it to 1%?) I think it's probably fine. I think, that the optimal number for the base_threhold is probably dependant on the width of the row, for a very narrow row where you might have many on the same page, 20 or 50 might be right, but for a very wide table a smaller number might be optimal, however I think it probably doesn't matter much anyway. Reducing the default to 10 seems fine, and perhaps even removing it as a tuning knob. I think there are too many autovacuum knobs and it confuses people. Is it too late to possibly remove this GUC altogether? ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bgwriter strategies
Heikki Linnakangas wrote: I scheduled a test with the moving average method as well, we'll see how that fares. No too well :(. Strange. The total # of writes is on par with having bgwriter disabled, but the physical I/O graphs show more I/O (on par with the aggressive bgwriter), and the response times are higher. I just noticed that on the tests with the moving average, or the simple "just enough" method, there's a small bump in the CPU usage during the ramp up period. I believe that's because bgwriter scans through the whole buffer cache without finding enough buffers to clean. I ran some tests earlier with unpatched bgwriter tuned to the maximum, and it used ~10% of CPU, which is the same level that the bump rises to. Unfortunately I haven't been taking pg_buffercache snapshots until after the ramp up; it should've shown up there. I've been running these test with bgwriter_delay of 10 ms, which is probably too aggressive. I used that to test the idea of starting the scan from where it left off, instead of always starting from clock hand. If someone wants to have a look, the # of writes are collected to a separate log file in /server/buf_alloc_stats.log. There's no link to it from the html files. There's also summary snapshots of pg_buffercache every 30 seconds in /server/bufcache.log. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] CurrentMemoryContext is NULL
Hi guys, I've played with PostgreSql for couple weeks. Currently, I try to develop an extension on windows system with VC++2005. I installed the full package of PostgreSql 8.3 with include files and libraries. Compilation is ok in vc2005, besides some warnings. But, when I tried to run my function in psql, I always got a run time error. After I got the PostgreSql source codes, I complied my own libpostgres.lib. And then link my code with this lib. Run it, got the same error. I found the CurrentMemoryContext is NULL, which was supposed to be initialized by the backend itself. Anyone could tell me what happened here? Just because it's windows? or I should use cygwin? I noticed CurrentMemoryContext is supposed to be imported to my dll by the macro DLLIMPORT. It seems like PostgreSql can't send this parameter to my dll... Thank you for helping me! The snippet of my codes looks like below: #include "postgres.h" #include "fmgr.h" PG_FUNTION_INFO_V1(foobar) Datum foobar(PG_FUNCTION_ARGS) { void* ptr=palloc(10); // or pgport_palloc(10); error occurred here, CurrentMemoryContext equals to NULL, NULL pointer! return NULL; }
Re: [HACKERS] Still recommending daily vacuum...
Kevin Grittner wrote: > >>> On Tue, Jul 3, 2007 at 5:34 PM, in message > <[EMAIL PROTECTED]>, Alvaro Herrera > <[EMAIL PROTECTED]> wrote: > > Kevin Grittner wrote: > > > >> Autovacuum is enabled with very aggressive settings, to cover small > >> tables, including one with about 75 rows that can be updated 100 or more > >> times per second. Even with these settings there is zero chance of any > >> table of even moderate size hitting the autovacuum threshold between our > >> scheduled vacuums. > > > > Sounds like you would be served by setting those specific tables to a > > lower vacuum scale factor (keeping a more normal default for the rest of > > the tables), and having a non-zero vacuum delay setting (to avoid > > excessive I/O consumption). Have you tried that? > > I did play with that, but it doens't seem to make sense in our environment. > We have about 100 databases, most of them scattered around the state, and > any extra maintenance like that has a cost, particularly with the daily > cluster changing the oid. Both from doing the math and from experience, > I can say that the autovacuum only affects the small, frequently updated > tables, so I could see no benefit. Am I missing somethign? (I can't see > where this causes any extra I/O.) There seem to be a misunderstanding here. 1. Cluster does not change the OID. It only changes the relfilenode. The file on disk is named something else, but the OID used in the database remains unchanged. So if you insert something into pg_autovacuum it continues to work after a CLUSTER, you don't need to update the OID. 2. The point of autovacuum is to get rid of maintenance burden, not add to it. If you know which tables are small and frequently updated, then configure those to specific settings that you've found to be optimal, and then you don't need to worry about vacuuming them any longer. You already know this but: autovacuum uses a formula to determine which tables to vacuum. The formula is based on the number of dead tuples, the size of the table and two factors that you can configure per table as well as globally. If you didn't tune it to match specific tables, most likely your biggest tables never met the formula's condition, which is why you were seeing it affecting only the small tables (which met the condition under the values you configured server-wide). The extra I/O I was talking about would come from vacuuming one of your biggest tables, which could cause the amount of I/O to swamp everything else the server was doing at the time. Since it never actually touched the big tables this hasn't happened to you yet. Do note that autovacuum uses the vacuum_cost_delay if autovacuum_vacuum_cost_delay is set to the default value of -1. > Our tables tend to fall into one of four categories, small tables with high > update rates, medium tables (millions or tens of millions of rows) with > thousands or tens of thousands of updates per day, static tables of various > sizes that are only modified as part of a software release, and big honking > tables (100s of GB) which are either insert-only or are insert with > periodic purge of old rows. Only the first group has a chance of being > autovacuumed in normal operations. Event he purges don't cause it to kick > in. It could certainly vacuum all your tables. But one thing to keep in mind that as of 8.2, only one autovacuum process can be running. So if and when it decides to vacuum the big tables, it will be long before it is able to go back and check the small tables. This is fixed in 8.3. > >> Oh, the tiny, high-update tables occasionally bloat to hundreds or > >> thousands of pages because of long-running transactions, so we schedule > >> a daily cluster on those, just to keep things tidy. > > > > If you can afford the cluster then there's no problem. I don't expect > > that to change in 8.3. > > Here also we're talking 10 to 20 milliseconds. I understand that in 8.2 > that leaves a chance of an error, but we seem to have dodged that bullet > so far. Has that gotten any safer in 8.3? Yes, it did, assuming I understood what error are you talking about (cluster not leaving dead tuples possibly seen by concurrent transactions). -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Still recommending daily vacuum...
Matthew T. O'Connor wrote: > Alvaro Herrera wrote: > >Jim C. Nasby wrote: > >>FWIW, I normally go with the 8.2 defaults, though I could see dropping > >>vacuum_scale_factor down to 0.1 or 0.15. I also think the thresholds > >>could be decreased further, maybe divide by 10. > > > >How about pushing thresholds all the way down to 0? > > As long as it handles small (or zero row) tables ok then yes. The > base_threshold in the originial contrib autovacuum was just an easy way > to not vacuum really small tables too often. If a table has only 10 > rows, it's going to get vacuumed every time one row is updated. I guess > that's not a big problem with a table that small but still seems excessive. Well, if a table has 10 rows, and we keep the current threshold of 1000 rows, then this table must have 1002 dead tuples (99% dead tuples, 1002 dead + 10 live) before being vacuumed. This seems wasteful because there are 500 dead tuples on it and only 10 live tuples. So each scan must wade through all the dead tuples. Another small table with 100 tuples will be vacuumed on every iteration as well, even if there are just two dead tuples. So you are right -- maybe dropping it all the way to 0 is too much. But a small value of 10 is reasonable? That will make the 10 tuple table be vacuumed when there are 10 dead tuples (50% of dead tuples), and the 100 tuple table when there are 11 (11% of dead tuples). It decreases quickly to the scale factor (2%, or do we want to decrease it to 1%?) Does this sound acceptable? -- Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J "I am amazed at [the pgsql-sql] mailing list for the wonderful support, and lack of hesitasion in answering a lost soul's question, I just wished the rest of the mailing list could be like this." (Fotis) (http://archives.postgresql.org/pgsql-sql/2006-06/msg00265.php) ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bgwriter strategies
On Fri, 6 Jul 2007, Tom Lane wrote: The problem is that it'd be very hard to track how far ahead of the recycling sweep hand we are, because that number has to be measured in usage-count-zero pages. I see no good way to know how many of the pages we scanned before have been touched (and given nonzero usage counts) unless we rescan them. I've actually been working on how to address that specific problem without expressly tracking the contents of the buffer cache. When the background writer is called, it finds out how many buffers were allocated and how far the sweep point moved since the last call. From that, you can calculate how many buffers on average need to be scanned per allocation, which tells you something about the recently encountered density of 0-usage count buffers. My thought was to use that as an input to the computation for how far ahead to stay. I've been doing moving averages for years and years, and I find that the multiplication approach works at least as well as explicitly storing the last K observations. It takes a lot less storage and arithmetic too. I was simplifying the description just to comment on the range for K; I was using a multiplication approach for the computation. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] usleep feature for pgbench
On 7/6/2007 1:32 PM, Heikki Linnakangas wrote: Jan Wieck wrote: On 7/6/2007 10:44 AM, Peter Eisentraut wrote: Am Donnerstag, 5. Juli 2007 21:12 schrieb Jan Wieck: To test some changes in Slony I needed a \usleep [microseconds|:variable] in pgbench's scripting language to be able to have hundreds of concurrent running transactions without totally swamping the system. I was wondering if anyone would object to permanently adding this to the pgbench code? Or maybe a \sleep command that takes units, if it's not too much work. You mean as a second, optional argument? Good idea. us = microseconds ms = milliseconds s = seconds (default) \sleep {value|:variable} [us|ms|s] Is that okay with everyone? I won't object, but is it really worth the trouble? Can you do microsecond precision sleeps, and on what platforms? How much overhead is there? IIRC, on Linux the minimum time you can sleep depends on CONFIG_HZ, and the default was 10 ms until recently. It is what you tell select(2) in the struct timeval. Why limit it to some arbitrary precision? Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bgwriter strategies
Tom Lane wrote: Heikki Linnakangas <[EMAIL PROTECTED]> writes: imola-336 imola-337 imola-340 writes by checkpoint 38302 30410 39529 writes by bgwriter 350113 2205782 1418672 writes by backends 1834333 265755 787633 writes total748 2501947 2245834 allocations 2683170 2657896 2699974 It looks like Tom's idea is not a winner; it leads to more writes than necessary. The incremental number of writes is not that large; only about 10% more. The interesting thing is that those "extra" writes must represent buffers that were re-touched after their usage_count went to zero, but before they could be recycled by the clock sweep. While you'd certainly expect some of that, I'm surprised it is as much as 10%. Maybe we need to play with the buffer allocation strategy some more. The very small difference in NOTPM among the three runs says that either this whole area is unimportant, or DBT2 isn't a good test case for it; or maybe that there's something wrong with the patches? The small difference in NOTPM is because the I/O still wasn't saturated even with 10% extra writes. I ran more tests with a higher number of warehouses, and the extra writes start to show in the response times. See tests 341-344: http://community.enterprisedb.com/bgwriter/. I scheduled a test with the moving average method as well, we'll see how that fares. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] usleep feature for pgbench
Jan Wieck wrote: On 7/6/2007 10:44 AM, Peter Eisentraut wrote: Am Donnerstag, 5. Juli 2007 21:12 schrieb Jan Wieck: To test some changes in Slony I needed a \usleep [microseconds|:variable] in pgbench's scripting language to be able to have hundreds of concurrent running transactions without totally swamping the system. I was wondering if anyone would object to permanently adding this to the pgbench code? Or maybe a \sleep command that takes units, if it's not too much work. You mean as a second, optional argument? Good idea. us = microseconds ms = milliseconds s = seconds (default) \sleep {value|:variable} [us|ms|s] Is that okay with everyone? I won't object, but is it really worth the trouble? Can you do microsecond precision sleeps, and on what platforms? How much overhead is there? IIRC, on Linux the minimum time you can sleep depends on CONFIG_HZ, and the default was 10 ms until recently. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] usleep feature for pgbench
On 7/6/2007 10:44 AM, Peter Eisentraut wrote: Am Donnerstag, 5. Juli 2007 21:12 schrieb Jan Wieck: To test some changes in Slony I needed a \usleep [microseconds|:variable] in pgbench's scripting language to be able to have hundreds of concurrent running transactions without totally swamping the system. I was wondering if anyone would object to permanently adding this to the pgbench code? Or maybe a \sleep command that takes units, if it's not too much work. You mean as a second, optional argument? Good idea. us = microseconds ms = milliseconds s = seconds (default) \sleep {value|:variable} [us|ms|s] Is that okay with everyone? Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] pg_autovacuum -> pg_class.reloptions?
Alvaro Herrera <[EMAIL PROTECTED]> writes: > A long time ago, Tom proposed moving the pg_autovacuum settings into > reloptions. I know it's really late in the devel cycle but I wonder if > such a move would be acceptable at this time? I think it's too late to be considering essentially-cosmetic changes for 8.3, especially when you've not even started on a patch. If we don't start saying "no" on a regular basis, we'll never get this thing out the door. We are already more than two months behind the intended schedule, and I see absolutely nothing getting done on several of the major patches. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bgwriter strategies
Greg Smith <[EMAIL PROTECTED]> writes: > On Thu, 5 Jul 2007, Tom Lane wrote: >> This would give us a safety margin such that buffers_to_clean is not >> less than the largest demand observed in the last 100 iterations...and >> it takes quite a while for the memory of a demand spike to be forgotten >> completely. > If you tested this strategy even on a steady load, I'd expect you'll find > there are large spikes in allocations during the occasional period where > everything is just right to pull a bunch of buffers in, and if you let > that max linger around for 100 iterations you'll write a large number of > buffers more than you need. You seem to have the same misunderstanding as Heikki. What I was proposing was not a target for how many to *write* on each cycle, but a target for how far ahead of the clock sweep hand to look. If say the target is 100, we'll scan forward from the sweep until we have seen 100 clean zero-usage-count buffers; but we only have to write whichever of them weren't already clean. This is actually not so different from my previous proposal, in that the idea is to keep ahead of the sweep by a particular distance. The previous idea was that that distance was "all the buffers", whereas this idea is "a moving average of the actual demand rate". The excess writes created by the previous proposal were because of the probability of re-dirtying buffers between cleaning and recycling. We reduce that probability by not trying to keep so many of 'em clean. But I think that we can meet the goal of having backends do hardly any of the writes with a relatively small increase in the target distance, and thus a relatively small differential in the number of wasted writes. Heikki's test showed that Itagaki-san's patch wasn't doing that well in eliminating writes by backends, so we need a more aggressive target for how many buffers to keep clean than it has; but I think not a huge amount more, and thus my proposal. BTW, somewhere upthread you suggested combining the target-distance idea with the idea that the cleaning work uses a separate sweep hand and thus doesn't re-examine the same buffers on every bgwriter iteration. The problem is that it'd be very hard to track how far ahead of the recycling sweep hand we are, because that number has to be measured in usage-count-zero pages. I see no good way to know how many of the pages we scanned before have been touched (and given nonzero usage counts) unless we rescan them. We could approximate it maybe: try to keep the cleaning hand N total buffers ahead of the recycling hand, where N is the target number of clean usage-count-zero buffers scaled by the average fraction of count-zero buffers (which we can track a moving average of as we advance the recycling hand). However I'm not sure the complexity and uncertainty is worth it. What I took away from Heikki's experiment is that trying to stay a large distance in front of the recycle sweep isn't actually so useful because you get too many wasted writes due to re-dirtying. So restructuring the algorithm to make it cheap CPU-wise to stay well ahead is not so useful either. > I ended up settling on max(moving average of the last 16,most recent > allocation), and that seemed to work pretty well without being too > wasteful from excessive writes. I've been doing moving averages for years and years, and I find that the multiplication approach works at least as well as explicitly storing the last K observations. It takes a lot less storage and arithmetic too. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] pg_autovacuum -> pg_class.reloptions?
Hi, A long time ago, Tom proposed moving the pg_autovacuum settings into reloptions. I know it's really late in the devel cycle but I wonder if such a move would be acceptable at this time? I feel it would be a good move, for example per http://thread.gmane.org/gmane.comp.db.postgresql.general/92643/ and the third item in http://thread.gmane.org/gmane.comp.db.postgresql.bugs/14175 I admit I haven't even tried yet, so I would need to start working on a patch. This would have the advantage that 1. users would not need to come up with "disable" values for variables they are not interested in touching, and 2. we could have more control in detecting invalid values Opinions? -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bgwriter strategies
Tom Lane wrote: Heikki Linnakangas <[EMAIL PROTECTED]> writes: Tom Lane wrote: buffers_to_clean = Max(buffers_used * 1.1, buffers_to_clean * 0.999); That would be overly aggressive on a workload that's steady on average, but consists of small bursts. Like this: 0 0 0 0 100 0 0 0 0 100 0 0 0 0 100. You'd end up writing ~100 pages on every bgwriter round, but you only need an average of 20 pages per round. No, you wouldn't be *writing* that many, you'd only be keeping that many *clean*; which only costs more work if any of them get re-dirtied between writing and use. Which is a fairly small probability if we're talking about a small difference in the number of buffers to keep clean. So I think the average number of writes is hardly different, it's just that the backends are far less likely to have to do any of them. Ah, ok, I misunderstood what you were proposing. Yes, that seems like a good algorithm then. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bgwriter strategies
Heikki Linnakangas <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> buffers_to_clean = Max(buffers_used * 1.1, >> buffers_to_clean * 0.999); > That would be overly aggressive on a workload that's steady on average, > but consists of small bursts. Like this: 0 0 0 0 100 0 0 0 0 100 0 0 0 0 > 100. You'd end up writing ~100 pages on every bgwriter round, but you > only need an average of 20 pages per round. No, you wouldn't be *writing* that many, you'd only be keeping that many *clean*; which only costs more work if any of them get re-dirtied between writing and use. Which is a fairly small probability if we're talking about a small difference in the number of buffers to keep clean. So I think the average number of writes is hardly different, it's just that the backends are far less likely to have to do any of them. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] usleep feature for pgbench
Am Donnerstag, 5. Juli 2007 21:12 schrieb Jan Wieck: > To test some changes in Slony I needed a > > \usleep [microseconds|:variable] > > in pgbench's scripting language to be able to have hundreds of > concurrent running transactions without totally swamping the system. I > was wondering if anyone would object to permanently adding this to the > pgbench code? Or maybe a \sleep command that takes units, if it's not too much work. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bgwriter strategies
Greg Smith wrote: On Fri, 6 Jul 2007, Heikki Linnakangas wrote: There's something wrong with that. The number of buffer allocations shouldn't depend on the bgwriter strategy at all. I was seeing a smaller (closer to 5%) increase in buffer allocations switching from no background writer to using the stock one before I did any code tinkering, so it didn't strike me as odd. I believe it's related to the TPS numbers. When there are more transactions being executed per unit time, it's more likely the useful blocks will stay in memory because their usage_count is getting tickled faster, and therefore there's less of the most useful blocks being swapped out only to be re-allocated again later. Did you run the test for a constant number of transactions? If you did, the access pattern and the number of allocations should be *exactly* the same with 1 client, assuming the initial state and the seed used for the random number generator is the same. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bgwriter strategies
On Fri, 6 Jul 2007, Heikki Linnakangas wrote: There's something wrong with that. The number of buffer allocations shouldn't depend on the bgwriter strategy at all. I was seeing a smaller (closer to 5%) increase in buffer allocations switching from no background writer to using the stock one before I did any code tinkering, so it didn't strike me as odd. I believe it's related to the TPS numbers. When there are more transactions being executed per unit time, it's more likely the useful blocks will stay in memory because their usage_count is getting tickled faster, and therefore there's less of the most useful blocks being swapped out only to be re-allocated again later. Since the bad bgwriter tunings reduce TPS, I believe that's the mechanism by which there are more allocations needed. I'll try to keep an eye on this now that you've brought it up. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bgwriter strategies
Greg Smith wrote: As you can see, I achieved the goal of almost never having a backend write its own buffer, so yeah for that. That's the only good thing I can say about it though. The TPS results take a moderate dive, and there's about 10% more buffer allocations. The big and obvious issues is that I'm writing almost 75% more buffers this way--way worse even than the 10% extra overhead Heikki was seeing. But since I've going out of my way to find a worse-case for this code, I consider mission accomplished there. There's something wrong with that. The number of buffer allocations shouldn't depend on the bgwriter strategy at all. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bgwriter strategies
I just got my own first set of useful tests of using the new "remember where you last scanned to" BGW implementation suggested by Tom. What I did was keep the exiting % to scan, but cut back the number to scan when so close to a complete lap ahead of the strategy point that I'd cross it if I scanned that much. So when the system was idle, it would very quickly catch up with the strategy point, but if the %/max numbers were low it's possible for it to fall behind. My workload was just the UPDATE statement out of pgbench with a database of scale 25 (~400MB, picked so most operations were in memory), which pushes lots of things in and out of the buffer cache as fast as possible. Here's some data with no background writer at all: clients tps buf_clean buf_backend buf_alloc 1 13400 72554 96846 2 14210 73969 88879 3 14180 71452 86339 4 13440 75184 90187 8 13610 73063 88099 15 13480 71861 86923 And here's what I got with the new approach, using 10% for the scan percentage and a maximum of 200 buffers written out. I picked those numbers after some experimentation because they were the first I found where the background writer was almost always riding right behind the strategy point; with lower numbers, when the background writer woke up it often found it had already fallen behind the stategy point and had to start cleaning forward the old way instead, which wasn't what I wanted to test. clients tps buf_clean buf_backend buf_alloc 1 1261122917 150 105655 2 1186126663 26 97586 3 1154127780 21 98077 4 1181127685 19 98068 8 1076128597 2 98229 15 1065128399 5 98143 As you can see, I achieved the goal of almost never having a backend write its own buffer, so yeah for that. That's the only good thing I can say about it though. The TPS results take a moderate dive, and there's about 10% more buffer allocations. The big and obvious issues is that I'm writing almost 75% more buffers this way--way worse even than the 10% extra overhead Heikki was seeing. But since I've going out of my way to find a worse-case for this code, I consider mission accomplished there. Anyway, will have more detailed reports to post after I collect some more data; for now I just wanted to join Heikki in confirming that the strategy of trying to get the LRU cleaner to ride right behind the strategy point can really waste a whole lot of writes. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bgwriter strategies
Tom Lane wrote: Heikki Linnakangas <[EMAIL PROTECTED]> writes: imola-336 imola-337 imola-340 writes by checkpoint 38302 30410 39529 writes by bgwriter 350113 2205782 1418672 writes by backends 1834333 265755 787633 writes total748 2501947 2245834 allocations 2683170 2657896 2699974 It looks like Tom's idea is not a winner; it leads to more writes than necessary. The incremental number of writes is not that large; only about 10% more. The interesting thing is that those "extra" writes must represent buffers that were re-touched after their usage_count went to zero, but before they could be recycled by the clock sweep. While you'd certainly expect some of that, I'm surprised it is as much as 10%. Maybe we need to play with the buffer allocation strategy some more. The very small difference in NOTPM among the three runs says that either this whole area is unimportant, or DBT2 isn't a good test case for it; or maybe that there's something wrong with the patches? On imola-340, there's still a significant amount of backend writes. I'm still not sure what we should be aiming at. Is 0 backend writes our goal? Well, the lower the better, but not at the cost of a very large increase in total writes. Imola-340 was with a patch along the lines of Itagaki's original patch, ensuring that there's as many clean pages in front of the clock head as were consumed by backends since last bgwriter iteration. This seems intuitively wrong, since in the presence of bursty request behavior it'll constantly be getting caught short of buffers. I think you need a safety margin and a moving-average decay factor. Possibly something like buffers_to_clean = Max(buffers_used * 1.1, buffers_to_clean * 0.999); where buffers_used is the current observation of demand. This would give us a safety margin such that buffers_to_clean is not less than the largest demand observed in the last 100 iterations (0.999 ^ 100 is about 0.90, cancelling out the initial 10% safety margin), and it takes quite a while for the memory of a demand spike to be forgotten completely. That would be overly aggressive on a workload that's steady on average, but consists of small bursts. Like this: 0 0 0 0 100 0 0 0 0 100 0 0 0 0 100. You'd end up writing ~100 pages on every bgwriter round, but you only need an average of 20 pages per round. That'd be effectively the same as keeping all buffers with usage_count=0 clean. BTW, I believe that kind of workload is actually very common. That's what you get if one transaction causes say 10-100 buffer allocations, and you execute one such transaction every few seconds. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bgwriter strategies
Greg Smith wrote: On Thu, 5 Jul 2007, Heikki Linnakangas wrote: It looks like Tom's idea is not a winner; it leads to more writes than necessary. What I came away with as the core of Tom's idea is that the cleaning/LRU writer shouldn't ever scan the same section of the buffer cache twice, because anything that resulted in a new dirty buffer will be unwritable by it until the clock sweep passes over it. I never took that to mean that idea necessarily had to be implemented as "trying to aggressively keep all pages with usage_count=0 clean". I've been making slow progress on this myself, and the question I've been trying to answer is whether this fundamental idea really matters or not. One clear benefit of that alternate implementation should allow is setting a lower value for the interval without being as concerned that you're wasting resources by doing so, which I've found to a problem with the current implementation--it will consume a lot of CPU scanning the same section right now if you lower that too much. Yes, in fact ignoring the CPU overhead of scanning the same section over and over again, Tom's proposal is the same as setting both bgwriter_lru_* settings all the way up to the max. In fact I ran a DBT-2 test like that as well, and the # of writes was indeed the same, just with a max higher CPU usage. It's clear that scanning the same section over and over again has been a waste of time in previous releases. As a further data point, I constructed a smaller test case that performs random DELETEs on a table using an index. I varied the # of shared_buffers, and ran the test with bgwriter disabled, or tuned all the way up to the maximum. Here's the results from that: shared_buffers | writes | writes | writes_ratio +++--- 2560 | 86936 | 88023 | 1.01250345081439 5120 | 81207 | 84551 | 1.04117871612053 7680 | 75367 | 80603 | 1.06947337694216 10240 | 69772 | 74533 | 1.06823654187926 12800 | 64281 | 69237 | 1.07709898725907 15360 | 58515 | 64735 | 1.10629753054772 17920 | 53231 | 58635 | 1.10151979109917 20480 | 48128 | 54403 | 1.13038148271277 23040 | 43087 | 49949 | 1.15925917330053 25600 | 39062 | 46477 | 1.1898264297783 28160 | 35391 | 43739 | 1.23587917832217 30720 | 32713 | 37480 | 1.14572188426619 33280 | 31634 | 31677 | 1.00135929695897 35840 | 31668 | 31717 | 1.00154730327144 38400 | 31696 | 31693 | 0.05350832913 40960 | 31685 | 31730 | 1.00142023039293 43520 | 31694 | 31650 | 0.998611724616647 46080 | 31661 | 31650 | 0.999652569407157 The first writes-column is the # of writes with bgwriter disabled, 2nd column is with the aggressive bgwriter. The table size is 4 pages, so after that the table fits in cache and the bgwriter strategy makes no difference. As far as your results, first off I'm really glad to see someone else comparing checkpoint/backend/bgwriter writes the same I've been doing so I finally have someone else's results to compare against. I expect that the optimal approach here is a hybrid one that structures scanning the buffer cache the new way Tom suggests, but limits the number of writes to "just enough". I happen to be fond of the "just enough" computation based on a weighted moving average I wrote before, but there's certainly room for multiple implementations of that part of the code to evolve. We need to get the requirements straight. One goal of bgwriter is clearly to keep just enough buffers clean in front of the clock hand so that backends don't need to do writes themselves until the next bgwriter iteration. But not any more than that, otherwise we might end up doing more writes than necessary if some of the buffers are redirtied. To deal with bursty workloads, for example a batch of 2 GB worth of inserts coming in every 10 minutes, it seems we want to keep doing a little bit of cleaning even when the system is idle, to prepare for the next burst. The idea is to smoothen the physical I/O bursts; if we don't clean the dirty buffers left over from the previous burst during the idle period, the I/O system will be bottlenecked during the bursts, and sit idle otherwise. To strike a balance between cleaning buffers ahead of possible bursts in the future and not doing unnecessary I/O when no such bursts come, I think a reasonable strategy is to write buffers with usage_count=0 at a slow pace when there's no buffer allocations happening. To smoothen the small variations on a relatively steady workload, the weighted average sounds good. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bgwriter strategies
On Thu, 5 Jul 2007, Tom Lane wrote: This would give us a safety margin such that buffers_to_clean is not less than the largest demand observed in the last 100 iterations...and it takes quite a while for the memory of a demand spike to be forgotten completely. If you tested this strategy even on a steady load, I'd expect you'll find there are large spikes in allocations during the occasional period where everything is just right to pull a bunch of buffers in, and if you let that max linger around for 100 iterations you'll write a large number of buffers more than you need. That's what I saw when I tried to remember too much information about allocation history in the version of the auto LRU tuner I worked on. For example, with 32000 buffers, with pgbench trying to UPDATE as fast as possible I sometimes hit 1500 allocations in an interval, but the steady-state allocation level was closer to 500. I ended up settling on max(moving average of the last 16,most recent allocation), and that seemed to work pretty well without being too wasteful from excessive writes. Playing with multiples of 2, 8 was definately not enough memory to smooth usefully, while 32 seemed a little sluggish on the entry and wasteful on the exit ends. At the default interval, 16 iterations is looking back at the previous 3.2 seconds. I have a feeling the proper tuning for this should be time-based, where you would decide how long ago to consider looking back for and compute the iterations based on that. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] usleep feature for pgbench
On Thu, 5 Jul 2007, Jan Wieck wrote: Original pgbench reported 39, 37 and 33 TPS. Having my patch applied it reported 40, 38 and 33 TPS. Inserting a "\usleep 1" after the update to accounts of a default equivalent script changed those numbers to 40, 37 and 33. I interpret that as "does not change observed performance". Tell you what: put your work into a patch, send it to the list, and I'll test that it doesn't degrade results for you. If your pgbench results are in the <40 TPS range even with that low of a scale, you're not in a position to tell whether it has a negative performance impact. That select statement you're fiddling with can turn into a bottleneck at high client loads, and from your description I can't tell if you've made that worse, but you'll never see it unless you're pushing, say, 1000 TPS and >50 clients. Also: 3 pgbench results at one client load is quite a bit short of proving no impact on performance; I'll queue up 50 or so, which is where I start to trust results from that unruly tool. This is actually a feature I'd be kind of interested to have, because it would allow you to pass two (or more) script files to pgbench and adjust the transaction mix. What happens when you do that right now is that inevitably all the clients get blocked at once on whatever the hardest to execute transaction is, and the results are kind of deceptive as a result. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] tsearch2: language or encoding
Tatsuo Ishii <[EMAIL PROTECTED]> writes: > I'm wondering if a tsearch's configuration is bound to a language or > an encoding. If it's bound to a language, there's a serious design > problem, I would think. An encoding or charset is not necessarily > bound to single language. We can find such that example everywhere(I'm > not talking about Unicode here). LATIN1 inclues English and several > european languages. EUC-JP includes English and Japanese etc. And > we specify encoding for char's property, not language, I would say the > configuration should be bound to an encoding. Surely not, because then what do you do with utf8, which (allegedly) represents every language on earth? As far as the word-stemming part goes, that is very clearly bound to a language not an encoding. There may be some other parts of the code that really are better attached to an encoding --- Oleg, Teodor, your thoughts? regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings