Re: [HACKERS] OperationalError: FATAL: lock AccessShareLock on object 0/1260/0 is already
On Sun, Aug 14, 2011 at 12:16:39AM -0400, Robert Haas wrote: > On Fri, Aug 12, 2011 at 7:19 PM, daveg wrote: > > This seems to be bug month for my client. Now there are seeing periods > > where all new connections fail immediately with the error: > > > > FATAL: lock AccessShareLock on object 0/1260/0 is already held > > > > This happens on postgresql 8.4.7 on a large (512GB, 32 core) system that has > > been up for months. It started happening sporadicly a few days ago. It will > > do this for a period of several minutes to an hour and then go back to > > normal for hours or days. > > > > One complete failing session out of several hundred around that time: > > - > > 2011-08-09 00:01:04.446 8823 [unknown] [unknown] LOG: connection > > received: host=op05.xxx port=34067 > > 2011-08-09 00:01:04.446 8823 c77 apps LOG: connection authorized: > > user=apps database=c77 > > 2011-08-09 00:01:04.449 8823 c77 apps FATAL: lock AccessShareLock on > > object 0/1260/0 is already held > > -- > > > > What can I do to help track this down? > > I've seen that error (though not that exact fact pattern) caused by > bad RAM. It's unclear to me what else could cause it. I'll look into that. I think it is only happening on one host, so that might make sense. On the other hand, these are pretty fancy hosts all ECC and that so I'd hope they would have squeaked about bad ram. > In terms of debugging, it seems like it might be sensible to start by > injecting some debugging code that dumps out the contents of the LOCK > and LOCALLOCK structures at the point the error occurs. Hmm, we will update to 9.0 next week on these hosts, so I'll try to hold off on this part at least until then. -dg -- David Gould da...@sonic.net 510 536 1443510 282 0869 If simplicity worked, the world would be overrun with insects. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OperationalError: FATAL: lock AccessShareLock on object 0/1260/0 is already
On Fri, Aug 12, 2011 at 7:19 PM, daveg wrote: > This seems to be bug month for my client. Now there are seeing periods > where all new connections fail immediately with the error: > > FATAL: lock AccessShareLock on object 0/1260/0 is already held > > This happens on postgresql 8.4.7 on a large (512GB, 32 core) system that has > been up for months. It started happening sporadicly a few days ago. It will > do this for a period of several minutes to an hour and then go back to > normal for hours or days. > > One complete failing session out of several hundred around that time: > - > 2011-08-09 00:01:04.446 8823 [unknown] [unknown] LOG: connection > received: host=op05.xxx port=34067 > 2011-08-09 00:01:04.446 8823 c77 apps LOG: connection authorized: > user=apps database=c77 > 2011-08-09 00:01:04.449 8823 c77 apps FATAL: lock AccessShareLock on > object 0/1260/0 is already held > -- > > What can I do to help track this down? I've seen that error (though not that exact fact pattern) caused by bad RAM. It's unclear to me what else could cause it. In terms of debugging, it seems like it might be sensible to start by injecting some debugging code that dumps out the contents of the LOCK and LOCALLOCK structures at the point the error occurs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] vacuum scheduling (was: index-only scans)
On Sat, Aug 13, 2011 at 5:31 PM, Heikki Linnakangas wrote: > Yeah, I'm not excited about making the planner and statistics more dynamic. > Feedback loops and plan instability are not fun. I think we should rather be > more aggressive in setting the visibility map bits. As far as I can see, the only way we're really going to make any headway setting visibility map bits more aggressively is to get autovacuum to do it for us. I mean, we could consider doing it when we perform a HOT cleanup; and we could also consider having the background writer do it when it evicts buffers. But frankly, those sorts of things are a drop in the bucket. They're just not going to occur frequently enough to really make a dent. Maybe you could make it work for the case where the working set fits in shared_buffers, because then the background writer isn't doing anything anyway and so, hey, why not set visibility map bits? But you're not going to get much of a performance win out of index-only scans on that workload anyway. If the working set does NOT fit in shared_buffers, you're going to be evicting pages like crazy. I can't really see us deciding to dirty all of those pages and write and flush XLOG for them instead of just booting them out. Even if you have the background writer try to do some of that, other backends that are just throwing things over the fence to get foreground work done are going to run circles around it, so it's going to take a long, long time to make much headway by this route. Assuming you believe the above analysis (which you may not, so feel free to argue if you see something I'm missing), that leaves autovacuum. There are many things to like about running autovacuum more aggressively, and even on insert-only tables. Setting visibility map bits isn't only useful for index-only scans: it also significantly speeds up sequential scans that are not I/O-bound, because it enables the visibility checks on the offending pages to be skipped. On the flip side, setting visibility map bits (and hint bits) has to be considered a lower priority than vacuuming tables that contain dead tuples, because bloat is a far, far bigger problem than a few unset visibility map bits. There are already situations where vacuum can't keep up, especially with exciting values of vacuum_cost_delay, and I don't think we can afford to stir any more work into that pot without prioritizing it. We already have this parameter called vacuum_cost_delay that is intensively problematic. If you set it too low, vacuum can consume too much I/O bandwidth and you suffer a performance drop. On the other hand, if you set it too high (which is easy to do), autovacuum can't keep up and you get bloat. In the worst case, this sends the system into a tailspin: as all the tables get bigger, it gets harder and harder for vacuum to keep up. What would be really nice is to make this parameter self-tuning, at least to some degree. We used to run checkpoints at full speed, but we now have this nifty parameter called checkpoint_completion_target that makes them run slowly, but not so slowly that they don't complete fast enough. It would be nice to do the same thing for autovacuum. If we can estimate that a table will next need vacuuming in 37 minutes and that at the current vacuum_cost_delay setting the currently-in-progress vacuum will finish in 83 minutes (or 83 hours), we should realize that we have a problem and vacuum faster. Similarly, if we notice that tables are coming due for vacuuming faster than autovacuum workers are processing them, we should try to speed up. You can imagine trying to solve this problem by teaching autovacuum to estimate, for every table in the cluster, the time remaining until the next vacuum needs to start; and the time at which we will be able to start the next few vacuums based on when we expect to have enough autovacuum workers available. We could then detect whether or not we're keeping up, and adjust accordingly. This would allow us, in turn, to run vacuum with a higher cost delay by default, and let it decide for itself when it needs to speed up. And if we want vacuum to do lower-priority tasks, such as set visibility map bits, we could teach it to do so only when it's keeping up. This all seems like a lot of work, but at the moment I'm not seeing another way to deliver on the suggested goal of "setting visibility map bits more aggressively". Does anyone else? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Ignore lost+found when checking if a directory is empty
Tom Lane wrote: > Alvaro Herrera writes: > > Excerpts from Jeff Davis's message of mar ago 09 16:03:26 -0400 2011: > >> I think I agree with Peter here that it's not a very good idea, and I > >> don't see a big upside. With tablespaces it seems to make a little bit > >> more sense, but I'd still lean away from that idea. > > > What if the init script tries to start postmaster before the filesystems > > are mounted? ISTM requiring a subdir is a good sanity check that the > > system is ready to run. Not creating stuff directly on the mountpoint > > ensures consistency. > > I went looking in the archives for previous discussions of this idea. > Most of them seem to focus on tablespaces rather than the primary data > directory, but the objections to doing it are pretty much the same FYI, the 9.0+ code will create a subdirectory under the tablespace directory named after the catversion number, and it doesn't check that the directory is empty, particularly so pg_upgrade can do its magic. So, I believe lost+found would work in such a case, but again, the security issues are real. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUM FULL versus TOAST
On 14.08.2011 01:13, Tom Lane wrote: On investigation, this turns out to occur when the planner is trying to fetch the value of a toasted attribute in a cached pg_statistic tuple, and a concurrent "vacuum full pg_statistic" has just finished. The problem of course is that vacuum full reassigned all the toast item OIDs in pg_statistic, so the one we have our hands on is no longer correct. In general, *any* access to a potentially toasted attribute value in a catcache entry is at risk here. I don't think it's going to be feasible, either from a notational or efficiency standpoint, to insist that callers always re-lock the source catalog before fetching a catcache entry from which we might wish to extract a potentially toasted attribute. I am thinking that the most reasonable solution is instead to fix VACUUM FULL/CLUSTER so that they don't change existing toast item OIDs when vacuuming a system catalog. They already do some pretty ugly things to avoid changing the toast table's OID in this case, and locking down the item OIDs too doesn't seem that much harder. (Though I've not actually looked at the code yet...) How about detoasting all datums before caching them? It's surprising that a datum that is supposedly in a catalog cache, actually needs disk access to use. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
On Sat, Aug 13, 2011 at 4:40 PM, Greg Stark wrote: > On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas wrote: >> and possibly we ought to put them all in a >> linked list so that the next guy who needs a buffer can just pop one > > The whole point of the clock sweep algorithm is to approximate an LRU > without needing to maintain a linked list. The problem with a linked > list is that you need to serialize access to it so every time you > reference a buffer you need to wait on a lock for the list so you can > move that buffer around in the list. Right, but the same doesn't apply to what I'm talking about. You just put each guy into the linked list when his reference count goes down to 0. When you want to allocate a buffer, you pop somebody off. If his reference count is still 0, you forget about him and pop the next guy, and keep going until you find a suitable victim. The problem with just running the clock sweep every time you need a victim buffer is that you may have to pass over a large number of unevictable buffers before you find someone you can actually kick out. That's work that should really be getting done in the background, not at the absolute last moment when we discover, hey, we need a buffer. > It does kind of seem like your numbers indicate we're missing part of > the picture though. The idea with the clock sweep algorithm is that > you keep approximately 1/nth of the buffers with each of the n values. > If we're allowing nearly all the buffers to reach a reference count of > 5 then you're right that we've lost any information about which > buffers have been referenced most recently. It doesn't seem right to have 1/nth of the buffers at each of n values because that might not match the actual workload. For example, you might have lightning-hot buffers that fill 50% of your available cache. Trying to artificially force some of those down to usage count 4 or 3 doesn't actually get you anywhere. In fact, AFAICS, there's no direct reason to care about about how many buffers have usage count 1 vs. usage count 5. What IS important for performance is that there are enough with usage count 0. Any other distinction is only useful to the extent that it allows us to make a better decision about which buffers we should push down to 0 next. > I wonder if we're suppoesd to be doing something like advancing the > clock hand each time we increment a reference count as well? That precise thing seems far too expensive, but I agree that something's missing. Right now we decrease usage counts when the clock hand moves (which is driven by buffer allocation), and we increase them when buffers are pinned (which is driven by access to already-resident buffers). I feel like that's comparing apples and oranges. If some subset of shared_buffers is very hot relative to the uncached pages, then everything's going to converge to 5 (or whatever arbitrary maximum we care to set in lieu of 5). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] VACUUM FULL versus TOAST
So I've gotten things fixed to the point where the regression tests seem to not fall over when contending with concurrent "vacuum full pg_class", and now expanded the scope of the testing to all the system catalogs. What's failing for me now is this chunk in opr_sanity: *** 209,219 NOT p1.proisagg AND NOT p2.proisagg AND (p1.proargtypes[3] < p2.proargtypes[3]) ORDER BY 1, 2; ! proargtypes | proargtypes ! -+- ! 1114 |1184 ! (1 row) ! SELECT DISTINCT p1.proargtypes[4], p2.proargtypes[4] FROM pg_proc AS p1, pg_proc AS p2 WHERE p1.oid != p2.oid AND --- 209,215 NOT p1.proisagg AND NOT p2.proisagg AND (p1.proargtypes[3] < p2.proargtypes[3]) ORDER BY 1, 2; ! ERROR: missing chunk number 0 for toast value 23902886 in pg_toast_2619 SELECT DISTINCT p1.proargtypes[4], p2.proargtypes[4] FROM pg_proc AS p1, pg_proc AS p2 WHERE p1.oid != p2.oid AND On investigation, this turns out to occur when the planner is trying to fetch the value of a toasted attribute in a cached pg_statistic tuple, and a concurrent "vacuum full pg_statistic" has just finished. The problem of course is that vacuum full reassigned all the toast item OIDs in pg_statistic, so the one we have our hands on is no longer correct. In general, *any* access to a potentially toasted attribute value in a catcache entry is at risk here. I don't think it's going to be feasible, either from a notational or efficiency standpoint, to insist that callers always re-lock the source catalog before fetching a catcache entry from which we might wish to extract a potentially toasted attribute. I am thinking that the most reasonable solution is instead to fix VACUUM FULL/CLUSTER so that they don't change existing toast item OIDs when vacuuming a system catalog. They already do some pretty ugly things to avoid changing the toast table's OID in this case, and locking down the item OIDs too doesn't seem that much harder. (Though I've not actually looked at the code yet...) The main potential drawback here is that if any varlena items that had not previously been toasted got toasted, they would require additional OIDs to be assigned, possibly leading to a duplicate-OID failure. This should not happen unless somebody decides to play with the attstorage properties of a system catalog, and I don't feel too bad about a small possibility of VAC FULL failing after that. (Note it should eventually succeed if you keep trying, since the generated OIDs would keep changing.) Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
I realize it's a bit late to jump in here with the path already having been committed. But I think there's a point that was missed in the discussion. One reason to do the test as Tom recommended is that the warning probably indicates that the test as written was just going to be optimized away as dead code. I think the cast to unsigned is the least likely idiom to be optimized away whereas any of the formulations based on comparing the enum with enum labels is quite likely to be.
Re: [HACKERS] index-only scans
On 13.08.2011 23:35, Kääriäinen Anssi wrote: """ Now, suppose that we know that 50% of the heap pages have their visibility map bits set. What's the chance that this query won't need a heap fetch? Well, the chance is 50% *if* you assume that a row which has been quiescent for a long time is just as likely to be queried as one that has been recently inserted or updated. However, in many real-world use cases, nothing could be farther from the truth. What do we do about that? """ The example is much more realistic if the query is a fetch of N latest rows from a table. Very common use case, and the whole relation's visibility statistics are completely wrong for that query. That is somewhat compensated by the fact that tuples that are accessed more often are also more likely to be in cache. Fetching the heap tuple to check visibility is very cheap when the tuple is in cache. I'm not sure how far that compensates it, though. I'm sure there's typically nevertheless a fairly wide range of pages that have been modified since the last vacuum, but not in cache anymore. Wouldn't it be great if there was something like pg_stat_statements that would know the statistics per query, derived from usage... Even if the statistics are not available per query, the statistics could be calculated from the relation usage: the weighted visibility of the pages would be pages_visible_when_read / total_pages_read for the relation. That percentage would minimize the average cost of the plans much better than just the non-weighted visibility percentage. For the above example, if the usage is 90% read the N latest rows and we assume they are never visible, the weighted visibility percentage would be 10% while the non-weighted visibility percentage could be 90%. Even if the visibility percentage would be incorrect for the queries reading old rows, by definition of the weighted visibility percentage there would not be too many of them. The same idea could of course be used to calculate the effective cache hit ratio for each table. Cache hit ratio would have the problem of feedback loops, though. Yeah, I'm not excited about making the planner and statistics more dynamic. Feedback loops and plan instability are not fun. I think we should rather be more aggressive in setting the visibility map bits. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas wrote: > and possibly we ought to put them all in a > linked list so that the next guy who needs a buffer can just pop one The whole point of the clock sweep algorithm is to approximate an LRU without needing to maintain a linked list. The problem with a linked list is that you need to serialize access to it so every time you reference a buffer you need to wait on a lock for the list so you can move that buffer around in the list. It does kind of seem like your numbers indicate we're missing part of the picture though. The idea with the clock sweep algorithm is that you keep approximately 1/nth of the buffers with each of the n values. If we're allowing nearly all the buffers to reach a reference count of 5 then you're right that we've lost any information about which buffers have been referenced most recently. I wonder if we're suppoesd to be doing something like advancing the clock hand each time we increment a reference count as well? -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
""" Now, suppose that we know that 50% of the heap pages have their visibility map bits set. What's the chance that this query won't need a heap fetch? Well, the chance is 50% *if* you assume that a row which has been quiescent for a long time is just as likely to be queried as one that has been recently inserted or updated. However, in many real-world use cases, nothing could be farther from the truth. What do we do about that? """ The example is much more realistic if the query is a fetch of N latest rows from a table. Very common use case, and the whole relation's visibility statistics are completely wrong for that query. Wouldn't it be great if there was something like pg_stat_statements that would know the statistics per query, derived from usage... Even if the statistics are not available per query, the statistics could be calculated from the relation usage: the weighted visibility of the pages would be pages_visible_when_read / total_pages_read for the relation. That percentage would minimize the average cost of the plans much better than just the non-weighted visibility percentage. For the above example, if the usage is 90% read the N latest rows and we assume they are never visible, the weighted visibility percentage would be 10% while the non-weighted visibility percentage could be 90%. Even if the visibility percentage would be incorrect for the queries reading old rows, by definition of the weighted visibility percentage there would not be too many of them. The same idea could of course be used to calculate the effective cache hit ratio for each table. Cache hit ratio would have the problem of feedback loops, though. Of course, keeping such statistic could be more expensive than the benefit it gives. On the other hand, page hit percentage is already available... - Anssi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql: bogus descriptions displayed by \d+
Robert Haas writes: > On the other hand, I am also not entirely sure such a change in > terminology would be a net improvement in clarity, even though it does > seem better in some cases. For example, the CREATE TABLE command does > not create a viewed table; nor is there any CREATE VIEWED TABLE > command. On the flip side, for something like CLUSTER, "ERROR: %s is > not a base table" does seem like it could be more clear than just > "ERROR: %s is not a table". > So I'm not sure what to do. Yeah. Even if the standard is consistent about using "base table" versus just "table" when they mean a plain table, I do not believe that that distinction is commonly understood among users. I think people would tend to think that "base table" means some subset of plain tables, and would get confused. One fairly likely interpretation would be that "base table" means the parent table of an inheritance tree, for instance. If we have to stop and specify what we mean by "base table" every other time we use the phrase, we have achieved nothing except pedantry. On the whole I prefer "table" for plain tables and "relation" for everything table-ish. We can quibble about whether indexes, say, are relations within that meaning ... but the PG code and docs have long used "relation" in the more general meaning, and I doubt that we'll get far if we try to redefine its meaning now. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
On Fri, Aug 12, 2011 at 10:51 PM, Greg Stark wrote: > On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas wrote: >> Only 96 of the 14286 buffers in sample_data are in shared_buffers, >> despite the fact that we have 37,218 *completely unused* buffers lying >> around. That sucks, because it means that the sample query did a >> whole lot of buffer eviction that was completely needless. The ring >> buffer is there to prevent trashing the shared buffer arena, but here >> it would have been fine to trash the arena: there wasn't anything >> there we would have minded losing (or, indeed, anything at all). > > I don't disagree with the general thrust of your point, but I just > wanted to point out one case where not using free buffers even though > they're available might make sense: > > If you execute a large batch delete or update or even just set lots of > hint bits you'll dirty a lot of buffers. The ring buffer forces the > query that is actually dirtying all these buffers to also do the i/o > to write them out. Otherwise you leave them behind to slow down other > queries. This was one of the problems with the old vacuum code which > the ring buffer replaced. It left behind lots of dirtied buffers for > other queries to do i/o on. Interesting point. After thinking about this some more, I'm coming around to the idea that we need to distinguish between: 1. Ensuring a sufficient supply of evictable buffers, and 2. Evicting a buffer. The second obviously needs to be done only when needed, but the first one should really be done as background work. Currently, the clock sweep serves both functions, and that's not good. We shouldn't ever let ourselves get to the point where there are no buffers at all with reference count zero, so that the next guy who needs a buffer has to spin the clock hand around until the reference counts get low enough. Maintaining a sufficient supply of refcount-zero buffers should be done as a background task; and possibly we ought to put them all in a linked list so that the next guy who needs a buffer can just pop one off. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql: bogus descriptions displayed by \d+
On Sat, Aug 13, 2011 at 1:56 PM, Peter Eisentraut wrote: > On fre, 2011-08-12 at 16:14 -0400, Robert Haas wrote: >> On Fri, Aug 12, 2011 at 4:11 PM, Peter Eisentraut wrote: >> > "A table is either a base table, a derived table, a transient table, or >> > a viewed table." (SQL/MED adds "foreign table".) >> > >> > Just FYI. >> >> Base table seems clear enough, and a transient table sounds like a >> temporary table, but what is a derived table? Is a viewed table a >> view? > > A base table is either a permanent base table or (one of various kinds > of) a temporary base table. A derived table is the result of a table > expression, so this is more of a notional syntactic term. A transient > table is, roughly speaking, OLD or NEW in a trigger. A viewed table is > a view. Ah. Well, I guess I'm somewhat open to the idea of making all the places where we mean, specifically, a table say "base table". Then we could substitute the term "table" where we now say "relation". On the other hand, I am also not entirely sure such a change in terminology would be a net improvement in clarity, even though it does seem better in some cases. For example, the CREATE TABLE command does not create a viewed table; nor is there any CREATE VIEWED TABLE command. On the flip side, for something like CLUSTER, "ERROR: %s is not a base table" does seem like it could be more clear than just "ERROR: %s is not a table". So I'm not sure what to do. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On Fri, Aug 12, 2011 at 5:39 PM, Kevin Grittner wrote: > Robert Haas wrote: > >> That's one of the points I asked for feedback on in my original >> email. "How should the costing be done?" > > It seems pretty clear that there should be some cost adjustment. If > you can't get good numbers somehow on what fraction of the heap > accesses will be needed, I would suggest using a magic number based > on the assumption that half the heap access otherwise necessary will > be done. It wouldn't be the worst magic number in the planner. Of > course, real numbers are always better if you can get them. It wouldn't be that difficult (I think) to make VACUUM and/or ANALYZE gather some statistics; what I'm worried about is that we'd have correlation problems. Consider a wide table with an index on (id, name), and the query: SELECT name FROM tab WHERE id = 12345 Now, suppose that we know that 50% of the heap pages have their visibility map bits set. What's the chance that this query won't need a heap fetch? Well, the chance is 50% *if* you assume that a row which has been quiescent for a long time is just as likely to be queried as one that has been recently inserted or updated. However, in many real-world use cases, nothing could be farther from the truth. What do we do about that? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation
On Sat, Aug 13, 2011 at 12:18 PM, Tom Lane wrote: > I wrote: >> Yeah. Also, to my mind this is only a fix that will be used in 9.0 and >> 9.1 --- now that it's occurred to me that we could use tuple xmin/xmax >> to invalidate catcaches instead of recording individual TIDs, I'm >> excited about doing that instead for 9.2 and beyond. I believe that >> that could result in a significant reduction in sinval traffic, which >> would be a considerable performance win. > > On closer inspection this idea doesn't seem workable. I was imagining > that after a transaction commits, we could find obsoleted catcache > entries by looking for tuples with xmax equal to the transaction's XID. > But a catcache entry made before the transaction had done the update > wouldn't contain the correct xmax, so we'd fail to remove it. The only > apparent way to fix that would be to go out to disk and consult the > current on-disk xmax, which would hardly be any cheaper than just > dropping the cache entry and then reloading it when/if needed. > > All is not entirely lost, however: there's still some possible > performance benefit to be gained here, if we go to the scheme of > identifying victim catcache entries by hashvalue only. Currently, > each heap_update in a cached catalog has to issue two sinval messages > (per cache!): one against the old TID and one against the new TID. > We'd be able to reduce that to one message in the common case where the > hashvalue remains the same because the cache key columns didn't change. Cool. > Another thing we could consider doing, if one-in-2^32 hash collisions > seems too high, is widening the hash values to 64 bits. I'm not > terribly excited about that, because a lot of the caches are on OID > columns for which there'd be zero benefit. Yeah, and I can't get excited about the increased memory usage, either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql: bogus descriptions displayed by \d+
On fre, 2011-08-12 at 16:14 -0400, Robert Haas wrote: > On Fri, Aug 12, 2011 at 4:11 PM, Peter Eisentraut wrote: > > "A table is either a base table, a derived table, a transient table, or > > a viewed table." (SQL/MED adds "foreign table".) > > > > Just FYI. > > Base table seems clear enough, and a transient table sounds like a > temporary table, but what is a derived table? Is a viewed table a > view? A base table is either a permanent base table or (one of various kinds of) a temporary base table. A derived table is the result of a table expression, so this is more of a notional syntactic term. A transient table is, roughly speaking, OLD or NEW in a trigger. A viewed table is a view. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation
I wrote: > Yeah. Also, to my mind this is only a fix that will be used in 9.0 and > 9.1 --- now that it's occurred to me that we could use tuple xmin/xmax > to invalidate catcaches instead of recording individual TIDs, I'm > excited about doing that instead for 9.2 and beyond. I believe that > that could result in a significant reduction in sinval traffic, which > would be a considerable performance win. On closer inspection this idea doesn't seem workable. I was imagining that after a transaction commits, we could find obsoleted catcache entries by looking for tuples with xmax equal to the transaction's XID. But a catcache entry made before the transaction had done the update wouldn't contain the correct xmax, so we'd fail to remove it. The only apparent way to fix that would be to go out to disk and consult the current on-disk xmax, which would hardly be any cheaper than just dropping the cache entry and then reloading it when/if needed. All is not entirely lost, however: there's still some possible performance benefit to be gained here, if we go to the scheme of identifying victim catcache entries by hashvalue only. Currently, each heap_update in a cached catalog has to issue two sinval messages (per cache!): one against the old TID and one against the new TID. We'd be able to reduce that to one message in the common case where the hashvalue remains the same because the cache key columns didn't change. Another thing we could consider doing, if one-in-2^32 hash collisions seems too high, is widening the hash values to 64 bits. I'm not terribly excited about that, because a lot of the caches are on OID columns for which there'd be zero benefit. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inserting heap tuples in bulk in COPY
Heikki Linnakangas writes: > The patch is WIP, mainly because I didn't write the WAL replay routines > yet, but please let me know if you see any issues. Why do you need new WAL replay routines? Can't you just use the existing XLOG_HEAP_NEWPAGE support? By any large, I think we should be avoiding special-purpose WAL entries as much as possible. Also, I strongly object to turning regular heap_insert into a wrapper around some other more complicated operation. That will likely have bad consequences for the performance of every other operation. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation
Simon Riggs writes: > With HOT, there is very little need to perform a VACUUM FULL on any > shared catalog table. Look at the indexes... This is not really a useful argument. People do do VAC FULL on catalogs, whether we think they should or not. Also, it's not only "shared" catalogs that are at stake. We could avoid fixing the bug if we forbade both VAC FULL and CLUSTER on all system catalogs, period, no exceptions. That doesn't seem like a workable answer though. Some usage patterns do see bloat on the catalogs, especially if you disable or dial down autovacuum. > In the unlikely event we do actually have to VACUUM FULL a shared > catalog table, nuke any cache entry for the whole shared catalog. You're still not getting the point. We *do* nuke all cache entries after a VAC FULL. That does not avoid this bug. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inserting heap tuples in bulk in COPY
On 12 August 2011 23:19, Heikki Linnakangas wrote: >>> Triggers complicate this. I believe it is only safe to group tuples >>> together >>> like this if the table has no triggers. A BEFORE ROW trigger might run a >>> SELECT on the table being copied to, and check if some of the tuples >>> we're >>> about to insert exist. If we run BEFORE ROW triggers for a bunch of >>> tuples >>> first, and only then insert them, none of the trigger invocations will >>> see >>> the other rows as inserted yet. Similarly, if we run AFTER ROW triggers >>> after inserting a bunch of tuples, the trigger for each of the insertions >>> would see all the inserted rows. So at least for now, the patch simply >>> falls >>> back to inserting one row at a time if there are any triggers on the >>> table. I guess DEFAULT values could also suffer from a similar problem to BEFORE ROW triggers though (in theory a DEFAULT could be based on a function that SELECTs from the table being copied to). Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/Perl Returned Array
On fre, 2011-08-12 at 20:09 -0700, Darren Duncan wrote: > David E. Wheeler wrote: > > On Aug 12, 2011, at 6:17 PM, Alex Hunsaker wrote: > >> Anyway, the attached patch fixes it for me. That is when we don't have > >> an array state, just return an empty array. (Also adds some > >> additional comments) > > > > Fix confirmed, thank you! > > > > +1 to getting this committed before the next beta/RC. > > Policy question. If this problem hadn't been detected until after 9.1 was > declared production, would it have been considered a bug to fix in 9.1.x or > would it have been delayed to 9.2 since that fix might be considered an > incompatibility? Surely this would be a bug fix. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers