Re: [HACKERS] WIP: Fast GiST index build
On 11.08.2011 23:30, Alexander Korotkov wrote: On Thu, Aug 11, 2011 at 2:28 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 10.08.2011 22:44, Alexander Korotkov wrote: Manual and readme updates. Thanks, I'm reviewing these now. Do we want to expose the level-step and buffersize parameters to users? They've been useful during testing, but I'm thinking we should be able to guess good enough values for them automatically, and just remove the options. It's pretty much impossible for a user to tune them correctly, it would require deep knowledge of the buffering algorithm. I'm thinking that even when you explicitly turn buffering on, we should still process the first 1 or so tuples with simple inserts. That way we always have a sample of tuples to calculate the average tuple size from. It's plausible that if the input data is ordered, looking at the first N tuples will give skewed sample, but I don't think there's much danger of that in practice. Even if the data is ordered, the length of GiST tuples shouldn't vary much. What happens if we get the levelstep and pagesPerBuffer estimates wrong? How sensitive is the algorithm to that? Or will we run out of memory? Would it be feasible to adjust those in the middle of the index build, if we e.g exceed the estimated memory usage greatly? I see the following risks. For levelstep: Too small: not so great IO benefit as can be Too large: 1) If subtree doesn't fit effective_cache, much more IO then should be (because of cache misses during buffer emptying) 2) If last pages of buffers don't fit to maintenance_work_mem, possible OOM Hmm, we could avoid running out of memory if we used a LRU cache replacement policy on the buffer pages, instead of explicitly unloading the buffers. 1) would still apply, though. For buffersize: Too small: less IO benefit, becuse buffer size is relatively small in comparison with sub-tree size. Too large: greater CPU overhead (because of more penalty calls) then can be with same IO benefit. Thereby I propose following. 1) Too large levelstep is greatest risk. Let's use pessimistic estimate for it. Pessimistic estimate has following logic: largest sub-tree = maximal tuples per page = minimal tuple size Thereby always using minimal tuple size in levelstep calculation we exclude greatest risks. 2) Risks of buffersize are comparable and not too critical. Thats why I propose to use size of first 1 tuples for estimate. Yep, sounds reasonable. I think it would also be fairly simple to decrease levelstep and/or adjust buffersize on-the-fly. The trick would be in figuring out the heuristics on when to do that. Another thing occurred to me while looking at the buffer emptying process: At the moment, we stop emptying after we've flushed 1/2 buffer size worth of tuples. The point of that is to avoid overfilling a lower-level buffer, in the case that the tuples we emptied all landed on the same lower-level buffer. Wouldn't it be fairly simple to detect that case explicitly, and stop the emptying process only if one of the lower-level buffers really fills up? That should be more efficient, as you would have swap between different subtrees less often. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas robertmh...@gmail.com wrote: On the other hand, the buffer manager has *no problem at all* trashing the buffer arena if we're faulting in pages for an index scan rather than a sequential scan. If you manage to get all of sample_data into memory (by running many copies of the above query in parallel, you can get each one to allocate its own ring buffer, and eventually pull in all the pages), and then run some query that probes an index which is too large to fit in shared_buffers, it cheerfully blows the whole sample_data table out without a second thought. Had you sequentially scanned a big table, of course, there would be some protection, but an index scan can stomp all over everything with complete impunity. That's a good observation and I think we should do this * Make an IndexScan use a ring buffer once it has used 32 blocks. The vast majority won't do that, so we avoid overhead on the common path. * Make an BitmapIndexScan use a ring buffer when we know that the index is larger than 32 blocks. (Ignore upper parts of tree for that calc). -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas robertmh...@gmail.com wrote: rhaas=# select usagecount, sum(1) from pg_buffercache group by 1 order by 1; usagecount | sum +--- 1 | 136 2 | 12 3 | 72 4 | 7 5 | 13755 | 37218 (6 rows) Only 96 of the 14286 buffers in sample_data are in shared_buffers, despite the fact that we have 37,218 *completely unused* buffers lying around. That sucks, because it means that the sample query did a whole lot of buffer eviction that was completely needless. The ring buffer is there to prevent trashing the shared buffer arena, but here it would have been fine to trash the arena: there wasn't anything there we would have minded losing (or, indeed, anything at all). You're missing an important point. The SeqScan is measurably faster when using the ring buffer because of the effects of L2 cacheing on the buffers. Also, your logic is a little off, since you're doing the scan on an otherwise quiet machine soon after startup and there are some completely unused buffers. That isn't the typical state of the buffer cache and so spoiling the cache is not acceptable in the general case. The above case doesn't suck because it carefully reserved parts of shared buffers for other users when spoiling the cache would be of no benefit to the user, so it worked great from my perspective. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpython crash
On 11/08/11 18:01, Jean-Baptiste Quenot wrote: Hi there, plpython crashes on me on various 64-bit Ubuntu hosts, see the gdb backtrace at: https://gist.github.com/1140005 Do you believe there was recent bugfixes regarding PLyMapping_ToTuple() ? This is PG 9.0.4 with HEAD of plpython taken in march 2011 and backported. Please tell me if you need more information. Hi, there were no changes to that area of plpython after March 2011. Could you try to see if the error also appears if you run your app with current PostgreSQL HEAD (both the server and plpython)? Which Python version is that? You can get that info by running: do $$ import sys; plpy.info(sys.version) $$ language plpythonu; Could you try to extract a self-contained example of how to reproduce it? If the bug only appears under your application's specific workload, perhaps you could try running it with Postgres compiled with -O0, because compiling with -O2 causes the gdb backtrace to be missing optimised out values and inlined functions? Cheers, Jan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
2011/8/12 Robert Haas robertmh...@gmail.com: On Thu, Aug 11, 2011 at 5:39 PM, Cédric Villemain cedric.villemain.deb...@gmail.com wrote: 2011/8/11 Robert Haas robertmh...@gmail.com: Please find attached a patch implementing a basic version of index-only scans. This patch is the work of my colleague Ibrar Ahmed and myself, and also incorporates some code from previous patches posted by Heikki Linnakanagas. Great! Glad you like it...! Can this faux heap tuple be appended by the data from another index once it has been created ? ( if the query involves those 2 index) I don't see how to make that work. In general, a query like SELECT a, b FROM foo WHERE a = 1 AND b = 1 can only use both indexes if we use a bitmap index scan on each followed by a bitmapand and then a bitmap heap scan. However, this patch only touches the index-scan path, which only knows how to use one index for any given query. I thought of something like that: 'select a,b from foo where a=1 order by b limit 100' (or: where a=1 and b now() ) Actually, I can see a possible way to allow an index-only type optimization to be used for bitmap scans. As you scan the index, any tuples that can be handled index-only get returned immediately; the rest are thrown into a bitmap. Once you're done examining the index, you then do a bitmap heap scan to get the tuples that couldn't be handled index-only. This seems like it might be our best hope for a fast count(*) type optimization, especially if you could combine it with some method of scanning the index in physical order rather than logical order. IIRC we expose some ideas around that, yes. (optimizing bitmap) Maybe a question that will explain me more about the feature limitation (if any): Does an index-only scan used when the table has no vismap set will cost (in duration, IO, ...) more than a normal Index scan ? But even that trick would only work with a single index. I'm not sure there's any good way to assemble pieces of tuples from two different indexes, at least not efficiently. okay. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas robertmh...@gmail.com wrote: The general problem here is that we are not very smart about handling workloads with weak locality - i.e. the working set is larger than shared buffers. If the working set fits in shared_buffers, we will keep it there, and it will be strongly protected against eviction. But if the working set *doesn't* fit in shared buffers, then we fall back on some not-very-smart heuristics to decide what to do: keep pages involved in index scans, ditch those involved in sequential scans. It seems to me that we need a smarter algorithm. It seems that NetBSD has adopted the use of Clock-Pro, and there are some discussions of it being used in Linux as well, though I'm not sure whether that's actually (still?) the case. Paper is here, although I find the description of the algorithm to be approximately clear as mud: http://www.cse.ohio-state.edu/~fchen/paper/papers/usenix05.pdf One of the important notions which many of the algorithms in the literature seem to hit on in one way or another is that you mustn't go and decide that all of your buffers - or nearly all your buffers - are hot. That's really equivalent to saying that none of your buffers are hot; and it's also pretty easy to see that such a scenario would be disastrous for our current algorithm, which - if everything in shared_buffers has usage-count 5 all the time - will have to clock sweep a long way each time it needs to allocate a buffer. In fact, the more of your buffers want to be hot (and thus hard to evict), the fewer of them should actually be allowed to acquire that status. This is something I've been actively working on. I was on the trail of possible reasons that increasing the size of shared buffers would slow down PostgreSQL. The worst case behaviour of the current freelist code is that it can take up to 5 * shared_buffers checks before identifying a victim buffer. That occurs when we have a working set exactly matching size of shared buffers. There are problems when large portions of shared buffers are very popular, so that the free-ish buffers are a small proportion of the total buffers - this case gets worse if the distribution of buffer allocations is non-random so you get say 10 free-ish blocks together, followed by N-10 very popular ones. That makes every 11th freelist request take ~O(shared_buffers). These problems will show themselves as sporadic BufFreelistLock contention. Sporadic is the problem here since it may make the contention hard to measure in aggregate, yet with real impact on performance. For us to benefit from increased shared_buffers we need to have an algorithm that is O(k) in its worst case behaviour and O(1) in its best case behaviour - the latter aspect is provided by a correctly functioning bgwriter. At the moment, we do nothing to enforce O(k) behaviour. The following patch implements O(k) behaviour using a heuristic limit. My first attempt was a fixed heuristic limit, but that was less suitable because there are actually 2 requirements. The value of k must be small to have a measurable impact on the smoothness of StrategyGetBuffer(), but k must also be fairly large to ensure the choice of victim is a relatively good one. So the way to balance those conflicting objectives is to have a progressively increasing search window. An just to repeat, k is not a % of shared buffers, which would still give O(N) behaviour. I've not been reading the literature, given the problems we had in 2004/5 regarding patents in this area. I also think that since we rely on the underlying filesystem for cacheing that we don't have exactly the same problem as other systems. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services freelist_ok.v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Fast GiST index build
On Fri, Aug 12, 2011 at 12:23 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: I think it would also be fairly simple to decrease levelstep and/or adjust buffersize on-the-fly. The trick would be in figuring out the heuristics on when to do that. I would be simple to decrease levelstep to the it's divider. It seems quite hard to dicrease it, for example, from 3 to 2. Also, it's pretty hard to detect that sub-tree actually doen't fit to the cache. I don't see much difficulties in buffersize runtime tuning. Another thing occurred to me while looking at the buffer emptying process: At the moment, we stop emptying after we've flushed 1/2 buffer size worth of tuples. The point of that is to avoid overfilling a lower-level buffer, in the case that the tuples we emptied all landed on the same lower-level buffer. Wouldn't it be fairly simple to detect that case explicitly, and stop the emptying process only if one of the lower-level buffers really fills up? That should be more efficient, as you would have swap between different subtrees less often. Yes, it seems reasonable to me. -- With best regards, Alexander Korotkov.
[HACKERS] bgwriter and checkpoints
The bgwriter has been responsible for two main activities: incremental page cleaning and checkpointing. We've worked out the code to smooth checkpointing, but that now means we have periods where we do both page cleaning and checkpointing, and other times when we do just page cleaning. That means the processing loops are complex and that makes latch based processing more difficult to introduce. Moreover, when we perform fsyncs at the end of checkpoint we stop doing pagecleaning. I propose to introduce a new process dedicated to checkpointing, leaving the bgwriter process to perform pagecleaning, if that is requested. Note that on smaller systems the bgwriter process would not even be required to exist, if page cleaning was not used. This will simplify the code, so that checkpoint happens fairly cleanly - and can then be extended in other ways. The checkpointer would be the important process from a shutdown perspective. It will also simplify the bgwriter, potentially allowing us to consider that it performs other tasks, such as HOT vacuuming or hint bit setting. Nothing added in first version, but this clears the way to allow these things to be considered and prototyped. So the likelihood is this change will improve performance of itself in large systems, but the main benefit is code simplicity which is required for a range of potential futures. Are there objections to this work beginning? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bgwriter and checkpoints
2011/8/12 Simon Riggs si...@2ndquadrant.com: The bgwriter has been responsible for two main activities: incremental page cleaning and checkpointing. We've worked out the code to smooth checkpointing, but that now means we have periods where we do both page cleaning and checkpointing, and other times when we do just page cleaning. That means the processing loops are complex and that makes latch based processing more difficult to introduce. Moreover, when we perform fsyncs at the end of checkpoint we stop doing pagecleaning. I propose to introduce a new process dedicated to checkpointing, leaving the bgwriter process to perform pagecleaning, if that is requested. Note that on smaller systems the bgwriter process would not even be required to exist, if page cleaning was not used. This will simplify the code, so that checkpoint happens fairly cleanly - and can then be extended in other ways. The checkpointer would be the important process from a shutdown perspective. It will also simplify the bgwriter, potentially allowing us to consider that it performs other tasks, such as HOT vacuuming or hint bit setting. Nothing added in first version, but this clears the way to allow these things to be considered and prototyped. So the likelihood is this change will improve performance of itself in large systems, but the main benefit is code simplicity which is required for a range of potential futures. Are there objections to this work beginning? No objections. (+100 for doing that) -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpython crash
Here is the same with -O0: https://gist.github.com/1140005 sys.version reports this: INFO: 2.6.6 (r266:84292, Sep 15 2010, 16:41:53) [GCC 4.4.5] -- Jean-Baptiste Quenot -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
On Fri, Aug 12, 2011 at 11:53 AM, Simon Riggs si...@2ndquadrant.com wrote: I've not been reading the literature, given the problems we had in 2004/5 regarding patents in this area. I also think that since we rely on the underlying filesystem for cacheing that we don't have exactly the same problem as other systems. Reading the link you gave, I'm unimpressed by ClockPro. The measurements made are all in low numbers of MB and the results show that Clock is roughly same or better for large caches, but one might read them multiple ways. The zone of interest for us is above 1GB and the issues we care about are contention, so even if we think ClockPro has a chance of improving things we really don't have any measurements that support the cases we care about. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sha1, sha2 functions into core?
On Thu, Aug 11, 2011 at 5:46 PM, Tom Lane t...@sss.pgh.pa.us wrote: Marko Kreen mark...@gmail.com writes: On Wed, Aug 10, 2011 at 9:19 PM, Tom Lane t...@sss.pgh.pa.us wrote: ... which this approach would create, because digest() isn't restricted to just those algorithms. I think it'd be better to just invent two new functions, which also avoids issues for applications that currently expect the digest functions to be installed in pgcrypto's schema. I would suggest digest() with fixed list of algorithms: md5, sha1, sha2. The uncommon/obsolete algorithms that can be used from digest() if compiled with openssl, are not something we need to worry over. In fact we have never supported them, as no testing has been done. Hmm ... they may be untested by us, but I feel sure that if we remove that functionality from pgcrypto, *somebody* is gonna complain. Well, if you are worried about that, you can duplicate current pgcrypto behaviour - if postgres is compiled against openssl, you get whatever algorithms are available in that particular version of openssl. My point was that giving such open-ended list of algorithms was bad idea, but there is no problem keeping old behaviour. I don't see anything much wrong with sha1(bytea/text) - bytea. There's no law that says it has to work exactly like md5() does. The problem is that list of must-have algorithms is getting quite long: md5, sha1, sha224, sha256, sha384, sha512, + at least 4 from upcoming sha3. Another problem is that generic hashes are bad way for storing passwords - identical passwords will look identical, and its easy to brute-force passwords as the algorithms are very fast. So the question is: is there actual *good* reason add each algorithm separately, without uniform API to core functions? If the user requests are about storing passwords, and we want to make that easier, then we should import crypt() also, as that is the secure way for password storage. Then the md5(), md5_b() plus bunch of sha-s will look silly. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
On Fri, Aug 12, 2011 at 4:33 AM, Simon Riggs si...@2ndquadrant.com wrote: You're missing an important point. The SeqScan is measurably faster when using the ring buffer because of the effects of L2 cacheing on the buffers. I hadn't thought of that, but I think that's only true if the relation won't fit in shared_buffers (or whatever portion of shared_buffers is reasonably available, given the other activity on the machine). In this particular case, it's almost 20% faster if the relation is all in shared_buffers; I tested it. I think what's going on here is that initscan() has a heuristic that tries to use a BufferAccessStrategy if the relation is larger than 1/4 of shared_buffers. That's probably a pretty good heuristic in general, but in this case I have a relation which just so happens to be 27.9% of shared_buffers but will still fit. As you say below, that may not be typical in real life, although there are probably data warehousing systems where it's normal to have only one big query running at a time. Also, your logic is a little off, since you're doing the scan on an otherwise quiet machine soon after startup and there are some completely unused buffers. That isn't the typical state of the buffer cache and so spoiling the cache is not acceptable in the general case. The above case doesn't suck because it carefully reserved parts of shared buffers for other users when spoiling the cache would be of no benefit to the user, so it worked great from my perspective. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
On Fri, Aug 12, 2011 at 4:36 AM, Simon Riggs si...@2ndquadrant.com wrote: On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas robertmh...@gmail.com wrote: On the other hand, the buffer manager has *no problem at all* trashing the buffer arena if we're faulting in pages for an index scan rather than a sequential scan. If you manage to get all of sample_data into memory (by running many copies of the above query in parallel, you can get each one to allocate its own ring buffer, and eventually pull in all the pages), and then run some query that probes an index which is too large to fit in shared_buffers, it cheerfully blows the whole sample_data table out without a second thought. Had you sequentially scanned a big table, of course, there would be some protection, but an index scan can stomp all over everything with complete impunity. That's a good observation and I think we should do this * Make an IndexScan use a ring buffer once it has used 32 blocks. The vast majority won't do that, so we avoid overhead on the common path. * Make an BitmapIndexScan use a ring buffer when we know that the index is larger than 32 blocks. (Ignore upper parts of tree for that calc). We'd need to think about what happens to the internal pages of the btree, the leaf pages, and then the heap pages from the underlying relation; those probably shouldn't all be treated the same. Also, I think the tricky part is figuring out when to apply the optimization in the first place. Once we decide we need a ring buffer, a very small one (like 32 blocks) is probably the way to go. But it will be a loser to apply the optimization to data sets that would otherwise have fit in shared_buffers. This is a classic case of the LRU/MRU problem. You want to evict buffers in LRU fashion until the working set gets larger than you can cache; and then you want to switch to MRU to avoid uselessly caching pages that you'll never manage to revisit before they're evicted. The point of algorithms like Clock-Pro is to try to have the system work that out on the fly, based on the actual workload, rather than using heuristics. I agree with you there's no convincing evidence that Clock-Pro would be better for us; I mostly thought it was interesting because it seems that the NetBSD and Linux guys find it interesting, and they're managing much larger caches than the ones we're dealing with. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
On Fri, Aug 12, 2011 at 1:14 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Aug 12, 2011 at 4:33 AM, Simon Riggs si...@2ndquadrant.com wrote: You're missing an important point. The SeqScan is measurably faster when using the ring buffer because of the effects of L2 cacheing on the buffers. I hadn't thought of that, but I think that's only true if the relation won't fit in shared_buffers (or whatever portion of shared_buffers is reasonably available, given the other activity on the machine). In this particular case, it's almost 20% faster if the relation is all in shared_buffers; I tested it. I think what's going on here is that initscan() has a heuristic that tries to use a BufferAccessStrategy if the relation is larger than 1/4 of shared_buffers. That's probably a pretty good heuristic in general, but in this case I have a relation which just so happens to be 27.9% of shared_buffers but will still fit. As you say below, that may not be typical in real life, although there are probably data warehousing systems where it's normal to have only one big query running at a time. I think there are reasonable arguments to make * prefer_cache = off (default) | on a table level storage parameter, =on will disable the use of BufferAccessStrategy * make cache_spoil_threshold a parameter, with default 0.25 Considering the world of very large RAMs in which we now live, some control of the above makes sense. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
On Fri, Aug 12, 2011 at 6:53 AM, Simon Riggs si...@2ndquadrant.com wrote: The worst case behaviour of the current freelist code is that it can take up to 5 * shared_buffers checks before identifying a victim buffer. That occurs when we have a working set exactly matching size of shared buffers. There are problems when large portions of shared buffers are very popular, so that the free-ish buffers are a small proportion of the total buffers - this case gets worse if the distribution of buffer allocations is non-random so you get say 10 free-ish blocks together, followed by N-10 very popular ones. That makes every 11th freelist request take ~O(shared_buffers). These problems will show themselves as sporadic BufFreelistLock contention. Sporadic is the problem here since it may make the contention hard to measure in aggregate, yet with real impact on performance. I've been thinking about this exact same set of problems. For us to benefit from increased shared_buffers we need to have an algorithm that is O(k) in its worst case behaviour and O(1) in its best case behaviour - the latter aspect is provided by a correctly functioning bgwriter. At the moment, we do nothing to enforce O(k) behaviour. Check. The following patch implements O(k) behaviour using a heuristic limit. My first attempt was a fixed heuristic limit, but that was less suitable because there are actually 2 requirements. The value of k must be small to have a measurable impact on the smoothness of StrategyGetBuffer(), but k must also be fairly large to ensure the choice of victim is a relatively good one. So the way to balance those conflicting objectives is to have a progressively increasing search window. An just to repeat, k is not a % of shared buffers, which would still give O(N) behaviour. This is a very interesting idea, and I think it has a lot of potential. One additional thought I had is that right now I think it's far too easy for buffers to get pushed all the way up to usage count 5, because we bump the reference count every time the buffer is pinned, which can easily happen many times per clock sweep. I think we would be better off having a used flag on each buffer. Each pin sets the used bit, but does nothing if it is already set. The usage count is only changed by the clock sweep, which does the following on each buffer: - If the used flag is set, then the flag is cleared and the usage count increases by one to a maximum of 5. - If the used flag is not set, then the usage count decreases by one. That way, buffers can't get hot because of a fast flurry of access followed by nothing - to get up to usage_count 5, they've actually got to stay interesting for a period of time. As a side effect, I believe we could get rid of the spinlock that we currently take and release for every buffer the clock sweep hits, because we could regard the usage_count as protected by the BufFreelistLock rather than by the buffer header spinlock; and the used flag doesn't require perfect synchronization. We'd only pin the buffer we actually plan to evict (and could recheck the used flag before doing so, in case a memory ordering effect made us miss the fact that it had been recently set). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
On Fri, Aug 12, 2011 at 1:26 PM, Robert Haas robertmh...@gmail.com wrote: But it will be a loser to apply the optimization to data sets that would otherwise have fit in shared_buffers. Spoiling the cache is a bad plan, even if it makes the current query faster. I think we should make the optimisation stronger still and use FADV_DONT_NEED on blocks that fall from the ring buffer. Spoiling the OS cache is a problem as well. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
On Fri, Aug 12, 2011 at 8:28 AM, Simon Riggs si...@2ndquadrant.com wrote: On Fri, Aug 12, 2011 at 1:14 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Aug 12, 2011 at 4:33 AM, Simon Riggs si...@2ndquadrant.com wrote: You're missing an important point. The SeqScan is measurably faster when using the ring buffer because of the effects of L2 cacheing on the buffers. I hadn't thought of that, but I think that's only true if the relation won't fit in shared_buffers (or whatever portion of shared_buffers is reasonably available, given the other activity on the machine). In this particular case, it's almost 20% faster if the relation is all in shared_buffers; I tested it. I think what's going on here is that initscan() has a heuristic that tries to use a BufferAccessStrategy if the relation is larger than 1/4 of shared_buffers. That's probably a pretty good heuristic in general, but in this case I have a relation which just so happens to be 27.9% of shared_buffers but will still fit. As you say below, that may not be typical in real life, although there are probably data warehousing systems where it's normal to have only one big query running at a time. I think there are reasonable arguments to make * prefer_cache = off (default) | on a table level storage parameter, =on will disable the use of BufferAccessStrategy * make cache_spoil_threshold a parameter, with default 0.25 Yeah, something like that might make sense. Of course, a completely self-tuning system would be better, but I'm not sure we're going to find one of those. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
On Fri, Aug 12, 2011 at 8:35 AM, Simon Riggs si...@2ndquadrant.com wrote: On Fri, Aug 12, 2011 at 1:26 PM, Robert Haas robertmh...@gmail.com wrote: But it will be a loser to apply the optimization to data sets that would otherwise have fit in shared_buffers. Spoiling the cache is a bad plan, even if it makes the current query faster. I think we should make the optimisation stronger still and use FADV_DONT_NEED on blocks that fall from the ring buffer. Spoiling the OS cache is a problem as well. That would probably be better for really big tables, but it will be worse for those where the entire table fits in the OS cache. Caching spoiling means you're putting things into the cache which won't still be there the next time they're needed. If the data in question fits in cache (and I don't think we can regard that as uncommon, particularly for the OS cache, which can be half a terabyte) then you don't want the anti-spoiling logic to kick in. One thing we could consider for large sequential scans is doing POSIX_FADV_SEQUENTIAL before beginning the scan (and maybe one more time if the scan wraps). That's basically throwing the problem of whether or not to go LRU or MRU back on the OS, but the OS may well have a better idea whether there's enough cache available to hold the whole than we do. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
Robert, I imagine we store positional information in gin index and return tuples in relevant order - instant full-text search ! Great work, guys ! Oleg On Thu, 11 Aug 2011, Robert Haas wrote: Please find attached a patch implementing a basic version of index-only scans. This patch is the work of my colleague Ibrar Ahmed and myself, and also incorporates some code from previous patches posted by Heikki Linnakanagas. I'm able to demonstrate a significant performance benefit from this patch, even in a situation where it doesn't save any disk I/O, due to reduced thrashing of shared_buffers. Configuration settings: max_connections = 100 shared_buffers = 400MB maintenance_work_mem = 256MB synchronous_commit = off checkpoint_segments = 100 checkpoint_timeout = 30min checkpoint_completion_target = 0.9 checkpoint_warning = 90s seq_page_cost = 1.0 random_page_cost = 1.0 effective_cache_size = 3GB Test setup: pgbench -i -s 50 create table sample_data as select (random()*500)::int as aid, repeat('a', 1000) as filler from generate_series(1,10); Test queries: select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567); select sum(aid) from sample_data a1 where exists (select * from pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567); On my laptop, the first query executes in about 555 ms, while the second one takes about 1125 ms. Inspection via pg_buffercache reveals that the second one thrashes shared_buffers something fierce, while the first one does not. You can actually get the time for the first query down to about 450 ms if you can persuade PostgreSQL to cache the entire sample_data table - which is difficult, due the BufferAccessStrategy stuff - and as soon as you run the second query, it blows the table out of cache, so practically speaking you're not going to get that faster time very often. I expect that you could get an even larger benefit from this type of query if you could avoid actual disk I/O, rather than just buffer cache thrashing, but I haven't come up with a suitable test cases for that yet (volunteers?). There are several things about this patch that probably need further thought and work, or at least some discussion. 1. The way that nodeIndexscan.c builds up the faux heap tuple is perhaps susceptible to improvement. I thought about building a virtual tuple, but then what do I do with an OID column, if I have one? Or maybe this should be done some other way altogether. 2. Suppose we scan one tuple on a not-all-visible page followed by 99 tuples on all-visible pages. The code as written will hold the pin on the first heap page for the entire scan. As soon as we hit the end of the scan or another tuple where we have to actually visit the page, the old pin will be released, but until then we hold onto it. This isn't totally stupid: the next tuple that's on a not-all-visible page could easily be on the same not-all-visible page we already have pinned. And in 99% cases holding the pin for slightly longer is probably completely harmless. On the flip side, it could increase the chances of interfering with VACUUM. Then again, a non-index-only scan would still interfere with the same VACUUM, so maybe we don't care. 3. The code in create_index_path() builds up a bitmapset of heap attributes that get used for any purpose anywhere in the query, and hangs it on the RelOptInfo so it doesn't need to be rebuilt for every index under consideration. However, if it were somehow possible to have the rel involved without using any attributes at all, we'd rebuild the cache over and over, since it would never become non-NULL. I don't think that can happen right now, but future planner changes might make it possible. 4. There are a couple of cases that use index-only scans even though the EXPLAIN output sort of makes it look like they shouldn't. For example, in the above queries, an index-only scan is chosen even though the query does SELECT * from the table being scanned. Even though the EXPLAIN (VERBOSE) output makes it look otherwise, it seems that the target list of an EXISTS query is in fact discarded, e.g.: create or replace function error() returns int as $$begin select 1/0; end$$ language plpgsql; select * from pgbench_accounts a where exists (select error() from pgbench_branches b where b.bid = a.aid); -- doesn't error out! Along similar lines, COUNT(*) does not preclude an index-only scan, because the * is apparently just window dressing. You'll still get just a seq-scan unless you have an indexable qual in the query somewhere, because... 5. We haven't made any planner changes at all, not even for cost estimation. It is not clear to me what the right way to do cost estimation here is. It seems that it would be useful to know what percentage of the table is all-visible at planning time, but even if we did, there could (and probably often will) be correlation between the pages the query is
Re: [HACKERS] index-only scans
On Fri, Aug 12, 2011 at 6:20 AM, Cédric Villemain cedric.villemain.deb...@gmail.com wrote: Can this faux heap tuple be appended by the data from another index once it has been created ? ( if the query involves those 2 index) I don't see how to make that work. In general, a query like SELECT a, b FROM foo WHERE a = 1 AND b = 1 can only use both indexes if we use a bitmap index scan on each followed by a bitmapand and then a bitmap heap scan. However, this patch only touches the index-scan path, which only knows how to use one index for any given query. I thought of something like that: 'select a,b from foo where a=1 order by b limit 100' (or: where a=1 and b now() ) Well... PostgreSQL can only use the index on a or the index on b, not both. This patch doesn't change that. I'm not trying to use indexes in some completely new way; I'm just trying to make them faster by optimizing away the heap access. Actually, I can see a possible way to allow an index-only type optimization to be used for bitmap scans. As you scan the index, any tuples that can be handled index-only get returned immediately; the rest are thrown into a bitmap. Once you're done examining the index, you then do a bitmap heap scan to get the tuples that couldn't be handled index-only. This seems like it might be our best hope for a fast count(*) type optimization, especially if you could combine it with some method of scanning the index in physical order rather than logical order. IIRC we expose some ideas around that, yes. (optimizing bitmap) Maybe a question that will explain me more about the feature limitation (if any): Does an index-only scan used when the table has no vismap set will cost (in duration, IO, ...) more than a normal Index scan ? Yeah, it'll do a bit of extra work - the btree AM will cough up the tuple uselessly, and we'll check the visibility map, also uselessly. Then we'll end up doing it the regular way anyhow. I haven't measured that effect yet; hopefully it's fairly small. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bgwriter and checkpoints
On Fri, Aug 12, 2011 at 7:09 AM, Simon Riggs si...@2ndquadrant.com wrote: The bgwriter has been responsible for two main activities: incremental page cleaning and checkpointing. We've worked out the code to smooth checkpointing, but that now means we have periods where we do both page cleaning and checkpointing, and other times when we do just page cleaning. That means the processing loops are complex and that makes latch based processing more difficult to introduce. I've thought about this before, and I think it's probably a good idea. At the very least that code needs some refactoring, and breaking it into two completely separate processes may be the way to go. One possible trouble spot is that this will have the effect of increasing traffic on BgWriterCommLock. Right now, page-cleaning activity doesn't need to do that, because the bgwriter is both the source of all the fsync requests and the guy who eventually executes them, so it can just update its local state directly. If the two functions are separated, that gets a bit more complicated: the checkpoint process, even when not checkpointing, will need to absorb fsync requests every so often (maybe that could be driven off a latch, so we wake it up when the request queue is 25% full, or something like that?). Or else the cleaning process will have to be in charge of issuing the fsyncs and the checkpoint process will have to request that it do so at the appropriate time. Or maybe there's some third solution I'm not thinking of. Anyway, it may not be a big problem in practice, just something I was worrying about... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
2011/8/12 Robert Haas robertmh...@gmail.com: On Fri, Aug 12, 2011 at 6:20 AM, Cédric Villemain cedric.villemain.deb...@gmail.com wrote: Can this faux heap tuple be appended by the data from another index once it has been created ? ( if the query involves those 2 index) I don't see how to make that work. In general, a query like SELECT a, b FROM foo WHERE a = 1 AND b = 1 can only use both indexes if we use a bitmap index scan on each followed by a bitmapand and then a bitmap heap scan. However, this patch only touches the index-scan path, which only knows how to use one index for any given query. I thought of something like that: 'select a,b from foo where a=1 order by b limit 100' (or: where a=1 and b now() ) Well... PostgreSQL can only use the index on a or the index on b, not both. This patch doesn't change that. I'm not trying to use indexes in some completely new way; I'm just trying to make them faster by optimizing away the heap access. For this kind of plan : Bitmap Heap Scan Recheck Cond BitmapAnd Bitmap Index Scan Bitmap Index Scan It may prevent useless Heap Fetch during Bitmap Heap Scan, isn't it ? Actually, I can see a possible way to allow an index-only type optimization to be used for bitmap scans. As you scan the index, any tuples that can be handled index-only get returned immediately; the rest are thrown into a bitmap. Once you're done examining the index, you then do a bitmap heap scan to get the tuples that couldn't be handled index-only. This seems like it might be our best hope for a fast count(*) type optimization, especially if you could combine it with some method of scanning the index in physical order rather than logical order. IIRC we expose some ideas around that, yes. (optimizing bitmap) Maybe a question that will explain me more about the feature limitation (if any): Does an index-only scan used when the table has no vismap set will cost (in duration, IO, ...) more than a normal Index scan ? Yeah, it'll do a bit of extra work - the btree AM will cough up the tuple uselessly, and we'll check the visibility map, also uselessly. Then we'll end up doing it the regular way anyhow. I haven't measured that effect yet; hopefully it's fairly small. If it is small, or if we can reduce it to be near absent. Then... why do we need to distinguish Index Scan at all ? (or Index*-Only* scan, which aren't 100% 'Only' btw) It is then just an internal optimisation on how we can access/use an index. (really good and promising one) -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bgwriter and checkpoints
On Fri, Aug 12, 2011 at 2:19 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Aug 12, 2011 at 7:09 AM, Simon Riggs si...@2ndquadrant.com wrote: The bgwriter has been responsible for two main activities: incremental page cleaning and checkpointing. We've worked out the code to smooth checkpointing, but that now means we have periods where we do both page cleaning and checkpointing, and other times when we do just page cleaning. That means the processing loops are complex and that makes latch based processing more difficult to introduce. I've thought about this before, and I think it's probably a good idea. At the very least that code needs some refactoring, and breaking it into two completely separate processes may be the way to go. One possible trouble spot is that this will have the effect of increasing traffic on BgWriterCommLock. Right now, page-cleaning activity doesn't need to do that, because the bgwriter is both the source of all the fsync requests and the guy who eventually executes them, so it can just update its local state directly. If the two functions are separated, that gets a bit more complicated: the checkpoint process, even when not checkpointing, will need to absorb fsync requests every so often (maybe that could be driven off a latch, so we wake it up when the request queue is 25% full, or something like that?). Or else the cleaning process will have to be in charge of issuing the fsyncs and the checkpoint process will have to request that it do so at the appropriate time. Or maybe there's some third solution I'm not thinking of. Anyway, it may not be a big problem in practice, just something I was worrying about... Yes, they would still need to talk. But the good news is that they only actually need to talk once per checkpoint cycle so we can buffer them to a certain extent in shared memory to remove the worst part of such contention. Checkpointing needs a little more time in its diary to receive those messages than it has right now, so there's no easy route. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bgwriter and checkpoints
On Fri, Aug 12, 2011 at 9:33 AM, Simon Riggs si...@2ndquadrant.com wrote: Yes, they would still need to talk. But the good news is that they only actually need to talk once per checkpoint cycle so we can buffer them to a certain extent in shared memory to remove the worst part of such contention. Yeah, some kind of special-purpose communication method between the cleaning scan and the checkpoint process might help, if the lock contention turns out to be a problem in practice. Then again, maybe I'm overthinking things: there's zero sign in any profiling I've done that BgWriterCommLock is even mildly contended, so even worrying about it at this point might be a waste of time. Checkpointing needs a little more time in its diary to receive those messages than it has right now, so there's no easy route. Yeah. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Fast GiST index build
On Thu, Aug 11, 2011 at 6:21 AM, Alexander Korotkov aekorot...@gmail.com wrote: [ new patch ] Some random comments: - It appears that the noFollowFight flag is really supposed to be called noFollowRight. - In gist_private.h you've written halt-filled where you really mean half-filled. - It seems like you're using reloptions to set parameters that are only going to do anything at index creation time. IIUC, BUFFERING, LEVELSTEP and BUFFERSIZE have no permanent meaning for that index; they're just used ephemerally while constructing it. If we're going to expose such things as options, maybe they should be GUCs, not reloptions. - Function names should begin with gist or some other, appropriate prefix, especially if they are non-static. decreasePathRefcount(), getNodeBuffer(), relocateBuildBuffersOnSplit(), adn getNodeBufferBusySize() violate this rule, and it might be good to change the static functions to follow it, too, just for consistency, and to avoid renaming things if something that's currently static later needs to be made non-static. - validateBufferOption needs to use ereport(), not elog(). - This needs a bit of attention: + /* TODO: Write the WAL record */ + if (RelationNeedsWAL(state-r)) + recptr = gistXLogSplit(state-r-rd_node, blkno, is_leaf, + dist, oldrlink, oldnsn, InvalidBuffer, true); + else + recptr = GetXLogRecPtrForTemp(); + I don't think the comment matches the code, since gistXLogSplit() does in fact call XLogInsert(). Also, you should probably move the RelationNeedsWAL() test inside gistXLogSplit(). Otherwise, every single caller of gistXLogSplit() is going to need to do the same dance. - In gistBufferingPlaceToPage, you've got a series of loops that look like this: + for (ptr = dist; ptr; ptr = ptr-next) The first loop allocates a bunch of buffers. The second loop sets up downlink pointers. Then there's some other code. Then there's a third loop, which adds items to each page in turn and sets up right links. Then there's a fourth loop, which marks all those buffers dirty. Then you write XLOG. Then there's a fifth loop, which sets all the LSNs and TLIs, and a sixth loop, which does UnlockReleaseBuffer() on each valid buffer in the list. All of this seems like it could be simplified. In particular, the third and fourth loops can certainly be merged - you should set the dirty bit at the same time you're adding items to the page. And the fifth and sixth loops can also be merged. You certainly don't need to set all the LSNs and TLIs before releasing any of the buffer locks pins. I'm not sure if there's any more merging that can be done than that, but you might want to have a look. I'm also wondering how long this linked list can be. It's not good to have large numbers of buffers locked for a long period of time. At the very least, some comments are in order here. Another general comment about this function is that it seems like it is backwards. The overall flow of the function is: if (is_split) { /* complicated stuff */ } else { /* simple case */ } It seems like it might be better to flip that around and do this: if (!is_split) { /* simple case */ return result; } /* complicated stuff */ It's easier to read and avoids having the indentation get too deep. - As I look at this more, I see that a lot of the logic in gistBufferingBuildPlaceToPage is copied from gistplacetopage(). It would be nice to move the common bits to common subroutines that both functions can call, instead of duplicating the code. - On a related note, gistBufferingBuildPlaceToPage needs to do START_CRIT_SECTION and END_CRIT_SECTION at appropriate points in the sequence, as gistplacetopage() does. - gistFindCorrectParent() seems to rely heavily on the assumption that there's no concurrent activity going on in this index. Otherwise, it's got to be unsafe to release the buffer lock before using the answer the function computes. Some kind of comment seems like it would be a good idea. - On a more algorithmic note, I don't really understand why we attach buffers to all pages on a level or none of them. If it isn't necessary to have buffers on every internal page in the tree, why do we have them on every other level or every third level rather than, say, creating them on the fly in whatever parts of the tree end up busy? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On Fri, Aug 12, 2011 at 9:31 AM, Cédric Villemain cedric.villemain.deb...@gmail.com wrote: Well... PostgreSQL can only use the index on a or the index on b, not both. This patch doesn't change that. I'm not trying to use indexes in some completely new way; I'm just trying to make them faster by optimizing away the heap access. For this kind of plan : Bitmap Heap Scan Recheck Cond BitmapAnd Bitmap Index Scan Bitmap Index Scan It may prevent useless Heap Fetch during Bitmap Heap Scan, isn't it ? The input to the bitmap heap scan is just a TID bitmap. It's too late to decide we want the index tuples at that point; we've forgotten where they are, and even if we remembered it, it would necessarily be any cheaper than checking the heap. We could optimize this to avoid a heap fetch in the case where we don't need any of the tuple data at all, but that's going to be somewhat rare, I think. Actually, I can see a possible way to allow an index-only type optimization to be used for bitmap scans. As you scan the index, any tuples that can be handled index-only get returned immediately; the rest are thrown into a bitmap. Once you're done examining the index, you then do a bitmap heap scan to get the tuples that couldn't be handled index-only. This seems like it might be our best hope for a fast count(*) type optimization, especially if you could combine it with some method of scanning the index in physical order rather than logical order. IIRC we expose some ideas around that, yes. (optimizing bitmap) Maybe a question that will explain me more about the feature limitation (if any): Does an index-only scan used when the table has no vismap set will cost (in duration, IO, ...) more than a normal Index scan ? Yeah, it'll do a bit of extra work - the btree AM will cough up the tuple uselessly, and we'll check the visibility map, also uselessly. Then we'll end up doing it the regular way anyhow. I haven't measured that effect yet; hopefully it's fairly small. If it is small, or if we can reduce it to be near absent. Then... why do we need to distinguish Index Scan at all ? (or Index*-Only* scan, which aren't 100% 'Only' btw) It is then just an internal optimisation on how we can access/use an index. (really good and promising one) Right, that's kind of what I was going for. But the overhead isn't going to be exactly zero, so I think it makes sense to disable it in the cases where it clearly can't work. The question is really more whether we need to get more fine-grained than that, and actually do a cost-based comparison of an index-only scan vs. a regular index scan. I hope not, but I haven't tested it yet. One other thing to think about is that the choice to use an index-scan is not free of externalities. The index-only scan is hopefully faster than touching all the heap pages, but even if it weren't, not touching all of the heap pages potentially means avoiding evicting things from shared_buffers that some *other* query might need. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Possible Bug in pg_upgrade
On 08/11/2011 12:02 AM, Peter Eisentraut wrote: On ons, 2011-08-10 at 18:53 -0400, Tom Lane wrote: Dave Byrnedby...@mdb.com writes: Attached is a patch that skips orphaned temporary relations in pg_upgrade if they are lingering around. It works for 9.0 - 9.1 upgrades, however I wasn't able to tell when pg_class.relistemp was added so if it was unavailable in versions prior to 9.0 an additional check will have to be added. I'm inclined to think the correct fix is to revert the assumption that the old and new databases contain exactly the same number of tables ... that seems to have a lot of potential failure modes besides this one. It's basically checking whether pg_dump -s worked. That doesn't seem like a good use of time. If anyone has a suggestion for a better approach I'm happy to work on it and amend the patch. It is certainly is a corner case but it bit me when preparing for the upcoming 9.1 release. I would imagine others will hit it as well. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Reworking the writing of WAL
I present a number of connected proposals 1. Earlier, I suggested that the sync rep code would allow us to redesign the way we write WAL, using ideas from group commit. My proposal is that when when a backend needs to flush WAL to local disk it will be added to a SHMQUEUE exactly the same as when we flush WAL to sync standby. The WALWriter will be woken by latch and then perform the actual work. When complete WALWriter will wake the queue in order, so there is a natural group commit effect. The WAL queue will be protected by a new lock WALFlushRequestLock, which should be much less heavily contended than the way we do things now. Notably this approach will mean that all waiters get woken quickly, without having to wait for the queue of WALWriteLock requests to drain down, so commit will be marginally quicker. On almost idle systems this will give very nearly the same response time as having each backend write WAL directly. On busy systems this will give optimal efficiency by having WALWriter working in a very tight loop to perform the I/O instead of queuing itself to get the WALWriteLock with all the other backends. It will also allow piggybacking of commits even when WALInsertLock is not available. 2. A further idea is to use the same queue to reduce contention on accessing the ProcArray and Clog at end of transaction also. That would not be part of the initial work, but I'd want to bear in mind that possibility in the design stage at least if there were any choices to make. 3. In addition, we will send the WAL to standby servers as soon as it has been written, not flushed. As part of the chunk header the WALSender would include the known WAL flush ptr. So we would be sending WAL data to the standby ahead of it being flushed, but then only applying data up the flush ptr. This would mean we don't flush WAL fully and then send it, we partially overlap those operations to give us the option of saying we don't want to fsync remotely for additional speed (DRBD 'B' mode). 4. I'm tempted by the thought to make backends write their commit records but not flush them, which fits in with the above. 5. And we would finally get rid of the group commit parameters. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpython crash
On 12/08/11 13:55, Jean-Baptiste Quenot wrote: Here is the same with -O0: https://gist.github.com/1140005 sys.version reports this: INFO: 2.6.6 (r266:84292, Sep 15 2010, 16:41:53) [GCC 4.4.5] I'm still at a loss. Did you reproduce it with git HEAD? I see that the query being execute is select * from myfunc(); would it be possible to share the code of myfunc? Jan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sha1, sha2 functions into core?
On Aug 12, 2011, at 5:02 AM, Marko Kreen wrote: My point was that giving such open-ended list of algorithms was bad idea, but there is no problem keeping old behaviour. I don't see anything much wrong with sha1(bytea/text) - bytea. There's no law that says it has to work exactly like md5() does. The problem is that list of must-have algorithms is getting quite long: md5, sha1, sha224, sha256, sha384, sha512, + at least 4 from upcoming sha3. +1 I think some sort of digest() function that takes a parameter naming the algorithm would be the way to go. That's not to say that the existing named functions could continue to exist -- md5() in core and sha1() in pg_crypto. But it sure seems to me like we ought to have just one function for digests (or 2, if we also have hexdigest()). Best, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reworking the writing of WAL
On Fri, Aug 12, 2011 at 11:34 AM, Simon Riggs si...@2ndquadrant.com wrote: 1. Earlier, I suggested that the sync rep code would allow us to redesign the way we write WAL, using ideas from group commit. My proposal is that when when a backend needs to flush WAL to local disk it will be added to a SHMQUEUE exactly the same as when we flush WAL to sync standby. The WALWriter will be woken by latch and then perform the actual work. When complete WALWriter will wake the queue in order, so there is a natural group commit effect. The WAL queue will be protected by a new lock WALFlushRequestLock, which should be much less heavily contended than the way we do things now. Notably this approach will mean that all waiters get woken quickly, without having to wait for the queue of WALWriteLock requests to drain down, so commit will be marginally quicker. On almost idle systems this will give very nearly the same response time as having each backend write WAL directly. On busy systems this will give optimal efficiency by having WALWriter working in a very tight loop to perform the I/O instead of queuing itself to get the WALWriteLock with all the other backends. It will also allow piggybacking of commits even when WALInsertLock is not available. I like the idea of putting all the backends that are waiting for xlog flush on a SHM_QUEUE, and having a single process do the flush and then wake them all up. That seems like a promising approach, and should avoid quite a bit of context-switching and spinlocking that would otherwise be necessary. However, I think it's possible that the overhead in the single-client case might be pretty significant, and I'm wondering whether we might be able to set things up so that backends can flush their own WAL in the uncontended case. What I'm imagining is something like this: struct { slock_t mutex; XLogRecPtr CurrentFlushLSN; XLogRecPtr HighestFlushLSN; SHM_QUEUE WaitersForCurrentFlush; SHM_QUEUE WaitersForNextFlush; }; To flush, you first acquire the mutex. If the CurrentFlushLSN is not InvalidXLogRecPtr, then there's a flush in progress, and you add yourself to either WaitersForCurrentFlush or WaitersForNextFlush, depending on whether your LSN is lower or higher than CurrentFlushLSN. If you queue on WaitersForNextFlush you advance HighestFlushLSN to the LSN you need flushed. You then release the spinlock and sleep on your semaphore. But if you get the mutex and find that CurrentFlushLSN is XLogRecPtr, then you know that no flush is in progress. In that case, you set CurrentFlushLSN to the maximum of the LSN you need flushed and HighestFlushLSN and move all WaitersForNextFlush over to WaitersForCurrentFlush. You then release the spinlock and perform the flush. After doing so, you reacquire the spinlock, remove everyone from WaitersForCurrentFlush, note whether there are any WaitersForNextFlush, and release the spinlock. If there were any WaitersForNextFlush, you set the WAL writer latch. You then wake up anyone you removed from WaitersForCurrentFlush. Every time the WAL writer latch is set, the WAL writer wakes up and performs any needed flush, unless there's already one in progress. This allows processes to flush their own WAL when there's no contention, but as more contention develops the work moves to the WAL writer which will then run in a tight loop, as in your proposal. 5. And we would finally get rid of the group commit parameters. That would be great, and I think the performance will be quite a bit better, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] VACUUM FULL versus system catalog cache invalidation
I've been testing the problem reported by Dave Gould by running make installcheck-parallel together with a tight loop of vacuum full pg_class: while true; do psql -c vacuum full pg_class regression; usleep 10; done Even after fixing the cache-reset-recovery-order problem I described yesterday, there are still occasional bizarre failures, for example in this bit of truncate.sql: BEGIN; SELECT * FROM trunc_f; TRUNCATE trunc_f; SELECT * FROM trunc_f; ROLLBACK; BEGIN; SELECT * FROM trunc_f; TRUNCATE ONLY trunc_f; ERROR: attempted to update invisible tuple I'm not sure that there's only one bug remaining here, but I've identified this bug at least. In the first transaction above, the TRUNCATE has to update trunc_f's pg_class tuple with a new relfilenode value. Say the update invalidates the tuple at tid A and inserts the updated version at tid B. Then vacuum full gets control and rewrites pg_class. It correctly keeps both versions of the trunc_f tuple, but now they will be at new tid locations, say C and D. The original transaction meanwhile was blocked trying to re-open pg_class to reload trunc_f's pg_class tuple into its catcaches. Now it successfully does that, and the tuple it reads has tid D. It finishes out the transaction, and rolls back as per script. In the rollback, we know that we have to flush catcache entries for tuple versions inserted by the failed transaction ... but what inval.c has stored is an inval event against pg_class tid B. There is no such entry in the catcaches, so nothing happens, and the entry with tid D stays valid. Ooops. The relcache entry for trunc_f does get rebuilt correctly (without consulting the bogus catcache entry), so the next SELECT goes through all right. But when the second TRUNCATE wants to again update the pg_class tuple, it finds it in the catcaches ... and what it finds has tid D, so it attempts to heap_update that TID instead of the actually live tuple at tid C. In this particular case, the incorrectly targeted tuple is the just-invalidated version of trunc_f, so heap_update sees it as XMIN_INVALID and fails with attempted to update invisible tuple. The potential consequences are hugely worse than that, though: with a bit more time between the two operations, it'd be possible for someone else to reclaim the dead tuple and replace it with something else. As long as the TID is live when we get to it, heap_update will blindly replace it, whether or not it has anything to do with the tuple we think we're replacing. I suspect that some of the other bizarre cases I'm seeing are artifacts of that type of outcome, but they're hard to reproduce so I've not been able to pin them down yet. In principle this same type of failure could have occurred before 9.0, since catcache inval has always been TID-based. It turns out we were saved by the fact that the old VACUUM FULL implementation would chicken out and not do any tuple moving if it found any INSERT_IN_PROGRESS or DELETE_IN_PROGRESS tuples. The pre-9.0 versions of CLUSTER reject such tuples too, so I don't think we need to do anything in those branches. But in 9.0 and up, we have a problem. So far I've thought of two possible avenues to fix it: 1. When a SHAREDINVALCATALOG_ID inval message comes in (telling us a VAC FULL or CLUSTER just finished on a system catalog), enter that message into the transaction's local inval list as well as executing it immediately. On transaction abort, redo the resulting cache flushes a second time. This would ensure we flushed any bad entries even though they were logged with obsolete TIDs elsewhere in the inval list. (We might need to do this on commit too, though right now I don't think so. Also, a cache reset event would a fortiori have to queue a similar entry to blow things away a second time at transaction end.) 2. Forget about targeting catcache invals by TID, and instead just use the key hash value to determine which cache entries to drop. Approach #2 seems a lot less invasive and more trustworthy, but it has the disadvantage that cache invals would become more likely to blow away entries unnecessarily (because of chance hashvalue collisions), even without any VACUUM FULL being done. If we could make approach #1 work reliably, it would result in more overhead during VACUUM FULL but less at other times --- or at least we could hope so. In an environment where lots of sinval overflows and consequent resets happen, we might come out behind due to doubling the number of catcache flushes forced by a reset event. Right at the moment I'm leaning to approach #2. I wonder if anyone sees it differently, or has an idea for a third approach? BTW, going forward it might be interesting to think about invalidating the catcaches based on xmin/xmax rather than specific TIDs. That would reduce the sinval traffic to one message per transaction that had affected the catalogs, instead of one per TID. But that clearly is not going to lead to something we'd dare
Re: [HACKERS] Reworking the writing of WAL
On Fri, Aug 12, 2011 at 7:02 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Aug 12, 2011 at 11:34 AM, Simon Riggs si...@2ndquadrant.com wrote: 1. Earlier, I suggested that the sync rep code would allow us to redesign the way we write WAL, using ideas from group commit. My proposal is that when when a backend needs to flush WAL to local disk it will be added to a SHMQUEUE exactly the same as when we flush WAL to sync standby. The WALWriter will be woken by latch and then perform the actual work. When complete WALWriter will wake the queue in order, so there is a natural group commit effect. The WAL queue will be protected by a new lock WALFlushRequestLock, which should be much less heavily contended than the way we do things now. Notably this approach will mean that all waiters get woken quickly, without having to wait for the queue of WALWriteLock requests to drain down, so commit will be marginally quicker. On almost idle systems this will give very nearly the same response time as having each backend write WAL directly. On busy systems this will give optimal efficiency by having WALWriter working in a very tight loop to perform the I/O instead of queuing itself to get the WALWriteLock with all the other backends. It will also allow piggybacking of commits even when WALInsertLock is not available. I like the idea of putting all the backends that are waiting for xlog flush on a SHM_QUEUE, and having a single process do the flush and then wake them all up. That seems like a promising approach, and should avoid quite a bit of context-switching and spinlocking that would otherwise be necessary. OK, good. This work builds on Peter's latch-for-walwriter patch, so this won't appear for a while yet, since it needs to happen after that. Hopefully for Nov CF. However, I think it's possible that the overhead in the single-client case might be pretty significant, and I'm wondering whether we might be able to set things up so that backends can flush their own WAL in the uncontended case. I think we should measure that and see. I don't think it will be that bad, but I accept it might be wishful thinking on my part. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation
On Fri, Aug 12, 2011 at 2:09 PM, Tom Lane t...@sss.pgh.pa.us wrote: But in 9.0 and up, we have a problem. So far I've thought of two possible avenues to fix it: 1. When a SHAREDINVALCATALOG_ID inval message comes in (telling us a VAC FULL or CLUSTER just finished on a system catalog), enter that message into the transaction's local inval list as well as executing it immediately. On transaction abort, redo the resulting cache flushes a second time. This would ensure we flushed any bad entries even though they were logged with obsolete TIDs elsewhere in the inval list. (We might need to do this on commit too, though right now I don't think so. Also, a cache reset event would a fortiori have to queue a similar entry to blow things away a second time at transaction end.) 2. Forget about targeting catcache invals by TID, and instead just use the key hash value to determine which cache entries to drop. Approach #2 seems a lot less invasive and more trustworthy, but it has the disadvantage that cache invals would become more likely to blow away entries unnecessarily (because of chance hashvalue collisions), even without any VACUUM FULL being done. If we could make approach #1 work reliably, it would result in more overhead during VACUUM FULL but less at other times --- or at least we could hope so. In an environment where lots of sinval overflows and consequent resets happen, we might come out behind due to doubling the number of catcache flushes forced by a reset event. Right at the moment I'm leaning to approach #2. I wonder if anyone sees it differently, or has an idea for a third approach? I don't think it really matters whether we occasionally blow away an entry unnecessarily due to a hash-value collision. IIUC, we'd only need to worry about hash-value collisions between rows in the same catalog; and the number of entries that we have cached had better be many orders of magnitude less than 2^32. If the cache is large enough that we're having hash value collisions more than once in a great while, we probably should have flushed some entries out of it a whole lot sooner and a whole lot more aggressively, because we're likely eating memory like crazy. On the other hand, having to duplicate sinval resets seems like a bad thing. So +1 for #2. BTW, going forward it might be interesting to think about invalidating the catcaches based on xmin/xmax rather than specific TIDs. That would reduce the sinval traffic to one message per transaction that had affected the catalogs, instead of one per TID. But that clearly is not going to lead to something we'd dare back-patch. To be honest, I am not real keen on back-patching either of the options you list above, either. I think we should probably do something about the problem Dave Gould reported (to wit: sinval resets need to be smarter about the order they do things in), but I am not sure it's a good idea to back-patch what seems like a fairly radical overhaul of the sinval mechanism to fix a problem nobody's actually complained about hitting in production, especially given there's no certainty that this is the last bug. Perhaps we should just fix this one in master and consider back-patching it if and when we get some plausibly related bug reports. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra check in 9.0 exclusion constraint unintended consequences
On Thu, Aug 11, 2011 at 2:25 PM, Jeff Davis pg...@j-davis.com wrote: On Thu, 2011-08-11 at 11:58 -0400, Robert Haas wrote: I'm OK with adding a note either to the 9.0 docs only (which means it might be missed by a 9.0 user who only looks at the current docs) or with adding a note to all versions mentioning the difference in behavior with 9.0, but I'm not really sure which way to go with it. Or we could just not do anything at all. Anyone else have an opinion? It seems to be somewhat of a burden to carry a version-specific note indefinitely... more clutter than helpful. So I'd vote for just changing the 9.0 docs. Or, we could add the 9.0-specific note to 9.0 and 9.1 docs, but leave it out of 'master'. That way it sticks around for a while but we don't have to remember to remove it later. Having thought about this a bit further, I'm coming around to the view that if it isn't worth adding this in master, it's not worth adding at all. I just don't think it's going to get any visibility as a back-branch only doc patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Change format of FDW options used in \d* commands (was: Re: [HACKERS] per-column FDW options, v5)
2011/8/12 Shigeru Hanada shigeru.han...@gmail.com: (2011/08/12 1:05), Robert Haas wrote: On Thu, Aug 11, 2011 at 12:04 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: Excerpts from Robert Haas's message of jue ago 11 11:50:40 -0400 2011: 2011/8/9 Shigeru Hanadashigeru.han...@gmail.com: I'd like to pick #3, and also change per-column options format. In addition, I'd like to change options format for other FDW objects such as wrappers, servers and user mappings for consistency. Of course, only if it's acceptable to break backward compatibility... I think it's fine to change the display format. We haven't had these features for very long, so users hopefully shouldn't be expecting that everything is set in stone. We have made far bigger changes to backslash commands that have been around for far longer (\df, I'm looking at you). We've never promised that backslash commands behave identically across releases. I think they are more for human consumption than machine, so why would we care about changing one of them a bit? Yeah, I agree. Thanks for the comments. Attached patch changes various \d*+ commands to show FDW options in same format as OPTIONS clause. IMHO the new format is easier to read. Example) old: {delimiter=,,quote=\} new: delimiter ',', quote '' All regression tests including contrib's installcheck passed. I'll add this patch to CF app as new item. IMHO, the new format should put parentheses around the options list. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
On fre, 2011-08-12 at 11:46 -0700, David E. Wheeler wrote: I figure there might be warnings you haven't seen if you haven't been building with bonjour, perl, openssl, pam, libxml, or ossp-uuid, so I've attached the output. We have a build farm member (smew) that covers all of that except bonjour, but I can confirm that there are no additional warnings on account of that. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
On Aug 12, 2011, at 12:09 PM, Peter Eisentraut wrote: I figure there might be warnings you haven't seen if you haven't been building with bonjour, perl, openssl, pam, libxml, or ossp-uuid, so I've attached the output. We have a build farm member (smew) that covers all of that except bonjour, but I can confirm that there are no additional warnings on account of that. Ah, great, thanks! David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation
Robert Haas robertmh...@gmail.com writes: On Fri, Aug 12, 2011 at 2:09 PM, Tom Lane t...@sss.pgh.pa.us wrote: 2. Forget about targeting catcache invals by TID, and instead just use the key hash value to determine which cache entries to drop. Right at the moment I'm leaning to approach #2. I wonder if anyone sees it differently, or has an idea for a third approach? To be honest, I am not real keen on back-patching either of the options you list above, either. I think we should probably do something about the problem Dave Gould reported (to wit: sinval resets need to be smarter about the order they do things in), but I am not sure it's a good idea to back-patch what seems like a fairly radical overhaul of the sinval mechanism to fix a problem nobody's actually complained about hitting in production, especially given there's no certainty that this is the last bug. What radical overhaul? I'm talking about removing one if-test in CatalogCacheIdInvalidate, viz nextelt = DLGetSucc(elt); if (hashValue != ct-hash_value) continue; /* ignore non-matching hash values */ - if (ct-negative || - ItemPointerEquals(pointer, ct-tuple.t_self)) { if (ct-refcount 0 || (ct-c_list ct-c_list-refcount 0)) { In any case, it is now clear to me that this bug is capable of eating peoples' databases, as in what just happened to our most critical table? Uh, it's not there anymore, boss, but we seem to have duplicate pg_class entries for this other table. This may not have happened yet in the field, but that's probably only because 9.0 hasn't been in production that long. Doing nothing about it in the released branches is unacceptable IMO. People that that happens to will not be satisfied with a response like you shouldn't have been using VACUUM FULL on system catalogs, it's buggy. For that matter, I'm not convinced that Gould hasn't seen some form of this --- he's mentioned seeing bizarre errors above and beyond the could not find pg_class tuple for index 2662 one. As for it not being the last bug, no, it's probably not. That's a pretty sucky excuse for not fixing it too. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sha1, sha2 functions into core?
On Thu, Aug 11, 2011 at 5:46 PM, Tom Lane t...@sss.pgh.pa.us wrote: Marko Kreen mark...@gmail.com writes: On Wed, Aug 10, 2011 at 9:19 PM, Tom Lane t...@sss.pgh.pa.us wrote: ... which this approach would create, because digest() isn't restricted to just those algorithms. I think it'd be better to just invent two new functions, which also avoids issues for applications that currently expect the digest functions to be installed in pgcrypto's schema. I would suggest digest() with fixed list of algorithms: md5, sha1, sha2. The uncommon/obsolete algorithms that can be used from digest() if compiled with openssl, are not something we need to worry over. In fact we have never supported them, as no testing has been done. Hmm ... they may be untested by us, but I feel sure that if we remove that functionality from pgcrypto, *somebody* is gonna complain. If you dont want to break digest() but do not want such behaviour in core, we could go with hash(data, algo) that has fixed number of digests, but also couple non-cryptographic hashes like crc32, lookup2/3. This would also fix the problem of people using hashtext() in user code. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation
Robert Haas robertmh...@gmail.com wrote: Perhaps we should just fix this one in master and consider back-patching it if and when we get some plausibly related bug reports. I'm not completely clear on what one would do to be vulnerable to hitting the bug, or what the impact of hitting it would be. Tom said: The potential consequences are hugely worse than that, though: with a bit more time between the two operations, it'd be possible for someone else to reclaim the dead tuple and replace it with something else. As long as the TID is live when we get to it, heap_update will blindly replace it, whether or not it has anything to do with the tuple we think we're replacing. What would a worst-case bug report based on hitting that look like? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Inserting heap tuples in bulk in COPY
COPY is slow. Let's make it faster. One obvious optimization is to insert heap tuples in bigger chunks, instead of calling heap_insert() separately for every tuple. That saves the overhead of pinning and locking the buffer for every tuple, and you only need to write one WAL record for all the tuples written to the same page, instead of one for each tuple. Attached is a WIP patch to do that. It adds a new function, heap_multi_insert, which does the same thing as heap_insert, but works in bulk. It takes an array of tuples as argument, and tries to cram as many of them into the chosen targe page as it can, and only writes a single WAL record of the operation. This gives a significant speedup to COPY, particularly for narrow tables, with small tuples. Grouping multiple tuples into one WAL record reduces the WAL volume significantly, and the time spent in writing that WAL. The reduced overhead of repeatedly locking the buffer is also most noticeable on narrow tables. On wider tables, the effects are smaller. See copytest-results.txt, containing test results with three tables of different widths. The scripts used to get those numbers are also attached. Triggers complicate this. I believe it is only safe to group tuples together like this if the table has no triggers. A BEFORE ROW trigger might run a SELECT on the table being copied to, and check if some of the tuples we're about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples first, and only then insert them, none of the trigger invocations will see the other rows as inserted yet. Similarly, if we run AFTER ROW triggers after inserting a bunch of tuples, the trigger for each of the insertions would see all the inserted rows. So at least for now, the patch simply falls back to inserting one row at a time if there are any triggers on the table. The patch is WIP, mainly because I didn't write the WAL replay routines yet, but please let me know if you see any issues. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 9f1bcf1..0594332 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -24,6 +24,7 @@ * heap_getnext - retrieve next tuple in scan * heap_fetch - retrieve tuple with given tid * heap_insert - insert tuple into a relation + * heap_multi_insert - insert multiple tuples into a relation * heap_delete - delete a tuple from a relation * heap_update - replace a tuple in a relation with another tuple * heap_markpos - mark scan position @@ -1860,11 +1861,39 @@ Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid, int options, BulkInsertState bistate) { + HeapTuple heaptup; + + heaptup = heap_prepare_insert(relation, tup, cid, options, bistate); + + heap_multi_insert(relation, heaptup, 1, options, bistate); + + /* + * If heaptup is a private copy, release it. Don't forget to copy t_self + * back to the caller's image, too. + */ + if (heaptup != tup) + { + tup-t_self = heaptup-t_self; + heap_freetuple(heaptup); + } + + return HeapTupleGetOid(tup); +} + +/* + * Prepare a tuple for insertion with heap_multi_insert. This sets the + * tuple header fields, assigns an OID, and toasts the tuple if necessary. + * Returns a toasted version of the tuple if it was toasted, or the original + * tuple if not. + * + * This needs to be called for each tuple before calling heap_multi_insert(). + */ +HeapTuple +heap_prepare_insert(Relation relation, HeapTuple tup, CommandId cid, + int options, BulkInsertState bistate) +{ TransactionId xid = GetCurrentTransactionId(); - HeapTuple heaptup; - Buffer buffer; - Buffer vmbuffer = InvalidBuffer; - bool all_visible_cleared = false; + HeapTuple heaptup; if (relation-rd_rel-relhasoids) { @@ -1916,6 +1945,39 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, else heaptup = tup; + return heaptup; +} + +/* + * Inserts tuples to relation. This always inserts at least one tuple, and + * opportunistically more if the chosen target page happens to have room, up + * to ntuples. Returns the number of tuples inserted. + */ +int +heap_multi_insert(Relation relation, HeapTuple *heaptuples, int ntuples, + int options, BulkInsertState bistate) +{ + HeapTuple heaptup = heaptuples[0]; + Buffer buffer; + Buffer vmbuffer = InvalidBuffer; + bool all_visible_cleared = false; + int i; + int ndone; + char *scratch = NULL; + int scratchused = 0; + Page page; + + /* + * Allocate some memory to use for constructing the WAL record. Using + * palloc() within a critical section is not safe, so we allocate this + * beforehand. + */ + if (!(options HEAP_INSERT_SKIP_WAL) RelationNeedsWAL(relation)) + scratch = palloc(BLCKSZ); + + if (IsSystemRelation(relation)) + Assert(ntuples == 1); + /* * Find buffer to insert this tuple into. If the page is all visible, * this
Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation
Tom Lane t...@sss.pgh.pa.us wrote: In any case, it is now clear to me that this bug is capable of eating peoples' databases, as in what just happened to our most critical table? Uh, it's not there anymore, boss, but we seem to have duplicate pg_class entries for this other table. Based on this, I don't think we want to wait for bug reports. One would be too many. +1 for backpatching a fix. As for it not being the last bug, no, it's probably not. That's a pretty sucky excuse for not fixing it too. I agree. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation
On 12.08.2011 21:49, Robert Haas wrote: On Fri, Aug 12, 2011 at 2:09 PM, Tom Lanet...@sss.pgh.pa.us wrote: 2. Forget about targeting catcache invals by TID, and instead just use the key hash value to determine which cache entries to drop. Approach #2 seems a lot less invasive and more trustworthy, but it has the disadvantage that cache invals would become more likely to blow away entries unnecessarily (because of chance hashvalue collisions), even without any VACUUM FULL being done. If we could make approach #1 work reliably, it would result in more overhead during VACUUM FULL but less at other times --- or at least we could hope so. In an environment where lots of sinval overflows and consequent resets happen, we might come out behind due to doubling the number of catcache flushes forced by a reset event. Right at the moment I'm leaning to approach #2. I wonder if anyone sees it differently, or has an idea for a third approach? I don't think it really matters whether we occasionally blow away an entry unnecessarily due to a hash-value collision. IIUC, we'd only need to worry about hash-value collisions between rows in the same catalog; and the number of entries that we have cached had better be many orders of magnitude less than 2^32. If the cache is large enough that we're having hash value collisions more than once in a great while, we probably should have flushed some entries out of it a whole lot sooner and a whole lot more aggressively, because we're likely eating memory like crazy. What would suck, though, is if you have an application that repeatedly creates and drops a temporary table, and the hash value for that happens to match some other table in the database. catcache invalidation would keep flushing the entry for that other table too, and you couldn't do anything about it except for renaming one of the tables. Despite that, +1 for option #2. The risk of collision seems acceptable, and the consequence of a collision wouldn't be too bad in most applications anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Further news on Clang - spurious warnings
On 12 August 2011 20:10, David E. Wheeler da...@kineticode.com wrote: Ah, great, thanks! Unfortunately, you'll need to build your own Clang to be up to speed on the work I've done here, because the Clang developers both fixed the spurious warning from assigning past what appears to be the end of an array, plus a significant performance issue when building gram.c and other grammar files. Building Clang is not difficult, but it's a little time-consuming. I imagine we'll have to wait until the release and wide availability of Clang 3 before we start to see people using it for day-to-day hacking on Postgres. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: On 12.08.2011 21:49, Robert Haas wrote: I don't think it really matters whether we occasionally blow away an entry unnecessarily due to a hash-value collision. IIUC, we'd only need to worry about hash-value collisions between rows in the same catalog; and the number of entries that we have cached had better be many orders of magnitude less than 2^32. If the cache is large enough that we're having hash value collisions more than once in a great while, we probably should have flushed some entries out of it a whole lot sooner and a whole lot more aggressively, because we're likely eating memory like crazy. What would suck, though, is if you have an application that repeatedly creates and drops a temporary table, and the hash value for that happens to match some other table in the database. catcache invalidation would keep flushing the entry for that other table too, and you couldn't do anything about it except for renaming one of the tables. Despite that, +1 for option #2. The risk of collision seems acceptable, and the consequence of a collision wouldn't be too bad in most applications anyway. Yeah. Also, to my mind this is only a fix that will be used in 9.0 and 9.1 --- now that it's occurred to me that we could use tuple xmin/xmax to invalidate catcaches instead of recording individual TIDs, I'm excited about doing that instead for 9.2 and beyond. I believe that that could result in a significant reduction in sinval traffic, which would be a considerable performance win. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inserting heap tuples in bulk in COPY
On Fri, Aug 12, 2011 at 3:16 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: COPY is slow. No kidding! So at least for now, the patch simply falls back to inserting one row at a time if there are any triggers on the table. Maybe we want to change that to fall back to old ways if there are any FOR EACH ROW triggers, since FOR EACH STATEMENT triggers won't be bothered by this optimization. -- Gurjeet Singh EnterpriseDB Corporation The Enterprise PostgreSQL Company
Re: [HACKERS] Inserting heap tuples in bulk in COPY
On Aug12, 2011, at 21:16 , Heikki Linnakangas wrote: Triggers complicate this. I believe it is only safe to group tuples together like this if the table has no triggers. A BEFORE ROW trigger might run a SELECT on the table being copied to, and check if some of the tuples we're about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples first, and only then insert them, none of the trigger invocations will see the other rows as inserted yet. Similarly, if we run AFTER ROW triggers after inserting a bunch of tuples, the trigger for each of the insertions would see all the inserted rows. Don't we run AFTER ROW triggers after inserting *all* the tuples anyway? At least this is what we do in the case of INSERT/UPDATE/DELETE if I'm not mistaken. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On 11.08.2011 23:06, Robert Haas wrote: Comments, testing, review appreciated... I would've expected this to use an index-only scan: postgres=# CREATE TABLE foo AS SELECT generate_series(1,10) AS id; SELECT 10 postgres=# CREATE INDEX i_foo ON foo (id) WHERE id = 10; CREATE INDEX postgres=# VACUUM ANALYZE foo; VACUUM postgres=# EXPLAIN SELECT id FROM foo WHERE id = 10; QUERY PLAN - Index Scan using i_foo on foo (cost=0.00..8.27 rows=1 width=4) Index Cond: (id = 10) (2 rows) If it's not a predicate index, then it works: postgres=# DROP INDEX i_foo; DROP INDEX postgres=# EXPLAIN SELECT id FROM foo WHERE id = 10; QUERY PLAN --- Index Only Scan using i_foo2 on foo (cost=0.00..8.28 rows=1 width=4) Index Cond: (id = 10) (2 rows) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation
On Fri, Aug 12, 2011 at 3:14 PM, Tom Lane t...@sss.pgh.pa.us wrote: In any case, it is now clear to me that this bug is capable of eating peoples' databases, as in what just happened to our most critical table? Uh, it's not there anymore, boss, but we seem to have duplicate pg_class entries for this other table. OK, I take your point. That's certainly scary enough to worry about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On Fri, Aug 12, 2011 at 4:03 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 11.08.2011 23:06, Robert Haas wrote: Comments, testing, review appreciated... I would've expected this to use an index-only scan: postgres=# CREATE TABLE foo AS SELECT generate_series(1,10) AS id; Ugh. I think there's something wrong with this test: + if (index-amcanreturn !index-predOK !index-indexprs) I think that should probably say something like (ind-indpred == NIL || ind-predOK). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inserting heap tuples in bulk in COPY
On 12.08.2011 22:57, Florian Pflug wrote: On Aug12, 2011, at 21:16 , Heikki Linnakangas wrote: Triggers complicate this. I believe it is only safe to group tuples together like this if the table has no triggers. A BEFORE ROW trigger might run a SELECT on the table being copied to, and check if some of the tuples we're about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples first, and only then insert them, none of the trigger invocations will see the other rows as inserted yet. Similarly, if we run AFTER ROW triggers after inserting a bunch of tuples, the trigger for each of the insertions would see all the inserted rows. Don't we run AFTER ROW triggers after inserting *all* the tuples anyway? At least this is what we do in the case of INSERT/UPDATE/DELETE if I'm not mistaken. Um, yes, you're right. Now I feel silly. The above still applies to BEFORE ROW triggers, though. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql: bogus descriptions displayed by \d+
On tor, 2011-08-04 at 14:59 -0400, Robert Haas wrote: Well, the facts are: According to the SQL standard, table includes views and foreign tables. According to scientific-ish database literature, a table is a relation and vice versa. So what are you supposed to call it if you mean, specifically, a table? A table is either a base table, a derived table, a transient table, or a viewed table. (SQL/MED adds foreign table.) Just FYI. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql: bogus descriptions displayed by \d+
On Fri, Aug 12, 2011 at 4:11 PM, Peter Eisentraut pete...@gmx.net wrote: On tor, 2011-08-04 at 14:59 -0400, Robert Haas wrote: Well, the facts are: According to the SQL standard, table includes views and foreign tables. According to scientific-ish database literature, a table is a relation and vice versa. So what are you supposed to call it if you mean, specifically, a table? A table is either a base table, a derived table, a transient table, or a viewed table. (SQL/MED adds foreign table.) Just FYI. Base table seems clear enough, and a transient table sounds like a temporary table, but what is a derived table? Is a viewed table a view? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
On Aug 12, 2011, at 10:03 PM, Heikki Linnakangas wrote: On 11.08.2011 23:06, Robert Haas wrote: Comments, testing, review appreciated... I would've expected this to use an index-only scan: postgres=# CREATE TABLE foo AS SELECT generate_series(1,10) AS id; SELECT 10 postgres=# CREATE INDEX i_foo ON foo (id) WHERE id = 10; CREATE INDEX postgres=# VACUUM ANALYZE foo; VACUUM postgres=# EXPLAIN SELECT id FROM foo WHERE id = 10; QUERY PLAN - Index Scan using i_foo on foo (cost=0.00..8.27 rows=1 width=4) Index Cond: (id = 10) (2 rows) If it's not a predicate index, then it works: postgres=# DROP INDEX i_foo; DROP INDEX postgres=# EXPLAIN SELECT id FROM foo WHERE id = 10; QUERY PLAN --- Index Only Scan using i_foo2 on foo (cost=0.00..8.28 rows=1 width=4) Index Cond: (id = 10) (2 rows) is there any plan to revise the cost for index only scans compared to what it is now? many thanks, hans -- Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra check in 9.0 exclusion constraint unintended consequences
On Fri, 2011-08-12 at 14:58 -0400, Robert Haas wrote: Having thought about this a bit further, I'm coming around to the view that if it isn't worth adding this in master, it's not worth adding at all. I just don't think it's going to get any visibility as a back-branch only doc patch. Fine with me. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] New copyright program
Folks, I noticed that src/tools/copyright looks like it can only be run on Bruce's machine, so this translation to Perl is intended: 1. To make the script idempotent, which allows its safe use in automated tools that might run it many times. 2. To get the script to run on any machine a PostgreSQL developer would be using, as Perl is already required. 3. To make the script more efficient. As the copyright notice we need to munge only appears once per file, it stops once it has made a substitution. Please find attached a patch implementing same. Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate diff --git a/src/tools/copyright b/src/tools/copyright deleted file mode 100755 index b0d61e1..000 --- a/src/tools/copyright +++ /dev/null @@ -1,14 +0,0 @@ -#!/bin/sh - -# src/tools/copyright - -echo Using current year: `date '+%Y'` - -rgrep -l 'Copyright.*PostgreSQL Global Development Group' | while read FILE -do - pipe sed 's/^\(.*Copyright (c) [12][0-9][0-9][0-9]\) \?- \?[12][0-9][0-9][0-9] \?,\?\( PostgreSQL Global Development Group.*\)$/\1-'`date '+%Y'`',\2/' $FILE - # handle cases where only one year appears - pipe sed 's/^\(.*Copyright (c) [12][0-9][0-9][0-9]\) \?,\?\( PostgreSQL Global Development Group.*\)$/\1-'`date '+%Y'`',\2/' $FILE -done - -echo Manually update doc/src/sgml/legal.sgml and src/interfaces/libpq/libpq.rc.in too 12 diff --git a/src/tools/copyright.pl b/src/tools/copyright.pl new file mode 100755 index 000..d981911 --- /dev/null +++ b/src/tools/copyright.pl @@ -0,0 +1,43 @@ +#!/usr/bin/perl +# +# copyright.pl -- update copyright notices throughout the source tree, idempotently. +# +# Copyright (c) 2011, PostgreSQL Global Development Group +# +# src/tools/copyright.pl +# + +use strict; +use warnings; + +use File::Find; + +my $pgdg = 'PostgreSQL Global Development Group'; +my $cc = 'Copyright (c) '; +# year-1900 is what localtime(time) puts in element 5 +my $year = 1900+${[localtime(time)]}[5]; + +print Using current year: $year\n; + +find({wanted = \wanted, no_chdir = 1}, '.'); + +sub wanted { +return unless -f $File::Find::name; + +my @lines; +tie @lines, Tie::File, $File::Find::name; + +foreach my $line (@lines) { +# We only care about lines with a copyright notice. +next unless $line =~ m/$cc.*$pgdg/; +# We stop when we've done one substitution. This is both for +# efficiency and, at least in the case of this program, for +# correctness. +last if $line =~ m/$cc.*$year.*$pgdg/; +last if $line =~ s/($cc\d{4})(, $pgdg)/$1-$year$2/; +last if $line =~ s/($cc\d{4})-\d{4}(, $pgdg)/$1-$year$2/; +} +untie @lines; +} +print Manually update doc/src/sgml/legal.sgml and src/interfaces/libpq/libpq.rc.in too\n; + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation
On Fri, Aug 12, 2011 at 7:09 PM, Tom Lane t...@sss.pgh.pa.us wrote: Right at the moment I'm leaning to approach #2. I wonder if anyone sees it differently, or has an idea for a third approach? You are trying to solve the problem directly, which seems overkill. With HOT, there is very little need to perform a VACUUM FULL on any shared catalog table. Look at the indexes... I would a suggest that VACUUM FULL perform only a normal VACUUM on shared catalog tables, then perform an actual VACUUM FULL only in dire need (some simple heuristic in size and density). This avoids doing a VACUUM FULL unless it is actually necessary to do so. That has the added advantage of not locking out essential tables, which is always a concern. In the unlikely event we do actually have to VACUUM FULL a shared catalog table, nuke any cache entry for the whole shared catalog. That way we absolutely and positively will never get any more bugs in this area, ever again. Sounds harsh, but these events are only actually needed very, very rarely and hygiene is more important than a few minor points of performance. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inserting heap tuples in bulk in COPY
On Fri, Aug 12, 2011 at 3:16 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: COPY is slow. Let's make it faster. One obvious optimization is to insert heap tuples in bigger chunks, instead of calling heap_insert() separately for every tuple. That saves the overhead of pinning and locking the buffer for every tuple, and you only need to write one WAL record for all the tuples written to the same page, instead of one for each tuple. Attached is a WIP patch to do that. It adds a new function, heap_multi_insert, which does the same thing as heap_insert, but works in bulk. It takes an array of tuples as argument, and tries to cram as many of them into the chosen targe page as it can, and only writes a single WAL record of the operation. This gives a significant speedup to COPY, particularly for narrow tables, with small tuples. Grouping multiple tuples into one WAL record reduces the WAL volume significantly, and the time spent in writing that WAL. The reduced overhead of repeatedly locking the buffer is also most noticeable on narrow tables. On wider tables, the effects are smaller. See copytest-results.txt, containing test results with three tables of different widths. The scripts used to get those numbers are also attached. Triggers complicate this. I believe it is only safe to group tuples together like this if the table has no triggers. A BEFORE ROW trigger might run a SELECT on the table being copied to, and check if some of the tuples we're about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples first, and only then insert them, none of the trigger invocations will see the other rows as inserted yet. Similarly, if we run AFTER ROW triggers after inserting a bunch of tuples, the trigger for each of the insertions would see all the inserted rows. So at least for now, the patch simply falls back to inserting one row at a time if there are any triggers on the table. The patch is WIP, mainly because I didn't write the WAL replay routines yet, but please let me know if you see any issues. I thought about trying to do this at one point in the past, but I couldn't figure out exactly how to make it work. I think the approach you've taken here is good. Aside from the point already raised about needing to worry only about BEFORE ROW triggers, I don't see any showstoppers. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inserting heap tuples in bulk in COPY
On 08/12/2011 04:57 PM, Robert Haas wrote: I thought about trying to do this at one point in the past, but I couldn't figure out exactly how to make it work. I think the approach you've taken here is good. Aside from the point already raised about needing to worry only about BEFORE ROW triggers, I don't see any showstoppers. Yeah, this looks very promising indeed. Well done. In fact, I'm asking myself how hard it will be to backport for one particular customer of ours, for whom the WAL load is a major hotspot. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enforcing that all WAL has been replayed after restoring from backup
Magnus Hagander mag...@hagander.net writes: Or add a signal handler in the pg_basebackup client emitting a warning about it? We don't have such a signal handler pg_dump either. I don't think we should add it. Hmm. I guess an aborted pg_dump will also look ok but actually be corrupt (or incomplete). Good point. What about having the signal handler corrupt the backup by adding some garbage into it? Now the failure case is obvious… Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inserting heap tuples in bulk in COPY
On Fri, Aug 12, 2011 at 8:16 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: COPY is slow. Let's make it faster. One obvious optimization is to insert heap tuples in bigger chunks, instead of calling heap_insert() separately for every tuple. That saves the overhead of pinning and locking the buffer for every tuple, and you only need to write one WAL record for all the tuples written to the same page, instead of one for each tuple. We don't pin the buffer for every tuple, that optimisation is already done... When we discussed this before you said that it wasn't worth trying to do this additional work - it was certainly a smaller gain than the one we achieved by removing the pinning overhead. Also, we discussed that you would work on buffering the index inserts, which is where the main problem lies. The main heap is only a small part of the overhead if we have multiple indexes already built on a table - which is the use case that causes the most problem. So I'm a little surprised to see you working on this and I'm guessing that the COPY improvement with indexes is barely noticeable. This would be a nice improvement, but not until the bulk index inserts are done. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
2011/8/12 PostgreSQL - Hans-Jürgen Schönig postg...@cybertec.at: is there any plan to revise the cost for index only scans compared to what it is now? That's one of the points I asked for feedback on in my original email. How should the costing be done? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inserting heap tuples in bulk in COPY
On Fri, Aug 12, 2011 at 2:16 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: COPY is slow. Let's make it faster. One obvious optimization is to insert heap tuples in bigger chunks, instead of calling heap_insert() separately for every tuple. That saves the overhead of pinning and locking the buffer for every tuple, and you only need to write one WAL record for all the tuples written to the same page, instead of one for each tuple. Attached is a WIP patch to do that. It adds a new function, heap_multi_insert, which does the same thing as heap_insert, but works in bulk. It takes an array of tuples as argument, and tries to cram as many of them into the chosen targe page as it can, and only writes a single WAL record of the operation. This gives a significant speedup to COPY, particularly for narrow tables, with small tuples. Grouping multiple tuples into one WAL record reduces the WAL volume significantly, and the time spent in writing that WAL. The reduced overhead of repeatedly locking the buffer is also most noticeable on narrow tables. On wider tables, the effects are smaller. See copytest-results.txt, containing test results with three tables of different widths. The scripts used to get those numbers are also attached. Triggers complicate this. I believe it is only safe to group tuples together like this if the table has no triggers. A BEFORE ROW trigger might run a SELECT on the table being copied to, and check if some of the tuples we're about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples first, and only then insert them, none of the trigger invocations will see the other rows as inserted yet. Similarly, if we run AFTER ROW triggers after inserting a bunch of tuples, the trigger for each of the insertions would see all the inserted rows. So at least for now, the patch simply falls back to inserting one row at a time if there are any triggers on the table. But generic RI triggers would be ok, right? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] USECS_* constants undefined with float8 timestamps?
Robert Haas wrote: On Fri, Jul 29, 2011 at 11:18 AM, Johann 'Myrkraverk' Oskarsson joh...@2ndquadrant.com wrote: Hi all, I just noticed that the USECS_* constants are not defined when the server is compiled without integer dates and timestamps. Explicitly, timestamp.h is #ifdef HAVE_INT64_TIMESTAMP #define USECS_PER_DAY INT64CONST(864) #define USECS_PER_HOUR INT64CONST(36) #define USECS_PER_MINUTE INT64CONST(6000) #define USECS_PER_SEC INT64CONST(100) #endif Is there a particular reason for this? ?Even with float8 timestamps there are uses for these constants in extensions. I don't see any particular reason not define them unconditionally. Well, they are only used by integer dates, so why expand their visibility? The define does make it clear how they are used. I suppose if someone wanted to use them outside that case, we could open them up. It is true that with integer dates now the default, we might find that someone introduces compile problems by using them outside the integer dates scope. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only scans
Robert Haas robertmh...@gmail.com wrote: That's one of the points I asked for feedback on in my original email. How should the costing be done? It seems pretty clear that there should be some cost adjustment. If you can't get good numbers somehow on what fraction of the heap accesses will be needed, I would suggest using a magic number based on the assumption that half the heap access otherwise necessary will be done. It wouldn't be the worst magic number in the planner. Of course, real numbers are always better if you can get them. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] USECS_* constants undefined with float8 timestamps?
Bruce Momjian br...@momjian.us writes: Robert Haas wrote: On Fri, Jul 29, 2011 at 11:18 AM, Johann 'Myrkraverk' Oskarsson joh...@2ndquadrant.com wrote: I just noticed that the USECS_* constants are not defined when the server is compiled without integer dates and timestamps. [snip] I don't see any particular reason not define them unconditionally. Well, they are only used by integer dates, so why expand their visibility? The define does make it clear how they are used. I suppose if someone wanted to use them outside that case, we could open them up. It is true that with integer dates now the default, we might find that someone introduces compile problems by using them outside the integer dates scope. I found a use for them in PL/Java which detects at run-time whether the server is using floating point or integer dates. The simplest way was just to use magic numbers instead on the off chance it's compiled with a server using float dates. -- Johann Oskarssonhttp://www.2ndquadrant.com/|[] PostgreSQL Development, 24x7 Support, Training and Services --+-- | Blog: http://my.opera.com/myrkraverk/blog/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inserting heap tuples in bulk in COPY
On 13.08.2011 00:17, Simon Riggs wrote: Also, we discussed that you would work on buffering the index inserts, which is where the main problem lies. The main heap is only a small part of the overhead if we have multiple indexes already built on a table - which is the use case that causes the most problem. Sure, if you have indexes on the table already, then the overhead of updating them is more significant. I am actually working on that too, I will make a separate post about that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inserting heap tuples in bulk in COPY
On 13.08.2011 00:26, Merlin Moncure wrote: On Fri, Aug 12, 2011 at 2:16 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Triggers complicate this. I believe it is only safe to group tuples together like this if the table has no triggers. A BEFORE ROW trigger might run a SELECT on the table being copied to, and check if some of the tuples we're about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples first, and only then insert them, none of the trigger invocations will see the other rows as inserted yet. Similarly, if we run AFTER ROW triggers after inserting a bunch of tuples, the trigger for each of the insertions would see all the inserted rows. So at least for now, the patch simply falls back to inserting one row at a time if there are any triggers on the table. But generic RI triggers would be ok, right? RI triggers are AFTER ROW triggers, which we concluded to be OK after all, so they would be ok. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUM FULL versus system catalog cache invalidation
On Fri, Aug 12, 2011 at 09:26:02PM +0100, Simon Riggs wrote: With HOT, there is very little need to perform a VACUUM FULL on any shared catalog table. Look at the indexes... I would a suggest that VACUUM FULL perform only a normal VACUUM on shared catalog tables, then perform an actual VACUUM FULL only in dire need (some simple heuristic in size and density). This avoids doing a VACUUM FULL unless it is actually necessary to do so. That has the added advantage of not locking out essential tables, which is always a concern. In the unlikely event we do actually have to VACUUM FULL a shared catalog table, nuke any cache entry for the whole shared catalog. That way we absolutely and positively will never get any more bugs in this area, ever again. Sounds harsh, but these events are only actually needed very, very rarely and hygiene is more important than a few minor points of performance. This is a very optimistic view. My client makes heavy use of temp tables. HOT and autovacuum are not sufficient to keep catalog bloat under control. We run a daily script that calculates the density of the catalog and only vaccum fulls those that are severely bloated. Here is a result from a recent bloat check on one db. 'packed' is the number of pages needed for the rows if they were packed, 'bloat' is the multiple of pages in use over the number really needed. relation | tuples | pages | packed | bloat --++---++--- pg_class; -- | 4292 | 10619 |114 | 93.2 pg_depend; --| 25666 | 7665 |217 | 35.4 pg_attrdef; -- | 6585 | 7595 |236 | 32.2 pg_type; -- | 4570 | 8177 |416 | 19.6 pg_shdepend; -- | 52040 | 7968 |438 | 18.2 -dg -- David Gould da...@sonic.net 510 536 1443510 282 0869 If simplicity worked, the world would be overrun with insects. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
On Fri, Aug 12, 2011 at 01:28:49PM +0100, Simon Riggs wrote: I think there are reasonable arguments to make * prefer_cache = off (default) | on a table level storage parameter, =on will disable the use of BufferAccessStrategy * make cache_spoil_threshold a parameter, with default 0.25 Considering the world of very large RAMs in which we now live, some control of the above makes sense. As long as we are discussion cache settings for tables, I have a client who would like to be able to lock specific tables and indexes into cache as they have strict response time requirements for particular queries. At the moment they are running postgres with a tablespace on ramfs and taking frequent backups, but this is not optimal. -dg -- David Gould da...@sonic.net 510 536 1443510 282 0869 If simplicity worked, the world would be overrun with insects. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] OperationalError: FATAL: lock AccessShareLock on object 0/1260/0 is already
This seems to be bug month for my client. Now there are seeing periods where all new connections fail immediately with the error: FATAL: lock AccessShareLock on object 0/1260/0 is already held This happens on postgresql 8.4.7 on a large (512GB, 32 core) system that has been up for months. It started happening sporadicly a few days ago. It will do this for a period of several minutes to an hour and then go back to normal for hours or days. One complete failing session out of several hundred around that time: - 2011-08-09 00:01:04.446 8823 [unknown] [unknown] LOG: connection received: host=op05.xxx port=34067 2011-08-09 00:01:04.446 8823 c77 apps LOG: connection authorized: user=apps database=c77 2011-08-09 00:01:04.449 8823 c77 apps FATAL: lock AccessShareLock on object 0/1260/0 is already held -- What can I do to help track this down? -dg -- David Gould da...@sonic.net 510 536 1443510 282 0869 If simplicity worked, the world would be overrun with insects. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PL/Perl Returned Array
Hackers, Given this script on 9.1beta3: BEGIN; CREATE EXTENSION plperl; CREATE OR REPLACE FUNCTION wtf( ) RETURNS TEXT[] LANGUAGE plperl AS $$ return []; $$; SELECT wtf() = '{}'::TEXT[]; ROLLBACK; The output is: BEGIN CREATE EXTENSION CREATE FUNCTION ?column? -- f (1 row) ROLLBACK Why is that? If I save the values to TEXT[] columns, they are still not the same. But if I pg_dump them and then load them up again, they *are* the same. The dump looks like this: CREATE TABLE gtf ( have text[], want text[] ); COPY gtf (have, want) FROM stdin; {} {} \. Is PL/Perl doing something funky under the hood to array values it returns on 9.1? Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/Perl Returned Array
On Fri, Aug 12, 2011 at 18:00, David E. Wheeler da...@kineticode.com wrote: Hackers, Given this script on 9.1beta3: BEGIN; CREATE EXTENSION plperl; CREATE OR REPLACE FUNCTION wtf( ) RETURNS TEXT[] LANGUAGE plperl AS $$ return []; $$; SELECT wtf() = '{}'::TEXT[]; Why is that? If I save the values to TEXT[] columns, they are still not the same. But if I pg_dump them and then load them up again, they *are* the same. The dump looks like this: Eek. Is PL/Perl doing something funky under the hood to array values it returns on 9.1? Yeah, its not handling empty arrays very well (its calling accumArrayResult which increments the number of elements even when we are a zero length array). This was rewritten in 9.1 as part of the array overhaul. Previously returned arrays were converted into a string with the perl sub encode_array_literal(), potgres would then convert that string into the appropriate data type. Not only was that a tad inefficient, but it failed to handle composite types nicely. In 9.1 you can pass in arrays with composite types and the like, we figured it would be awfully nice if you could return them as well :-) Anyway, the attached patch fixes it for me. That is when we don't have an array state, just return an empty array. (Also adds some additional comments) I thought about doing this right after we set dims[0] in plperl_array_to_datum(), but that would fail to handle nested empty multidim arrays like [[]]. As long as you can do that via ARRAY[ARRAY[]]... I figure plperl should support it to. Thanks for the report! *** a/src/pl/plperl/plperl.c --- b/src/pl/plperl/plperl.c *** *** 1078,1091 _array_to_datum(AV *av, int *ndims, int *dims, int cur_depth, int i = 0; int len = av_len(av) + 1; - if (len == 0) - astate = accumArrayResult(astate, (Datum) 0, true, atypid, NULL); - for (i = 0; i len; i++) { SV **svp = av_fetch(av, i, FALSE); SV *sav = svp ? get_perl_array_ref(*svp) : NULL; if (sav) { AV *nav = (AV *) SvRV(sav); --- 1078,1092 int i = 0; int len = av_len(av) + 1; for (i = 0; i len; i++) { + /* fetch the array element */ SV **svp = av_fetch(av, i, FALSE); + + /* see if this element is an array, if so get that */ SV *sav = svp ? get_perl_array_ref(*svp) : NULL; + /* multi-dimensional array? */ if (sav) { AV *nav = (AV *) SvRV(sav); *** *** 1149,1154 plperl_array_to_datum(SV *src, Oid typid) --- 1150,1158 astate = _array_to_datum((AV *) SvRV(src), ndims, dims, 1, astate, typid, atypid); + if (!astate) + return PointerGetDatum(construct_empty_array(atypid)); + for (i = 0; i ndims; i++) lbs[i] = 1; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] USECS_* constants undefined with float8 timestamps?
Johann 'Myrkraverk' Oskarsson wrote: Bruce Momjian br...@momjian.us writes: Robert Haas wrote: On Fri, Jul 29, 2011 at 11:18 AM, Johann 'Myrkraverk' Oskarsson joh...@2ndquadrant.com wrote: I just noticed that the USECS_* constants are not defined when the server is compiled without integer dates and timestamps. [snip] I don't see any particular reason not define them unconditionally. Well, they are only used by integer dates, so why expand their visibility? The define does make it clear how they are used. I suppose if someone wanted to use them outside that case, we could open them up. It is true that with integer dates now the default, we might find that someone introduces compile problems by using them outside the integer dates scope. I found a use for them in PL/Java which detects at run-time whether the server is using floating point or integer dates. The simplest way was just to use magic numbers instead on the off chance it's compiled with a server using float dates. OK, that is a good reason. Done for PG 9.2. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench internal contention
Robert Haas wrote: Works for me. Just to confirm, that means we'd also change GEQO to use pg_erand48(). BTW, as far as the original plan of using random_r is concerned, how did you manage to not run into this? http://sourceware.org/bugzilla/show_bug.cgi?id=3662 I just wasted half an hour on that stupidity in an unrelated context... Good grief. It's hard to imagine a more user-hostile attitude than the one taken there. I did run into that precise issue, but managed Yes, it is scary to think people are relying on software written by people which such a poor attitude toward quality. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/Perl Returned Array
On Aug 12, 2011, at 6:17 PM, Alex Hunsaker wrote: Anyway, the attached patch fixes it for me. That is when we don't have an array state, just return an empty array. (Also adds some additional comments) Fix confirmed, thank you! +1 to getting this committed before the next beta/RC. David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas robertmh...@gmail.com wrote: Only 96 of the 14286 buffers in sample_data are in shared_buffers, despite the fact that we have 37,218 *completely unused* buffers lying around. That sucks, because it means that the sample query did a whole lot of buffer eviction that was completely needless. The ring buffer is there to prevent trashing the shared buffer arena, but here it would have been fine to trash the arena: there wasn't anything there we would have minded losing (or, indeed, anything at all). I don't disagree with the general thrust of your point, but I just wanted to point out one case where not using free buffers even though they're available might make sense: If you execute a large batch delete or update or even just set lots of hint bits you'll dirty a lot of buffers. The ring buffer forces the query that is actually dirtying all these buffers to also do the i/o to write them out. Otherwise you leave them behind to slow down other queries. This was one of the problems with the old vacuum code which the ring buffer replaced. It left behind lots of dirtied buffers for other queries to do i/o on. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/Perl Returned Array
David E. Wheeler wrote: On Aug 12, 2011, at 6:17 PM, Alex Hunsaker wrote: Anyway, the attached patch fixes it for me. That is when we don't have an array state, just return an empty array. (Also adds some additional comments) Fix confirmed, thank you! +1 to getting this committed before the next beta/RC. Policy question. If this problem hadn't been detected until after 9.1 was declared production, would it have been considered a bug to fix in 9.1.x or would it have been delayed to 9.2 since that fix might be considered an incompatibility? If the latter, then I'm really glad it was found now. -- Darren Duncan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Transient plans versus the SPI API
Tom Lane wrote: Note that the SPI functions are more or less directly exposed in PL/Perl and PL/Python, and there are a number of existing idioms there that make use of prepared plans. Changing the semantics of those functions might upset a lot of code. Right, but by the same token, if we don't change the default behavior, there is going to be a heck of a lot of code requiring manual adjustment before it can make use of the (hoped-to-be) improvements. To me it makes more sense to change the default and then provide ways for people to lock down the behavior if the heuristic doesn't work for them. Agreed. I think the big sticking point is that without logic on how the replanning will happen, users are having to guess how much impact this new default behavior will have. I also agree that this will harm some uses but improve a larger pool of users. Remember, the people on this email list are probably using this feature in a much more sophisticated way than the average user. Also, there is a TODO idea that the results found by executing the query (e.g. number of rows returned at each stage) could be fed back and affect the replanning of queries. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers