Re: [PERFORM] Finding bloated indexes?
On Fri, 2007-04-13 at 14:01 -0600, Dan Harris wrote: Is there a pg_stat_* table or the like that will show how bloated an index is? I am trying to squeeze some disk space and want to track down where the worst offenders are before performing a global REINDEX on all tables, as the database is rougly 400GB on disk and this takes a very long time to run. I have been able to do this with tables, using a helpful view posted to this list a few months back, but I'm not sure if I can get the same results on indexes. Use pgstatindex in contrib/pgstattuple -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[PERFORM] Basic Q on superfluous primary keys
Consider these two very similar schemas: Schema 1: CREATE TABLE foo ( id serial PRIMARY KEY, frobnitz character(varying 100) NOT NULL UNIQUE ); CREATE TABLE bar ( id serial PRIMARY KEY, foo_id int REFERENCES foo(id) ) Schema 2: CREATE TABLE foo ( frobnitz character(varying 100) PRIMARY KEY ); CREATE TABLE bar ( id serial PRIMARY KEY, frobnitz character(varying 100) REFERENCES foo(frobnitz) ) The two situations are semantically identical: each record in table bar refers to a record in table foo. The difference is that in the first schema, this referencing is done through an artificial serial-integer primary key, while in the second schema this reference is done through a data field that happens to be unique and not null, so it can serve as primary key. I find Schema 1 awkward and unnatural; more specifically, foo.id seems unnecessary in light of the non-null uniqueness of foo.frobnitz. But I remember once reading that long fields like foo.frobnitz did not make good primary keys. Is the field foo.id in Schema 1 superfluous? For example, wouldn't the referencing from bar to foo really be done behind the scenes through some hidden field (oid?) instead of through the frobnitz text field? Which of the two schemas would give better perfornance? Thanks! kj
Re: [PERFORM] Basic Q on superfluous primary keys
In response to Kynn Jones [EMAIL PROTECTED]: Consider these two very similar schemas: Schema 1: CREATE TABLE foo ( id serial PRIMARY KEY, frobnitz character(varying 100) NOT NULL UNIQUE ); CREATE TABLE bar ( id serial PRIMARY KEY, foo_id int REFERENCES foo(id) ) Schema 2: CREATE TABLE foo ( frobnitz character(varying 100) PRIMARY KEY ); CREATE TABLE bar ( id serial PRIMARY KEY, frobnitz character(varying 100) REFERENCES foo(frobnitz) ) The two situations are semantically identical: each record in table bar refers to a record in table foo. The difference is that in the first schema, this referencing is done through an artificial serial-integer primary key, while in the second schema this reference is done through a data field that happens to be unique and not null, so it can serve as primary key. The first case is call a surrogate key. A little googling on that term will turn up a wealth of discussion -- both for and against. I find Schema 1 awkward and unnatural; more specifically, foo.id seems unnecessary in light of the non-null uniqueness of foo.frobnitz. But I remember once reading that long fields like foo.frobnitz did not make good primary keys. I had a discussion about this recently on the Drupal mailing lists, at the end of which I promised to do some benchmarking to determine whether or not text keys really do hurt performance of indexes. Unfortunately, I still haven't followed through on that promise -- maybe I'll get to it tomorrow. Is the field foo.id in Schema 1 superfluous? For example, wouldn't the referencing from bar to foo really be done behind the scenes through some hidden field (oid?) instead of through the frobnitz text field? Which of the two schemas would give better perfornance? -- Bill Moran Collaborative Fusion Inc. [EMAIL PROTECTED] Phone: 412-422-3463x4023 IMPORTANT: This message contains confidential information and is intended only for the individual named. If the reader of this message is not an intended recipient (or the individual responsible for the delivery of this message to an intended recipient), please be advised that any re-use, dissemination, distribution or copying of this message is prohibited. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] choose_bitmap_and again (was Re: [PERFORM] Strangely Variable Query Performance)
Dann Corbit [EMAIL PROTECTED] writes: Instead of sorting, I suggest the quickselect() algorithm, which is O(n). What for? Common cases have less than half a dozen entries. That is not the place we need to be spending engineering effort --- what we need to worry about is what's the choice algorithm, not implementation details. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Question about memory allocations
Ron [EMAIL PROTECTED] writes: One of the reasons for the wide variance in suggested values for pg memory use is that pg 7.x and pg 8.x are =very= different beasts. If you break the advice into pg 7.x and pg 8.x categories, you find that there is far less variation in the suggestions. Bottom line: pg 7.x could not take advantage of larger sums of memory anywhere near as well as pg 8.x can. Actually I think it was 8.1 that really broke the barrier in terms of scalability of shared_buffers. Pre-8.1, the buffer manager just didn't scale well enough to make it useful to use more than a few hundred meg. (In fact, we never even bothered to fix the shared-memory-sizing calculations to be able to deal with 2GB shared memory until 8.1; if you try it in 8.0 it'll probably just crash.) regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] choose_bitmap_and again (was Re: [PERFORM] Strangely Variable Query Performance)
Alvaro Herrera [EMAIL PROTECTED] writes: I think the concern about condition redundancy should be attacked separately. How about just comparing whether they have common prefixes of conditions? I admit I don't understand what would happen with indexes defined like (lower(A), B, C) versus (A, B) for example. I understand that issue a bit better than I did when I wrote the comment (so I suppose I better rewrite it). The $64 reason for rejecting AND-combinations of indexes that are using some of the same WHERE-conditions is that if we don't, we effectively double-count the selectivity of those conditions, causing us to prefer useless AND-combinations. An example is WHERE A 42 AND B 100 where we have an index on A and one on (A,B). The selectivity calculation will blindly assume that the selectivities of the two indexes are independent and thus prefer to BitmapAnd them, when obviously there is no point in using both. Ideally we should improve the selectivity calculation to not get fooled like this, but that seems hard and probably slow. So for the moment we have the heuristic that no WHERE-clause should be used twice in any AND-combination. Given that we are using that heuristic, it becomes important that we visit the indexes in the right order --- as the code stands, in the (A) vs (A,B) case it will consider only the first index it arrives at in the selectivity sort order, because the second will be rejected on the basis of re-use of the WHERE A 42 condition. Sorting by selectivity tends to ensure that we pick the index that can make use of as many WHERE-clauses as possible. The idea of considering each index alone fixes the order dependency for cases where a single index is the best answer, but it doesn't help much for cases where you really do want a BitmapAnd, only not one using the index with the individually best selectivity. We really need a heuristic here --- exhaustive search will be impractical in cases where there are many indexes, because you'd be looking at 2^N combinations. (In Steve's example there are actually 38 relevant indexes, which is bad database design if you ask me, but in any case we cannot afford to search through 2^38 possibilities.) But the one we're using now is too fragile. Maybe we should use a cutoff similar to the GEQO one: do exhaustive search if there are less than N relevant indexes, for some N. But that's not going to help Steve; we still need a smarter heuristic for what to look for above the cutoff. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] choose_bitmap_and again (was Re: [PERFORM] Strangely Variable Query Performance)
Alvaro Herrera [EMAIL PROTECTED] writes: Tom Lane wrote: Has anyone got any thoughts about the best way to do this? How about doing both: sort the index by index scan cost; then pick the first index on the list and start adding indexes when they lower the cost. When adding each index, consider it by itself against the already stacked indexes. If the cost is lower, put this index at the top of the list, and restart the algorithm (after the sorting step of course). The restart part of that bothers me --- it's not entirely clear that you couldn't get into an infinite loop. (Imagine that A+B is better than A alone, so we adopt it, but worse than C alone, so we take C as the new leader and start over. Then perhaps C+B is better than C alone but worse than A alone, so we take A as the leader and start over. Maybe this is impossible but I'm unsure.) I looked back at the gdb results I'd gotten from Steve's example and noticed that for his 38 indexes there were only three distinct index selectivity values, and the sort step grouped indexes by cost within those groups. In hindsight of course this is obvious: the selectivity depends on the set of WHERE-clauses used, so with two WHERE-clauses there are three possible selectivities (to within roundoff error anyway) depending on whether the index uses one or both clauses. So the existing algorithm gets one thing right: for any two indexes that make use of just the same set of WHERE-clauses, it will always take the one with cheaper scan cost. Thinking more about this leads me to the following proposal: 1. Explicitly group the indexes according to the subset of WHERE-conditions (and partial index conditions, if any) they use. Within each such group, discard all but the cheapest-scan-cost one. 2. Sort the remaining indexes according to scan cost. 3. For each index in order, consider it as a standalone scan, and also consider adding it on to the AND-group led by each preceding index, using the same logic as now: reject using any WHERE-condition twice in a group, and then add on only if the total cost of the AND-group scan is reduced. This would be approximately O(N^2) in the number of relevant indexes, whereas the current scheme is roughly linear (handwaving a bit here because the number of WHERE-clauses is a factor too). But that seems unlikely to be a problem for sane numbers of indexes, unlike the O(2^N) behavior of an exhaustive search. It would get rid of (most of) the order-sensitivity problem, since we would definitely consider the AND-combination of every pair of combinable indexes. I can imagine cases where it wouldn't notice useful three-or-more-way combinations because the preceding two-way combination didn't win, but they seem pretty remote possibilities. Comments? regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PERFORM] FK triggers misused?
I have performance problem with the following simple update query: UPDATE posts SET num_views = num_views + 1 WHERE post_id IN (2526,5254,2572,4671,25); The table posts is a large table with a number of foreign keys (FK). It seems that the FK triggers for the table are evaluated even though none of the FK columns are altered. In fact, these FK triggers seems to constitute a considerable part of the total execution time. See the below EXPLAIN ANALYZE. Why are these FK triggers evaluated at all and why do they take so much time? -- = EXPLAIN ANALYZE update posts set num_views = num_views + 1 where post_id in (2526,5254,2572,4671,25); QUERY PLAN - Bitmap Heap Scan on posts (cost=10.02..29.81 rows=5 width=1230) (actual time=0.146..0.253 rows=5 loops=1) Recheck Cond: ((post_id = 2526) OR (post_id = 5254) OR (post_id = 2572) OR (post_id = 4671) OR (post_id = 25)) - BitmapOr (cost=10.02..10.02 rows=5 width=0) (actual time=0.105..0.105 rows=0 loops=1) - Bitmap Index Scan on posts_pkey (cost=0.00..2.00 rows=1 width=0) (actual time=0.053..0.053 rows=2 loops=1) Index Cond: (post_id = 2526) - Bitmap Index Scan on posts_pkey (cost=0.00..2.00 rows=1 width=0) (actual time=0.012..0.012 rows=2 loops=1) Index Cond: (post_id = 5254) - Bitmap Index Scan on posts_pkey (cost=0.00..2.00 rows=1 width=0) (actual time=0.008..0.008 rows=2 loops=1) Index Cond: (post_id = 2572) - Bitmap Index Scan on posts_pkey (cost=0.00..2.00 rows=1 width=0) (actual time=0.010..0.010 rows=2 loops=1) Index Cond: (post_id = 4671) - Bitmap Index Scan on posts_pkey (cost=0.00..2.00 rows=1 width=0) (actual time=0.011..0.011 rows=2 loops=1) Index Cond: (post_id = 25) Trigger for constraint posts_question_id_fkey: time=50.031 calls=5 Trigger for constraint posts_author_id_fkey: time=22.330 calls=5 Trigger for constraint posts_language_id_fkey: time=1.282 calls=5 Trigger posts_tsvectorupdate: time=61.659 calls=5 Total runtime: 174.230 ms (18 rows) ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] FK triggers misused?
cluster [EMAIL PROTECTED] writes: It seems that the FK triggers for the table are evaluated even though none of the FK columns are altered. Hm, they're not supposed to be, at least not in reasonably modern PG releases (and one that breaks out trigger runtime in EXPLAIN ANALYZE should be modern enough IIRC). Exactly which PG release are you running? Can you provide a self-contained test case? regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] FK triggers misused?
On 2007-04-15, Tom Lane [EMAIL PROTECTED] wrote: cluster [EMAIL PROTECTED] writes: It seems that the FK triggers for the table are evaluated even though none of the FK columns are altered. Hm, they're not supposed to be, at least not in reasonably modern PG releases (and one that breaks out trigger runtime in EXPLAIN ANALYZE should be modern enough IIRC). Exactly which PG release are you running? Can you provide a self-contained test case? Looking at current CVS code the RI check seems to be skipped on update of the _referred to_ table if the old and new values match, but not on update of the _referring_ table. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster