Re: [PERFORM] Finding bloated indexes?

2007-04-14 Thread Simon Riggs
On Fri, 2007-04-13 at 14:01 -0600, Dan Harris wrote:
 Is there a pg_stat_* table or the like that will show how bloated an index 
 is? 
 I am trying to squeeze some disk space and want to track down where the worst 
 offenders are before performing a global REINDEX on all tables, as the 
 database 
 is rougly 400GB on disk and this takes a very long time to run.
 
 I have been able to do this with tables, using a helpful view posted to this 
 list a few months back, but I'm not sure if I can get the same results on 
 indexes.

Use pgstatindex in contrib/pgstattuple

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[PERFORM] Basic Q on superfluous primary keys

2007-04-14 Thread Kynn Jones

Consider these two very similar schemas:

Schema 1:


CREATE TABLE foo (
 id serial PRIMARY KEY,
 frobnitz character(varying 100) NOT NULL UNIQUE
);


CREATE TABLE bar (
 id serial PRIMARY KEY,
 foo_id int REFERENCES foo(id)
)


Schema 2:


CREATE TABLE foo (
 frobnitz character(varying 100) PRIMARY KEY
);


CREATE TABLE bar (
 id serial PRIMARY KEY,
 frobnitz character(varying 100) REFERENCES foo(frobnitz)
)




The two situations are semantically identical: each record in table bar
refers to a record in table foo.  The difference is that in the first
schema, this referencing is done through an artificial serial-integer
primary key, while in the second schema this reference is done through a
data field that happens to be unique and not null, so it can serve as
primary key.


I find Schema 1 awkward and unnatural; more specifically, foo.id seems
unnecessary in light of the non-null uniqueness of foo.frobnitz.  But I
remember once reading that long fields like foo.frobnitz did not make good
primary keys.


Is the field foo.id in Schema 1 superfluous?  For example, wouldn't the
referencing from bar to foo really be done behind the scenes through some
hidden field (oid?) instead of through the frobnitz text field?  Which of
the two schemas would give better perfornance?


Thanks!


kj


Re: [PERFORM] Basic Q on superfluous primary keys

2007-04-14 Thread Bill Moran
In response to Kynn Jones [EMAIL PROTECTED]:

 Consider these two very similar schemas:
 
 Schema 1:
 
 
 CREATE TABLE foo (
   id serial PRIMARY KEY,
   frobnitz character(varying 100) NOT NULL UNIQUE
 );
 
 
 CREATE TABLE bar (
   id serial PRIMARY KEY,
   foo_id int REFERENCES foo(id)
 )
 
 
 Schema 2:
 
 
 CREATE TABLE foo (
   frobnitz character(varying 100) PRIMARY KEY
 );
 
 
 CREATE TABLE bar (
   id serial PRIMARY KEY,
   frobnitz character(varying 100) REFERENCES foo(frobnitz)
 )
 
 
 
 
 The two situations are semantically identical: each record in table bar
 refers to a record in table foo.  The difference is that in the first
 schema, this referencing is done through an artificial serial-integer
 primary key, while in the second schema this reference is done through a
 data field that happens to be unique and not null, so it can serve as
 primary key.

The first case is call a surrogate key.  A little googling on that term
will turn up a wealth of discussion -- both for and against.

 I find Schema 1 awkward and unnatural; more specifically, foo.id seems
 unnecessary in light of the non-null uniqueness of foo.frobnitz.  But I
 remember once reading that long fields like foo.frobnitz did not make good
 primary keys.

I had a discussion about this recently on the Drupal mailing lists, at the
end of which I promised to do some benchmarking to determine whether or
not text keys really do hurt performance of indexes.  Unfortunately, I
still haven't followed through on that promise -- maybe I'll get to it
tomorrow.

 Is the field foo.id in Schema 1 superfluous?  For example, wouldn't the
 referencing from bar to foo really be done behind the scenes through some
 hidden field (oid?) instead of through the frobnitz text field?  Which of
 the two schemas would give better perfornance?

-- 
Bill Moran
Collaborative Fusion Inc.

[EMAIL PROTECTED]
Phone: 412-422-3463x4023


IMPORTANT: This message contains confidential information and is
intended only for the individual named. If the reader of this
message is not an intended recipient (or the individual
responsible for the delivery of this message to an intended
recipient), please be advised that any re-use, dissemination,
distribution or copying of this message is prohibited. Please
notify the sender immediately by e-mail if you have received
this e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or
error-free as information could be intercepted, corrupted, lost,
destroyed, arrive late or incomplete, or contain viruses. The
sender therefore does not accept liability for any errors or
omissions in the contents of this message, which arise as a
result of e-mail transmission.


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] choose_bitmap_and again (was Re: [PERFORM] Strangely Variable Query Performance)

2007-04-14 Thread Tom Lane
Dann Corbit [EMAIL PROTECTED] writes:
 Instead of sorting, I suggest the quickselect() algorithm, which is
 O(n).

What for?  Common cases have less than half a dozen entries.  That is
not the place we need to be spending engineering effort --- what we
need to worry about is what's the choice algorithm, not implementation
details.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Question about memory allocations

2007-04-14 Thread Tom Lane
Ron [EMAIL PROTECTED] writes:
 One of the reasons for the wide variance in suggested values for pg 
 memory use is that pg 7.x and pg 8.x are =very= different beasts.
 If you break the advice into pg 7.x and pg 8.x categories, you find 
 that there is far less variation in the suggestions.
 Bottom line: pg 7.x could not take advantage of larger sums of memory 
 anywhere near as well as pg 8.x can.

Actually I think it was 8.1 that really broke the barrier in terms of
scalability of shared_buffers.  Pre-8.1, the buffer manager just didn't
scale well enough to make it useful to use more than a few hundred meg.
(In fact, we never even bothered to fix the shared-memory-sizing
calculations to be able to deal with 2GB shared memory until 8.1;
if you try it in 8.0 it'll probably just crash.)

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] choose_bitmap_and again (was Re: [PERFORM] Strangely Variable Query Performance)

2007-04-14 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 I think the concern about condition redundancy should be attacked
 separately.  How about just comparing whether they have common prefixes
 of conditions?  I admit I don't understand what would happen with
 indexes defined like (lower(A), B, C) versus (A, B) for example.

I understand that issue a bit better than I did when I wrote the comment
(so I suppose I better rewrite it).  The $64 reason for rejecting
AND-combinations of indexes that are using some of the same
WHERE-conditions is that if we don't, we effectively double-count the
selectivity of those conditions, causing us to prefer useless
AND-combinations.  An example is WHERE A  42 AND B  100 where we
have an index on A and one on (A,B).  The selectivity calculation
will blindly assume that the selectivities of the two indexes are
independent and thus prefer to BitmapAnd them, when obviously there
is no point in using both.  Ideally we should improve the selectivity
calculation to not get fooled like this, but that seems hard and
probably slow.  So for the moment we have the heuristic that no
WHERE-clause should be used twice in any AND-combination.

Given that we are using that heuristic, it becomes important that
we visit the indexes in the right order --- as the code stands,
in the (A) vs (A,B) case it will consider only the first index it
arrives at in the selectivity sort order, because the second will
be rejected on the basis of re-use of the WHERE A  42 condition.
Sorting by selectivity tends to ensure that we pick the index that
can make use of as many WHERE-clauses as possible.

The idea of considering each index alone fixes the order dependency
for cases where a single index is the best answer, but it doesn't
help much for cases where you really do want a BitmapAnd, only not
one using the index with the individually best selectivity.

We really need a heuristic here --- exhaustive search will be
impractical in cases where there are many indexes, because you'd
be looking at 2^N combinations.  (In Steve's example there are
actually 38 relevant indexes, which is bad database design if
you ask me, but in any case we cannot afford to search through
2^38 possibilities.)  But the one we're using now is too fragile.

Maybe we should use a cutoff similar to the GEQO one: do exhaustive
search if there are less than N relevant indexes, for some N.
But that's not going to help Steve; we still need a smarter heuristic
for what to look for above the cutoff.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] choose_bitmap_and again (was Re: [PERFORM] Strangely Variable Query Performance)

2007-04-14 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 Has anyone got any thoughts about the best way to do this?

 How about doing both: sort the index by index scan cost; then pick the
 first index on the list and start adding indexes when they lower the
 cost.  When adding each index, consider it by itself against the
 already stacked indexes.  If the cost is lower, put this index at the
 top of the list, and restart the algorithm (after the sorting step of
 course).

The restart part of that bothers me --- it's not entirely clear that
you couldn't get into an infinite loop.  (Imagine that A+B is better
than A alone, so we adopt it, but worse than C alone, so we take C as
the new leader and start over.  Then perhaps C+B is better than C alone
but worse than A alone, so we take A as the leader and start over.
Maybe this is impossible but I'm unsure.)

I looked back at the gdb results I'd gotten from Steve's example and
noticed that for his 38 indexes there were only three distinct index
selectivity values, and the sort step grouped indexes by cost within
those groups.  In hindsight of course this is obvious: the selectivity
depends on the set of WHERE-clauses used, so with two WHERE-clauses
there are three possible selectivities (to within roundoff error anyway)
depending on whether the index uses one or both clauses.  So the
existing algorithm gets one thing right: for any two indexes that make
use of just the same set of WHERE-clauses, it will always take the one
with cheaper scan cost.

Thinking more about this leads me to the following proposal:

1. Explicitly group the indexes according to the subset of
WHERE-conditions (and partial index conditions, if any) they use.
Within each such group, discard all but the cheapest-scan-cost one.

2. Sort the remaining indexes according to scan cost.

3. For each index in order, consider it as a standalone scan, and also
consider adding it on to the AND-group led by each preceding index,
using the same logic as now: reject using any WHERE-condition twice
in a group, and then add on only if the total cost of the AND-group
scan is reduced.

This would be approximately O(N^2) in the number of relevant indexes,
whereas the current scheme is roughly linear (handwaving a bit here
because the number of WHERE-clauses is a factor too).  But that seems
unlikely to be a problem for sane numbers of indexes, unlike the O(2^N)
behavior of an exhaustive search.  It would get rid of (most of) the
order-sensitivity problem, since we would definitely consider the
AND-combination of every pair of combinable indexes.  I can imagine
cases where it wouldn't notice useful three-or-more-way combinations
because the preceding two-way combination didn't win, but they seem
pretty remote possibilities.

Comments?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[PERFORM] FK triggers misused?

2007-04-14 Thread cluster

I have performance problem with the following simple update query:

  UPDATE posts
  SET num_views = num_views + 1
  WHERE post_id IN (2526,5254,2572,4671,25);

The table posts is a large table with a number of foreign keys (FK).

It seems that the FK triggers for the table are evaluated even though
none of the FK columns are altered. In fact, these FK triggers seems to
constitute a considerable part of the total execution time. See the
below EXPLAIN ANALYZE.

Why are these FK triggers evaluated at all and why do they take so much
time?

--
= EXPLAIN ANALYZE update posts set num_views = num_views + 1 where
post_id in (2526,5254,2572,4671,25);
   QUERY PLAN
-
 Bitmap Heap Scan on posts  (cost=10.02..29.81 rows=5 width=1230)
(actual time=0.146..0.253 rows=5 loops=1)
   Recheck Cond: ((post_id = 2526) OR (post_id = 5254) OR (post_id =
2572) OR (post_id = 4671) OR (post_id = 25))
   -  BitmapOr  (cost=10.02..10.02 rows=5 width=0) (actual
time=0.105..0.105 rows=0 loops=1)
 -  Bitmap Index Scan on posts_pkey  (cost=0.00..2.00 rows=1
width=0) (actual time=0.053..0.053 rows=2 loops=1)
   Index Cond: (post_id = 2526)
 -  Bitmap Index Scan on posts_pkey  (cost=0.00..2.00 rows=1
width=0) (actual time=0.012..0.012 rows=2 loops=1)
   Index Cond: (post_id = 5254)
 -  Bitmap Index Scan on posts_pkey  (cost=0.00..2.00 rows=1
width=0) (actual time=0.008..0.008 rows=2 loops=1)
   Index Cond: (post_id = 2572)
 -  Bitmap Index Scan on posts_pkey  (cost=0.00..2.00 rows=1
width=0) (actual time=0.010..0.010 rows=2 loops=1)
   Index Cond: (post_id = 4671)
 -  Bitmap Index Scan on posts_pkey  (cost=0.00..2.00 rows=1
width=0) (actual time=0.011..0.011 rows=2 loops=1)
   Index Cond: (post_id = 25)
 Trigger for constraint posts_question_id_fkey: time=50.031 calls=5
 Trigger for constraint posts_author_id_fkey: time=22.330 calls=5
 Trigger for constraint posts_language_id_fkey: time=1.282 calls=5
 Trigger posts_tsvectorupdate: time=61.659 calls=5
 Total runtime: 174.230 ms
(18 rows)

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] FK triggers misused?

2007-04-14 Thread Tom Lane
cluster [EMAIL PROTECTED] writes:
 It seems that the FK triggers for the table are evaluated even though
 none of the FK columns are altered.

Hm, they're not supposed to be, at least not in reasonably modern
PG releases (and one that breaks out trigger runtime in EXPLAIN ANALYZE
should be modern enough IIRC).  Exactly which PG release are you
running?  Can you provide a self-contained test case?

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] FK triggers misused?

2007-04-14 Thread Andrew - Supernews
On 2007-04-15, Tom Lane [EMAIL PROTECTED] wrote:
 cluster [EMAIL PROTECTED] writes:
 It seems that the FK triggers for the table are evaluated even though
 none of the FK columns are altered.

 Hm, they're not supposed to be, at least not in reasonably modern
 PG releases (and one that breaks out trigger runtime in EXPLAIN ANALYZE
 should be modern enough IIRC).  Exactly which PG release are you
 running?  Can you provide a self-contained test case?

Looking at current CVS code the RI check seems to be skipped on update of
the _referred to_ table if the old and new values match, but not on update
of the _referring_ table.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster