Re: [PERFORM] Optimizing a huge_table/tiny_table join

2006-05-30 Thread PFC


SELECT * FROM huge_table h, tiny_table t WHERE UPPER( h.id ) =  
UPPER( t.id )


What about :

SELECT * FROM huge_table h WHERE UPPER(id) IN (SELECT upper(id) FROM  
tiny_table t)


	Or, try opening a cursor on your original query and using FETCH. It might  
result in a different plan.

Or lower random_page_cost.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Optimizing a huge_table/tiny_table join

2006-05-30 Thread Kynn Jones

On 5/24/06, Tom Lane <[EMAIL PROTECTED]> wrote: 
<[EMAIL PROTECTED]> writes:>  Limit  (cost=19676.75..21327.99
 rows=6000 width=84)>->  Hash Join  (cost=19676.75..1062244.81 rows=3788315 width=84)>  Hash Cond: (upper(("outer".id)::text) = upper(("inner".id)::text))>  ->  Seq Scan on huge_table h  (cost=
0.00..51292.43 rows=2525543 width=46)>  ->  Hash  (cost=19676.00..19676.00 rows=300 width=38)>->  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 width=38)Um, if huge_table is so much bigger than tiny_table, why are the cost
estimates for seqscanning them only about 2.5x different?  There'ssomething wacko about your statistics, methinks.
 
 
You mean there's a bug in explain?  I agree that it makes no sense that the costs don't differ as much as one would expect, but you can see right there the numbers of rows for the two tables, which are exactly as I described.  At any rate, how would one go about finding an explanation for these strange stats?

 
More bewildering still (and infuriating as hell--because it means that all of my work for yesterday has been wasted) is that I can no longer reproduce the best query plan, even though the tables have not changed at all.  (Hence I can't post the explain analyze for the best query plan.)  No matter what value I use for LIMIT, the query planner now insists on sequentially scanning huge_table and ignoring the available index.  (If I turn off enable_seqscan, I get the second worst query plan I posted yesterday.)

 
Anyway, I take it that there is no way to bypass the optimizer and instruct PostgreSQL exactly how one wants the search performed?
 
Thanks!
 
kj
 


[PERFORM] Optimizing a huge_table/tiny_table join

2006-05-30 Thread kynn




[ I had a problem with my mailer when I first sent this.  My apologies
  for any repeats. ]


I want to optimize this simple join:

SELECT * FROM huge_table h, tiny_table t WHERE UPPER( h.id ) = UPPER( t.id )

huge_table has about 2.5 million records, can be assumed as fixed, and
has the following index:

CREATE INDEX huge_table_index ON huge_table( UPPER( id ) );

...while tiny_table changes with each user request, and typically will
contain on the order of 100-1000 records.  For this analysis, I put
300 records in tiny_table, resulting in 505 records in the join.

I tried several approaches.  In order of increasing speed of
execution:

1. executed as shown above, with enable_seqscan on: about 100 s.

2. executed as shown above, with enable_seqscan off: about 10 s.

3. executed with a LIMIT 6000 clause added to the SELECT statement, and
   enable_seqscan on: about 5 s.

4. executed with a LIMIT 600 clause added to the SELECT statement, and
   enable_seqscan on: less than 1 s.



Clearly, using LIMIT is the way to go.  Unfortunately I *do* want all
the records that would have been produced without the LIMIT clause,
and I don't have a formula for the limit that will guarantee this.  I
could use a very large value (e.g. 20x the size of tiny_table, as in
approach 3 above) which would make the probability of hitting the
limit very small, but unfortunately, the query plan in this case is
different from the query plan when the limit is just above the
expected number of results (approach 4 above).

The query plan for the fastest approach is this:

   QUERY PLAN
-
 Limit  (cost=0.01..2338.75 rows=600 width=84)
   ->  Nested Loop  (cost=0.01..14766453.89 rows=3788315 width=84)
 ->  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 width=38)
 ->  Index Scan using huge_table_index on huge_table h  
(cost=0.01..48871.80 rows=12628 width=46)
   Index Cond: (upper(("outer".id)::text) = upper((h.id)::text))



How can I *force* this query plan even with a higher limit value?

I found, by dumb trial and error, that in this case the switch happens
at LIMIT 5432, which, FWIW, is about 0.2% of the size of huge_table.
Is there a simpler way to determine this limit (hopefully
programmatically)?


Alternatively, I could compute the value for LIMIT as 2x the number of
records in tiny_table, and if the number of records found is *exactly*
this number, I would know that (most likely) some records were left
out.  In this case, I could use the fact that, according to the query
plan above, the scan of tiny_table is sequential to infer which
records in tiny_table were disregarded when the limit was reached, and
then repeat the query with only these left over records in tiny_table.

What's your opinion of this strategy?  Is there a good way to improve
it?

Many thanks in advance!

kj

PS:  FWIW, the query plan for the query with LIMIT 6000 is this:

 QUERY PLAN
-
 Limit  (cost=19676.75..21327.99 rows=6000 width=84)
   ->  Hash Join  (cost=19676.75..1062244.81 rows=3788315 width=84)
 Hash Cond: (upper(("outer".id)::text) = upper(("inner".id)::text))
 ->  Seq Scan on huge_table h  (cost=0.00..51292.43 rows=2525543 
width=46)
 ->  Hash  (cost=19676.00..19676.00 rows=300 width=38)
   ->  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 
width=38)

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[PERFORM] Optimizing a huge_table/tiny_table join

2006-05-30 Thread kynn





I want to optimize this simple join:

SELECT * FROM huge_table h, tiny_table t WHERE UPPER( h.id ) = UPPER( t.id )

huge_table has about 2.5 million records, can be assumed as fixed, and
has the following index:

CREATE INDEX huge_table_index ON huge_table( UPPER( id ) );

...while tiny_table changes with each user request, and typically will
contain on the order of 100-1000 records.  For this analysis, I put
300 records in tiny_table, resulting in 505 records in the join.

I tried several approaches.  In order of increasing speed of
execution:

1. executed as shown above, with enable_seqscan on: about 100 s.

2. executed as shown above, with enable_seqscan off: about 10 s.

3. executed with a LIMIT 6000 clause added to the SELECT statement, and
   enable_seqscan on: about 5 s.

4. executed with a LIMIT 600 clause added to the SELECT statement, and
   enable_seqscan on: less than 1 s.



Clearly, using LIMIT is the way to go.  Unfortunately I *do* want all
the records that would have been produced without the LIMIT clause,
and I don't have a formula for the limit that will guarantee this.  I
could use a very large value (e.g. 20x the size of tiny_table, as in
approach 3 above) which would make the probability of hitting the
limit very small, but unfortunately, the query plan in this case is
different from the query plan when the limit is just above the
expected number of results (approach 4 above).

The query plan for the fastest approach is this:

   QUERY PLAN
-
 Limit  (cost=0.01..2338.75 rows=600 width=84)
   ->  Nested Loop  (cost=0.01..14766453.89 rows=3788315 width=84)
 ->  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 width=38)
 ->  Index Scan using huge_table_index on huge_table h  
(cost=0.01..48871.80 rows=12628 width=46)
   Index Cond: (upper(("outer".id)::text) = upper((h.id)::text))



How can I *force* this query plan even with a higher limit value?

I found, by dumb trial and error, that in this case the switch happens
at LIMIT 5432, which, FWIW, is about 0.2% of the size of huge_table.
Is there a simpler way to determine this limit (hopefully
programmatically)?


Alternatively, I could compute the value for LIMIT as 2x the number of
records in tiny_table, and if the number of records found is *exactly*
this number, I would know that (most likely) some records were left
out.  In this case, I could use the fact that, according to the query
plan above, the scan of tiny_table is sequential to infer which
records in tiny_table were disregarded when the limit was reached, and
then repeat the query with only these left over records in tiny_table.

What's your opinion of this strategy?  Is there a good way to improve
it?

Many thanks in advance!

kj

PS:  FWIW, the query plan for the query with LIMIT 6000 is this:

 QUERY PLAN
-
 Limit  (cost=19676.75..21327.99 rows=6000 width=84)
   ->  Hash Join  (cost=19676.75..1062244.81 rows=3788315 width=84)
 Hash Cond: (upper(("outer".id)::text) = upper(("inner".id)::text))
 ->  Seq Scan on huge_table h  (cost=0.00..51292.43 rows=2525543 
width=46)
 ->  Hash  (cost=19676.00..19676.00 rows=300 width=38)
   ->  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 
width=38)

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Optimizing a huge_table/tiny_table join

2006-05-25 Thread Mark Kirkwood

Tom Lane wrote:

<[EMAIL PROTECTED]> writes:

 Limit  (cost=19676.75..21327.99 rows=6000 width=84)
   ->  Hash Join  (cost=19676.75..1062244.81 rows=3788315 width=84)
 Hash Cond: (upper(("outer".id)::text) = upper(("inner".id)::text))
 ->  Seq Scan on huge_table h  (cost=0.00..51292.43 rows=2525543 
width=46)
 ->  Hash  (cost=19676.00..19676.00 rows=300 width=38)
   ->  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 
width=38)


Um, if huge_table is so much bigger than tiny_table, why are the cost
estimates for seqscanning them only about 2.5x different?  There's
something wacko about your statistics, methinks.



This suggests that tiny_table is very wide (i.e a lot of columns 
compared to huge_table), or else has thousands of dead tuples.


Do you want to post the descriptions for these tables?

If you are running 8.1.x, then the output of 'ANALYZE VERBOSE 
tiny_table' is of interest too.


If you are running a pre-8.1 release, then lets see 'VACUUM VERBOSE 
tiny_table'.


Note that after either of these, your plans may be altered (as ANALYZE 
will recompute your stats for tiny_table, and VACUUM may truncate pages 
full of dead tuples at the end of it)!



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Optimizing a huge_table/tiny_table join

2006-05-25 Thread Jim Nasby

On May 25, 2006, at 12:07 PM, Dawid Kuroczko wrote:

On 5/25/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

Well, they're not my statistics; they're explain's.  You mean there's
a bug in explain?  I agree that it makes no sense that the costs  
don't

differ as much as one would expect, but you can see right there the
numbers of rows for the two tables.  At any rate, how would one go
about finding an explanation for these strange stats?


Well, the query planner uses statistics to deduce the best plan
possible.  Explain includes this statistical data in its output.
See:
http://www.postgresql.org/docs/8.1/interactive/planner-stats.html
...for information about what it is all about.

The idea is that your statistics are probably not detailed enough
to help the planner.  See ALTER TABLE SET STATISTICS to change
that.


http://www.pervasive-postgres.com/lp/newsletters/2006/ 
Insights_postgres_Mar.asp#4 might also be worth your time to read.



Hmm, there is a probability (though statistics are more probable
go) that you're using some older version of PostgreSQL, and you're
hitting same problem as I did:

http://archives.postgresql.org/pgsql-performance/2005-07/msg00345.php

Tom has provided back then a patch, which fixed it:

http://archives.postgresql.org/pgsql-performance/2005-07/msg00352.php

...but I don't remember when it made into release.


According to cvs, it's been in since 8.1 and 8.0.4:

Revision 1.111.4.2: download - view: text, markup, annotated - select  
for diffs

Fri Jul 22 19:12:33 2005 UTC (10 months ago) by tgl
Branches: REL8_0_STABLE
CVS tags: REL8_0_8, REL8_0_7, REL8_0_6, REL8_0_5, REL8_0_4
Diff to: previous 1.111.4.1: preferred, colored; branchpoint 1.111:  
preferred, colored; next MAIN 1.112: preferred, colored

Changes since revision 1.111.4.1: +18 -37 lines

Fix compare_fuzzy_path_costs() to behave a bit more sanely.  The  
original

coding would ignore startup cost differences of less than 1% of the
estimated total cost; which was OK for normal planning but highly not OK
if a very small LIMIT was applied afterwards, so that startup cost  
becomes

the name of the game.  Instead, compare startup and total costs fuzzily
but independently.  This changes the plan selected for two queries in  
the

regression tests; adjust expected-output files for resulting changes in
row order.  Per reports from Dawid Kuroczko and Sam Mason.

Revision 1.124: download - view: text, markup, annotated - select for  
diffs

Fri Jul 22 19:12:01 2005 UTC (10 months ago) by tgl
Branches: MAIN
CVS tags: REL8_1_0BETA3, REL8_1_0BETA2, REL8_1_0BETA1
Diff to: previous 1.123: preferred, colored
Changes since revision 1.123: +18 -37 lines

Fix compare_fuzzy_path_costs() to behave a bit more sanely.  The  
original

coding would ignore startup cost differences of less than 1% of the
estimated total cost; which was OK for normal planning but highly not OK
if a very small LIMIT was applied afterwards, so that startup cost  
becomes

the name of the game.  Instead, compare startup and total costs fuzzily
but independently.  This changes the plan selected for two queries in  
the

regression tests; adjust expected-output files for resulting changes in
row order.  Per reports from Dawid Kuroczko and Sam Mason.

--
Jim C. Nasby, Database Architect[EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Optimizing a huge_table/tiny_table join

2006-05-25 Thread Dawid Kuroczko

On 5/25/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

Well, they're not my statistics; they're explain's.  You mean there's
a bug in explain?  I agree that it makes no sense that the costs don't
differ as much as one would expect, but you can see right there the
numbers of rows for the two tables.  At any rate, how would one go
about finding an explanation for these strange stats?


Well, the query planner uses statistics to deduce the best plan
possible.  Explain includes this statistical data in its output.
See:
http://www.postgresql.org/docs/8.1/interactive/planner-stats.html
...for information about what it is all about.

The idea is that your statistics are probably not detailed enough
to help the planner.  See ALTER TABLE SET STATISTICS to change
that.


More bewildering still (and infuriating as hell--because it means that
all of my work for yesterday has been wasted) is that I can no longer
reproduce the best query plan I posted earlier, even though the tables
have not changed at all.  (Hence I can't post the explain analyze for
the best query plan, which Josh Drake asked for.)  No matter what
value I use for LIMIT, the query planner now insists on sequentially
scanning huge_table and ignoring the available index.  (If I turn off
enable_seqscan, I get the second worst query plan I posted yesterday.)

Anyway, I take it that there is no way to bypass the optimizer and
instruct PostgreSQL exactly how one wants the search performed?


There is no way to bypass.  But there are many ways to tune it.



Hmm, there is a probability (though statistics are more probable
go) that you're using some older version of PostgreSQL, and you're
hitting same problem as I did:

http://archives.postgresql.org/pgsql-performance/2005-07/msg00345.php

Tom has provided back then a patch, which fixed it:

http://archives.postgresql.org/pgsql-performance/2005-07/msg00352.php

...but I don't remember when it made into release.

  Regfa

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Optimizing a huge_table/tiny_table join

2006-05-25 Thread Andrew Sullivan
On Thu, May 25, 2006 at 12:31:04PM -0400, [EMAIL PROTECTED] wrote:
> Well, they're not my statistics; they're explain's.  You mean there's

Explain doesn't get them from nowhere.  How often is the table being
ANALYSEd?

> More bewildering still (and infuriating as hell--because it means that
> all of my work for yesterday has been wasted) is that I can no longer
> reproduce the best query plan I posted earlier, even though the tables
> have not changed at all.  (Hence I can't post the explain analyze for

I find that very hard to believe.  Didn't change _at all_?  Are you
sure no VACUUMs or anything are happening automatically?

> Anyway, I take it that there is no way to bypass the optimizer and
> instruct PostgreSQL exactly how one wants the search performed?

No, there isn't.  

A

-- 
Andrew Sullivan  | [EMAIL PROTECTED]
The fact that technology doesn't work is no bar to success in the marketplace.
--Philip Greenspun

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Optimizing a huge_table/tiny_table join

2006-05-25 Thread kynn

On 5/24/06, Tom Lane <[EMAIL PROTECTED]> wrote:
>
> <[EMAIL PROTECTED]> writes:
> >  Limit  (cost=19676.75..21327.99 rows=6000 width=84)
> >->  Hash Join  (cost=19676.75..1062244.81 rows=3788315 width=84)
> >  Hash Cond: (upper(("outer".id)::text) upper(("inner".id)::text))
> >  ->  Seq Scan on huge_table h  (cost= 0.00..51292.43 rows=2525543 
> > width=46)
> >  ->  Hash  (cost=19676.00..19676.00 rows=300 width=38)
> >->  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 
> > width=38)
>
> Um, if huge_table is so much bigger than tiny_table, why are the cost
> estimates for seqscanning them only about 2.5x different?  There's
> something wacko about your statistics, methinks.



Well, they're not my statistics; they're explain's.  You mean there's
a bug in explain?  I agree that it makes no sense that the costs don't
differ as much as one would expect, but you can see right there the
numbers of rows for the two tables.  At any rate, how would one go
about finding an explanation for these strange stats?

More bewildering still (and infuriating as hell--because it means that
all of my work for yesterday has been wasted) is that I can no longer
reproduce the best query plan I posted earlier, even though the tables
have not changed at all.  (Hence I can't post the explain analyze for
the best query plan, which Josh Drake asked for.)  No matter what
value I use for LIMIT, the query planner now insists on sequentially
scanning huge_table and ignoring the available index.  (If I turn off
enable_seqscan, I get the second worst query plan I posted yesterday.)

Anyway, I take it that there is no way to bypass the optimizer and
instruct PostgreSQL exactly how one wants the search performed?

Thanks!

kj

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Optimizing a huge_table/tiny_table join

2006-05-24 Thread Tom Lane
<[EMAIL PROTECTED]> writes:
>  Limit  (cost=19676.75..21327.99 rows=6000 width=84)
>->  Hash Join  (cost=19676.75..1062244.81 rows=3788315 width=84)
>  Hash Cond: (upper(("outer".id)::text) = upper(("inner".id)::text))
>  ->  Seq Scan on huge_table h  (cost=0.00..51292.43 rows=2525543 
> width=46)
>  ->  Hash  (cost=19676.00..19676.00 rows=300 width=38)
>->  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 
> width=38)

Um, if huge_table is so much bigger than tiny_table, why are the cost
estimates for seqscanning them only about 2.5x different?  There's
something wacko about your statistics, methinks.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Optimizing a huge_table/tiny_table join

2006-05-24 Thread Joshua D. Drake



kj

PS:  FWIW, the query plan for the query with LIMIT 6000 is this:


What is the explain analyze?



 QUERY PLAN
-
 Limit  (cost=19676.75..21327.99 rows=6000 width=84)
   ->  Hash Join  (cost=19676.75..1062244.81 rows=3788315 width=84)
 Hash Cond: (upper(("outer".id)::text) = upper(("inner".id)::text))
 ->  Seq Scan on huge_table h  (cost=0.00..51292.43 rows=2525543 
width=46)
 ->  Hash  (cost=19676.00..19676.00 rows=300 width=38)
   ->  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 
width=38)

=_1148485808-20617-3--


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq




--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[PERFORM] Optimizing a huge_table/tiny_table join

2006-05-24 Thread kynn




I want to optimize this simple join:

SELECT * FROM huge_table h, tiny_table t WHERE UPPER( h.id ) = UPPER( t.id )

huge_table has about 2.5 million records, can be assumed as fixed, and
has the following index:

CREATE INDEX huge_table_index ON huge_table( UPPER( id ) );

...while tiny_table changes with each user request, and typically will
contain on the order of 100-1000 records.  For this analysis, I put
300 records in tiny_table, resulting in 505 records in the join.

I tried several approaches.  In order of increasing speed of
execution:

1. executed as shown above, with enable_seqscan on: about 100 s.

2. executed as shown above, with enable_seqscan off: about 10 s.

3. executed with a LIMIT 6000 clause added to the SELECT statement, and
   enable_seqscan on: about 5 s.

4. executed with a LIMIT 600 clause added to the SELECT statement, and
   enable_seqscan on: less than 1 s.



Clearly, using LIMIT is the way to go.  Unfortunately I *do* want all
the records that would have been produced without the LIMIT clause,
and I don't have a formula for the limit that will guarantee this.  I
could use a very large value (e.g. 20x the size of tiny_table, as in
approach 3 above) which would make the probability of hitting the
limit very small, but unfortunately, the query plan in this case is
different from the query plan when the limit is just above the
expected number of results (approach 4 above).

The query plan for the fastest approach is this:

   QUERY PLAN
-
 Limit  (cost=0.01..2338.75 rows=600 width=84)
   ->  Nested Loop  (cost=0.01..14766453.89 rows=3788315 width=84)
 ->  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 width=38)
 ->  Index Scan using huge_table_index on huge_table h  
(cost=0.01..48871.80 rows=12628 width=46)
   Index Cond: (upper(("outer".id)::text) = upper((h.id)::text))



How can I *force* this query plan even with a higher limit value?

I found, by dumb trial and error, that in this case the switch happens
at LIMIT 5432, which, FWIW, is about 0.2% of the size of huge_table.
Is there a simpler way to determine this limit (hopefully
programmatically)?


Alternatively, I could compute the value for LIMIT as 2x the number of
records in tiny_table, and if the number of records found is *exactly*
this number, I would know that (most likely) some records were left
out.  In this case, I could use the fact that, according to the query
plan above, the scan of tiny_table is sequential to infer which
records in tiny_table were disregarded when the limit was reached, and
then repeat the query with only these left over records in tiny_table.

What's your opinion of this strategy?  Is there a good way to improve
it?

Many thanks in advance!

kj

PS:  FWIW, the query plan for the query with LIMIT 6000 is this:

 QUERY PLAN
-
 Limit  (cost=19676.75..21327.99 rows=6000 width=84)
   ->  Hash Join  (cost=19676.75..1062244.81 rows=3788315 width=84)
 Hash Cond: (upper(("outer".id)::text) = upper(("inner".id)::text))
 ->  Seq Scan on huge_table h  (cost=0.00..51292.43 rows=2525543 
width=46)
 ->  Hash  (cost=19676.00..19676.00 rows=300 width=38)
   ->  Seq Scan on tiny_table t  (cost=0.00..19676.00 rows=300 
width=38)

=_1148485808-20617-3--


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq