Re: [PERFORM] 512,600ms query becomes 7500ms... but why? Postgres 8.3 query planner quirk?

2010-02-15 Thread Robert Haas
On Sat, Feb 13, 2010 at 2:58 AM, Bryce Nesbitt  wrote:
> So as the op, back to the original posting
>
> In the real world, what should I do?  Does it make sense to pull the "AND
> articles.indexed" clause into an outer query?  Will that query simply
> perform poorly on other arbitrary combinations of words?

It's really hard to say whether a query that performs better is going
to always perform better on every combination of words.  My suggestion
is - try it and see.  It's my experience that rewriting queries is a
pretty effective way of speeding them up, but I can't vouch for what
will happen in your particular case.  It's depressing to see that
increasing the statistics target didn't help much; but that makes me
think that the problem is that your join selectivity estimates are
off, as Tom and Jorge said upthread.  Rewriting the query or trying an
upgrade are probably your only options.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 512,600ms query becomes 7500ms... but why? Postgres 8.3 query planner quirk?

2010-02-13 Thread Pierre Frédéric Caillau d


Your Query :

SELECT contexts.context_key FROM contexts
JOIN articles ON (articles.context_key=contexts.context_key)
JOIN matview_82034 ON (contexts.context_key=matview_82034.context_key)
WHERE contexts.context_key IN
(SELECT context_key FROM article_words JOIN words using (word_key)  
WHERE word = 'insider'

 INTERSECT
 SELECT context_key FROM article_words JOIN words using (word_key)  
WHERE word = 'trading')

AND contexts.context_key IN
 (SELECT a.context_key FROM virtual_ancestors a JOIN bp_categories  
ON (a.ancestor_key = bp_categories.context_key)

 WHERE lower(bp_categories.category) = 'law') AND articles.indexed;


I guess this is some form of keyword search, like :
- search for article
- with keywords "insider" and "trading"
- and belongs to a subcategory of "law"

The way you do it is exactly the same as the way phpBB forum implements  
it, in the case you use a database that doesn't support full text search.  
It is a fallback mechanism only meant for small forums on old versions of  
MySQL, because it is extremely slow.


Even your faster timing (7500 ms) is extremely slow.

Option 1 :

a) Instead of building your own keywords table, use Postgres' fulltext  
search, which is a lot smarter about combining keywords than using  
INTERSECT.
You can either index the entire article, or use a separate keyword field,  
or both.


b) If an article belongs to only one category, use an integer field. If,  
as is most often the case, an article can belong to several categories,  
use gist. When an article belongs to categories 1,2,3, set a column  
article_categories to the integer array {1,2,3}::INTEGER[]. Then, use a  
gist index on it.


You can then do a SELECT from articles (only one table) using an AND on  
the intersection of article_categories with an array of the required  
categories, and using Postgres' full text search on keywords.


This will most likely result in a Bitmap Scan, which will do the ANDing  
much faster than any other solution.


Alternately, you can also use keywords like category_1234, stuff  
everything in your keywords column, and use only Fulltext search.


You should this solution first, it works really well. When the data set  
becomes quite a bit larger than your RAM, it can get slow, though.


Option 2 :

Postgres' full text search is perfectly integrated and has benefits :  
fast, high write concurrency, etc. However full text search can be made  
much faster with some compromises.


For instance, I have tried Xapian : it is a lot faster than Postgres for  
full text search (and more powerful too), but the price you pay is

- a bit of work to integrate it
	- I suggest using triggers and a Python indexer script running in the  
background to update the index

- You can't SQL query it, so you need some interfacing
- updates are not concurrent (single-writer).

So, if you don't make lots of updates, Xapian may work for you. Its  
performance is unbelievable, even on huge datasets.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 512,600ms query becomes 7500ms... but why? Postgres 8.3 query planner quirk?

2010-02-12 Thread Bryce Nesbitt




So as the op, back to the original posting

In the real world, what should I do?  Does it make sense to pull the
"AND articles.indexed" clause into an outer query?  Will that query
simply perform poorly on other arbitrary combinations of words?


I'm happy to test any given query against the
same set of servers. If it involves a persistent change
it has to run on a test server).  For example, the Robert Haas method:
# ...
Total runtime: 254207.857 ms

# ALTER TABLE article_words ALTER COLUMN word_key SET STATISTICS 1000;
# ANALYZE VERBOSE article_words
INFO:  analyzing "public.article_words"
INFO:  "article_words": scanned 30 of 1342374 pages, containing 64534899 live rows and 3264839 dead rows; 30 rows in sample, 288766568 estimated total rows
ANALYZE
# ...
Total runtime: 200591.751 ms

# ALTER TABLE article_words ALTER COLUMN word_key SET STATISTICS 50;
# ANALYZE VERBOSE article_words
# ...
Total runtime: 201204.972 ms


Sadly, it made essentially zero difference.  Attached.






preproduction-20091214=# EXPLAIN ANALYZE
SELECT contexts.context_key FROM contexts
JOIN articles ON (articles.context_key=contexts.context_key)
JOIN matview_82034 ON (contexts.context_key=matview_82034.context_key)
WHERE contexts.context_key IN
(SELECT context_key FROM article_words JOIN words using (word_key) WHERE 
word = 'insider'
 INTERSECT
 SELECT context_key FROM article_words JOIN words using (word_key) WHERE 
word = 'trading')
AND contexts.context_key IN
 (SELECT a.context_key FROM virtual_ancestors a JOIN bp_categories ON 
(a.ancestor_key = bp_categories.context_key)
 WHERE lower(bp_categories.category) = 'law') AND articles.indexed;

  QUERY PLAN
   
---
 Nested Loop  (cost=12861.91..62150.21 rows=2 width=4) (actual 
time=136.789..200744.455 rows=546 loops=1)
   ->  Nested Loop IN Join  (cost=12861.91..62148.32 rows=1 width=16) (actual 
time=136.777..200737.004 rows=546 loops=1)
 Join Filter: (a.context_key = articles.context_key)
 ->  Nested Loop  (cost=12861.91..12875.48 rows=2 width=12) (actual 
time=55.674..116.443 rows=1306 loops=1)
   ->  Nested Loop  (cost=12861.91..12871.68 rows=2 width=8) 
(actual time=55.662..97.863 rows=1306 loops=1)
 ->  Subquery Scan "IN_subquery"  (cost=12861.91..12862.18 
rows=5 width=4) (actual time=55.639..75.777 rows=1473 loops=1)
   ->  SetOp Intersect  (cost=12861.91..12862.14 rows=5 
width=4) (actual time=55.638..73.724 rows=1473 loops=1)
 ->  Sort  (cost=12861.91..12862.02 rows=46 
width=4) (actual time=55.631..62.140 rows=17892 loops=1)
   Sort Key: "*SELECT* 1".context_key
   Sort Method:  quicksort  Memory: 1607kB
   ->  Append  (cost=0.00..12860.63 rows=46 
width=4) (actual time=0.040..42.026 rows=17892 loops=1)
 ->  Subquery Scan "*SELECT* 1"  
(cost=0.00..6430.32 rows=23 width=4) (actual time=0.039..6.909 rows=3583 
loops=1)
   ->  Nested Loop  
(cost=0.00..6430.09 rows=23 width=4) (actual time=0.038..5.110 rows=3583 
loops=1)
 ->  Index Scan using 
words_word on words  (cost=0.00..2.22 rows=1 width=4) (actual time=0.019..0.020 
rows=1 loops=1)
   Index Cond: 
((word)::text = 'insider'::text)
 ->  Index Scan using 
article_words_wc on article_words  (cost=0.00..6323.81 rows=8325 width=8) 
(actual time=0.015..3.243 rows=3583 loops=1)
   Index Cond: 
(public.article_words.word_key = public.words.word_key)
 ->  Subquery Scan "*SELECT* 2"  
(cost=0.00..6430.32 rows=23 width=4) (actual time=0.036..27.166 rows=14309 
loops=1)
   ->  Nested Loop  
(cost=0.00..6430.09 rows=23 width=4) (actual time=0.035..20.037 rows=14309 
loops=1)
 ->  Index Scan using 
words_word on words  (cost=0.00..2.22 rows=1 width=4) (actual time=0.014..0.015 
rows=1 loops=1)
   Index Cond: 
((word)::text = 'trading'::text)
 ->  Index Scan using 
article_words_wc on article_words  (cost=0.00..6323.81 rows=8325 width=8) 
(actual time=0.017.

Re: [PERFORM] 512,600ms query becomes 7500ms... but why? Postgres 8.3 query planner quirk?

2010-02-12 Thread Bryce Nesbitt






Kevin Grittner wrote:

  Bryce Nesbitt  wrote:
 
  
  
I've got a very slow query, which I can make faster by doing
something seemingly trivial. 

  
   
Out of curiosity, what kind of performance do you get with?:
 
EXPLAIN ANALYZE
SELECT contexts.context_key
  FROM contexts
  JOIN articles ON (articles.context_key = contexts.context_key)
  JOIN matview_82034 ON (matview_82034.context_key =
 contexts.context_key)
  WHERE EXISTS
(
  SELECT *
FROM article_words
JOIN words using (word_key)
WHERE context_key = contexts.context_key
  AND word = 'insider'
)
AND EXISTS
(
  SELECT *
FROM article_words
JOIN words using (word_key)
WHERE context_key = contexts.context_key
  AND word = 'trading'
)
AND EXISTS
(
  SELECT *
FROM virtual_ancestors a
JOIN bp_categories ON (bp_categories.context_key =
   a.ancestor_key)
WHERE a.context_key = contexts.context_key
  AND lower(bp_categories.category) = 'law'
)
AND articles.indexed
;
  

512,600ms query becomes 225,976ms.  Twice as fast on pos
Definitely not beating the 7500ms version.
PostgreSQL 8.3.4






EXPLAIN ANALYZE
SELECT contexts.context_key
  FROM contexts
  JOIN articles ON (articles.context_key = contexts.context_key)
  JOIN matview_82034 ON (matview_82034.context_key =
 contexts.context_key)
  WHERE EXISTS
(
  SELECT *
FROM article_words
JOIN words using (word_key)
WHERE context_key = contexts.context_key
  AND word = 'insider'
)
AND EXISTS
(
  SELECT *
FROM article_words
JOIN words using (word_key)
WHERE context_key = contexts.context_key
  AND word = 'trading'
)
AND EXISTS
(
  SELECT *
FROM virtual_ancestors a
JOIN bp_categories ON (bp_categories.context_key =
   a.ancestor_key)
WHERE a.context_key = contexts.context_key
  AND lower(bp_categories.category) = 'law'
)
AND articles.indexed
;





 Nested Loop  (cost=13511.95..13662023.05 rows=19282 width=4) (actual 
time=3070.834..225973.159 rows=622 loops=1)
   ->  Hash Join  (cost=13511.95..356938.38 rows=177937 width=8) (actual 
time=1106.242..7520.756 rows=351337 loops=1)
 Hash Cond: (articles.context_key = matview_82034.context_key)
 ->  Seq Scan on articles  (cost=0.00..334502.90 rows=386266 width=4) 
(actual time=0.030..4031.203 rows=355626 loops=1)
   Filter: indexed
 ->  Hash  (cost=6322.20..6322.20 rows=438220 width=4) (actual 
time=1105.663..1105.663 rows=438220 loops=1)
   ->  Seq Scan on matview_82034  (cost=0.00..6322.20 rows=438220 
width=4) (actual time=7.105..544.072 rows=438220 loops=1)
   ->  Index Scan using contexts_pkey on contexts  (cost=0.00..74.76 rows=1 
width=4) (actual time=0.619..0.619 rows=0 loops=351337)
 Index Cond: (contexts.context_key = articles.context_key)
 Filter: ((subplan) AND (subplan) AND (subplan))
 SubPlan
   ->  Nested Loop  (cost=0.00..30.54 rows=1 width=17) (actual 
time=6.119..6.119 rows=1 loops=983)
 ->  Index Scan using words_word on words  (cost=0.00..5.50 
rows=1 width=13) (actual time=0.028..0.029 rows=1 loops=983)
   Index Cond: ((word)::text = 'trading'::text)
 ->  Index Scan using article_words_cw on article_words  
(cost=0.00..25.02 rows=1 width=8) (actual time=6.082..6.082 rows=1 loops=983)
   Index Cond: ((public.article_words.context_key = $0) AND 
(public.article_words.word_key = public.words.word_key))
   ->  Nested Loop  (cost=0.00..30.54 rows=1 width=17) (actual 
time=6.196..6.196 rows=0 loops=26494)
 ->  Index Scan using words_word on words  (cost=0.00..5.50 
rows=1 width=13) (actual time=0.022..0.024 rows=1 loops=26494)
   Index Cond: ((word)::text = 'insider'::text)
 ->  Index Scan using article_words_cw on article_words  
(cost=0.00..25.02 rows=1 width=8) (actual time=6.165..6.165 rows=0 loops=26494)
   Index Cond: ((public.article_words.context_key = $0) AND 
(public.article_words.word_key = public.words.word_key))
   ->  Nested Loop  (cost=0.00..38.38 rows=3 width=29) (actual 
time=0.122..0.122 rows=0 loops=351337)
 ->  Index Scan using virtual_context_key_idx on 
virtual_ancestors a  (cost=0.00..7.35 rows=5 width=10) (actual 
time=0.074..0.085 rows=5 loops=3

Re: [PERFORM] 512,600ms query becomes 7500ms... but why? Postgres 8.3 query planner quirk?

2010-02-12 Thread Tom Lane
Karl Denninger  writes:
> Explain Analyze on the alternative CLAIMS the same query planner time
> (within a few milliseconds) with explain analyze.  But if I replace the
> executing code with one that has the alternative ("not exists") syntax
> in it, the system load goes to crap instantly and the execution times
> "in the wild" go bananas.

Could we see the actual explain analyze output, and not some handwaving?

What I would expect 8.4 to do with the NOT EXISTS version is to convert
it to an antijoin --- probably a hash antijoin given that the subtable
is apparently small.  That should be a significant win compared to
repeated seqscans as you have now.  The only way I could see for it to
be a loss is that that join would probably be performed after the other
subplan tests instead of before.  However, the rowcounts for your
original query suggest that all the subplans get executed the same
number of times; so at least on the test values you used here, all
those conditions succeed.  Maybe your test values were not
representative of "in the wild" cases, and in the real usage it's
important to make this test before the others.

If that's what it is, you might see what happens when all of the
sub-selects are converted to exists/not exists style, instead of
having a mishmash...

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 512,600ms query becomes 7500ms... but why? Postgres 8.3 query planner quirk?

2010-02-12 Thread Kevin Grittner
Karl Denninger  wrote:
> Kevin Grittner wrote:
 
>> Have you seen such a difference under 8.4?  Can you provide a
>> self-contained example?
 
> Yes:
> 
> [query and EXPLAIN ANALYZE of fast query]
 
> The alternative:
> 
> [query with no other information]
>  
> goes nuts.
 
Which means what?  Could you post an EXPLAIN ANALYZE, or at least an
EXPLAIN, of the slow version?  Can you post the table structure,
including indexes?
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 512,600ms query becomes 7500ms... but why? Postgres 8.3 query planner quirk?

2010-02-12 Thread Karl Denninger
Yes:

select forum, * from post where
marked is not true
and toppost = 1
and (select login from ignore_thread where login='xxx' and
number=post.number) is null
and (replied > now() - '30 days'::interval)
and (replied > (select lastview from forumlog where login='xxx' and
forum=post.forum and number is null)) is not false
and (replied > (select lastview from forumlog where login='xxx' and
forum=post.forum and number=post.number)) is not false
and ((post.forum = (select who from excludenew where who='xxx' and
forum_name = post.forum)) or (select who from excludenew where who='xxx'
and forum_name = post.forum) is null)
   order by pinned desc, replied desc offset 0 limit 100

Returns the following query plan:
  QUERY
PLAN
   

-
 Limit  (cost=5301575.63..5301575.63 rows=1 width=433) (actual
time=771.000..771.455 rows=100 loops=1)
   ->  Sort  (cost=5301575.63..5301575.63 rows=1 width=433) (actual
time=770.996..771.141 rows=100 loops=1)
 Sort Key: post.pinned, post.replied
 Sort Method:  top-N heapsort  Memory: 120kB
 ->  Index Scan using post_top on post  (cost=0.00..5301575.62
rows=1 width=433) (actual time=0.664..752.994 rows=3905 loops=1)
   Index Cond: (toppost = 1)
   Filter: ((marked IS NOT TRUE) AND (replied > (now() - '30
days'::interval)) AND ((SubPlan 1) IS NULL) AND ((replied > (SubPlan 2))
IS NOT FALSE) AND ((replied > (SubPlan 3)) IS NOT FALSE) AND ((forum =
(SubPlan 4)) OR ((SubPlan 5) IS NULL)))
   SubPlan 1
 ->  Seq Scan on ignore_thread  (cost=0.00..5.45 rows=1
width=7) (actual time=0.037..0.037 rows=0 loops=3905)
   Filter: ((login = 'xxx'::text) AND (number = $0))
   SubPlan 2
 ->  Index Scan using forumlog_composite on forumlog 
(cost=0.00..9.50 rows=1 width=8) (actual time=0.008..0.008 rows=0
loops=3905)
   Index Cond: ((login = 'xxx'::text) AND (forum =
$1) AND (number IS NULL))
   SubPlan 3
 ->  Index Scan using forumlog_composite on forumlog 
(cost=0.00..9.50 rows=1 width=8) (actual time=0.006..0.006 rows=0
loops=3905)
   Index Cond: ((login = 'xxx'::text) AND (forum =
$1) AND (number = $0))
   SubPlan 4
 ->  Index Scan using excludenew_pkey on excludenew 
(cost=0.00..8.27 rows=1 width=9) (actual time=0.004..0.004 rows=0
loops=3905)
   Index Cond: ((who = 'xxx'::text) AND (forum_name
= $1))
   SubPlan 5
 ->  Index Scan using excludenew_pkey on excludenew 
(cost=0.00..8.27 rows=1 width=9) (actual time=0.004..0.004 rows=0
loops=3905)
   Index Cond: ((who = 'xxx'::text) AND (forum_name
= $1))
 Total runtime: 771.907 ms
(23 rows)

The alternative:

select forum, * from post where
marked is not true
and toppost = 1
and not exists (select login from ignore_thread where login='xxx'
and number=post.number)
and (replied > now() - '30 days'::interval)
and (replied > (select lastview from forumlog where login='xxx' and
forum=post.forum and number is null)) is not false
and (replied > (select lastview from forumlog where login='xxx' and
forum=post.forum and number=post.number)) is not false
and ((post.forum = (select who from excludenew where who='xxx' and
forum_name = post.forum)) or (select who from excludenew where who='xxx'
and forum_name = post.forum) is null)
   order by pinned desc, replied desc offset 0 limit 100

goes nuts.

(Yes, I know, most of those others which are "not false" could be
"Exists" too)

Explain Analyze on the alternative CLAIMS the same query planner time
(within a few milliseconds) with explain analyze.  But if I replace the
executing code with one that has the alternative ("not exists") syntax
in it, the system load goes to crap instantly and the execution times
"in the wild" go bananas.

I don't know why it does - I just know THAT it does.  When I first added
that top clause in there (to allow people an individual "ignore thread"
list) the system load went bananas immediately and forced me to back it
out.   When I re-wrote the query as above the performance was (and
remains) fine.

I'm running 8.4.2.

I agree (in advance) it shouldn't trash performance - all I know is that
it does and forced me to re-write the query.


Kevin Grittner wrote:
> Karl Denninger  wrote:
> Kevin Grittner wrote:
>  
>   
>>> I suspect that the above might do pretty well in 8.4.
>>>   
>  
>   
>> "Exists" can be quite slow

Re: [PERFORM] 512,600ms query becomes 7500ms... but why? Postgres 8.3 query planner quirk?

2010-02-12 Thread Kevin Grittner
Karl Denninger  wrote:
Kevin Grittner wrote:
 
>> I suspect that the above might do pretty well in 8.4.
 
> "Exists" can be quite slow.  So can "not exists"
> 
> See if you can re-write it using a sub-select - just replace the
> "exists " with "(select ...) is not null"
> 
> Surprisingly this often results in a MUCH better query plan under
> Postgresql.  Why the planner evaluates it "better" eludes me (it
> shouldn't) but the differences are often STRIKING - I've seen
> factor-of-10 differences in execution performance.
 
Have you seen such a difference under 8.4?  Can you provide a
self-contained example?
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 512,600ms query becomes 7500ms... but why? Postgres 8.3 query planner quirk?

2010-02-12 Thread Karl Denninger
"Exists" can be quite slow.  So can "not exists"

See if you can re-write it using a sub-select - just replace the "exists
" with "(select ...) is not null"

Surprisingly this often results in a MUCH better query plan under
Postgresql.  Why the planner evaluates it "better" eludes me (it
shouldn't) but the differences are often STRIKING - I've seen
factor-of-10 differences in execution performance.


Kevin Grittner wrote:
> Bryce Nesbitt  wrote:
>  
>   
>> I've got a very slow query, which I can make faster by doing
>> something seemingly trivial. 
>> 
>  
> Out of curiosity, what kind of performance do you get with?:
>  
> EXPLAIN ANALYZE
> SELECT contexts.context_key
>   FROM contexts
>   JOIN articles ON (articles.context_key = contexts.context_key)
>   JOIN matview_82034 ON (matview_82034.context_key =
>  contexts.context_key)
>   WHERE EXISTS
> (
>   SELECT *
> FROM article_words
> JOIN words using (word_key)
> WHERE context_key = contexts.context_key
>   AND word = 'insider'
> )
> AND EXISTS
> (
>   SELECT *
> FROM article_words
> JOIN words using (word_key)
> WHERE context_key = contexts.context_key
>   AND word = 'trading'
> )
> AND EXISTS
> (
>   SELECT *
> FROM virtual_ancestors a
> JOIN bp_categories ON (bp_categories.context_key =
>a.ancestor_key)
> WHERE a.context_key = contexts.context_key
>   AND lower(bp_categories.category) = 'law'
> )
> AND articles.indexed
> ;
>  
> (You may have to add some table aliases in the subqueries.)
>  
> If you are able to make a copy on 8.4 and test the various forms,
> that would also be interesting.  I suspect that the above might do
> pretty well in 8.4.
>  
> -Kevin
>
>   
<>
-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 512,600ms query becomes 7500ms... but why? Postgres 8.3 query planner quirk?

2010-02-12 Thread Kevin Grittner
Bryce Nesbitt  wrote:
 
> I've got a very slow query, which I can make faster by doing
> something seemingly trivial. 
 
Out of curiosity, what kind of performance do you get with?:
 
EXPLAIN ANALYZE
SELECT contexts.context_key
  FROM contexts
  JOIN articles ON (articles.context_key = contexts.context_key)
  JOIN matview_82034 ON (matview_82034.context_key =
 contexts.context_key)
  WHERE EXISTS
(
  SELECT *
FROM article_words
JOIN words using (word_key)
WHERE context_key = contexts.context_key
  AND word = 'insider'
)
AND EXISTS
(
  SELECT *
FROM article_words
JOIN words using (word_key)
WHERE context_key = contexts.context_key
  AND word = 'trading'
)
AND EXISTS
(
  SELECT *
FROM virtual_ancestors a
JOIN bp_categories ON (bp_categories.context_key =
   a.ancestor_key)
WHERE a.context_key = contexts.context_key
  AND lower(bp_categories.category) = 'law'
)
AND articles.indexed
;
 
(You may have to add some table aliases in the subqueries.)
 
If you are able to make a copy on 8.4 and test the various forms,
that would also be interesting.  I suspect that the above might do
pretty well in 8.4.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] 512,600ms query becomes 7500ms... but why? Postgres 8.3 query planner quirk?

2010-02-09 Thread Bryce Nesbitt
I've got a very slow query, which I can make faster by doing something
seemingly trivial. 
The query has been trouble for years (always slow, sometimes taking hours):

 512,600ms Original, filter on articles.indexed (622 results)
   7,500ms Remove "AND articles.indexed" (726 results, undesirable).
   7,675ms Extra join for "AND articles.indexed" (622 results, same as
original).

Hardware is Postgres 8.3 on a Sunfire X4240 under Debian Lenny, with a
fresh ANALYZE. What I don't understand is why the improvement? Is the
second way of doing things actually superior, or is this just a query
planner edge case?


Original (512,600ms)
-
EXPLAIN ANALYZE
SELECT contexts.context_key FROM contexts
JOIN articles ON (articles.context_key=contexts.context_key)
JOIN matview_82034 ON (contexts.context_key=matview_82034.context_key)
WHERE contexts.context_key IN
  (SELECT context_key FROM article_words JOIN words using (word_key)
WHERE word = 'insider'
   INTERSECT
   SELECT context_key FROM article_words JOIN words using (word_key)
WHERE word = 'trading')
AND contexts.context_key IN
  (SELECT a.context_key FROM virtual_ancestors a JOIN bp_categories ON
(a.ancestor_key = bp_categories.context_key)
   WHERE lower(bp_categories.category) = 'law')
AND articles.indexed;

Extra join (7,675ms)
-
EXPLAIN ANALYZE
SELECT contexts.context_key FROM contexts JOIN articles using (context_key)
WHERE contexts.context_key IN
(
SELECT contexts.context_key FROM contexts
JOIN matview_82034 ON (contexts.context_key=matview_82034.context_key)
WHERE contexts.context_key IN
(SELECT context_key FROM article_words JOIN words using (word_key)
WHERE word = 'insider'
 INTERSECT
 SELECT context_key FROM article_words JOIN words using (word_key)
WHERE word = 'trading')
AND contexts.context_key IN
(SELECT a.context_key FROM virtual_ancestors a JOIN bp_categories ON
(a.ancestor_key = bp_categories.context_key)
 WHERE lower(bp_categories.category) = 'law')
)
AND articles.indexed;



# select indexed,count(*) from articles group by indexed;
 indexed | count
-+
 t   | 354605
 f   | 513552


QUERY PLAN   

 Hash Join  (cost=131663.39..140005.40 rows=4769 width=4) (actual
time=511893.241..512599.348 rows=622 loops=1)
   Hash Cond: (matview_82034.context_key = articles.context_key)
   ->  Seq Scan on matview_82034  (cost=0.00..6596.00 rows=465600
width=4) (actual time=0.019..463.278 rows=438220 loops=1)
   ->  Hash  (cost=131663.38..131663.38 rows=1 width=16) (actual
time=511659.671..511659.671 rows=622 loops=1)
 ->  Nested Loop IN Join  (cost=46757.70..131663.38 rows=1
width=16) (actual time=1142.789..511656.211 rows=622 loops=1)
   Join Filter: (a.context_key = articles.context_key)
   ->  Nested Loop  (cost=46757.70..46789.06 rows=2
width=12) (actual time=688.057..839.297 rows=1472 loops=1)
 ->  Nested Loop  (cost=46757.70..46780.26 rows=2
width=8) (actual time=688.022..799.945 rows=1472 loops=1)
   ->  Subquery Scan "IN_subquery" 
(cost=46757.70..46757.97 rows=5 width=4) (actual time=687.963..743.587
rows=1652 loops=1)
 ->  SetOp Intersect 
(cost=46757.70..46757.93 rows=5 width=4) (actual time=687.961..738.955
rows=1652 loops=1)
   ->  Sort 
(cost=46757.70..46757.81 rows=46 width=4) (actual time=687.943..709.972
rows=19527 loops=1)
 Sort Key: "*SELECT*
1".context_key
 Sort Method:  quicksort 
Memory: 1684kB
 ->  Append 
(cost=0.00..46756.43 rows=46 width=4) (actual time=8.385..657.839
rows=19527 loops=1)
   ->  Subquery Scan
"*SELECT* 1"  (cost=0.00..23378.21 rows=23 width=4) (actual
time=8.383..215.613 rows=4002 loops=1)
 ->  Nested
Loop  (cost=0.00..23377.98 rows=23 width=4) (actual time=8.380..207.499
rows=4002 loops=1)
   ->  Index
Scan using words_word on words  (cost=0.00..5.47 rows=1 width=4) (actual
time=0.102..0.105 rows=1 loops=1)

Index Cond: ((word)::text = 'insider'::text)
   ->  Index
Scan using article_words_wc on article_words  (cost=0.00..23219.17
rows=12268 width=8) (actual time=8.272..199.224 rows=4002 loops=1)

Index Cond: (public.article_words.word_key