[SQL] Seq scan on join, not on subselect? analyze this

2008-11-02 Thread Bryce Nesbitt
I'm a bit confused why the query planner is not restricting my join, and
not using the index.  Two explain analyze statements follow.
Why is the second so much better?

lyell5=> select version();
PostgreSQL 8.3.4 on x86_64-pc-linux-gnu, compiled by GCC cc (GCC) 4.1.2
20061115 (prerelease) (Debian 4.1.1-21)

lyell5=> explain analyze select * from article_words join words using
(word_key) where context_key=535462;
++
|  
QUERY
PLAN   |
++
| Hash Join  (cost=192092.90..276920.93 rows=45327 width=17) (actual
time=6020.932..60084.817 rows=777
loops=1)  |
|   Hash Cond: (article_words.word_key =
words.word_key) 
   
|
|   ->  Index Scan using article_word_idx on article_words 
(cost=0.00..55960.50 rows=45327 width=8) (actual time=0.031..0.547
rows=777 loops=1) |
| Index Cond: (context_key =
535462) 
   
|
|   ->  Hash  (cost=93819.62..93819.62 rows=5653462 width=13) (actual
time=6020.605..6020.605 rows=5651551 loops=1)  |
| ->  Seq Scan on words  (cost=0.00..93819.62 rows=5653462
width=13) (actual time=0.006..2010.962 rows=5651551
loops=1)  |
| Total runtime: 60085.616
ms  
 
|
++


lyell5=> explain analyze select * from words where word_key in (select
word_key from article_words where context_key=535462);
+--+
| 
QUERY
PLAN  |
+--+
| Nested Loop  (cost=56073.81..56091.41 rows=2 width=13) (actual
time=0.808..4.723 rows=777
loops=1)   |
|   ->  HashAggregate  (cost=56073.81..56073.83 rows=2 width=4) (actual
time=0.795..1.072 rows=777
loops=1)|
| ->  Index Scan using article_word_idx on article_words 
(cost=0.00..55960.50 rows=45327 width=4) (actual time=0.030..0.344
rows=777 loops=1) |
|   Index Cond: (context_key =
535462) 
   
|
|   ->  Index Scan using words_pkey on words  (cost=0.00..8.78 rows=1
width=13) (actual time=0.003..0.004 rows=1
loops=777)|
| Index Cond: (words.word_key =
article_words.word_key) 
  
|
| Total runtime: 4.936
ms  
   
|
+--+


-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


[SQL] Seq scan on join, not on subselect? analyze this

2008-11-02 Thread Bryce Nesbitt
Dear Postgres Folks,

I'm a bit confused why the query planner is not restricting my join, and
not using the index.
Two explain analyze statements follow. Why is the second so much better?


lyell5=> vacuum analyze;
lyell5=> select version();
PostgreSQL 8.3.4 on x86_64-pc-linux-gnu, compiled by GCC cc (GCC) 4.1.2
20061115 (prerelease) (Debian 4.1.1-21)

lyell5=> explain analyze select * from words where word_key in (select
word_key from article_words where context_key=535462);
+--+
| 
QUERY
PLAN  |
+--+
| Nested Loop  (cost=56073.81..56091.41 rows=2 width=13) (actual
time=0.808..4.723 rows=777
loops=1)   |
|   ->  HashAggregate  (cost=56073.81..56073.83 rows=2 width=4) (actual
time=0.795..1.072 rows=777
loops=1)|
| ->  Index Scan using article_word_idx on article_words 
(cost=0.00..55960.50 rows=45327 width=4) (actual time=0.030..0.344
rows=777 loops=1) |
|   Index Cond: (context_key =
535462) 
   
|
|   ->  Index Scan using words_pkey on words  (cost=0.00..8.78 rows=1
width=13) (actual time=0.003..0.004 rows=1
loops=777)|
| Index Cond: (words.word_key =
article_words.word_key) 
  
|
| Total runtime: 4.936
ms  
   
|
+--+

lyell5=> explain analyze select words.* from article_words join words
using (word_key) where context_key=535462;
++
|  
QUERY
PLAN   |
++
| Hash Join  (cost=192092.90..276920.93 rows=45327 width=17) (actual
time=6020.932..60084.817 rows=777
loops=1)  |
|   Hash Cond: (article_words.word_key =
words.word_key) 
   
|
|   ->  Index Scan using article_word_idx on article_words 
(cost=0.00..55960.50 rows=45327 width=8) (actual time=0.031..0.547
rows=777 loops=1) |
| Index Cond: (context_key =
535462) 
   
|
|   ->  Hash  (cost=93819.62..93819.62 rows=5653462 width=13) (actual
time=6020.605..6020.605 rows=5651551 loops=1)  |
| ->  Seq Scan on words  (cost=0.00..93819.62 rows=5653462
width=13) (actual time=0.006..2010.962 rows=5651551
loops=1)  |
| Total runtime: 60085.616
ms  
 
|
++


lyell5=> \d article_words;
Table "public.article_words"
+-+-+---+
|   Column|  Type   | Modifiers |
+-+-+---+
| word_key| integer |   |
| context_key | integer |   |
+-+-+---+
Indexes:
"article_word_idx" btree (context_key)
"article_word_key_idx" btree (word_key) CLUSTER
Foreign-key constraints:
"article_words_context_key_constraint" FOREIGN KEY (context_key)
REFERENCES contexts(context_key) ON DELETE CASCADE DEFERRABLE INITIALLY
DEFERRED
"article_words_word_key_constraint" FOREIGN KEY (word_key)
REFERENCES words(word_key) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED

lyell5=> \d words;
   Table "public.words"
+--+++
|  Column  |  Type  |
Modifiers  |
+--+++
| word_key | 

Re: [SQL] Seq scan on join, not on subselect? analyze this

2008-11-02 Thread Tom Lane
Bryce Nesbitt <[EMAIL PROTECTED]> writes:
> I'm a bit confused why the query planner is not restricting my join, and
> not using the index.  Two explain analyze statements follow.

Why are the rowcount estimates so far off?  Maybe you need to increase
the statistics target for this table.

regards, tom lane

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] Seq scan on join, not on subselect? analyze this

2008-11-02 Thread Bryce Nesbitt




Tom Lane wrote:

  Why are the rowcount estimates so far off?  Maybe you need to increase
the statistics target for this table.
			regards, tom lane
  

Tom,
How does one tell the rowcount is off in a query plan?  I've never
found a great reference on interpreting the query analyze output!

Upping the stats target made little difference in the estimated
row count, but a huge difference in the query.  It also mattered
only on the context_key column, not the other columns.  Why would that
be?


lyell5=> \pset format wrapped
lyell5=> select * from pg_stats where tablename='article_words';
++-+--+---+---+++---+-+-+
| schemaname |  tablename  | attname  | null_frac | avg_width |
n_distinct |  most_common_vals  | most_common_freqs | 
histogram_bounds   | correlation |
++-+--+---+---+++---+-+-+
| public | article_wor | word_key | 0 | 4 |  
6361 | {382,226,1155,2130 | {0.003,0.0023 | {3,623,1376,2074,31 |  
0.0292482 |
|    ;
ds  ;
,2246,3328,8999,73 ; ,0.0023,0.002 ;
54,4601,7269,12770,   |
| 
; 3,1760,1796}   ; 3,0.0023, ;
30017,212496,582068   |
|  
; 0.0023,0.0023 ; 0}    |
|  
; ,0.002,0.002, |
|  
; 0.002}    |
| public | article_wor | context_ | 0 | 4 |  
4671 | {639302,113013,133 | {0.0023,0.001 | {28,42838,92697,140
|    0.867505 |
|    ; ds  ;
key   ; 052,211978,508496, ;
7,0.0017, ; 684,202950,248442,3   |
| 
; 545123,590829,5985 ; 0.0017,0.0016 ;
38954,403025,498952   |
| 
; 95,649645,37658}   ; 6667,0.0017,0 ;
,584048,654070}   |
|  
; .0017,0.00166 |
|  
; 667,0.0017,0. |
|  
; 0013} |
++-+--+---+---+++---+-+-+
lyell5=> analyze verbose article_words;
INFO:  "article_words": scanned 3000 of 1125489 pages, containing
555000 live rows
and 0 dead rows; 3000 rows in sample, 208,215,465 estimated total rows

lyell5=> set default_statistics_target to 500;
lyell5=> analyze verbose article_words;
INFO:  "article_words": scanned 15 of 1125489 pages, containing
27749820 live rows
and 180 dead rows; 15 rows in sample, 208,214,114 estimated total
rows



But it makes a huge impact on this query:



  

  stats target
  
  sampled
  
  query duration after analyze
  


  10
  
  3000
  
  80600ms
  


  15
  
  4500
  
  64000ms


  20
  
  6000
  
  4.2ms
  


  30
  
  9000
  
  4.2ms
  


  250
  
  75000
  
  4.2ms
  


  500
  
  ?
  
  4.2ms
  


  1000
  
  30
  
  4.2ms
  

  


Is there any good reason not to set stats target to 1000 always?




lyell5=> alter table article_words alter word_key set statistics
30;analyze verbose article_words;
explain analyze select words.* from article_words join words using
(word_key) where context_key=535462;
INFO:  "article_words": scanned 9000 of 1125489 pages, containing
1665000 live rows and 0 dead rows; 9000 rows in sample, 208215465
estimated total rows
++
|  
QUERY
PLAN   |
+---