Re: [PERFORM] Encouraging multi-table join order

2006-04-11 Thread Dan Harris

Tom Lane wrote:

SNIP
So it's estimating 5775 cost units per probe into eventactivity, which
is pretty high --- it must think that a lot of rows will be retrieved by
the index (way more than the 20 or so it thinks will get past the filter
condition). 



 What does the pg_stats entry for eventactivity.incidentid
contain?
select * from pg_stats where tablename = 'eventactivity' and 
attname='incidentid';
schemaname |   tablename   |  attname   | null_frac | avg_width | 
n_distinct |
most_common_vals 
|
most_common_freqs
|  
histogram_bounds  | 
correlation

+---++---+---++-+-++-
public | eventactivity | incidentid | 0 |14 |   
8157 | 
{P043190299,P051560740,P052581036,P052830218,P053100679,P053190889,P060370845,P042070391,P042690319,P043290117} 
| 
{0.0017,0.0017,0.0017,0.0017,0.0017,0.0017,0.0017,0.0013,0.0013,0.0013} 
| 
{P022140319,P030471058,P033090308,P041961082,P042910689,P050311006,P051350254,P052261148,P053270945,P060240316,P061000287} 
|0.241737



  It might be worth increasing the statistics target for that
column to try to get a better estimate.
  
How high should I set this?  I read the default is 10, but I'm not sure 
if doubling this would make a difference or if I should be doing a much 
larger number. There's approx 45 million rows in the table, if that matters.



Thanks again,
Dan

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Encouraging multi-table join order

2006-04-11 Thread Tom Lane
Dan Harris [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 What does the pg_stats entry for eventactivity.incidentid
 contain?

 {P043190299,P051560740,P052581036,P052830218,P053100679,P053190889,P060370845,P042070391,P042690319,P043290117}
  
 | 
 {0.0017,0.0017,0.0017,0.0017,0.0017,0.0017,0.0017,0.0013,0.0013,0.0013}
  

 How high should I set this?  I read the default is 10, but I'm not sure 
 if doubling this would make a difference or if I should be doing a much 
 larger number. There's approx 45 million rows in the table, if that matters.

What the stats entry is saying is that the most common entries occur
about 75000 times apiece (0.0017 * 45e6), which is what's scaring
the planner here ;-).  I think those frequencies are artificially high
though.  The default statistics sample size is 3000 rows (300 *
statistics target, actually), so those numbers correspond to 5 or 4
rows in the sample, which is probably just random chance.

Try increasing the stats targets for this table to 100, then re-ANALYZE
and see what you get.  The most_common_freqs entries might drop as much
as a factor of 10.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Encouraging multi-table join order

2006-04-10 Thread Tom Lane
Dan Harris [EMAIL PROTECTED] writes:
 I have a query that is intended to select from multiple small tables 
 to get a limited subset of incidentid and then join with a very 
 large table.  One of the operations will require a sequential scan, but 
 the planner is doing the scan on the very large table before joining the 
 small ones, resulting in a huge amount of disk I/O.  How would I make 
 this query join the large table only after narrowing down the possible 
 selections from the smaller tables?  This is running on version 8.0.3.

That's very strange --- the estimated cost of the seqscan is high enough
that the planner should have chosen a nestloop with inner indexscan on
the big table.  I'm not sure about the join-order point, but the hash
plan for the first join seems wrong in any case.

Um, you do have an index on eventactivity.incidentid, right?  What's the
datatype(s) of the incidentid columns?  What happens to the plan if you
turn off enable_hashjoin and enable_mergejoin?

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Encouraging multi-table join order

2006-04-10 Thread Dan Harris

Tom Lane wrote:

That's very strange --- the estimated cost of the seqscan is high enough
that the planner should have chosen a nestloop with inner indexscan on
the big table.  I'm not sure about the join-order point, but the hash
plan for the first join seems wrong in any case.

Um, you do have an index on eventactivity.incidentid, right?  What's the
datatype(s) of the incidentid columns?  What happens to the plan if you
turn off enable_hashjoin and enable_mergejoin?

regards, tom lane
  
Yes, eventactivity.incidentid is indexed.  The datatype is varchar(40).  
Although, by checking this, I noticed that k_h.incidentid was 
varchar(100).  Perhaps the difference in length between the keys caused 
the planner to not use the fastest method?  I have no defense as to why 
those aren't the same.. I will make them so and check.


Here's the EXPLAIN analyze with enable_hashjoin = off and 
enable_mergejoin = off :


Limit  (cost=4226535.73..4226544.46 rows=698 width=82) (actual 
time=74339.016..74356.521 rows=888 loops=1)
  -  Unique  (cost=4226535.73..4226544.46 rows=698 width=82) (actual 
time=74339.011..74354.073 rows=888 loops=1)
-  Sort  (cost=4226535.73..4226537.48 rows=698 width=82) 
(actual time=74339.003..74344.031 rows=3599 loops=1)
  Sort Key: eventmain.entrydate, eventmain.incidentid, 
eventgeo.eventlocation, eventactivity.recordtext
  -  Nested Loop  (cost=0.00..4226502.76 rows=698 
width=82) (actual time=921.325..74314.959 rows=3599 loops=1)
-  Nested Loop  (cost=0.00..4935.61 rows=731 
width=72) (actual time=166.354..14638.308 rows=1162 loops=1)
  -  Nested Loop  (cost=0.00..2482.47 rows=741 
width=50) (actual time=150.396..7348.013 rows=1162 loops=1)
-  Index Scan using k_h_id_idx on k_h  
(cost=0.00..217.55 rows=741 width=14) (actual time=129.540..1022.243 
rows=1162 loops=1)

  Index Cond: (id = 33396)
  Filter: ((entrydate = 
'2006-01-01 00:00:00'::timestamp without time zone) AND (entrydate  
'2006-04-08 00:00:00'::timestamp without time zone))
-  Index Scan using 
eventgeo_incidentid_idx on eventgeo  (cost=0.00..3.04 rows=1 width=36) 
(actual time=5.260..5.429 rows=1 loops=1162)
  Index Cond: 
((eventgeo.incidentid)::text = (outer.incidentid)::text)
  -  Index Scan using eventmain_incidentid_idx 
on eventmain  (cost=0.00..3.30 rows=1 width=22) (actual 
time=5.976..6.259 rows=1 loops=1162)
Index Cond: 
((eventmain.incidentid)::text = (outer.incidentid)::text)
-  Index Scan using eventactivity1 on 
eventactivity  (cost=0.00..5774.81 rows=20 width=52) (actual 
time=29.768..51.334 rows=3 loops=1162)
  Index Cond: ((outer.incidentid)::text = 
(eventactivity.incidentid)::text)
  Filter: ' '::text || (recordtext)::text) 
|| ' '::text) ~~ '%HAL%'::text) AND (entrydate = '2006-01-01 
00:00:00'::timestamp without time zone) AND (entrydate  '2006-04-08 
00:00:00'::timestamp without time zone))




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Encouraging multi-table join order

2006-04-10 Thread Tom Lane
Dan Harris [EMAIL PROTECTED] writes:
 Yes, eventactivity.incidentid is indexed.  The datatype is varchar(40).  
 Although, by checking this, I noticed that k_h.incidentid was 
 varchar(100).  Perhaps the difference in length between the keys caused 
 the planner to not use the fastest method?

No, the planner wouldn't care about that.

 Here's the EXPLAIN analyze with enable_hashjoin = off and 
 enable_mergejoin = off :

OK, so it does consider the right plan, but it's estimating it'll take
longer than the other one.  One thing that's very strange is that the
estimated number of rows out has changed ... did you re-ANALYZE since
the previous message?

  -  Index Scan using eventactivity1 on 
 eventactivity  (cost=0.00..5774.81 rows=20 width=52) (actual 
 time=29.768..51.334 rows=3 loops=1162)
Index Cond: ((outer.incidentid)::text = 
 (eventactivity.incidentid)::text)
Filter: ' '::text || (recordtext)::text) 
 || ' '::text) ~~ '%HAL%'::text) AND (entrydate = '2006-01-01 
 00:00:00'::timestamp without time zone) AND (entrydate  '2006-04-08 
 00:00:00'::timestamp without time zone))

So it's estimating 5775 cost units per probe into eventactivity, which
is pretty high --- it must think that a lot of rows will be retrieved by
the index (way more than the 20 or so it thinks will get past the filter
condition).  What does the pg_stats entry for eventactivity.incidentid
contain?  It might be worth increasing the statistics target for that
column to try to get a better estimate.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match