Oddball data distribution giving me headaches.

We have a distinct 'customer' table with customer_id, type and name/demographic 
information.
Assume some 1 million rows in the customer table.

We then have a customer 'relationship' table which simply contains 2 
columns…designating parent and child relationships…but allowing complex 
hierarchies between customers.


CREATE TABLE customer_rel
(
  parent_customer integer NOT NULL,
  child_customer integer NOT NULL,
 )

8 million rows in this table.  Oddball distribution.  We have some 8 levels of 
hierarchy (customer type) represented with this table.  Every customer gets an 
entry where parent/child is themselves…and then for every 'upline'.  At the 
highest level, we have 'distributors' which have all other customer types 
underneath them.  Assuming we had some 68 distributors, the entries where THEY 
are the parent_customer represent nearly a million rows of the 8 million.

I have extracted a simple case from a larger query that was generating an 
off-beat plan because of the unexpected planner row-counts being spewed by a 
low level query.

explain analyze 
select * from customer_rel where parent_customer in (select customer_id from 
customer where customer_type='DISTRIBUTOR')


"  Nested Loop  (cost=25429.44..29626.39 rows=931 width=0) (actual 
time=216.325..1238.091 rows=1025401 loops=1)"
"        ->  HashAggregate  (cost=25429.44..25430.80 rows=136 width=4) (actual 
time=216.304..216.339 rows=68 loops=1)"
"              ->  Seq Scan on customer  (cost=0.00..25429.10 rows=136 width=4) 
(actual time=0.018..216.226 rows=68 loops=1)"
"                    Filter: (customer_type = 'DISTRIBUTOR'::bpchar)"
"        ->  Index Scan using rel_parent on customer_rel  (cost=0.00..30.76 
rows=7 width=4) (actual time=0.006..8.190 rows=15079 loops=68)"
"              Index Cond: (parent_customer = customer.customer_id)"
"Total runtime: 1514.810 ms"    

The fact that the top level nested loop THINKS it only will be returning 931 
rows (instead of over 1 million) is the killer here…3 orders of magnitude.  The 
results of this  query are used as part of a bigger query and the screwed up 
stats are causing all sorts of havoc upline.

I'm experimenting in 9.1.0…have set the statistics to 1000 (and 10000) on both 
columns (parent and child) to little effect.  Have hardcoded the n_distinct on 
the parent_customer column to be 1,000,000 also…with no effect (doing analyze 
of table after each change)

Does this oddball data distribution doom me to poor planning forever?
Any other thoughts?


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to