[PERFORM] Multiple Uniques

2004-09-02 Thread Markus Schaber
Hello,

Today, we stumbled about the following query plan on PostGreSQL 7.4.1:

logigis=# explain select count(id) from (select distinct id from (select distinct 
ref_in_id as id from streets union select distinct nref_in_id as id from streets) as 
blubb) as blabb; 
  QUERY PLAN   
   
--
 Aggregate  (cost=33246086.32..33246086.32 rows=1 width=8)
   ->  Subquery Scan blabb  (cost=32970061.50..33246085.82 rows=200 width=8)
 ->  Unique  (cost=32970061.50..33246083.82 rows=200 width=8)
   ->  Sort  (cost=32970061.50..33108072.66 rows=55204464 width=8)
 Sort Key: id
 ->  Subquery Scan blubb  (cost=23697404.79..24525471.75 
rows=55204464 width=8)
   ->  Unique  (cost=23697404.79..23973427.11 rows=55204464 
width=8)
 ->  Sort  (cost=23697404.79..23835415.95 
rows=55204464 width=8)
   Sort Key: id
   ->  Append  (cost=7212374.04..15252815.03 
rows=55204464 width=8)
 ->  Subquery Scan "*SELECT* 1"  
(cost=7212374.04..7626407.52 rows=27602232 width=8)
   ->  Unique  
(cost=7212374.04..7350385.20 rows=27602232 width=8)
 ->  Sort  
(cost=7212374.04..7281379.62 rows=27602232 width=8)
   Sort Key: ref_in_id
   ->  Seq Scan on streets 
 (cost=0.00..3129090.32 rows=27602232 width=8)
 ->  Subquery Scan "*SELECT* 2"  
(cost=7212374.04..7626407.52 rows=27602232 width=8)
   ->  Unique  
(cost=7212374.04..7350385.20 rows=27602232 width=8)
 ->  Sort  
(cost=7212374.04..7281379.62 rows=27602232 width=8)
   Sort Key: nref_in_id
   ->  Seq Scan on streets 
 (cost=0.00..3129090.32 rows=27602232 width=8)
(20 rows)

I might have to add that this query is not an every-day query, it's
merely part of some statistics that a colleague needs for some
estimations as he has to create a tool that works on this data.

Both ref_in_id and nref_in_id are non-indexed columns with type int8.

I was somehow irritated by the fact that the Query Plan contains 4 Uniques.

Then, I tried the following query:

logigis=# explain select count(id) from (select distinct id from (select  ref_in_id as 
id from streets union select  nref_in_id as id from streets) as blubb) as blabb;
QUERY PLAN 

---
 Aggregate  (cost=24803496.57..24803496.57 rows=1 width=8)
   ->  Subquery Scan blabb  (cost=24527471.75..24803496.07 rows=200 width=8)
 ->  Unique  (cost=24527471.75..24803494.07 rows=200 width=8)
   ->  Sort  (cost=24527471.75..24665482.91 rows=55204464 width=8)
 Sort Key: id
 ->  Subquery Scan blubb  (cost=15254815.03..16082881.99 
rows=55204464 width=8)
   ->  Unique  (cost=15254815.03..15530837.35 rows=55204464 
width=8)
 ->  Sort  (cost=15254815.03..15392826.19 
rows=55204464 width=8)
   Sort Key: id
   ->  Append  (cost=0.00..6810225.28 
rows=55204464 width=8)
 ->  Subquery Scan "*SELECT* 1"  
(cost=0.00..3405112.64 rows=27602232 width=8)
   ->  Seq Scan on streets  
(cost=0.00..3129090.32 rows=27602232 width=8)
 ->  Subquery Scan "*SELECT* 2"  
(cost=0.00..3405112.64 rows=27602232 width=8)
   ->  Seq Scan on streets  
(cost=0.00..3129090.32 rows=27602232 width=8)
(14 rows)

And after re-parsing the documentation about UNION, the following one:

logigis=# explain select count(id) from (select  ref_in_id as id from streets union 
select  nref_in_id as id from streets) as blubb;
   QUERY PLAN  
  
-
 Aggregate  (cost=16220893.1

Re: [PERFORM] Multiple Uniques

2004-09-02 Thread Tom Lane
Markus Schaber <[EMAIL PROTECTED]> writes:
> Today, we stumbled about the following query plan on PostGreSQL 7.4.1:

> logigis=# explain select count(id) from (select distinct id from (select distinct 
> ref_in_id as id from streets union select distinct nref_in_id as id from streets) as 
> blubb) as blabb; 

> I was somehow irritated by the fact that the Query Plan contains 4 Uniques.

Well, if you write a silly query, you can get a silly plan ...

As you appear to have realized later, given the definition of UNION,
all three of the explicit DISTINCTs are redundant.

> So, now my question is, why does the query optimizer not recognize that
> it can throw away those "non-unique" Sort/Unique passes?

Because the issue doesn't come up often enough to justify expending
cycles to check for it.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Multiple Uniques

2004-09-02 Thread Greg Stark

Markus Schaber <[EMAIL PROTECTED]> writes:

> logigis=# explain select count(id) from (select  ref_in_id as id from streets union 
> select  nref_in_id as id from streets) as blubb;
>QUERY PLAN
> 
> -
>  Aggregate  (cost=16220893.16..16220893.16 rows=1 width=8)
>->  Subquery Scan blubb  (cost=15254815.03..16082881.99 rows=55204464 width=8)
>  ->  Unique  (cost=15254815.03..15530837.35 rows=55204464 width=8)
>->  Sort  (cost=15254815.03..15392826.19 rows=55204464 width=8)
>  Sort Key: id
>  ->  Append  (cost=0.00..6810225.28 rows=55204464 width=8)
>->  Subquery Scan "*SELECT* 1"  (cost=0.00..3405112.64 
> rows=27602232 width=8)
>  ->  Seq Scan on streets  (cost=0.00..3129090.32 
> rows=27602232 width=8)
>->  Subquery Scan "*SELECT* 2"  (cost=0.00..3405112.64 
> rows=27602232 width=8)
>  ->  Seq Scan on streets  (cost=0.00..3129090.32 
> rows=27602232 width=8)

You can actually go one step further and do:

 select count(distinct id) from (select ... union all select ...) as blubb;

I'm not sure why this is any faster since it still has to do all the same
work, but it's a different code path and it seems to be about 20% faster for
me.

-- 
greg


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])