[SQL] How slow is distinct - 2nd

Michael Contzen Fri, 04 Oct 2002 05:25:14 -0700

Hello,

I posted some observations to the performance of postgres some weeks
ago.
The problem with the poor performance of "select distinct" still exists, 
but I tried to worked out some reproducable results in a less complicated 
way than in my first postings.


'select distinct' preforms on Oracle  about 30 times faster than in
Postgres.

Well, only 'select distinct' is slow. All other parts of our application
using Postgres 
perform quite well on Postgres (with really big amounts of data). 


(All tests on Postgres 7.3b1 and Oracle 9.0.1.3.0.
Hardware: 4 * 1800 MHz Xeon, 3 * 100GB IDE Software-Raid0)
Suse-Linux, 2.4.18 SMP-kernel)


Here the table:

mc=# \d egal
     Table "public.egal"
 Column |  Type   | Modifiers 
--------+---------+-----------
 i      | integer | 

mc=# select count(*) from egal;
  count  
---------
 7227744
(1 row)

mc=# select count(distinct i) from egal;
 count 
-------
    67
(1 row)

(The integers have values between 0 and 99 are an extract of our
warehouse and nearly distributed randomly)

The last query - count(distinct) - takes 1m30s while oracle needs 0m7s
for the same query and data. Getting the distinct rows and measuring the
time results to

time echo "select distinct i from egal;"|psql >/dev/null

real    4m52.108s
user    0m0.000s
sys     0m0.010s

Here I don't understand the difference in performance between "select
distinct i" and "select count(distinct i)" - Oracle takes constantly 7s
for each query:

time echo "select distinct i from egal;"|sqlplus xxxx/yyyy >/dev/null

real    0m6.979s
user    0m0.020s
sys     0m0.030s

In the first case (count(distinct)) postgres produces a temporary file
of 86 MB in pgsql_tmp, in the second case (select distinct) the
temp-file increases to 260 MB. (this is even larger than the table size
of egal which is 232 MB)

I think the planner has not many choices for this query so it results to 
mc=# explain select distinct (i) from egal;
                                QUERY
PLAN                                 
---------------------------------------------------------------------------
 Unique  (cost=1292118.29..1328257.01 rows=722774 width=4)
   ->  Sort  (cost=1292118.29..1310187.65 rows=7227744 width=4)
         Sort Key: i
         ->  Seq Scan on egal  (cost=0.00..104118.44 rows=7227744
width=4)
(4 rows)


Which looks similar to oracle's plan:
QPLAN          
---------------
  SORT UNIQUE  
    TABLE ACCESS FULL EGAL

Our problem is that things getting worse when the table size increases,
as described in my first mails. (Especially when the temp-file is split
into parts) Reorganizing/normalizing the data is no solultion, because
our application is just for analyzing the data. 

I tried out the trigger-idea: inserting a value into a table with an
unique key only if still does't exists, it works, but .. select distinct
is a much faster solution.

An other way I tried is an PL/TCL function which stores all different
values into a assoziative array in memory, by far the fastest solution -
but returning the different values makes problems to me (I shouldn't
say, but I put it into a file and "copy" it back). And it's not very
nice - not as nice as "select distinct" :)

Kind regards,

Michael Contzen

<<attachment: winmail.dat>>


---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

[SQL] How slow is distinct - 2nd

Reply via email to