[SQL] How slow is distinct - 2nd

2002-10-04 Thread Michael Contzen

Hello,

I posted some observations to the performance of postgres some weeks
ago.
The problem with the poor performance of "select distinct" still exists, 
but I tried to worked out some reproducable results in a less complicated 
way than in my first postings. 

'select distinct' preforms on Oracle  about 30 times faster than in
Postgres.

Well, only 'select distinct' is slow. All other parts of our application
using Postgres 
perform quite well on Postgres (with really big amounts of data). 


(All tests on Postgres 7.3b1 and Oracle 9.0.1.3.0.
Hardware: 4 * 1800 MHz Xeon, 3 * 100GB IDE Software-Raid0)
Suse-Linux, 2.4.18 SMP-kernel)


Here the table:

mc=# \d egal
 Table "public.egal"
 Column |  Type   | Modifiers 
+-+---
 i  | integer | 

mc=# select count(*) from egal;
  count  
-
 7227744
(1 row)

mc=# select count(distinct i) from egal;
 count 
---
67
(1 row)

(The integers have values between 0 and 99 are an extract of our
warehouse and nearly distributed randomly)

The last query - count(distinct) - takes 1m30s while oracle needs 0m7s
for the same query and data. Getting the distinct rows and measuring the
time results to

time echo "select distinct i from egal;"|psql >/dev/null

real4m52.108s
user0m0.000s
sys 0m0.010s

Here I don't understand the difference in performance between "select
distinct i" and "select count(distinct i)" - Oracle takes constantly 7s
for each query:

time echo "select distinct i from egal;"|sqlplus / >/dev/null

real0m6.979s
user0m0.020s
sys 0m0.030s

In the first case (count(distinct)) postgres produces a temporary file
of 86 MB in pgsql_tmp, in the second case (select distinct) the
temp-file increases to 260 MB. (this is even larger than the table size
of egal which is 232 MB)

I think the planner has not many choices for this query so it results to 
mc=# explain select distinct (i) from egal;
QUERY
PLAN 
---
 Unique  (cost=1292118.29..1328257.01 rows=722774 width=4)
   ->  Sort  (cost=1292118.29..1310187.65 rows=7227744 width=4)
 Sort Key: i
 ->  Seq Scan on egal  (cost=0.00..104118.44 rows=7227744
width=4)
(4 rows)


Which looks similar to oracle's plan:
QPLAN  
---
  SORT UNIQUE  
TABLE ACCESS FULL EGAL

Our problem is that things getting worse when the table size increases,
as described in my first mails. (Especially when the temp-file is split
into parts) Reorganizing/normalizing the data is no solultion, because
our application is just for analyzing the data. 

I tried out the trigger-idea: inserting a value into a table with an
unique key only if still does't exists, it works, but .. select distinct
is a much faster solution.

An other way I tried is an PL/TCL function which stores all different
values into a assoziative array in memory, by far the fastest solution -
but returning the different values makes problems to me (I shouldn't
say, but I put it into a file and "copy" it back). And it's not very
nice - not as nice as "select distinct" :)

Kind regards,

Michael Contzen



<>

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [SQL] How slow is distinct - 2nd

2002-10-08 Thread Michael Contzen

Bruno Wolff III schrieb:
> 
> On Tue, Oct 01, 2002 at 14:18:50 +0200,
>   Michael Contzen <[EMAIL PROTECTED]> wrote:
> > Here the table:
> >
> > mc=# \d egal
> >  Table "public.egal"
> >  Column |  Type   | Modifiers
> > +-+---
> >  i  | integer |
> >
> > mc=# select count(*) from egal;
> >   count
> > -
> >  7227744
> > (1 row)
> >
> > mc=# select count(distinct i) from egal;
> >  count
> > ---
> > 67
> > (1 row)
> 
> This suggests that the best way to do this is with a hash instead of a sort.
> 
> If you have lots of memory you might try increasing the sort memory size.

Hello,

ok, sort_mem was still set to the default (=1024). I've increased it to 
  sort_mem=10240
which results to: (same machine, same data, etc.)

time echo "select distinct i from egal;"|psql mc >/dev/null
 

real2m30.667s
user0m0.000s
sys 0m0.010s

If I set sort_mem=1024000:

time echo "select distinct i from egal;"|psql mc >/dev/null
 
real0m52.274s
user0m0.020s
sys 0m0.000s

wow, in comparison to nearly 5 minutes before this is quite good
speedup.

But: 

All the work could be done in memory as the processor load shows (output
of top, which shows the following output during all the time)


  PID USER PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
 8310 postgres  17   0  528M 528M  2712 R99.9 13.5   0:11 postmaster

Even it nearly performs 5 times faster than before with 1M memory,
postgres is still
8 times slower than oracle. Further increasing of sort_mem to 4096000
doesn't reduce the
time, as the cpu load cannot increased any more :-)

But increasing the memory in that way is not realy a solution: Normaly
not all the data
fits into memory. In our application I guess 10%.

Oracle has even less memory:

  PID USER PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
 8466 oracle14   0 11052  10M 10440 R99.9  0.2   0:01 oracle

(this 10M session memory plus 32M shared memory pool not shown here).

This shows to me, that oracle uses a quite different algorithm for this
task. May be it
uses some hashing-like algorithm first without sorting before. I don't
know oracle enough, perhaps this is that "sort unique" step in the
planners output. 
I think, first Postgres sorts all the data, which results to temporary
data of the same
size than before and which needs to be written to disk at least once,
and after that postgres does the unique operation, right? 

If I can do any more tests to oracle or postgres, let me know.

Kind regards,

Michael Contzen

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])