count(*)

Marc Mamin Mon, 26 Jan 2009 06:49:06 -0800

Hello,

I have a query to aggregate data wich is too slow :-)


Here a simplified example:

create table test 
(
time         int8, --store the time as epoch
a_group      varchar,
category     varchar
)


For each group, I need the first/last times and categories , the number
of distinct categories and the number of records.


Here my best solution until now:


SELECT
FIRST.a_group,
FIRST.time     as first_time,
FIRST.category as first_category,
LAST.time      as last_time,
LAST.category  as last_category,
AGG.c_count,
AGG.c_all
FROM 

        (
         select distinct on (a_group) 
         a_group,
         time,
         category
         from test
         order by a_group, time
        ) FIRST,

        (
         select distinct on (a_group) 
         a_group,
         time,
         category
         from test
         order by a_group, time DESC
        ) LAST,

        (
         select a_group, 
         count(distinct category) as c_count,
         count(*) as c_all
         from test
         group by a_group
         order by a_group 
         ) AGG
 
 where FIRST.a_group = LAST.a_group
 and LAST.a_group=AGG.a_group



each sub query is quite fast -- thanks for the DISTINCT ON feature :-) ,
but the whole is really slow as Postgres start to swap due to the large
amount of data to merge.


I guess there must be a better solution as the three sub queries return
exactly one row for each  'a_group'
and are moreover already sorted  (The table does not contain any NULL
value). 
But in the query plan below, most of the cost comes form the merges.


I imagine there must be a way using custom aggregation functions, but 
I'm not confident with those:

Is it possible to define aggregate in order to retrieve the first/last
values of an ordered result set? This would allow to make a single scan
of the table.
something like 

select a_group,
first(category) as first_category,
last(category)  as last_category,
...
from test 
order by a_group,time


Many thanks for any hints.

Marc Mamin





Here are some dummy values if you'd like to play with this issue:

insert into test select s,'G'||s , 'C1' from(select
generate_series(1,10000)as s)s;
insert into test select s+10,'G'||s , 'C2' from(select
generate_series(1,10000)as s)s;
insert into test select s+13,'G'||s , 'C3' from(select
generate_series(1,10000)as s)s;
insert into test select s+1,'G'||s , 'C2' from(select
generate_series(1,10000,5)as s)s;
insert into test select s,'G'||s%10 , 'C3' from(select
generate_series(1,10000,5)as s)s;
insert into test select s+1,'G'||s%5 , 'C2' from(select
generate_series(1,10000,5)as s)s;
insert into test select s+1,'G'||s , 'C1' from(select
generate_series(1,1000000)as s)s; --10^6 !!

create index test_i on test(a_group);
analyze test;

=>

Merge Join  (cost=259000.31..34904377039.75 rows=1550421099181
width=128)
  Merge Cond: ((test.a_group)::text = (last.a_group)::text)
  ->  Merge Join  (cost=129500.16..17814340.14 rows=783387153 width=120)
        Merge Cond: ((test.a_group)::text = (test.a_group)::text)
        ->  GroupAggregate  (cost=0.00..53681.23 rows=395825 width=10)
              ->  Index Scan using test_i on test  (cost=0.00..39973.53
rows=1036043 width=10)
        ->  Materialize  (cost=129500.16..133458.41 rows=395825
width=72)
              ->  Unique  (cost=119965.87..125146.08 rows=395825
width=18)
                    ->  Sort  (cost=119965.87..122555.97 rows=1036043
width=18)
                          Sort Key: test.a_group, test."time"
                          ->  Seq Scan on test  (cost=0.00..16451.43
rows=1036043 width=18)
  ->  Materialize  (cost=129500.16..133458.41 rows=395825 width=72)
        ->  Subquery Scan last  (cost=119965.87..129104.33 rows=395825
width=72)
              ->  Unique  (cost=119965.87..125146.08 rows=395825
width=18)
                    ->  Sort  (cost=119965.87..122555.97 rows=1036043
width=18)
                          Sort Key: test.a_group, test."time"
                          ->  Seq Scan on test  (cost=0.00..16451.43
rows=1036043 width=18)









-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

[SQL] aggregation problem: first/last/count(*)

Reply via email to