[SQL] Invitation to connect on LinkedIn

2009-01-26 Thread Suha Onay
LinkedIn


Suha Onay requested to add you as a connection on LinkedIn:
--

Marcin,

I'd like to add you to my professional network on LinkedIn.

- Suha

View invitation from Suha Onay
http://www.linkedin.com/e/XCcEdFwZmSzO1q19PFca3OmGtDdI_FirNbrhE9/blk/962530550_2/cBYMdjkMcPkOdzALqnpPbOYWrSlI/svi/
 

--
DID YOU KNOW you can use your LinkedIn profile as your website? Select a vanity 
URL and then promote this address on your business cards, email signatures, 
website, etc
http://www.linkedin.com/e/ewp/inv-21/


 
--
(c) 2009, LinkedIn Corporation



[SQL] aggregation problem: first/last/count(*)

2009-01-26 Thread Marc Mamin
Hello,

I have a query to aggregate data wich is too slow :-)

Here a simplified example:

create table test 
(
time int8, --store the time as epoch
a_group  varchar,
category varchar
)


For each group, I need the first/last times and categories , the number
of distinct categories and the number of records.


Here my best solution until now:


SELECT
FIRST.a_group,
FIRST.time as first_time,
FIRST.category as first_category,
LAST.time  as last_time,
LAST.category  as last_category,
AGG.c_count,
AGG.c_all
FROM 

(
 select distinct on (a_group) 
 a_group,
 time,
 category
 from test
 order by a_group, time
) FIRST,

(
 select distinct on (a_group) 
 a_group,
 time,
 category
 from test
 order by a_group, time DESC
) LAST,

(
 select a_group, 
 count(distinct category) as c_count,
 count(*) as c_all
 from test
 group by a_group
 order by a_group 
 ) AGG
 
 where FIRST.a_group = LAST.a_group
 and LAST.a_group=AGG.a_group



each sub query is quite fast -- thanks for the DISTINCT ON feature :-) ,
but the whole is really slow as Postgres start to swap due to the large
amount of data to merge.


I guess there must be a better solution as the three sub queries return
exactly one row for each  'a_group'
and are moreover already sorted  (The table does not contain any NULL
value). 
But in the query plan below, most of the cost comes form the merges.


I imagine there must be a way using custom aggregation functions, but 
I'm not confident with those:

Is it possible to define aggregate in order to retrieve the first/last
values of an ordered result set? This would allow to make a single scan
of the table.
something like 

select a_group,
first(category) as first_category,
last(category)  as last_category,
...
from test 
order by a_group,time


Many thanks for any hints.

Marc Mamin





Here are some dummy values if you'd like to play with this issue:

insert into test select s,'G'||s , 'C1' from(select
generate_series(1,1)as s)s;
insert into test select s+10,'G'||s , 'C2' from(select
generate_series(1,1)as s)s;
insert into test select s+13,'G'||s , 'C3' from(select
generate_series(1,1)as s)s;
insert into test select s+1,'G'||s , 'C2' from(select
generate_series(1,1,5)as s)s;
insert into test select s,'G'||s%10 , 'C3' from(select
generate_series(1,1,5)as s)s;
insert into test select s+1,'G'||s%5 , 'C2' from(select
generate_series(1,1,5)as s)s;
insert into test select s+1,'G'||s , 'C1' from(select
generate_series(1,100)as s)s; --10^6 !!

create index test_i on test(a_group);
analyze test;

=>

Merge Join  (cost=259000.31..34904377039.75 rows=1550421099181
width=128)
  Merge Cond: ((test.a_group)::text = (last.a_group)::text)
  ->  Merge Join  (cost=129500.16..17814340.14 rows=783387153 width=120)
Merge Cond: ((test.a_group)::text = (test.a_group)::text)
->  GroupAggregate  (cost=0.00..53681.23 rows=395825 width=10)
  ->  Index Scan using test_i on test  (cost=0.00..39973.53
rows=1036043 width=10)
->  Materialize  (cost=129500.16..133458.41 rows=395825
width=72)
  ->  Unique  (cost=119965.87..125146.08 rows=395825
width=18)
->  Sort  (cost=119965.87..122555.97 rows=1036043
width=18)
  Sort Key: test.a_group, test."time"
  ->  Seq Scan on test  (cost=0.00..16451.43
rows=1036043 width=18)
  ->  Materialize  (cost=129500.16..133458.41 rows=395825 width=72)
->  Subquery Scan last  (cost=119965.87..129104.33 rows=395825
width=72)
  ->  Unique  (cost=119965.87..125146.08 rows=395825
width=18)
->  Sort  (cost=119965.87..122555.97 rows=1036043
width=18)
  Sort Key: test.a_group, test."time"
  ->  Seq Scan on test  (cost=0.00..16451.43
rows=1036043 width=18)









-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] aggregation problem: first/last/count(*)

2009-01-26 Thread Volkan YAZICI
On Mon, 26 Jan 2009, "Marc Mamin"  writes:
> create table test 
> (
> time int8, --store the time as epoch
> a_group  varchar,
> category varchar
> )
>
> ...
>
> SELECT
> FIRST.a_group,
> FIRST.time as first_time,
> FIRST.category as first_category,
> LAST.time  as last_time,
> LAST.category  as last_category,
> AGG.c_count,
> AGG.c_all
> FROM
> ...

I think the problem in here is that you want to collect the first and
last values in the same row. Instead, splitting them into two sequential
rows would suit better to your database schema design, and you can
rebuild the data structure as you want in the application tier
later. For instance, consider below example:

test=# SELECT ts, grp, val FROM foo;
 ts | grp | val
+-+-
  1 |   1 |   1
  2 |   1 |   2
  3 |   1 |   3
  4 |   2 |   1
  4 |   2 |   2
  5 |   3 |   1
(6 rows)

test=# SELECT foo.ts, foo.grp, foo.val
 FROM (SELECT grp, MAX(ts) AS max_ts, MIN(ts) AS min_ts
 FROM foo
 GROUP BY grp)
   AS bar
   INNER JOIN foo
   ON foo.grp = bar.grp
  AND (foo.ts = bar.min_ts OR foo.ts = bar.max_ts);
 ts | grp | val
+-+-
  1 |   1 |   1
  3 |   1 |   3
  4 |   2 |   1
  4 |   2 |   2
  5 |   3 |   1
(5 rows)

After receiving above output, you can traverse returned rows one by one
in the application layer and output desired results.


Regards.

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] aggregation problem: first/last/count(*)

2009-01-26 Thread Marc Mamin
 

> I think the problem in here is that you want to collect the first and
last values in the same row


Your idea is ok, but it just postpone the problem. And I need the result
within the DB for further calculations /aggregations.

What I need is really something like:

test=# SELECT foo.ts, foo.grp, foo.val,foo2.val
 FROM (SELECT grp, MAX(ts) AS max_ts, MIN(ts) AS min_ts
 FROM foo
 GROUP BY grp)
   AS bar
   INNER JOIN foo
   ON foo.grp = bar.grp
  AND foo.ts = bar.min_ts
   INNER JOIN foo2
   ON foo2.grp = bar.grp
  AND foo2.ts = bar.max_ts 

I've tested different solutions and the DISTINCT ON clause was better.
(I guess the best solution depend of the distribution of grp and val).


I've also just found aggregate functions for first/last:
http://www.postgresonline.com/journal/index.php?/archives/68-More-Aggreg
ate-Fun-Whos-on-First-and-Whos-on-Last.html


But its is slightly slower as my solution.

I'll still make a test with more data As I guess that swapping will grow
fatser mith my query than with the first/last aggregate functions.


cheers,

Marc Mamin









-Original Message-
From: Volkan YAZICI [mailto:yazic...@ttmail.com] 
Sent: Monday, January 26, 2009 4:27 PM
To: Marc Mamin
Cc: pgsql-sql@postgresql.org
Subject: Re: aggregation problem: first/last/count(*)

On Mon, 26 Jan 2009, "Marc Mamin"  writes:
> create table test
> (
> time int8, --store the time as epoch
> a_group  varchar,
> category varchar
> )
>
> ...
>
> SELECT
> FIRST.a_group,
> FIRST.time as first_time,
> FIRST.category as first_category,
> LAST.time  as last_time,
> LAST.category  as last_category,
> AGG.c_count,
> AGG.c_all
> FROM
> ...

I think the problem in here is that you want to collect the first and
last values in the same row. Instead, splitting them into two sequential
rows would suit better to your database schema design, and you can
rebuild the data structure as you want in the application tier later.
For instance, consider below example:

test=# SELECT ts, grp, val FROM foo;
 ts | grp | val
+-+-
  1 |   1 |   1
  2 |   1 |   2
  3 |   1 |   3
  4 |   2 |   1
  4 |   2 |   2
  5 |   3 |   1
(6 rows)

test=# SELECT foo.ts, foo.grp, foo.val
 FROM (SELECT grp, MAX(ts) AS max_ts, MIN(ts) AS min_ts
 FROM foo
 GROUP BY grp)
   AS bar
   INNER JOIN foo
   ON foo.grp = bar.grp
  AND (foo.ts = bar.min_ts OR foo.ts = bar.max_ts);  ts | grp |
val
+-+-
  1 |   1 |   1
  3 |   1 |   3
  4 |   2 |   1
  4 |   2 |   2
  5 |   3 |   1
(5 rows)

After receiving above output, you can traverse returned rows one by one
in the application layer and output desired results.


Regards.

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql