[HACKERS] proposal: add window function to 8.4

2008-06-09 Thread H . Harada
This topic has been discussed on this list and many user expect that
PostgreSQL implements it.
I'd like to work on this feature and hope that we can include it on 8.4.

Former discussions are here:

http://archives.postgresql.org/pgsql-hackers/2004-11/msg01093.php

http://archives.postgresql.org/pgsql-hackers/2007-01/msg00861.php


How it works and examples:
 SELECT dept, empno,
 RANK() OVER(PARTITION BY dept ORDER BY age) as age_rank,
 RANK() OVER(PARTITION BY dept ORDER BY salary) as salary_rank,
 SUM(salary) OVER(PARTITION BY dept ORDER BY age) as run_total
 FROM employees ORDER BY 1, 3, 4;

 dept empno age_rank salary_rank run_total
 ENG  2 12   4
 ENG  1 21   9
 QA   3 12   3
 QA   4 21   65000
 (ref.: http://www.gavinsherry.org/talks/window_talk.pdf)


My current idea and concept:
- add window function and treat it specially such like aggregate
function and setof function.
- some features may be dropped at the first release, considering to
support them later.
- to formulate and to let it work properly are primary, performance
optimization is secondary.


From my survey around web and list archive, the points are:
- what is window function rank(), rank_dense(), lead() and others?
  First of all, we should define the window function such like
aggregate function. In my opinion, there are two types of functions in
OVER () call context. One of them is aggregate, and the other is
window (ranking) function. Sum() in a query like

  SELECT empno, sum(salary) OVER (PARTITION BY depno) FROM empsalary;

 is obviously aggregate function. This type of function can be used as
it is now. Only executer will change its behavior.
 Current pgsql feature sets lack window function like rank(). This
type of function must 1) have a context such as SETOF functions, 2)
return values until executor says DONE, rather than function itself
says DONE as in SETOF function, and 3) know about other tuples
(mainly ORDER BY clause), as rank() doesn't take any arguments but
knows the ranking boundary.  I suppose that pgsql have new function
system for these types called window function.
 Once we can define window function, users have extensibility to this
type of function.

- do we really need FRAME clause?
 From my survey, Oracle and DB2 have FRAME clause

  SELECT empno, sum(salary) OVER (ORDER BY empno ROWS BETWEEN
UNBOUNDED PRECEDING AND CURRENT ROW) FROM empsalary;

 while MS SQL Server doesn't (note that the literal from ROWS to
CURRENT ROW is called FRAME clause). To implement FRAME clause is
much more complicated than only PARTITION and ORDER clause support
because we need random access to result tuples. Though we will be
expected to support FAME clause in long-term, for the first release it
might be able to be put off. Even though rank() doesn't support FRAME
clause (only PARTITION and ORDER) it is so useful, more than now at
least.

- execution order
 In SQL:2003, the execution order is defined as

 where - group by - having - (windowing) * N - order by (outer,
currently existing one)
 where windowing is
 partition by - order by - (framing) * N

 But Oracle seems that it has another order

 (windowing) * N - where - group by ... and so on.

 which is better for us? With Oracle's one you can say
  SELECT empno, rank() OVER (PARTITION BY depno ORDER BY saraly) AS
topsalary FROM empsalary
  WHERE topsaraly  3
 to get the top 3 people taking heighest salary. In the SQL standard,
you should the nest query.
 I insist the first (standard) one is better because we may want use
the result of normal aggregate in OVER clause.

- plan and node
 Currently in my mind the execution strategy could be:

 1. Where  GroupBy  Having
 |
 2. SortBy partitionClause, orderByClause
 |
 3. Window
   foreach partition:
 if not there_is_frame():
   aggvalue = null
   foreach row in partition:
 aggvalue = agg_trans_func(aggvalue)
   aggvalue = agg_final_func(aggvalue)

 foreach row in partition:
   if has frame clause:
 aggvalue = null
 frame = make_frame()
 foreach row_in_frame:
   aggvalue = aggregate_trans_func(aggvalue)
 aggvalue = aggregate_final_func(aggvalue)

   set aggvalue to row
   val = window_func()
   set val to row
   goto 2. if another window remained
 |
 4. SortBy ORDER BY clause (outer)  Limit
 |
 5. Output

 This pseudo code is quite simple and stupid. We may optimize it by
splitting tasks with MergeJoin, etc. or think about process 2. that
collect same PARTITION clauses to reduce sort operation. Or to use
Hash Agg to create PARTITION. But let's be stupid at first.
Optimization is secondary.


References:
 description by Stefan DeBloch
http://wwwdvs.informatik.uni-kl.de/courses/NEDM/WS0607/Vorlesungsunterlagen/NEDM.Chapter.06.Windows_and_Query_Functions_in_SQL.pdf
 via Wikipedia[Select (SQL)] http://en.wikipedia.org/wiki/Select_(SQL)


 BTW, what about Bizgres 

Re: [HACKERS] proposal: add window function to 8.4

2008-06-09 Thread Decibel!

On Jun 9, 2008, at 7:32 AM, H.Harada wrote:

This topic has been discussed on this list and many user expect that
PostgreSQL implements it.
I'd like to work on this feature and hope that we can include it on  
8.4.



I can't really comment on the technical aspects of your proposal, but  
yes, please, windowing functions would be great to have even if not  
fully implemented.

--
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828




smime.p7s
Description: S/MIME cryptographic signature