Re: [HACKERS] Window functions patch v04 for the September commit fest

Heikki Linnakangas Mon, 01 Sep 2008 11:01:46 -0700

Hitoshi Harada wrote:

2008/8/30 Hitoshi Harada <[EMAIL PROTECTED]>:

Here's the latest window functions patch against HEAD. It seems to be
ready for the September commit fest, as added documents, WINDOW clause
feature and misc tests. I guess this would be the window functions
feature freeze for 8.4. The remaining feature will be implemented for
the later release.


This patch consists of:
- windowed aggregates
- cooperate with GROUP BY aggregates
- some optimizations with multiple windows
- ranking functions
- WINDOW clause
- windowing SQL regression tests
- sgml documents

Looking up the total road map, the dropped features are:

- sliding window (window framing)
- lead(), lag(), etc. that reach for random rows
- user defined window functions

Ok, I'm starting to read up on SQL2003 window functions, and to reviewthis patch. Here's a long brain-dump of my first thoughts on the design.


Let's list a couple of requirements first:

1. It's important that what gets committed now can be extended to handleall of the window function stuff in SQL2003 in the future, as well asuser-defined-window-functions in the spirit of PostgreSQL extensibility.Even if we don't implement all of it in this release.

2. Performance needs to be reasonable, in the face a large number ofinput rows. These are typically used in OLAP queries that processgazillions of rows. Some window functions need access to all rows in thewindow frame or partition, so you can't reasonably expect greatperformance with those if the working set is large, but those thatdon't, should work with finite memory requirements, and reasonably fast.

Because of 1., let's focus on the interface foruser-defined-window-functions first. Ideally, the interface is such thatall the window functions and aggregates defined in SQL2003, and otherswe might want to have in core or as pgfoundry projects, can beimplemented using it.

I think there's still some confusion in the design document about what aframe is. The terminology section is great, it helped me a lot to getstarted, so thank you for that. However, in the "Actual design in thepatch" you describe how a window function works, and you talk about awindow frame, but elsewhere in the doc you note that window frames arenot implemented yet.

As already discussed, there's two different kind of window expressions:window aggregates, and ranking aggregates (you call them windowfunctions in your doc). It seems that they are different enough that wecan't bang them together. For example, you can't use a window frameclause with ranking aggregates, and ranking aggregates need to see thewhole row, whereas window aggregates only see the expressions passed asargument.


API for window functions (aka. ranking aggregate functions)
------------------------

Borrowing ideas from the many approaches listed in the design doc, Ipropose an interface using a set-returning function. Similar to Simon'sproposal, but taking into account that a ranking function might need tosee some or all input tuples before returning anything.

For each input tuple, the window function can return any number oftuples, including 0. After processing all input values, though, thetotal number of output rows returned must match the total number ofinput values fed into it. The function must eventually output one valuefor each input value, in the same order.

This is a flexible design that allows for different kind ofimplementations; the function can return one output tuple for each inputtuple (ROW_NUMBER(), RANK() etc.), or it can swallow all input tuplesfirst, and only then return all output tuples (e.g. CUME_DIST).

Here's a couple of step-by-step examples of how some functions wouldwork. Imagine input values 1, 3, 4, 4, 5:


RANK
----

Invocation      Input   Value returned  Internal state (after)
1               1       1               (last value 1, count 1, rank=1)
2               3       2               (last value 2, count 1, rank=2)
3               4       3               (last value 4, count 1, rank=3)
4               4       3               (last value 4, count 2, rank=3)
5               5       5               (last value 5, count 1, rank=5)

So RANK returns one tuple after each input tuple. Just like in yourcurrent patch, keep the last value returned, a row count, and thecurrent rank as state.


LAG
---

Internal state is a FIFO queue of values. Each input value is pushed tothe FIFO, and the tuple that falls out of the queue is returned.


For example, LAG(<col>,2,0):

Invocation      Input row       Value returned  Internal state (after)
1               1               0               (1)
2               3               0               (3, 1)
3               4               1               (4, 3)
4               4               3               (4, 4)
5               5               4               (5, 4)

LEAD
----

Return nothing for first <offset> input tuples. Then return the currentinput tuple for the rest of the input tuples, and after the last inputtuple, return <offset> number of default values.


For example, LEAD(<col>,2,0):

Invocation      Input row       Value returned  Internal state (after)
1               1               <none>            (offsetbegin = 1, pad=2)
2               3               <none>            (offsetbegin = 0, pad=2)
3               4               4               (offsetbegin = 0, pad=2)
4               4               4               (offsetbegin = 0, pad=2)
5               5               5               (offsetbegin = 0, pad=2)
6               <none>            0               (offsetbegin = 0, pad=1)
7               <none>            0               (offsetbegin = 0, pad=0)

For functions like LEAD or CUME_DIST that don't immediately startreturning values, the executor will need a tuplestore to buffer inputrows until it gets their values from the window function.



The syntax for creating a ranking aggregate might look something like this:

CREATE RANKING AGGREGATE name (
    SFUNC = sfunc,
    STYPE = state_data_type,
    OUTFUNC = outfunc
)

This is very similar to Simon's proposal, and to current CREATEAGGREGATE, but tweaked a bit to fix the problems Hitoshi pointed out.

sfunc is called once for each input row. Unlike with normal aggregates,sfunc is passed the whole input row, so that e.g RANK can compare itagainst the previous row, or LEAD can buffer it.

outfunc is a set-returning function, and it is called until it returnsno more rows, after each sfunc invocation.


Window aggregates
-----------------

Like normal aggregates, window aggregates process N input values, andreturn one output value. In normal GROUP BY expressions, the aggregatefunction is called once each group, but in a window aggregate expressionwith a window frame (aka sliding window), there is as many "groups" asthere is rows.

In fact, any normal aggregate function can also be used as a windowaggregate and vice versa. The trivial implementation is to have a bufferto hold all values currently in the window frame, and for each inputrow, invoke the aggregate function over all the rows currently in theframe. I think we'll need support that, to allow using any built-in oruser-defined aggregate as a window aggregate.

However, we also want to provide optimized versions of the commonaggregates. Aggregates like COUNT, SUM, and AVG can easily beimplemented more efficiently, without rescanning all the tuples in thegroup.

I propose an extension to the current CREATE AGGREGATE syntax to allowfor optimized versions. Currently, the syntax is like:


CREATE AGGREGATE name ( input_data_type [ , ... ] ) (
    SFUNC = sfunc,
    STYPE = state_data_type
    [ , FINALFUNC = ffunc ]
    [ , INITCOND = initial_condition ]
    [ , SORTOP = sort_operator ]
)

Let's add a function to allow removing tuples from the current window frame:

CREATE AGGREGATE name ( input_data_type [ , ... ] ) (
    SFUNC = sfunc,
    STYPE = state_data_type
    [ , FINALFUNC = ffunc ]
    [ , INITCOND = initial_condition ]
    [ , SORTOP = sort_operator ]
    [ , SUBTRACTFUNC = subfunc,
)

subfunc is like sfunc, except that the input value is "subtracted" fromthe current value. On each input row, the executor calls sfunc for allnew tuples that enter the window frame, and subfunc for all tuples thatexit it. Another difference is that finalfunc can be called many timesduring the lifecycle of an aggregate invocation. For example, thesubfunc for COUNT would decrement the row count by one, and for SUM,subfunc would subtract the removed value from the current sum. Perhapswe should also support an "opportunistic subfunc", that could eitherperform the subtraction, or return a special value indicating that itcan't be done, and we need to calculate the aggregate from scratch as ifsubfunc wasn't defined. That would be useful for MIN/MAX, which can beimplemented by just keeping track of the current MIN/MAX value, and"subtracting" any other value than the current MIN/MAX is a no-op, but"subtracting" the current MIN/MAX value requires scanning all the restof the tuples from scratch.


PS. Have you looked at the "hypothetical set functions" in SQL2003?

PPS. I was just about to send this, when I noticed that LEAD/LAG are infact not in the SQL2003 draft I have at hand. In fact, none of thewindow functions defined there, RANK, DENSE_RANK, PERCENT_RANK,CUME_DIST and ROW_NUMBER, require random access to the tuples. Stillseems like a good idea to be have the flexibility for them.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Window functions patch v04 for the September commit fest

Reply via email to