Hitoshi Harada wrote:
2008/8/30 Hitoshi Harada <[EMAIL PROTECTED]>:
Here's the latest window functions patch against HEAD. It seems to be
ready for the September commit fest, as added documents, WINDOW clause
feature and misc tests. I guess this would be the window functions
feature freeze for 8.4. The remaining feature will be implemented for
the later release.

This patch consists of:
- windowed aggregates
- cooperate with GROUP BY aggregates
- some optimizations with multiple windows
- ranking functions
- WINDOW clause
- windowing SQL regression tests
- sgml documents

Looking up the total road map, the dropped features are:

- sliding window (window framing)
- lead(), lag(), etc. that reach for random rows
- user defined window functions

Ok, I'm starting to read up on SQL2003 window functions, and to review this patch. Here's a long brain-dump of my first thoughts on the design.

Let's list a couple of requirements first:

1. It's important that what gets committed now can be extended to handle all of the window function stuff in SQL2003 in the future, as well as user-defined-window-functions in the spirit of PostgreSQL extensibility. Even if we don't implement all of it in this release.

2. Performance needs to be reasonable, in the face a large number of input rows. These are typically used in OLAP queries that process gazillions of rows. Some window functions need access to all rows in the window frame or partition, so you can't reasonably expect great performance with those if the working set is large, but those that don't, should work with finite memory requirements, and reasonably fast.

Because of 1., let's focus on the interface for user-defined-window-functions first. Ideally, the interface is such that all the window functions and aggregates defined in SQL2003, and others we might want to have in core or as pgfoundry projects, can be implemented using it.

I think there's still some confusion in the design document about what a frame is. The terminology section is great, it helped me a lot to get started, so thank you for that. However, in the "Actual design in the patch" you describe how a window function works, and you talk about a window frame, but elsewhere in the doc you note that window frames are not implemented yet.

As already discussed, there's two different kind of window expressions: window aggregates, and ranking aggregates (you call them window functions in your doc). It seems that they are different enough that we can't bang them together. For example, you can't use a window frame clause with ranking aggregates, and ranking aggregates need to see the whole row, whereas window aggregates only see the expressions passed as argument.

API for window functions (aka. ranking aggregate functions)
------------------------

Borrowing ideas from the many approaches listed in the design doc, I propose an interface using a set-returning function. Similar to Simon's proposal, but taking into account that a ranking function might need to see some or all input tuples before returning anything.

For each input tuple, the window function can return any number of tuples, including 0. After processing all input values, though, the total number of output rows returned must match the total number of input values fed into it. The function must eventually output one value for each input value, in the same order.

This is a flexible design that allows for different kind of implementations; the function can return one output tuple for each input tuple (ROW_NUMBER(), RANK() etc.), or it can swallow all input tuples first, and only then return all output tuples (e.g. CUME_DIST).

Here's a couple of step-by-step examples of how some functions would work. Imagine input values 1, 3, 4, 4, 5:

RANK
----

Invocation      Input   Value returned  Internal state (after)
1               1       1               (last value 1, count 1, rank=1)
2               3       2               (last value 2, count 1, rank=2)
3               4       3               (last value 4, count 1, rank=3)
4               4       3               (last value 4, count 2, rank=3)
5               5       5               (last value 5, count 1, rank=5)

So RANK returns one tuple after each input tuple. Just like in your current patch, keep the last value returned, a row count, and the current rank as state.

LAG
---

Internal state is a FIFO queue of values. Each input value is pushed to the FIFO, and the tuple that falls out of the queue is returned.

For example, LAG(<col>,2,0):

Invocation      Input row       Value returned  Internal state (after)
1               1               0               (1)
2               3               0               (3, 1)
3               4               1               (4, 3)
4               4               3               (4, 4)
5               5               4               (5, 4)

LEAD
----

Return nothing for first <offset> input tuples. Then return the current input tuple for the rest of the input tuples, and after the last input tuple, return <offset> number of default values.

For example, LEAD(<col>,2,0):

Invocation      Input row       Value returned  Internal state (after)
1               1               <none>            (offsetbegin = 1, pad=2)
2               3               <none>            (offsetbegin = 0, pad=2)
3               4               4               (offsetbegin = 0, pad=2)
4               4               4               (offsetbegin = 0, pad=2)
5               5               5               (offsetbegin = 0, pad=2)
6               <none>            0               (offsetbegin = 0, pad=1)
7               <none>            0               (offsetbegin = 0, pad=0)


For functions like LEAD or CUME_DIST that don't immediately start returning values, the executor will need a tuplestore to buffer input rows until it gets their values from the window function.


The syntax for creating a ranking aggregate might look something like this:

CREATE RANKING AGGREGATE name (
    SFUNC = sfunc,
    STYPE = state_data_type,
    OUTFUNC = outfunc
)

This is very similar to Simon's proposal, and to current CREATE AGGREGATE, but tweaked a bit to fix the problems Hitoshi pointed out.

sfunc is called once for each input row. Unlike with normal aggregates, sfunc is passed the whole input row, so that e.g RANK can compare it against the previous row, or LEAD can buffer it.

outfunc is a set-returning function, and it is called until it returns no more rows, after each sfunc invocation.

Window aggregates
-----------------

Like normal aggregates, window aggregates process N input values, and return one output value. In normal GROUP BY expressions, the aggregate function is called once each group, but in a window aggregate expression with a window frame (aka sliding window), there is as many "groups" as there is rows.

In fact, any normal aggregate function can also be used as a window aggregate and vice versa. The trivial implementation is to have a buffer to hold all values currently in the window frame, and for each input row, invoke the aggregate function over all the rows currently in the frame. I think we'll need support that, to allow using any built-in or user-defined aggregate as a window aggregate.

However, we also want to provide optimized versions of the common aggregates. Aggregates like COUNT, SUM, and AVG can easily be implemented more efficiently, without rescanning all the tuples in the group.

I propose an extension to the current CREATE AGGREGATE syntax to allow for optimized versions. Currently, the syntax is like:

CREATE AGGREGATE name ( input_data_type [ , ... ] ) (
    SFUNC = sfunc,
    STYPE = state_data_type
    [ , FINALFUNC = ffunc ]
    [ , INITCOND = initial_condition ]
    [ , SORTOP = sort_operator ]
)

Let's add a function to allow removing tuples from the current window frame:

CREATE AGGREGATE name ( input_data_type [ , ... ] ) (
    SFUNC = sfunc,
    STYPE = state_data_type
    [ , FINALFUNC = ffunc ]
    [ , INITCOND = initial_condition ]
    [ , SORTOP = sort_operator ]
    [ , SUBTRACTFUNC = subfunc,
)

subfunc is like sfunc, except that the input value is "subtracted" from the current value. On each input row, the executor calls sfunc for all new tuples that enter the window frame, and subfunc for all tuples that exit it. Another difference is that finalfunc can be called many times during the lifecycle of an aggregate invocation. For example, the subfunc for COUNT would decrement the row count by one, and for SUM, subfunc would subtract the removed value from the current sum. Perhaps we should also support an "opportunistic subfunc", that could either perform the subtraction, or return a special value indicating that it can't be done, and we need to calculate the aggregate from scratch as if subfunc wasn't defined. That would be useful for MIN/MAX, which can be implemented by just keeping track of the current MIN/MAX value, and "subtracting" any other value than the current MIN/MAX is a no-op, but "subtracting" the current MIN/MAX value requires scanning all the rest of the tuples from scratch.

PS. Have you looked at the "hypothetical set functions" in SQL2003?

PPS. I was just about to send this, when I noticed that LEAD/LAG are in fact not in the SQL2003 draft I have at hand. In fact, none of the window functions defined there, RANK, DENSE_RANK, PERCENT_RANK, CUME_DIST and ROW_NUMBER, require random access to the tuples. Still seems like a good idea to be have the flexibility for them.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to