Why are window functions considered agg function?

2024-01-10 Thread Will Noble
I have a question about the way Calcite handles window functions.

Here are my assumptions:

   - The purpose of aggregation is to merge rows of the input relation.
   Therefore, an Aggregate
   

   layer can only exist in a relational expression if there is a
corresponding GROUP
   BY clause in the corresponding SQL expression. If there is no explicit GROUP
   BY, then GROUP BY () is assumed implicitly, but logically there is a
   1-to-1 correspondence between agg layers and (possibly implicit) GROUP BY
   clauses.
   - An agg function
   

   is only meant to appear in an aggregate layer; i.e. It should never appear
   in a projection layer as a rex function would.
   - Calcite generally treats a window function as an agg function with
   requiresOver=true. That's the purpose of the requiresOver field, which
   has existed since at least as early as 2015.
   - Window functions cannot be used with GROUP BY clauses. Invoking them
   never causes the rows of the input relation to merge as is characteristic
   of an aggregation. Therefore, they are not agg functions. It would probably
   make more sense for us to think of them as rex functions that should appear
   in a projection, even though they cannot be computed row-wise
   independently. Perhaps a new type of relational operator is needed besides
   aggregate layers and projection layers; a hypothetical "windowing layer".

Which of my assumptions is wrong?

Consider this BigQuery example using SUM as an agg function. This query
would be invalid without the GROUP BY clause. It will return as many rows
as there are unique names, and total_score is per-name.

SELECT name, SUM(score) as total_score
GROUP BY name
FROM games

Now consider this BQ example with SUM as a window function. It would be
invalid to include any GROUP BY clause here. It will return as many rows as
there are in the input table, and total_score is global (it will have the
same value in every single row).

SELECT name, SUM(score) OVER () as total_score
FROM games

Does anybody actually use window functions in Calcite as they're currently
implemented? How can it possibly make sense to consider them as agg
functions, when they can never be used in the same context as a "true" agg
function (which requires grouping)? Seems to me like these are actually two
completely different functions with zero overlap in terms of where they can
appear in a relational expression; they just happen to share the name SUM
and involve similar math.

Thanks for any clarification / guidance.


Re: Why are window functions considered agg function?

2024-01-10 Thread Mihai Budiu
I am not 100% sure I understand your question, but we do implement window 
functions in our compiler using the Calcite IR.

In our optimizer we use a Calcite rule which rewrites RexOver expressions into 
LogicalWindow operations:
CoreRules.PROJECT_TO_LOGICAL_PROJECT_AND_WINDOW

I find that the meaning of LogicalWindow as an IR representation is quite 
clean. Each group in a window has a list of aggregate calls, which work just 
like aggregate calls in a standard group-by setting. Our code handles both 
kinds of aggregates in the same way.

Mihai


From: Will Noble 
Sent: Wednesday, January 10, 2024 4:13 PM
To: dev@calcite.apache.org 
Subject: Why are window functions considered agg function?

I have a question about the way Calcite handles window functions.

Here are my assumptions:

   - The purpose of aggregation is to merge rows of the input relation.
   Therefore, an Aggregate
   
<https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/rel/core/Aggregate.java>
   layer can only exist in a relational expression if there is a
corresponding GROUP
   BY clause in the corresponding SQL expression. If there is no explicit GROUP
   BY, then GROUP BY () is assumed implicitly, but logically there is a
   1-to-1 correspondence between agg layers and (possibly implicit) GROUP BY
   clauses.
   - An agg function
   
<https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/sql/SqlAggFunction.java>
   is only meant to appear in an aggregate layer; i.e. It should never appear
   in a projection layer as a rex function would.
   - Calcite generally treats a window function as an agg function with
   requiresOver=true. That's the purpose of the requiresOver field, which
   has existed since at least as early as 2015.
   - Window functions cannot be used with GROUP BY clauses. Invoking them
   never causes the rows of the input relation to merge as is characteristic
   of an aggregation. Therefore, they are not agg functions. It would probably
   make more sense for us to think of them as rex functions that should appear
   in a projection, even though they cannot be computed row-wise
   independently. Perhaps a new type of relational operator is needed besides
   aggregate layers and projection layers; a hypothetical "windowing layer".

Which of my assumptions is wrong?

Consider this BigQuery example using SUM as an agg function. This query
would be invalid without the GROUP BY clause. It will return as many rows
as there are unique names, and total_score is per-name.

SELECT name, SUM(score) as total_score
GROUP BY name
FROM games

Now consider this BQ example with SUM as a window function. It would be
invalid to include any GROUP BY clause here. It will return as many rows as
there are in the input table, and total_score is global (it will have the
same value in every single row).

SELECT name, SUM(score) OVER () as total_score
FROM games

Does anybody actually use window functions in Calcite as they're currently
implemented? How can it possibly make sense to consider them as agg
functions, when they can never be used in the same context as a "true" agg
function (which requires grouping)? Seems to me like these are actually two
completely different functions with zero overlap in terms of where they can
appear in a relational expression; they just happen to share the name SUM
and involve similar math.

Thanks for any clarification / guidance.


Re: Why are window functions considered agg function?

2024-01-10 Thread Will Noble
Ok, I perhaps should have looked a little deeper. Looks like
org.apache.calcite.rel.core.Window is the "hypothetical" windowing layer I
described in assumption #4.

I suppose it makes sense to call window functions "window aggregates" on a
conceptual level. It's just a little confusing that both AggregateCall
<https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/rel/core/AggregateCall.java>
 and Window.RexWinAggCall
<https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/rel/core/Window.java>
define
their functions using SqlAggFunction. It feels to me like that's conflating
two different concepts, but perhaps that's not very important.

Thanks

On Wed, Jan 10, 2024 at 5:16 PM Mihai Budiu  wrote:

> I am not 100% sure I understand your question, but we do implement window
> functions in our compiler using the Calcite IR.
>
> In our optimizer we use a Calcite rule which rewrites RexOver expressions
> into LogicalWindow operations:
> CoreRules.PROJECT_TO_LOGICAL_PROJECT_AND_WINDOW
>
> I find that the meaning of LogicalWindow as an IR representation is quite
> clean. Each group in a window has a list of aggregate calls, which work
> just like aggregate calls in a standard group-by setting. Our code handles
> both kinds of aggregates in the same way.
>
> Mihai
>
> 
> From: Will Noble 
> Sent: Wednesday, January 10, 2024 4:13 PM
> To: dev@calcite.apache.org 
> Subject: Why are window functions considered agg function?
>
> I have a question about the way Calcite handles window functions.
>
> Here are my assumptions:
>
>- The purpose of aggregation is to merge rows of the input relation.
>Therefore, an Aggregate
><
> https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/rel/core/Aggregate.java
> >
>layer can only exist in a relational expression if there is a
> corresponding GROUP
>BY clause in the corresponding SQL expression. If there is no explicit
> GROUP
>BY, then GROUP BY () is assumed implicitly, but logically there is a
>1-to-1 correspondence between agg layers and (possibly implicit) GROUP
> BY
>clauses.
>- An agg function
><
> https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/sql/SqlAggFunction.java
> >
>is only meant to appear in an aggregate layer; i.e. It should never
> appear
>in a projection layer as a rex function would.
>- Calcite generally treats a window function as an agg function with
>requiresOver=true. That's the purpose of the requiresOver field, which
>has existed since at least as early as 2015.
>- Window functions cannot be used with GROUP BY clauses. Invoking them
>never causes the rows of the input relation to merge as is
> characteristic
>of an aggregation. Therefore, they are not agg functions. It would
> probably
>make more sense for us to think of them as rex functions that should
> appear
>in a projection, even though they cannot be computed row-wise
>independently. Perhaps a new type of relational operator is needed
> besides
>aggregate layers and projection layers; a hypothetical "windowing
> layer".
>
> Which of my assumptions is wrong?
>
> Consider this BigQuery example using SUM as an agg function. This query
> would be invalid without the GROUP BY clause. It will return as many rows
> as there are unique names, and total_score is per-name.
>
> SELECT name, SUM(score) as total_score
> GROUP BY name
> FROM games
>
> Now consider this BQ example with SUM as a window function. It would be
> invalid to include any GROUP BY clause here. It will return as many rows as
> there are in the input table, and total_score is global (it will have the
> same value in every single row).
>
> SELECT name, SUM(score) OVER () as total_score
> FROM games
>
> Does anybody actually use window functions in Calcite as they're currently
> implemented? How can it possibly make sense to consider them as agg
> functions, when they can never be used in the same context as a "true" agg
> function (which requires grouping)? Seems to me like these are actually two
> completely different functions with zero overlap in terms of where they can
> appear in a relational expression; they just happen to share the name SUM
> and involve similar math.
>
> Thanks for any clarification / guidance.
>


Re: Why are window functions considered agg function?

2024-01-12 Thread Julian Hyde
Simple answer is that an aggregate function is one that takes a set of values. 
For example, if you apply the SUM function to values [1, 2, 3] it returns 6.

When I execute a GROUP BY statement, the values of a particular column within a 
particular group (e.g. the group of rows for which deptno = 10) forms a set. 
And therefore I can apply an aggregate function to it.

When I use an OVER clause, to ‘OVER (PARTITION BY deptno ORDER BY hiredate ROWS 
3 PRECEDING)’ - the preceding 3 hires in the same department in as this 
employees - that also creates a set of values that I can apply the SUM function 
to.

In the two cases, how the sets are formed is different, but the SUM function 
does the same thing.

> On Jan 10, 2024, at 5:30 PM, Will Noble  wrote:
> 
> Ok, I perhaps should have looked a little deeper. Looks like
> org.apache.calcite.rel.core.Window is the "hypothetical" windowing layer I
> described in assumption #4.
> 
> I suppose it makes sense to call window functions "window aggregates" on a
> conceptual level. It's just a little confusing that both AggregateCall
> <https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/rel/core/AggregateCall.java>
> and Window.RexWinAggCall
> <https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/rel/core/Window.java>
> define
> their functions using SqlAggFunction. It feels to me like that's conflating
> two different concepts, but perhaps that's not very important.
> 
> Thanks
> 
> On Wed, Jan 10, 2024 at 5:16 PM Mihai Budiu  wrote:
> 
>> I am not 100% sure I understand your question, but we do implement window
>> functions in our compiler using the Calcite IR.
>> 
>> In our optimizer we use a Calcite rule which rewrites RexOver expressions
>> into LogicalWindow operations:
>> CoreRules.PROJECT_TO_LOGICAL_PROJECT_AND_WINDOW
>> 
>> I find that the meaning of LogicalWindow as an IR representation is quite
>> clean. Each group in a window has a list of aggregate calls, which work
>> just like aggregate calls in a standard group-by setting. Our code handles
>> both kinds of aggregates in the same way.
>> 
>> Mihai
>> 
>> 
>> From: Will Noble 
>> Sent: Wednesday, January 10, 2024 4:13 PM
>> To: dev@calcite.apache.org 
>> Subject: Why are window functions considered agg function?
>> 
>> I have a question about the way Calcite handles window functions.
>> 
>> Here are my assumptions:
>> 
>>   - The purpose of aggregation is to merge rows of the input relation.
>>   Therefore, an Aggregate
>>   <
>> https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/rel/core/Aggregate.java
>>> 
>>   layer can only exist in a relational expression if there is a
>> corresponding GROUP
>>   BY clause in the corresponding SQL expression. If there is no explicit
>> GROUP
>>   BY, then GROUP BY () is assumed implicitly, but logically there is a
>>   1-to-1 correspondence between agg layers and (possibly implicit) GROUP
>> BY
>>   clauses.
>>   - An agg function
>>   <
>> https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/sql/SqlAggFunction.java
>>> 
>>   is only meant to appear in an aggregate layer; i.e. It should never
>> appear
>>   in a projection layer as a rex function would.
>>   - Calcite generally treats a window function as an agg function with
>>   requiresOver=true. That's the purpose of the requiresOver field, which
>>   has existed since at least as early as 2015.
>>   - Window functions cannot be used with GROUP BY clauses. Invoking them
>>   never causes the rows of the input relation to merge as is
>> characteristic
>>   of an aggregation. Therefore, they are not agg functions. It would
>> probably
>>   make more sense for us to think of them as rex functions that should
>> appear
>>   in a projection, even though they cannot be computed row-wise
>>   independently. Perhaps a new type of relational operator is needed
>> besides
>>   aggregate layers and projection layers; a hypothetical "windowing
>> layer".
>> 
>> Which of my assumptions is wrong?
>> 
>> Consider this BigQuery example using SUM as an agg function. This query
>> would be invalid without the GROUP BY clause. It will return as many rows
>> as there are unique names, and total_score is per-name.
>> 
>> SELECT name, SUM(score) as total_score
>> GROUP BY name
>> FROM games
>> 
>> Now consider this BQ example with