[jira] [Comment Edited] (CALCITE-4483) WITHIN DISTINCT clause for aggregate functions (experimental)

Julian Hyde (Jira) Thu, 11 Feb 2021 16:09:29 -0800


    [ 
https://issues.apache.org/jira/browse/CALCITE-4483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283432#comment-17283432
 ]


Julian Hyde edited comment on CALCITE-4483 at 2/11/21, 11:58 PM:
-----------------------------------------------------------------

Based on [~amaliujia]'s feedback, I changed the behavior to throw if the 
operand value is not functionally dependent on the {{WITHIN DISTINCT}} key 
within the current group. (The non-throwing behavior is available via another 
variant of the planner rule.)

I implemented {{UNIQUE_VALUE}}, but it was not sufficient to solve the 
problems. When evaluated in grouping sets \(x) and \(x, y), we needed the 
result from the second grouping set (where the value was indeed unique) but 
were not interested in the result from the first grouping set (where the value 
is not unique). But it was still calculated in the first grouping set, and 
therefore threw an exception that aborted the query. CALCITE-4484 remains open 
(and has an implementation that is almost complete) but we will not use it to 
implement {{WITHIN DISTINCT}}.

So, the implementation is based on something like {{FILTER (WHERE MIN(v) IS NOT 
DISTINCT FROM MAX(v))}}, which works fine as long as the aggregate function 
does not want to receive NULL values (see further note below).


was (Author: julianhyde):
Based on [~amaliujia]'s feedback, I changed the behavior to throw if the 
operand value is not functionally dependent on the {{WITHIN DISTINCT}} key 
within the current group. (The non-throwing behavior is available via another 
variant of the planner rule.)

I implemented {{UNIQUE_VALUE}}, but it was not sufficient to solve the 
problems. When evaluated in grouping sets \(x) and \(x, y), we needed the 
result from the second grouping set (where the value was indeed unique) but 
were not interested in the result from the first grouping set (where the value 
is not unique). But it was still calculated in the first grouping set, and 
therefore threw an exception that aborted the query. CALCITE-4484 remains open 
(and has an implementation that is almost complete) but we will not use it to 
implement {{WITHIN DISTINCT}}.

So, the implementation is based on something like {{FILTER (WHERE MAX(v) IS NOT 
DISTINCT FROM MIN(x))}}, which works fine as long as the aggregate function 
does not want to receive NULL values (see further note below).

> WITHIN DISTINCT clause for aggregate functions (experimental)
> -------------------------------------------------------------
>
>                 Key: CALCITE-4483
>                 URL: https://issues.apache.org/jira/browse/CALCITE-4483
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Julian Hyde
>            Assignee: Julian Hyde
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.27.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Add a {{WITHIN DISTINCT}} clause to aggregate functions, allowing duplicate 
> rows to be eliminated before entering the function.
> This feature is non-standard, and in fact does not exist in any database I am 
> aware of. Therefore this feature is experimental, and may evolve or be 
> removed without notice.
> It is related to {{DISTINCT}}, and is in fact a generalization of it. 
> {{DISTINCT}} can always be rewritten in terms of {{WITHIN DISTINCT}}. For 
> example, {{SUM(DISTINCT sal)}} is equivalent to {{SUM(sal) WITHIN DISTINCT 
> (sal)}}.
> Consider the query
> {noformat}
> SELECT SUM(age),
>    SUM(DISTINCT age),
>   SUM(age) WITHIN DISTINCT (name)
> FROM Friends{noformat}
> where {{Friends}} has the rows
> {noformat}
> name   age job
> ====== === ==========
> Julian  16 Programmer
> Dick    15
> Anne    13 Car wash
> George  15 Lifeguard
> George  15 Dog walker
> Timmy    4
> {noformat}
> Note that there are two rows for George, because she has two jobs.
> The values of the columns are as follows:
>  * {{SUM(age)}} has the value (16 + 15 + 13 + 15 + 15 + 4) = 78;
>  * {{SUM(DISTINCT age)}} has the value (16 + 15 + 13 + 4) = 48;
>  * {{SUM(age) WITHIN DISTINCT (name)}} has the value (16 + 15 + 13 + 15 + 4) 
> = 63.
> {{WITHIN DISTINCT}} has treated the two 15 values for George as one value, 
> but has still counted the 15 for Dick separately.
> The {{WITHIN DISTINCT}} clause can be useful to prevent double-counting when 
> duplicates have been added via a many-to-one join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (CALCITE-4483) WITHIN DISTINCT clause for aggregate functions (experimental)

Reply via email to