[ https://issues.apache.org/jira/browse/CALCITE-4483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julian Hyde resolved CALCITE-4483. ---------------------------------- Fix Version/s: 1.27.0 Resolution: Fixed Fixed in [f1da6550|https://github.com/apache/calcite/commit/f1da65504e598928cf77aa6a7244552692ae2529]. As noted above, this feature is experimental and subject to change or removal without notice. Further work is required to complete the feature: * support FILTER clause in combination with WITHIN DISTINCT; * support WITHIN DISTINCT clause in GROUPING SETS queries; * duplicate detection fails for aggregate functions that respect nulls. Regarding the last point, aggregate functions that nulls. Luckily most aggregate functions ignore nulls by default, but exceptions include {{ARRAY_AGG}}. If the input to {{ARRAY_AGG}} for a given value of the {{WITHIN DISTINCT}} key is [null, 1, null], {{MIN}} and {{MAX}} will both be 1, so we cannot determine that all of the values were not the same. > WITHIN DISTINCT clause for aggregate functions (experimental) > ------------------------------------------------------------- > > Key: CALCITE-4483 > URL: https://issues.apache.org/jira/browse/CALCITE-4483 > Project: Calcite > Issue Type: Bug > Reporter: Julian Hyde > Assignee: Julian Hyde > Priority: Major > Labels: pull-request-available > Fix For: 1.27.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Add a {{WITHIN DISTINCT}} clause to aggregate functions, allowing duplicate > rows to be eliminated before entering the function. > This feature is non-standard, and in fact does not exist in any database I am > aware of. Therefore this feature is experimental, and may evolve or be > removed without notice. > It is related to {{DISTINCT}}, and is in fact a generalization of it. > {{DISTINCT}} can always be rewritten in terms of {{WITHIN DISTINCT}}. For > example, {{SUM(DISTINCT sal)}} is equivalent to {{SUM(sal) WITHIN DISTINCT > (sal)}}. > Consider the query > {noformat} > SELECT SUM(age), > SUM(DISTINCT age), > SUM(age) WITHIN DISTINCT (name) > FROM Friends{noformat} > where {{Friends}} has the rows > {noformat} > name age job > ====== === ========== > Julian 16 Programmer > Dick 15 > Anne 13 Car wash > George 15 Lifeguard > George 15 Dog walker > Timmy 4 > {noformat} > Note that there are two rows for George, because she has two jobs. > The values of the columns are as follows: > * {{SUM(age)}} has the value (16 + 15 + 13 + 15 + 15 + 4) = 78; > * {{SUM(DISTINCT age)}} has the value (16 + 15 + 13 + 4) = 48; > * {{SUM(age) WITHIN DISTINCT (name)}} has the value (16 + 15 + 13 + 15 + 4) > = 63. > {{WITHIN DISTINCT}} has treated the two 15 values for George as one value, > but has still counted the 15 for Dick separately. > The {{WITHIN DISTINCT}} clause can be useful to prevent double-counting when > duplicates have been added via a many-to-one join. -- This message was sent by Atlassian Jira (v8.3.4#803005)