[ 
https://issues.apache.org/jira/browse/DRILL-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17365749#comment-17365749
 ] 

ASF GitHub Bot commented on DRILL-7931:
---------------------------------------

paul-rogers edited a comment on pull request #2259:
URL: https://github.com/apache/drill/pull/2259#issuecomment-864223771


   Thanks for finding and fixing this bug. It is a very difficult one. This 
area of code is complex. It *might* be that the current fix is correct, but I 
do suspect that there is a deeper bug than this fix would suggest, so we should 
dig in a bit more.
   
   The description suggests that the code is attempting to combine the sum 
operation for `stddev()` with the sum operation for `sum()`. I believe that 
doing so is a bug unless there was some specific optimization attempt to share 
common expressions. Normally the planner would perform this kind of 
optimization. The trick, for Drill, is that the planner does not know the data 
type, so it must be the runtime that handles such details. I rather doubt that 
the original developers had gotten to the level of optimization in code 
generation where they attempted to share subexpressions.
   
   As a result, it is a bug if the `sum()` operation for one expression is 
combined with that for another column. We can check, what happens with:
   
   ```sql
   SELECT sum(c) AS a, sum(c) AS b FROM ...
   ```
   
   Even though the same expression appears twice, and a "normal" planner might 
combine them, Drill probably will not. The runtime should compute two sums in 
this case, not somehow try to combine them into one. The same is true for your 
case
   
   ```sql
   SELECT sum(c) AS a, stddev(c) AS b FROM ...
   ```
   
   Then there is the issue of the nullable `bigint` data type. According to the 
[SQL Server 
docs](https://docs.microsoft.com/en-us/sql/t-sql/functions/stdev-transact-sql?view=sql-server-ver15),
 `stddev` will ignore `NULL` values and will return `NULL` only if no 
non-`NULL` rows are present. Probably other engines work the same way. The 
[Drill docs](http://drill.apache.org/docs/aggregate-and-aggregate-statistical/) 
do not say how `NULL`s are handled, but the original developers mostly followed 
[Postgres](https://www.postgresql.org/docs/9.1/functions-aggregate.html), among 
others. Postgres also uses the "ignore NULL" behavior.
   
   Given this, there should *never* be a nullable anything in the accumulators, 
even if the input type *is* nullable. Instead, to implement the "`NULL` if no 
values" behavior, the "finalize" step (the one that computes the final output 
value), should return a nullable value if the count is zero. Note that Postgres 
returns `NULL` for `sum()` when there are no rows instead of 0. This is odd, 
I'm not sure if that is what Drill does. (All these details should be in the 
Drill docs, but are not. If we figure out how this works in Drill, we should 
file a Doc. JIRA ticket to get the information added.)
   
   Next, consider how the accumulators are created. Remember that Drill does 
not know the data type until the first row arrives. At that point, Drill has a 
data type and can set up the needed accumulators and generate the code for the 
operations. My understanding is that the accumulators and code will remain in 
place for the rest of the query -- unless the operation receives an 
`OK_NEW_SCHEMA` (schema change) from the upstream operator. At that point, 
Drill is supposed to handle the change. (A `sum(c)` first saw `c` as `INT`, but 
later saw it as `DOUBLE`, say.) I doubt if this actually works: we'd have to 
somehow compute the common type of the old and new types, convert the 
accumulators, etc. Maybe it works, but I'd not bet on it. (Most schema change 
events are not handled in most operators -- they are an intractable problem for 
which there is no good *a priori* solution. It is an ongoing embarrassment that 
Drill promises to do things which it cannot do, even in principle.)
   
   Let's look at the error message to see what that tells us. Query:
   
   ```sql
   select col1, stddev(col2), SUM(col2)
   FROM (values ('UA', 3), ('USA', 2), ('UA', 3), ('USA', 5), ('USA', 1), 
('UA', 9)) x(col1, col2)
   GROUP BY col1;
   ```
   
   Since the data is from a `VALUES` clause, the schema does not change, so we 
can rule out that as a possible cause. Instead, the problem is probably a bug 
within Drill itself.
   
   ```text
   rowtype of new rel:
   RecordType(CHAR(3) NOT NULL col1, BIGINT $f1, BIGINT $f2, BIGINT NOT NULL 
$f3, BIGINT $f4) NOT NULL
   rowtype of set:
   RecordType(CHAR(3) NOT NULL col1, BIGINT $f1, BIGINT $f2, BIGINT NOT NULL 
$f3, BIGINT NOT NULL $f4) NOT NULL
   ```
   
   We see that the `$f1` is not a very descriptive name. It probably means 
"function 1". The pull request description suggests that:
   
   * `$f1` is `STDDEV.sum(col2)`
   * `$f2` is `STDDEV.sum(col2*col2)`
   * `$f3` is `STDDEV.count(col2)`
   * `$f4` is `SUM(col2)`
   
   Where `STDDEV.sum` is something I just made up to mean "the `sum()` function 
that implements `STDDEV()`". We can immediately notice some issues:
   
   * *None* of the functions should be nullable, all should be `NOT NULL` 
because of the way that SQL handles `NULL`s: they are ignored, so we only 
accumulate non-null values. The `NULL`ness will be added to the output column.
   * So, the "new rel" is less correct than the "rowtype of set", even though 
both are wrong.
   
   So, where does this leave us? It says that we should check:
   
   * Why did code generation decide to create a nullable accumulator? According 
to the rules of SQL, discussed above, the accumulators should all be 
non-nullable.
   * How does the code find the accumulator? Each should have a unique name, 
derived, somehow, from the input expression. The `$f1` symbols are internal 
names. They clearly do not identify the function in any detail. If we use them 
as a key, we have to ensure that we do the same mapping from expression to 
`$fn` each time. Otherwise, the names should be more unique such as (making 
something up)  `$STDDEV$SUM$C`. The question is: are the names getting 
confused, or is the problem elsewhere?
   
   Again, thanks for looking into this; this is a very complex part of the 
code, so we have to work carefully.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Rowtype mismatch in DrillReduceAggregatesRule
> ---------------------------------------------
>
>                 Key: DRILL-7931
>                 URL: https://issues.apache.org/jira/browse/DRILL-7931
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 1.18.0
>            Reporter: wtf
>            Assignee: wtf
>            Priority: Major
>
> It's work correct for example in case when there is only one aggregation:
> select col1, stddev(col2) FROM(values ('UA', 3), ('USA', 2), ('UA', 3), 
> ('USA', 5), ('USA', 1), ('UA', 9)) x(col1, col2) GROUP BY col1 LIMIT 6;
> Or when it's after other aggregations:
> select col1, SUM(col2), stddev(col2) FROM (values ('UA', 3), ('USA', 2), 
> ('UA', 3), ('USA', 5), ('USA', 1), ('UA', 9)) x(col1, col2) GROUP BY col1 
> LIMIT 6;
> But when we try to put it before an aggregation, like SUM:
> select col1, stddev(col2), SUM(col2) FROM (values ('UA', 3), ('USA', 2), 
> ('UA', 3), ('USA', 5), ('USA', 1), ('UA', 9)) x(col1, col2) GROUP BY col1 
> LIMIT 6;
> It's failed with error:
> SYSTEM ERROR: AssertionError: Type mismatch:
> rowtype of new rel:
> RecordType(CHAR(3) NOT NULL col1, BIGINT $f1, BIGINT $f2, BIGINT NOT NULL 
> $f3, BIGINT $f4) NOT NULL
> rowtype of set:
> RecordType(CHAR(3) NOT NULL col1, BIGINT $f1, BIGINT $f2, BIGINT NOT NULL 
> $f3, BIGINT NOT NULL $f4) NOT NULL
> Same for stddev_samp



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to