okumin commented on PR #5091: URL: https://github.com/apache/hive/pull/5091#issuecomment-2095457650
# Digest of our discussions Let me clarify the requirements of HIVE-28082. - Any SQLs have to return the same results regardless of HiveAggregateReduceFunctionsRule - $SUM0/AVG/STDDEV_(POP|SAMP)/VAR_(POP/SAMP) have to accept non numeric types We would like to satisfy the following conditions. We may need to compromise some of them if they conflict with each other. - (A) We would like to justify the CBO path if it is working differently from the non-CBO mode - (B) `avg(x)` should be equal to `sum(x) / count(x)` - (C) `sum(c_string)` should be equal to `sum(cast(c_string as double))` # Strange behaviors we've observed As we see in the above table, we observe inconsistent results in the following cases. | | CBO | Non-CBO | Note | |-|-|-|-| | AVG(c_non_numeric) | 0.0 | NULL | Inconsistent | | AVG(c_mix) | 24.125 | 32.167 | Inconsistent | | STDDEV_POP(c_mix) | 48.401 | 53.524 | Inconsistent | | STDDEV_SAMP(c_mix) | 51.742 | 58.632 | Inconsistent | | VAR_POP(c_mix) | 2342.609 | 2864.806 | Inconsistent | | VAR_SAMP(c_mix) | 2677.268 | 3437.767 | Inconsistent | # My new findings Checking the 6 test cases, The root cause is likely located in HiveAggregateReduceFunctionsRule. All UDAF should skip `null` values, including `SUM` or `COUNT`, they actually behave like that. The problem with non-numeric values happens because we don't take care of the arg type of `COUNT`. `SUM(non_numeric_text)` mostly[1] behaves in the same way as `SUM(CAST(non_numeric_text AS DOUBLE))`, but `COUNT(non_numeric_text)` never behaves so since `non_numeric_text` is totally valid for `COUNT`. In the case of `AVG`, the numerator, `SUM`, doesn't count illegal numbers but the denominator side, `COUNT` does[2]. I'm currently trying to explicitly cast the arg of `COUNT` so that `SUM` and `COUNT` share the same null-check semantics. It doesn't satisfy (A) but I believe it is reasonable because the current `AVG(str)` doesn't meet the semantics of average. @kasakrisz @amansinha100 Please feel free to give your opinions to me if we have. - [1] I believe `SUM` should return `NULL` if all rows are evaluated as `NULL`, but it returns zero - [2] We can simply reproduce the problem with the following queries ``` > WITH a AS (SELECT 1.0 AS i UNION ALL SELECT NULL AS i UNION ALL SELECT 2.0 AS i) SELECT AVG(i), SUM(i), COUNT(i) FROM a; +---------+------+------+ | _c0 | _c1 | _c2 | +---------+------+------+ | 1.5000 | 3 | 2 | +---------+------+------+ > WITH a AS (SELECT '1' AS i UNION ALL SELECT 'null' AS i UNION ALL SELECT '2' AS i) SELECT AVG(i), SUM(i), COUNT(i) FROM a; +------+------+------+ | _c0 | _c1 | _c2 | +------+------+------+ | 1.0 | 3.0 | 3 | +------+------+------+ > WITH a AS (SELECT '1' AS i UNION ALL SELECT 'null' AS i UNION ALL SELECT '2' AS i) SELECT AVG(CAST(i AS DOUBLE)), SUM(CAST(i AS DOUBLE)), COUNT(CAST(i AS DOUBLE)) FROM a; +------+------+------+ | _c0 | _c1 | _c2 | +------+------+------+ | 1.5 | 3.0 | 2 | +------+------+------+ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
