[
https://issues.apache.org/jira/browse/HIVE-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841516#comment-13841516
]
Xuefu Zhang commented on HIVE-5878:
-----------------------------------
[~ehans] Thank you for your concern. However, I respectfully disagree that the
behavior WAS and IS reasonable for several reasons. First, AVG was probably
introduced before decimal, so there was no better choice than double. Hive has
the concept of exact types (int, long, decimal, etc.) vs approximate types
(double, float, etc), and Arithmetic operations (plus, divide, etc) on exact
types generates exact type for accuracy. If average is defined mathematically
as sum/count, then sum(int)/count should result an exact type. Otherwise, avg()
and sum()/count give different result. Another inconsistency exists when
avg(decimal) results a decimal. All those cause inconsistency in Hive's
mathematical concept and function behavior, and can create confusions among
users as well.
I understand vectorized current implementation chooses double for sum and uses
sum/count to get another double for average. While this extends the scope of
the changes, to me, however, vectorization is just implementation, which should
not dictate high-level concept and consistency.
> Hive standard avg UDAF returns double as the return type for some exact input
> types
> -----------------------------------------------------------------------------------
>
> Key: HIVE-5878
> URL: https://issues.apache.org/jira/browse/HIVE-5878
> Project: Hive
> Issue Type: Bug
> Components: Types, UDF
> Affects Versions: 0.12.0
> Reporter: Xuefu Zhang
> Assignee: Xuefu Zhang
> Attachments: HIVE-5878.1.patch, HIVE-5878.patch
>
>
> For standard, no-partial avg result, hive currently returns double as the
> result type.
> {code}
> hive> desc test;
> OK
> d int None
> Time taken: 0.051 seconds, Fetched: 1 row(s)
> hive> explain select avg(`d`) from test;
> ...
> Reduce Operator Tree:
> Group By Operator
> aggregations:
> expr: avg(VALUE._col0)
> bucketGroup: false
> mode: mergepartial
> outputColumnNames: _col0
> Select Operator
> expressions:
> expr: _col0
> type: double
> {code}
> However, exact types including integers and decimal should yield exact type.
> Here is what MySQL does:
> {code}
> mysql> desc test;
> +-------+--------------+------+-----+---------+-------+
> | Field | Type | Null | Key | Default | Extra |
> +-------+--------------+------+-----+---------+-------+
> | i | int(11) | YES | | NULL | |
> | b | tinyint(1) | YES | | NULL | |
> | d | double | YES | | NULL | |
> | s | varchar(5) | YES | | NULL | |
> | dd | decimal(5,2) | YES | | NULL | |
> +-------+--------------+------+-----+---------+-------+
> mysql> create table test62 as select avg(i) from test;
> mysql> desc test62;
> +-------+---------------+------+-----+---------+-------+
> | Field | Type | Null | Key | Default | Extra |
> +-------+---------------+------+-----+---------+-------+
> | avg(i) | decimal(14,4) | YES | | NULL | |
> +-------+---------------+------+-----+---------+-------+
> 1 row in set (0.00 sec)
> {code}
--
This message was sent by Atlassian JIRA
(v6.1#6144)