[jira] [Commented] (HIVE-5878) Hive standard avg UDAF returns double as the return type for some exact input types

2013-11-23 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830776#comment-13830776
 ] 

Hive QA commented on HIVE-5878:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12615467/HIVE-5878.patch

{color:red}ERROR:{color} -1 due to 16 failed/errored test(s), 4684 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_create_genericudaf
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_decimal_precision
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_decimal_udf
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_in
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_in_having
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_notin
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udf3
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorization_0
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorization_2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorization_not
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorization_pushdown
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorization_short_regress
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorized_mapjoin
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorized_shufflejoin
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_windowing
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_windowing_udaf
{noformat}

Test results: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/419/testReport
Console output: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/419/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests failed with: TestsFailedException: 16 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12615467

> Hive standard avg UDAF returns double as the return type for some exact input 
> types
> ---
>
> Key: HIVE-5878
> URL: https://issues.apache.org/jira/browse/HIVE-5878
> Project: Hive
>  Issue Type: Bug
>  Components: Types, UDF
>Affects Versions: 0.12.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-5878.patch
>
>
> For standard, no-partial avg result, hive currently returns double as the 
> result type.
> {code}
> hive> desc test;
> OK
> d int None
> Time taken: 0.051 seconds, Fetched: 1 row(s)
> hive> explain select avg(`d`) from test;  
> ...
>   Reduce Operator Tree:
> Group By Operator
>   aggregations:
> expr: avg(VALUE._col0)
>   bucketGroup: false
>   mode: mergepartial
>   outputColumnNames: _col0
>   Select Operator
> expressions:
>   expr: _col0
>   type: double
> {code}
> However, exact types including integers and decimal should yield exact type. 
> Here is what MySQL does:
> {code}
> mysql> desc test;
> +---+--+--+-+-+---+
> | Field | Type | Null | Key | Default | Extra |
> +---+--+--+-+-+---+
> | i | int(11)  | YES  | | NULL|   |
> | b | tinyint(1)   | YES  | | NULL|   |
> | d | double   | YES  | | NULL|   |
> | s | varchar(5)   | YES  | | NULL|   |
> | dd| decimal(5,2) | YES  | | NULL|   |
> +---+--+--+-+-+---+
> mysql> desc test62;
> +---+---+--+-+-+---+
> | Field | Type  | Null | Key | Default | Extra |
> +---+---+--+-+-+---+
> | sum_t | decimal(14,4) | YES  | | NULL|   |
> +---+---+--+-+-+---+
> 1 row in set (0.00 sec)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5878) Hive standard avg UDAF returns double as the return type for some exact input types

2013-12-06 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841158#comment-13841158
 ] 

Hive QA commented on HIVE-5878:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12617339/HIVE-5878.1.patch

{color:red}ERROR:{color} -1 due to 6 failed/errored test(s), 4458 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorization_0
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorization_2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorization_not
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorization_pushdown
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorization_short_regress
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorized_mapjoin
{noformat}

Test results: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/549/testReport
Console output: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/549/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 6 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12617339

> Hive standard avg UDAF returns double as the return type for some exact input 
> types
> ---
>
> Key: HIVE-5878
> URL: https://issues.apache.org/jira/browse/HIVE-5878
> Project: Hive
>  Issue Type: Bug
>  Components: Types, UDF
>Affects Versions: 0.12.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-5878.1.patch, HIVE-5878.patch
>
>
> For standard, no-partial avg result, hive currently returns double as the 
> result type.
> {code}
> hive> desc test;
> OK
> d int None
> Time taken: 0.051 seconds, Fetched: 1 row(s)
> hive> explain select avg(`d`) from test;  
> ...
>   Reduce Operator Tree:
> Group By Operator
>   aggregations:
> expr: avg(VALUE._col0)
>   bucketGroup: false
>   mode: mergepartial
>   outputColumnNames: _col0
>   Select Operator
> expressions:
>   expr: _col0
>   type: double
> {code}
> However, exact types including integers and decimal should yield exact type. 
> Here is what MySQL does:
> {code}
> mysql> desc test;
> +---+--+--+-+-+---+
> | Field | Type | Null | Key | Default | Extra |
> +---+--+--+-+-+---+
> | i | int(11)  | YES  | | NULL|   |
> | b | tinyint(1)   | YES  | | NULL|   |
> | d | double   | YES  | | NULL|   |
> | s | varchar(5)   | YES  | | NULL|   |
> | dd| decimal(5,2) | YES  | | NULL|   |
> +---+--+--+-+-+---+
> mysql> create table test62 as select avg(i) from test;
> mysql> desc test62;
> +---+---+--+-+-+---+
> | Field | Type  | Null | Key | Default | Extra |
> +---+---+--+-+-+---+
> | avg(i) | decimal(14,4) | YES  | | NULL|   |
> +---+---+--+-+-+---+
> 1 row in set (0.00 sec)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5878) Hive standard avg UDAF returns double as the return type for some exact input types

2013-12-06 Thread Eric Hanson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841442#comment-13841442
 ] 

Eric Hanson commented on HIVE-5878:
---

I'm not comfortable with this change. If the data type of expression and 
aggregate results in Hive was reasonable before, I think it would be best to 
leave the type the same as it was, for backward compatibility, so we don't 
break people's applications. 

Also, the vectorized implementation of aggregates was built to return the same 
data types as the row-at-a-time implementation, for compatibility. It is 
important that any changes in semantics or types be implemented in both the 
row-at-a-time, and vectorized execution paths.

Different database systems make different choices about expression and 
aggregate result types. For example, in SQL Server, avg applied to an int is an 
int:

{code}
create table test(i int);
select avg(i) avg_i into res2 from test;
{code}

gives table res2 with this schema:

{code}
CREATE TABLE [dbo].[res2](
[avg_i] [int] NULL
) 
{code}



> Hive standard avg UDAF returns double as the return type for some exact input 
> types
> ---
>
> Key: HIVE-5878
> URL: https://issues.apache.org/jira/browse/HIVE-5878
> Project: Hive
>  Issue Type: Bug
>  Components: Types, UDF
>Affects Versions: 0.12.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-5878.1.patch, HIVE-5878.patch
>
>
> For standard, no-partial avg result, hive currently returns double as the 
> result type.
> {code}
> hive> desc test;
> OK
> d int None
> Time taken: 0.051 seconds, Fetched: 1 row(s)
> hive> explain select avg(`d`) from test;  
> ...
>   Reduce Operator Tree:
> Group By Operator
>   aggregations:
> expr: avg(VALUE._col0)
>   bucketGroup: false
>   mode: mergepartial
>   outputColumnNames: _col0
>   Select Operator
> expressions:
>   expr: _col0
>   type: double
> {code}
> However, exact types including integers and decimal should yield exact type. 
> Here is what MySQL does:
> {code}
> mysql> desc test;
> +---+--+--+-+-+---+
> | Field | Type | Null | Key | Default | Extra |
> +---+--+--+-+-+---+
> | i | int(11)  | YES  | | NULL|   |
> | b | tinyint(1)   | YES  | | NULL|   |
> | d | double   | YES  | | NULL|   |
> | s | varchar(5)   | YES  | | NULL|   |
> | dd| decimal(5,2) | YES  | | NULL|   |
> +---+--+--+-+-+---+
> mysql> create table test62 as select avg(i) from test;
> mysql> desc test62;
> +---+---+--+-+-+---+
> | Field | Type  | Null | Key | Default | Extra |
> +---+---+--+-+-+---+
> | avg(i) | decimal(14,4) | YES  | | NULL|   |
> +---+---+--+-+-+---+
> 1 row in set (0.00 sec)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5878) Hive standard avg UDAF returns double as the return type for some exact input types

2013-12-06 Thread Eric Hanson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841449#comment-13841449
 ] 

Eric Hanson commented on HIVE-5878:
---

To be clear, I think the existing Hive behavior, where the result of 
avg() is a double, is reasonable behavior and should not be 
changed.

> Hive standard avg UDAF returns double as the return type for some exact input 
> types
> ---
>
> Key: HIVE-5878
> URL: https://issues.apache.org/jira/browse/HIVE-5878
> Project: Hive
>  Issue Type: Bug
>  Components: Types, UDF
>Affects Versions: 0.12.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-5878.1.patch, HIVE-5878.patch
>
>
> For standard, no-partial avg result, hive currently returns double as the 
> result type.
> {code}
> hive> desc test;
> OK
> d int None
> Time taken: 0.051 seconds, Fetched: 1 row(s)
> hive> explain select avg(`d`) from test;  
> ...
>   Reduce Operator Tree:
> Group By Operator
>   aggregations:
> expr: avg(VALUE._col0)
>   bucketGroup: false
>   mode: mergepartial
>   outputColumnNames: _col0
>   Select Operator
> expressions:
>   expr: _col0
>   type: double
> {code}
> However, exact types including integers and decimal should yield exact type. 
> Here is what MySQL does:
> {code}
> mysql> desc test;
> +---+--+--+-+-+---+
> | Field | Type | Null | Key | Default | Extra |
> +---+--+--+-+-+---+
> | i | int(11)  | YES  | | NULL|   |
> | b | tinyint(1)   | YES  | | NULL|   |
> | d | double   | YES  | | NULL|   |
> | s | varchar(5)   | YES  | | NULL|   |
> | dd| decimal(5,2) | YES  | | NULL|   |
> +---+--+--+-+-+---+
> mysql> create table test62 as select avg(i) from test;
> mysql> desc test62;
> +---+---+--+-+-+---+
> | Field | Type  | Null | Key | Default | Extra |
> +---+---+--+-+-+---+
> | avg(i) | decimal(14,4) | YES  | | NULL|   |
> +---+---+--+-+-+---+
> 1 row in set (0.00 sec)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5878) Hive standard avg UDAF returns double as the return type for some exact input types

2013-12-06 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841516#comment-13841516
 ] 

Xuefu Zhang commented on HIVE-5878:
---

[~ehans] Thank you for your concern. However, I respectfully disagree that the 
behavior WAS and IS reasonable for several reasons. First, AVG was probably 
introduced before decimal, so there was no better choice than double. Hive has 
the concept of exact types (int, long, decimal, etc.) vs approximate types 
(double, float, etc), and Arithmetic operations (plus, divide, etc) on exact 
types generates exact type for accuracy. If average is defined mathematically 
as sum/count, then sum(int)/count should result an exact type. Otherwise, avg() 
and sum()/count give different result. Another inconsistency exists when 
avg(decimal) results a decimal. All those cause inconsistency in Hive's 
mathematical concept and function behavior, and can create confusions among 
users as well.

I understand vectorized current implementation chooses double for sum and uses 
sum/count to get another double for average. While this extends the scope of 
the changes, to me, however, vectorization is just implementation, which should 
not dictate high-level concept and consistency.

> Hive standard avg UDAF returns double as the return type for some exact input 
> types
> ---
>
> Key: HIVE-5878
> URL: https://issues.apache.org/jira/browse/HIVE-5878
> Project: Hive
>  Issue Type: Bug
>  Components: Types, UDF
>Affects Versions: 0.12.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-5878.1.patch, HIVE-5878.patch
>
>
> For standard, no-partial avg result, hive currently returns double as the 
> result type.
> {code}
> hive> desc test;
> OK
> d int None
> Time taken: 0.051 seconds, Fetched: 1 row(s)
> hive> explain select avg(`d`) from test;  
> ...
>   Reduce Operator Tree:
> Group By Operator
>   aggregations:
> expr: avg(VALUE._col0)
>   bucketGroup: false
>   mode: mergepartial
>   outputColumnNames: _col0
>   Select Operator
> expressions:
>   expr: _col0
>   type: double
> {code}
> However, exact types including integers and decimal should yield exact type. 
> Here is what MySQL does:
> {code}
> mysql> desc test;
> +---+--+--+-+-+---+
> | Field | Type | Null | Key | Default | Extra |
> +---+--+--+-+-+---+
> | i | int(11)  | YES  | | NULL|   |
> | b | tinyint(1)   | YES  | | NULL|   |
> | d | double   | YES  | | NULL|   |
> | s | varchar(5)   | YES  | | NULL|   |
> | dd| decimal(5,2) | YES  | | NULL|   |
> +---+--+--+-+-+---+
> mysql> create table test62 as select avg(i) from test;
> mysql> desc test62;
> +---+---+--+-+-+---+
> | Field | Type  | Null | Key | Default | Extra |
> +---+---+--+-+-+---+
> | avg(i) | decimal(14,4) | YES  | | NULL|   |
> +---+---+--+-+-+---+
> 1 row in set (0.00 sec)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5878) Hive standard avg UDAF returns double as the return type for some exact input types

2013-12-06 Thread Eric Hanson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841534#comment-13841534
 ] 

Eric Hanson commented on HIVE-5878:
---

To me, the issue is really about the importance of obeying the SQL standard 
(which would argue for using exact types for results in the situations the 
standard describes) vs. maintaining backward compatibility.

I don't think that accuracy is the issue. E.g. avg(int) yields int in SQL 
Server which is less accurate than yielding double, but it does meet the SQL 
standard's requirements of being an exact type. I would not favor making Hive 
yield int for avg(int) to meet the SQL standard because that loses information 
compared to the previous Hive behavior (yielding double) and that would be 
perceived even more as a breaking change in existing applications than 
producing decimal.

I agree that vectorization is an implementation detail. Vectorization can be 
extended to handle decimal.

> Hive standard avg UDAF returns double as the return type for some exact input 
> types
> ---
>
> Key: HIVE-5878
> URL: https://issues.apache.org/jira/browse/HIVE-5878
> Project: Hive
>  Issue Type: Bug
>  Components: Types, UDF
>Affects Versions: 0.12.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-5878.1.patch, HIVE-5878.patch
>
>
> For standard, no-partial avg result, hive currently returns double as the 
> result type.
> {code}
> hive> desc test;
> OK
> d int None
> Time taken: 0.051 seconds, Fetched: 1 row(s)
> hive> explain select avg(`d`) from test;  
> ...
>   Reduce Operator Tree:
> Group By Operator
>   aggregations:
> expr: avg(VALUE._col0)
>   bucketGroup: false
>   mode: mergepartial
>   outputColumnNames: _col0
>   Select Operator
> expressions:
>   expr: _col0
>   type: double
> {code}
> However, exact types including integers and decimal should yield exact type. 
> Here is what MySQL does:
> {code}
> mysql> desc test;
> +---+--+--+-+-+---+
> | Field | Type | Null | Key | Default | Extra |
> +---+--+--+-+-+---+
> | i | int(11)  | YES  | | NULL|   |
> | b | tinyint(1)   | YES  | | NULL|   |
> | d | double   | YES  | | NULL|   |
> | s | varchar(5)   | YES  | | NULL|   |
> | dd| decimal(5,2) | YES  | | NULL|   |
> +---+--+--+-+-+---+
> mysql> create table test62 as select avg(i) from test;
> mysql> desc test62;
> +---+---+--+-+-+---+
> | Field | Type  | Null | Key | Default | Extra |
> +---+---+--+-+-+---+
> | avg(i) | decimal(14,4) | YES  | | NULL|   |
> +---+---+--+-+-+---+
> 1 row in set (0.00 sec)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5878) Hive standard avg UDAF returns double as the return type for some exact input types

2013-12-06 Thread Eric Hanson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841961#comment-13841961
 ] 

Eric Hanson commented on HIVE-5878:
---

Does anybody else want to voice an opinion here?

> Hive standard avg UDAF returns double as the return type for some exact input 
> types
> ---
>
> Key: HIVE-5878
> URL: https://issues.apache.org/jira/browse/HIVE-5878
> Project: Hive
>  Issue Type: Bug
>  Components: Types, UDF
>Affects Versions: 0.12.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-5878.1.patch, HIVE-5878.patch
>
>
> For standard, no-partial avg result, hive currently returns double as the 
> result type.
> {code}
> hive> desc test;
> OK
> d int None
> Time taken: 0.051 seconds, Fetched: 1 row(s)
> hive> explain select avg(`d`) from test;  
> ...
>   Reduce Operator Tree:
> Group By Operator
>   aggregations:
> expr: avg(VALUE._col0)
>   bucketGroup: false
>   mode: mergepartial
>   outputColumnNames: _col0
>   Select Operator
> expressions:
>   expr: _col0
>   type: double
> {code}
> However, exact types including integers and decimal should yield exact type. 
> Here is what MySQL does:
> {code}
> mysql> desc test;
> +---+--+--+-+-+---+
> | Field | Type | Null | Key | Default | Extra |
> +---+--+--+-+-+---+
> | i | int(11)  | YES  | | NULL|   |
> | b | tinyint(1)   | YES  | | NULL|   |
> | d | double   | YES  | | NULL|   |
> | s | varchar(5)   | YES  | | NULL|   |
> | dd| decimal(5,2) | YES  | | NULL|   |
> +---+--+--+-+-+---+
> mysql> create table test62 as select avg(i) from test;
> mysql> desc test62;
> +---+---+--+-+-+---+
> | Field | Type  | Null | Key | Default | Extra |
> +---+---+--+-+-+---+
> | avg(i) | decimal(14,4) | YES  | | NULL|   |
> +---+---+--+-+-+---+
> 1 row in set (0.00 sec)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5878) Hive standard avg UDAF returns double as the return type for some exact input types

2013-12-09 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843348#comment-13843348
 ] 

Xuefu Zhang commented on HIVE-5878:
---

Backward compatibility issue is a valid concern. However, before we decide to 
enforce or break that, we need to ask ourselves whether the new behavior is the 
right thing to do and how bad the old behavior is. We had questionings about 
why we need to keep backward compatible to a bug.

As an example, Hive returned double for int/int previously, but now returns 
decimal. This is a bigger backward compatibility problem, but obviously it's a 
right thing to do. The issue in the discussion seems following the same line.

Also, I think hive is still in its early days, and it doesn't even have 1.0 
release yet. Thus, personally, BC issue is less a concern than doing right 
thing.

> Hive standard avg UDAF returns double as the return type for some exact input 
> types
> ---
>
> Key: HIVE-5878
> URL: https://issues.apache.org/jira/browse/HIVE-5878
> Project: Hive
>  Issue Type: Bug
>  Components: Types, UDF
>Affects Versions: 0.12.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-5878.1.patch, HIVE-5878.patch
>
>
> For standard, no-partial avg result, hive currently returns double as the 
> result type.
> {code}
> hive> desc test;
> OK
> d int None
> Time taken: 0.051 seconds, Fetched: 1 row(s)
> hive> explain select avg(`d`) from test;  
> ...
>   Reduce Operator Tree:
> Group By Operator
>   aggregations:
> expr: avg(VALUE._col0)
>   bucketGroup: false
>   mode: mergepartial
>   outputColumnNames: _col0
>   Select Operator
> expressions:
>   expr: _col0
>   type: double
> {code}
> However, exact types including integers and decimal should yield exact type. 
> Here is what MySQL does:
> {code}
> mysql> desc test;
> +---+--+--+-+-+---+
> | Field | Type | Null | Key | Default | Extra |
> +---+--+--+-+-+---+
> | i | int(11)  | YES  | | NULL|   |
> | b | tinyint(1)   | YES  | | NULL|   |
> | d | double   | YES  | | NULL|   |
> | s | varchar(5)   | YES  | | NULL|   |
> | dd| decimal(5,2) | YES  | | NULL|   |
> +---+--+--+-+-+---+
> mysql> create table test62 as select avg(i) from test;
> mysql> desc test62;
> +---+---+--+-+-+---+
> | Field | Type  | Null | Key | Default | Extra |
> +---+---+--+-+-+---+
> | avg(i) | decimal(14,4) | YES  | | NULL|   |
> +---+---+--+-+-+---+
> 1 row in set (0.00 sec)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HIVE-5878) Hive standard avg UDAF returns double as the return type for some exact input types

2013-12-09 Thread Jason Dere (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843441#comment-13843441
 ] 

Jason Dere commented on HIVE-5878:
--

Just adding a couple of bits of info to this discussion.  There are some 
significant effects for users if we decide to go with the change to returning 
decimal type (both for avg and for int / int).

1) As I mentioned in HIVE-5356, there is a significant performance hit to 
switching to decimal arithmetic.  I suppose for avg() there could be some ways 
to improve this for int type, such as keeping the sum/count portions as long, 
and only doing decimal math when computing sum / count.

2) Decimal arithmetic can return NULL for some non-null inputs, if Hive is not 
able to represent the entire result exactly.  While most common cases (such as 
the example in HIVE-5022) should not longer occur, in the general case it is 
still possible to get a NULL result if the expression contains enough 
operations.  I know, there has been some discussion on this already, and some 
of the reasoning behind why those semantics were ok for decimal arithmetic was 
that if the user had made the choice to use decimal columns/types, then they 
knew what they were getting into with the decimal arithmetic.  But in this case 
users are going to start seeing the effects of these decimal semantics all over 
the place because they will start getting applied on expressions involving 
integer columns.  I don't think that users will find this NULL behavior 
desirable with their integer expressions, so I think something will need to be 
done one way or another here - either don't return decimal type in these 
situations, or possibly address this null behavior in some way.

> Hive standard avg UDAF returns double as the return type for some exact input 
> types
> ---
>
> Key: HIVE-5878
> URL: https://issues.apache.org/jira/browse/HIVE-5878
> Project: Hive
>  Issue Type: Bug
>  Components: Types, UDF
>Affects Versions: 0.12.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-5878.1.patch, HIVE-5878.patch
>
>
> For standard, no-partial avg result, hive currently returns double as the 
> result type.
> {code}
> hive> desc test;
> OK
> d int None
> Time taken: 0.051 seconds, Fetched: 1 row(s)
> hive> explain select avg(`d`) from test;  
> ...
>   Reduce Operator Tree:
> Group By Operator
>   aggregations:
> expr: avg(VALUE._col0)
>   bucketGroup: false
>   mode: mergepartial
>   outputColumnNames: _col0
>   Select Operator
> expressions:
>   expr: _col0
>   type: double
> {code}
> However, exact types including integers and decimal should yield exact type. 
> Here is what MySQL does:
> {code}
> mysql> desc test;
> +---+--+--+-+-+---+
> | Field | Type | Null | Key | Default | Extra |
> +---+--+--+-+-+---+
> | i | int(11)  | YES  | | NULL|   |
> | b | tinyint(1)   | YES  | | NULL|   |
> | d | double   | YES  | | NULL|   |
> | s | varchar(5)   | YES  | | NULL|   |
> | dd| decimal(5,2) | YES  | | NULL|   |
> +---+--+--+-+-+---+
> mysql> create table test62 as select avg(i) from test;
> mysql> desc test62;
> +---+---+--+-+-+---+
> | Field | Type  | Null | Key | Default | Extra |
> +---+---+--+-+-+---+
> | avg(i) | decimal(14,4) | YES  | | NULL|   |
> +---+---+--+-+-+---+
> 1 row in set (0.00 sec)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HIVE-5878) Hive standard avg UDAF returns double as the return type for some exact input types

2013-12-09 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843882#comment-13843882
 ] 

Xuefu Zhang commented on HIVE-5878:
---

For information, the follow is the text from SQL-92[1] standard w.r.t AVG 
function:

{quote}
c) If AVG is specified and DT is exact numeric, then the data
  type of the result is exact numeric with implementation-
  defined precision not less than the precision of DT and
  implementation-defined scale not less than the scale of DT.
{quote}

Clearly, currently Hive deviates from this. Both MySQL and SQL server are in 
line with this.

[1] http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt

> Hive standard avg UDAF returns double as the return type for some exact input 
> types
> ---
>
> Key: HIVE-5878
> URL: https://issues.apache.org/jira/browse/HIVE-5878
> Project: Hive
>  Issue Type: Bug
>  Components: Types, UDF
>Affects Versions: 0.12.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-5878.1.patch, HIVE-5878.patch
>
>
> For standard, no-partial avg result, hive currently returns double as the 
> result type.
> {code}
> hive> desc test;
> OK
> d int None
> Time taken: 0.051 seconds, Fetched: 1 row(s)
> hive> explain select avg(`d`) from test;  
> ...
>   Reduce Operator Tree:
> Group By Operator
>   aggregations:
> expr: avg(VALUE._col0)
>   bucketGroup: false
>   mode: mergepartial
>   outputColumnNames: _col0
>   Select Operator
> expressions:
>   expr: _col0
>   type: double
> {code}
> However, exact types including integers and decimal should yield exact type. 
> Here is what MySQL does:
> {code}
> mysql> desc test;
> +---+--+--+-+-+---+
> | Field | Type | Null | Key | Default | Extra |
> +---+--+--+-+-+---+
> | i | int(11)  | YES  | | NULL|   |
> | b | tinyint(1)   | YES  | | NULL|   |
> | d | double   | YES  | | NULL|   |
> | s | varchar(5)   | YES  | | NULL|   |
> | dd| decimal(5,2) | YES  | | NULL|   |
> +---+--+--+-+-+---+
> mysql> create table test62 as select avg(i) from test;
> mysql> desc test62;
> +---+---+--+-+-+---+
> | Field | Type  | Null | Key | Default | Extra |
> +---+---+--+-+-+---+
> | avg(i) | decimal(14,4) | YES  | | NULL|   |
> +---+---+--+-+-+---+
> 1 row in set (0.00 sec)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HIVE-5878) Hive standard avg UDAF returns double as the return type for some exact input types

2013-12-13 Thread Eric Hanson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13847818#comment-13847818
 ] 

Eric Hanson commented on HIVE-5878:
---

HIVE-5996 has a pretty useful, in-depth, related discussion about ANSI 
compatibility vs. backward compatibility etc. that is useful context.

> Hive standard avg UDAF returns double as the return type for some exact input 
> types
> ---
>
> Key: HIVE-5878
> URL: https://issues.apache.org/jira/browse/HIVE-5878
> Project: Hive
>  Issue Type: Bug
>  Components: Types, UDF
>Affects Versions: 0.12.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-5878.1.patch, HIVE-5878.patch
>
>
> For standard, no-partial avg result, hive currently returns double as the 
> result type.
> {code}
> hive> desc test;
> OK
> d int None
> Time taken: 0.051 seconds, Fetched: 1 row(s)
> hive> explain select avg(`d`) from test;  
> ...
>   Reduce Operator Tree:
> Group By Operator
>   aggregations:
> expr: avg(VALUE._col0)
>   bucketGroup: false
>   mode: mergepartial
>   outputColumnNames: _col0
>   Select Operator
> expressions:
>   expr: _col0
>   type: double
> {code}
> However, exact types including integers and decimal should yield exact type. 
> Here is what MySQL does:
> {code}
> mysql> desc test;
> +---+--+--+-+-+---+
> | Field | Type | Null | Key | Default | Extra |
> +---+--+--+-+-+---+
> | i | int(11)  | YES  | | NULL|   |
> | b | tinyint(1)   | YES  | | NULL|   |
> | d | double   | YES  | | NULL|   |
> | s | varchar(5)   | YES  | | NULL|   |
> | dd| decimal(5,2) | YES  | | NULL|   |
> +---+--+--+-+-+---+
> mysql> create table test62 as select avg(i) from test;
> mysql> desc test62;
> +---+---+--+-+-+---+
> | Field | Type  | Null | Key | Default | Extra |
> +---+---+--+-+-+---+
> | avg(i) | decimal(14,4) | YES  | | NULL|   |
> +---+---+--+-+-+---+
> 1 row in set (0.00 sec)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)