[ 
https://issues.apache.org/jira/browse/HIVE-5996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846553#comment-13846553
 ] 

Eric Hanson commented on HIVE-5996:
-----------------------------------

Xuefu,

I'm all for new, useful functionality and better performance for Hive. And I'm 
all for getting correct results. I appreciate your contributions and your 
passion.

But I strongly believe changing behavior from one reasonable alternative to 
another in a way that breaks backward compatibility is not the way to go. I 
have a lot of experience with evolving a database (SQL Server) over a decade, 
and have talked to a many people who've been evolving the product longer than 
that. From this experience, I can say that changing backward compatibility (for 
either functionality or performance, but especially functionality) even in 
subtle ways can anger customers/users. 

Any changes to semantics like this should first of all be avoided, and if they 
can't be avoided, they need to be rolled out carefully, with a switch to enable 
backward compatibility. SQL Server has compatibility levels and "SET options" 
as switches, and a defined deprecation schedule. This is kind of process-heavy 
in the engineering effort, and also causes explosion of the test matrix. So I 
am not recommending necessarily that Hive go there, though maybe we need to 
have that discussion. I think we're better off being strict about not breaking 
backward compatibility unless really needed.

So, I ask that you please close this JIRA without making a patch.

There are a couple of other areas where there is an issue of ANSI SQL 
compatibility (result type of int/int and avg(int)). We could have a further 
discussion on those, though you know my preference would be to leave the 
semantics as-is on those since I think backward compatibility trumps ANSI SQL 
compatibility for those. If there is no issue of ANSI compatibility, and the 
current Hive behavior is reasonable, I'd like us to leave things as they are. I 
don't think there is a need to be across-the-board compatible with another 
system (MySQL or anything else).

Best regards,
Eric

P.S. Your specific argument that you can overflow a bigint sum, while 
technically accurate, I think is not a significant user issue. I've never heard 
a complaint about it with SQL Server, or PDW, our scale-out data warehouse 
appliance. Really big numbers, like the national debt in pennies, fit in a 
bigint, just to put it in perspective. Users can cast the input to decimal or 
double if they need more magnitude.

> Query for sum of a long column of a table with only two rows produces wrong 
> result
> ----------------------------------------------------------------------------------
>
>                 Key: HIVE-5996
>                 URL: https://issues.apache.org/jira/browse/HIVE-5996
>             Project: Hive
>          Issue Type: Bug
>          Components: UDF
>    Affects Versions: 0.12.0
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>         Attachments: HIVE-5996.patch
>
>
> {code}
> hive> desc test2;
> OK
> l                     bigint                  None                
> hive> select * from test2;                                 
> OK
> 6666666666666666666
> 5555555555555555555
> hive> select sum(l) from test2;
> OK
> -6224521851487329395
> {code}
> It's believed that a wrap-around error occurred. It's surprising that it 
> happens only with two rows. Same query in MySQL returns:
> {code}
> mysql> select sum(l) from test;
> +----------------------+
> | sum(l)               |
> +----------------------+
> | 12222222222222222221 |
> +----------------------+
> 1 row in set (0.00 sec)
> {code}
> Hive should accommodate large number of rows. Overflowing with only two rows 
> is very unusable.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to