[ https://issues.apache.org/jira/browse/HIVE-5996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846553#comment-13846553 ]
Eric Hanson commented on HIVE-5996: ----------------------------------- Xuefu, I'm all for new, useful functionality and better performance for Hive. And I'm all for getting correct results. I appreciate your contributions and your passion. But I strongly believe changing behavior from one reasonable alternative to another in a way that breaks backward compatibility is not the way to go. I have a lot of experience with evolving a database (SQL Server) over a decade, and have talked to a many people who've been evolving the product longer than that. From this experience, I can say that changing backward compatibility (for either functionality or performance, but especially functionality) even in subtle ways can anger customers/users. Any changes to semantics like this should first of all be avoided, and if they can't be avoided, they need to be rolled out carefully, with a switch to enable backward compatibility. SQL Server has compatibility levels and "SET options" as switches, and a defined deprecation schedule. This is kind of process-heavy in the engineering effort, and also causes explosion of the test matrix. So I am not recommending necessarily that Hive go there, though maybe we need to have that discussion. I think we're better off being strict about not breaking backward compatibility unless really needed. So, I ask that you please close this JIRA without making a patch. There are a couple of other areas where there is an issue of ANSI SQL compatibility (result type of int/int and avg(int)). We could have a further discussion on those, though you know my preference would be to leave the semantics as-is on those since I think backward compatibility trumps ANSI SQL compatibility for those. If there is no issue of ANSI compatibility, and the current Hive behavior is reasonable, I'd like us to leave things as they are. I don't think there is a need to be across-the-board compatible with another system (MySQL or anything else). Best regards, Eric P.S. Your specific argument that you can overflow a bigint sum, while technically accurate, I think is not a significant user issue. I've never heard a complaint about it with SQL Server, or PDW, our scale-out data warehouse appliance. Really big numbers, like the national debt in pennies, fit in a bigint, just to put it in perspective. Users can cast the input to decimal or double if they need more magnitude. > Query for sum of a long column of a table with only two rows produces wrong > result > ---------------------------------------------------------------------------------- > > Key: HIVE-5996 > URL: https://issues.apache.org/jira/browse/HIVE-5996 > Project: Hive > Issue Type: Bug > Components: UDF > Affects Versions: 0.12.0 > Reporter: Xuefu Zhang > Assignee: Xuefu Zhang > Attachments: HIVE-5996.patch > > > {code} > hive> desc test2; > OK > l bigint None > hive> select * from test2; > OK > 6666666666666666666 > 5555555555555555555 > hive> select sum(l) from test2; > OK > -6224521851487329395 > {code} > It's believed that a wrap-around error occurred. It's surprising that it > happens only with two rows. Same query in MySQL returns: > {code} > mysql> select sum(l) from test; > +----------------------+ > | sum(l) | > +----------------------+ > | 12222222222222222221 | > +----------------------+ > 1 row in set (0.00 sec) > {code} > Hive should accommodate large number of rows. Overflowing with only two rows > is very unusable. -- This message was sent by Atlassian JIRA (v6.1.4#6159)