[
https://issues.apache.org/jira/browse/HIVE-5996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846705#comment-13846705
]
Xuefu Zhang commented on HIVE-5996:
-----------------------------------
[~ehans] Thanks for sharing your thoughts and your inquiry. For your
information, I'm not trying to make MySQL as the model. My first line of
consideration is SQL standard. For a functionality if there is no SQL standard,
Hive doesn't have to invent everything, thus, I do reference MySQL for ideas,
mostly because MySQL and its technical documentation are readily available.
However, this doesn't precludeme following other DB's practice. For instance,
precision/scale determination for arithmetic operations in hive is following
SQL server's formula. I'm not either anti- or pro- MySQL. Nor am I to SQL
server, but I strongly believe that following well-established practices
benefits Hive than doing something in a unique, unfortunate way. An example
would be int/int in Hive.
However, a lot of existing functionality in Hive was put into place when Hive
is positioned as a tool rather than DB, and before all necessary data types
were introduced. Take int/int again as an example, early developer probably
didn't even think about SQL-compliance, and even if he/she did, there wasn't
decimal data type as a consideration. As Hive is shift to a DB on bigdata
positioning, I believe that we should start thinking in a perspective other
than performance or backward compatibility. If we restrict ourselves based on
unconscious decisions made in the past, we may lose a lot of opportunities of
doing the right things.
As I worked on decimal precision/scale support, I found a lot of problems in
Hive about data types and their conversions and promotions. In many cases, Hive
is not consistent itself. Let me ask you a question to see if you know the
answer: what's the return type of 35 + '3.14', where 35 is from int column and
'3.14' from a string column? Before I made the changes, you probably would say:
wait, let me read the code first. And your answer might be different if my
question were 35/'3.14'. Now, to answer the same questions, I can give right
way, and I have a theory to tell why. In summary, it's a lot of effort to clean
up the mess and inconsistency in Hive from the beginning of my work on decimal.
Now if we use either performance or backward compatibility to shut down what we
have achieved, I don't see how Hive is shifting from a tool to a DB, and how
Hive can become adopted as enterprise grade product.
Hive is still evolving, and that's why I think we have certain luxury of
breaking backward compatibility for doing the right thing. As Ashutosh once
mentioned, we don't want to be backward compatible to a bug. Once Hive is
stabilized, it becomes much harder to make backward incompatible changes, as
you know with your experience with SQL server.
I understand your concern about backward compatibility, especially your
possible frustration over vectorization breaking or redoing. On the other hand,
I think we are here to help hive become more useful. A blunt rejection without
much consideration and communication doesn't seem as helpful and constructive
as it should be.
> Query for sum of a long column of a table with only two rows produces wrong
> result
> ----------------------------------------------------------------------------------
>
> Key: HIVE-5996
> URL: https://issues.apache.org/jira/browse/HIVE-5996
> Project: Hive
> Issue Type: Bug
> Components: UDF
> Affects Versions: 0.12.0
> Reporter: Xuefu Zhang
> Assignee: Xuefu Zhang
> Attachments: HIVE-5996.patch
>
>
> {code}
> hive> desc test2;
> OK
> l bigint None
> hive> select * from test2;
> OK
> 6666666666666666666
> 5555555555555555555
> hive> select sum(l) from test2;
> OK
> -6224521851487329395
> {code}
> It's believed that a wrap-around error occurred. It's surprising that it
> happens only with two rows. Same query in MySQL returns:
> {code}
> mysql> select sum(l) from test;
> +----------------------+
> | sum(l) |
> +----------------------+
> | 12222222222222222221 |
> +----------------------+
> 1 row in set (0.00 sec)
> {code}
> Hive should accommodate large number of rows. Overflowing with only two rows
> is very unusable.
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)