Re: Hive double-precision question

Johnny Zhang Mon, 17 Dec 2012 16:14:35 -0800

Hi, Periya:
Can you take a look at the patch of
https://issues.apache.org/jira/browse/HIVE-3715 and see if you can apply
the similar change to make sinc/cons more accurate for your use case? Feel
free to comments on the jira as well. Thanks.


Johnny


On Sat, Dec 8, 2012 at 11:23 AM, Periya.Data <[email protected]> wrote:

> Hi Lauren and Zhang,
>     The book "Programming Hive" suggests to use Double (instead of float)
> and also to cast any literal value to double. I am already using double for
> all my computations (both in hive table schema as well as in my UDF).
> Furthermore, I am not comparing two floats/doubles. I am doing some
> computations involving doubles...and those minor differences are adding up.
>
> It looks like what Mark Grover was telling - mapping between Java
> datatypes to Hive data-types. I am yet to look at that portion of the
> source-code.
>
> Thanks and will keep you posted,
> /PD
>
>
>
>  On Fri, Dec 7, 2012 at 2:12 PM, Lauren Yang <[email protected]>wrote:
>
>>  This sounds like https://issues.apache.org/jira/browse/HIVE-2586 ,
>> where comparing float/doubles will not work because of the way floating
>> point numbers are represented.****
>>
>> ** **
>>
>> Perhaps there is a comparison between a  float and double type because of
>> some internal representation in the Java library, or the UDF.****
>>
>> ** **
>>
>> Ed Capriolo’s book has a good section about workarounds and caveats for
>> working with floats/doubles in hive.****
>>
>> ** **
>>
>> Thanks,****
>>
>> Lauren****
>>
>> *From:* Periya.Data [mailto:[email protected]]
>> *Sent:* Friday, December 07, 2012 1:28 PM
>> *To:* [email protected]; [email protected]
>> *Subject:* Hive double-precision question****
>>
>> ** **
>>
>> Hi Hive Users,
>>     I recently noticed an interesting behavior with Hive and I am unable
>> to find the reason for it. Your insights into this is much appreciated.
>>
>> I am trying to compute the distance between two zip codes. I have the
>> distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF
>> and using Hive's built-in functions. There are some discrepancies from the
>> 3rd decimal place when I see the output got from using Hive UDF and Hive's
>> built-in functions. Here is an example:
>>
>> zip1          zip 2          Hadoop Built-in function
>> SAS                      R                                       Linux +
>> Java****
>>
>> 00501  ****
>>
>> 11720  ****
>>
>> 4.49493083698542000****
>>
>> 4.49508858****
>>
>> 4.49508858054005****
>>
>> 4.49508857976933000****
>>
>>
>> The formula used to compute distance is this (UDF):
>>
>>         double long1 = Math.atan(1)/45 * ux;
>>         double lat1 = Math.atan(1)/45 * uy;
>>         double long2 = Math.atan(1)/45 * mx;
>>         double lat2 = Math.atan(1)/45 * my;
>>
>>         double X1 = long1;
>>         double Y1 = lat1;
>>         double X2 = long2;
>>         double Y2 = lat2;
>>
>>         double distance = 3949.99 * Math.acos(Math.sin(Y1) *
>>                 Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1
>> - X2));
>>
>>
>> The one used using built-in functions (same as above):
>> 3949.99*acos(  sin(u_y_coord * (atan(1)/45 )) *
>>         sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))*
>>         cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord *
>>         (atan(1)/45) - m_x_coord * (atan(1)/45)) )
>>
>>
>>
>>
>> - The Hive's built-in functions used are acos, sin, cos and atan.
>> - for another try, I used Hive UDF, with Java's math library (Math.acos,
>> Math.atan etc)
>> - All variables used are double.
>>
>> I expected the value from Hadoop UDF (and Built-in functions) to be
>> identical with that got from plain Java code in Linux. But they are not.
>> The built-in function (as well as UDF) gives 49493083698542000 whereas
>> simple Java program running in Linux gives 49508857976933000. The linux
>> machine is similar to the Hadoop cluster machines.
>>
>> Linux version - Red Hat 5.5
>> Java - latest.
>> Hive - 0.7.1
>> Hadoop - 0.20.2
>>
>> This discrepancy is very consistent across thousands of zip-code
>> distances. It is not a one-off occurrence. In some cases, I see the
>> difference from the 4th decimal place. Some more examples:
>>
>> zip1          zip 2          Hadoop Built-in function
>> SAS                      R                                       Linux +
>> Java****
>>
>> 00602  ****
>>
>> 00617  ****
>>
>> 42.79095253903410000****
>>
>> 42.79072812****
>>
>> 42.79072812185650****
>>
>> 42.79072812185640000****
>>
>> 00603  ****
>>
>> 00617  ****
>>
>> 40.24044016655180000****
>>
>> 40.2402289****
>>
>> 40.24022889740920****
>>
>> 40.24022889740910000****
>>
>> 00605  ****
>>
>> 00617  ****
>>
>> 40.19191761288380000****
>>
>> 40.19186416****
>>
>> 40.19186415807060****
>>
>> 40.19186415807060000****
>>
>>
>> I have not tested the individual sin, cos, atan function returns. That
>> will be my next test. But, at the very least, why is there a difference in
>> the values between Hadoop's UDF/built-ins and that from Linux + Java?  I am
>> assuming that Hive's built-in mathematical functions are nothing but the
>> underlying Java functions.
>>
>> Thanks,
>> PD.****
>>
>
>  --
>
>
>
>

Re: Hive double-precision question

Reply via email to