Hi Lauren and Zhang, The book "Programming Hive" suggests to use Double (instead of float) and also to cast any literal value to double. I am already using double for all my computations (both in hive table schema as well as in my UDF). Furthermore, I am not comparing two floats/doubles. I am doing some computations involving doubles...and those minor differences are adding up.
It looks like what Mark Grover was telling - mapping between Java datatypes to Hive data-types. I am yet to look at that portion of the source-code. Thanks and will keep you posted, /PD On Fri, Dec 7, 2012 at 2:12 PM, Lauren Yang <lauren.y...@microsoft.com>wrote: > This sounds like https://issues.apache.org/jira/browse/HIVE-2586 , where > comparing float/doubles will not work because of the way floating point > numbers are represented.**** > > ** ** > > Perhaps there is a comparison between a float and double type because of > some internal representation in the Java library, or the UDF.**** > > ** ** > > Ed Capriolo’s book has a good section about workarounds and caveats for > working with floats/doubles in hive.**** > > ** ** > > Thanks,**** > > Lauren**** > > *From:* Periya.Data [mailto:periya.d...@gmail.com] > *Sent:* Friday, December 07, 2012 1:28 PM > *To:* user@hive.apache.org; cdh-u...@cloudera.org > *Subject:* Hive double-precision question**** > > ** ** > > Hi Hive Users, > I recently noticed an interesting behavior with Hive and I am unable > to find the reason for it. Your insights into this is much appreciated. > > I am trying to compute the distance between two zip codes. I have the > distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF > and using Hive's built-in functions. There are some discrepancies from the > 3rd decimal place when I see the output got from using Hive UDF and Hive's > built-in functions. Here is an example: > > zip1 zip 2 Hadoop Built-in function > SAS R Linux + > Java**** > > 00501 **** > > 11720 **** > > 4.49493083698542000**** > > 4.49508858**** > > 4.49508858054005**** > > 4.49508857976933000**** > > > The formula used to compute distance is this (UDF): > > double long1 = Math.atan(1)/45 * ux; > double lat1 = Math.atan(1)/45 * uy; > double long2 = Math.atan(1)/45 * mx; > double lat2 = Math.atan(1)/45 * my; > > double X1 = long1; > double Y1 = lat1; > double X2 = long2; > double Y2 = lat2; > > double distance = 3949.99 * Math.acos(Math.sin(Y1) * > Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 - > X2)); > > > The one used using built-in functions (same as above): > 3949.99*acos( sin(u_y_coord * (atan(1)/45 )) * > sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))* > cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord * > (atan(1)/45) - m_x_coord * (atan(1)/45)) ) > > > > > - The Hive's built-in functions used are acos, sin, cos and atan. > - for another try, I used Hive UDF, with Java's math library (Math.acos, > Math.atan etc) > - All variables used are double. > > I expected the value from Hadoop UDF (and Built-in functions) to be > identical with that got from plain Java code in Linux. But they are not. > The built-in function (as well as UDF) gives 49493083698542000 whereas > simple Java program running in Linux gives 49508857976933000. The linux > machine is similar to the Hadoop cluster machines. > > Linux version - Red Hat 5.5 > Java - latest. > Hive - 0.7.1 > Hadoop - 0.20.2 > > This discrepancy is very consistent across thousands of zip-code > distances. It is not a one-off occurrence. In some cases, I see the > difference from the 4th decimal place. Some more examples: > > zip1 zip 2 Hadoop Built-in function > SAS R Linux + > Java**** > > 00602 **** > > 00617 **** > > 42.79095253903410000**** > > 42.79072812**** > > 42.79072812185650**** > > 42.79072812185640000**** > > 00603 **** > > 00617 **** > > 40.24044016655180000**** > > 40.2402289**** > > 40.24022889740920**** > > 40.24022889740910000**** > > 00605 **** > > 00617 **** > > 40.19191761288380000**** > > 40.19186416**** > > 40.19186415807060**** > > 40.19186415807060000**** > > > I have not tested the individual sin, cos, atan function returns. That > will be my next test. But, at the very least, why is there a difference in > the values between Hadoop's UDF/built-ins and that from Linux + Java? I am > assuming that Hive's built-in mathematical functions are nothing but the > underlying Java functions. > > Thanks, > PD.**** >