This sounds like https://issues.apache.org/jira/browse/HIVE-2586 , where comparing float/doubles will not work because of the way floating point numbers are represented.
Perhaps there is a comparison between a float and double type because of some internal representation in the Java library, or the UDF. Ed Capriolo's book has a good section about workarounds and caveats for working with floats/doubles in hive. Thanks, Lauren From: Periya.Data [mailto:periya.d...@gmail.com] Sent: Friday, December 07, 2012 1:28 PM To: user@hive.apache.org; cdh-u...@cloudera.org Subject: Hive double-precision question Hi Hive Users, I recently noticed an interesting behavior with Hive and I am unable to find the reason for it. Your insights into this is much appreciated. I am trying to compute the distance between two zip codes. I have the distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF and using Hive's built-in functions. There are some discrepancies from the 3rd decimal place when I see the output got from using Hive UDF and Hive's built-in functions. Here is an example: zip1 zip 2 Hadoop Built-in function SAS R Linux + Java 00501 11720 4.49493083698542000 4.49508858 4.49508858054005 4.49508857976933000 The formula used to compute distance is this (UDF): double long1 = Math.atan(1)/45 * ux; double lat1 = Math.atan(1)/45 * uy; double long2 = Math.atan(1)/45 * mx; double lat2 = Math.atan(1)/45 * my; double X1 = long1; double Y1 = lat1; double X2 = long2; double Y2 = lat2; double distance = 3949.99 * Math.acos(Math.sin(Y1) * Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 - X2)); The one used using built-in functions (same as above): 3949.99*acos( sin(u_y_coord * (atan(1)/45 )) * sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))* cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord * (atan(1)/45) - m_x_coord * (atan(1)/45)) ) - The Hive's built-in functions used are acos, sin, cos and atan. - for another try, I used Hive UDF, with Java's math library (Math.acos, Math.atan etc) - All variables used are double. I expected the value from Hadoop UDF (and Built-in functions) to be identical with that got from plain Java code in Linux. But they are not. The built-in function (as well as UDF) gives 49493083698542000 whereas simple Java program running in Linux gives 49508857976933000. The linux machine is similar to the Hadoop cluster machines. Linux version - Red Hat 5.5 Java - latest. Hive - 0.7.1 Hadoop - 0.20.2 This discrepancy is very consistent across thousands of zip-code distances. It is not a one-off occurrence. In some cases, I see the difference from the 4th decimal place. Some more examples: zip1 zip 2 Hadoop Built-in function SAS R Linux + Java 00602 00617 42.79095253903410000 42.79072812 42.79072812185650 42.79072812185640000 00603 00617 40.24044016655180000 40.2402289 40.24022889740920 40.24022889740910000 00605 00617 40.19191761288380000 40.19186416 40.19186415807060 40.19186415807060000 I have not tested the individual sin, cos, atan function returns. That will be my next test. But, at the very least, why is there a difference in the values between Hadoop's UDF/built-ins and that from Linux + Java? I am assuming that Hive's built-in mathematical functions are nothing but the underlying Java functions. Thanks, PD.