Re: Hive parallel execution deadlocks, need restart of yarn-nodemanager
Ah i see, i had missed the fact that each MR jobs had an ApplicationManager that was taking a container, there were none free to run mappers (my jobs usually have only one mapper due to small input data). I understood that thanks to your explanations and using more nodes with a greater concurrency, and like before all containers were running an ApplicationManager ! Thank you very much ! -- Alexandre Fouche Lead operations engineer, cloud architect http://www.cleverscale.com | @cleverscale Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Thursday 6 December 2012 at 21:08, Vinod Kumar Vavilapalli wrote: You mentioned you only have one NodeManager. So, is hive generating 3 MapReduce jobs? And how many map and reduce tasks for each job? What is your yarn.nodemanager.resource.memory-mb? That determines the maximum number of containers you can run. You are running into an issue where all the jobs are running in parallel, and because job now has one 'ApplicationMaster' which also occupies a container, the jobs are getting into a scheduling livelock. On single node you will not have enough capacity to run many jobs in parallel. Thanks, +Vinod On Dec 6, 2012, at 5:24 AM, Alexandre Fouche wrote: Is there a known deadlock issue or bug when using Hive parallel execution with more parallel hive threads than there are computing nodemanagers ? On my test cluster, i have set Hive parallel excution to 2 or 3 threads, and have only 1 computing nodemanager with 5 cpu cores. When i run a hive request with a lot of unions that decomposes in a lot of jobs to be executed in parallel, after a few jobs done, it always endup deadlocking on 0% at mapping for all parallel jobs (from Hive0server2 logs). If i restart hadoop-yarn-nodemanager on the nodemanager server, Hive gets out of its deadlock and continues, until getting deadlocked a bit later again. Alex
Set the number of mapper in hive
Hi everyone, My cluster also runs HBase for real time processing. Hive query (on a big table) occupies all the map tasks so that the other service cannot run properly. Does anyone know how to limit the number of running map in hive? I see mapred.reduce.tasks in the configuration properties, but I don't see the configuration for mapper. Thank you. Regards, Philips
Re: How to set an empty value to hive.querylog.location to disable the creation of hive history file
do you mean NOT support disable the creation of hive history files OR NOT support using an empty string to achieve this? If Hive doesn't support disable the creation of query logs, do you know the reason? Thanks, - Bing 2012/12/6 Hezhiqiang (Ransom) ransom.hezhiqi...@huawei.com It’s not supported now. I think you a rise it in JIRA. ** ** Regards Ransom ** ** *From:* Bing Li [mailto:sarah.lib...@gmail.com] *Sent:* Thursday, December 06, 2012 5:06 PM *To:* user@hive.apache.org *Subject:* Re: How to set an empty value to hive.querylog.location to disable the creation of hive history file ** ** it will exit with error like FAILED: Failed to open Query Log: /dev/null/hive_job_log_xxx.txt and pointed that the path is not a directory. 2012/12/6 Jithendranath Joijoide pixelma...@gmail.com How about setting it to /dev/null . Not sure if that would help in your case. Just an hack. ** ** Regards. ** ** On Thu, Dec 6, 2012 at 2:14 PM, Bing Li sarah.lib...@gmail.com wrote:*** * Hi, all Refer to https://cwiki.apache.org/Hive/adminmanual-configuration.html, if I set hive.querylog.location to an empty string, it won't create structured log. I filed hive-site.xml in HIVE_HOME/conf and add the following setting, property namehive.querylog.location/name value/value /property BUT it didn't work, when launch HIVE_HOME/bin/hive, it created a history file in /tmp/user.name which is the default directory of this property. Do you know how to set an EMPTY value in hive-site.xml? Thanks, - Bing ** ** ** **
Re: Set the number of mapper in hive
ways to handle this 1) create separate job queues for hive and hbase users on jt and allocate resources according to your needs 2) you can not actually limit how many maps can be launched as its decided on run time by looking at split size. If you want less number of maps to be launched then increase split size...also keep in mind you will need larger memory for maps On Dec 7, 2012 2:32 PM, Philips Kokoh Prasetyo philipsko...@gmail.com wrote: Hi everyone, My cluster also runs HBase for real time processing. Hive query (on a big table) occupies all the map tasks so that the other service cannot run properly. Does anyone know how to limit the number of running map in hive? I see mapred.reduce.tasks in the configuration properties, but I don't see the configuration for mapper. Thank you. Regards, Philips
Re: Set the number of mapper in hive
Hi Nitin, Thanks for the reply. Do you mean using fair scheduler to separate job queue? http://hadoop.apache.org/docs/r0.20.2/fair_scheduler.html Regards, Philips On Fri, Dec 7, 2012 at 5:15 PM, Nitin Pawar nitinpawar...@gmail.com wrote: ways to handle this 1) create separate job queues for hive and hbase users on jt and allocate resources according to your needs 2) you can not actually limit how many maps can be launched as its decided on run time by looking at split size. If you want less number of maps to be launched then increase split size...also keep in mind you will need larger memory for maps On Dec 7, 2012 2:32 PM, Philips Kokoh Prasetyo philipsko...@gmail.com wrote: Hi everyone, My cluster also runs HBase for real time processing. Hive query (on a big table) occupies all the map tasks so that the other service cannot run properly. Does anyone know how to limit the number of running map in hive? I see mapred.reduce.tasks in the configuration properties, but I don't see the configuration for mapper. Thank you. Regards, Philips
Re: Set the number of mapper in hive
Yes On Dec 7, 2012 3:09 PM, Philips Kokoh Prasetyo philipsko...@gmail.com wrote: Hi Nitin, Thanks for the reply. Do you mean using fair scheduler to separate job queue? http://hadoop.apache.org/docs/r0.20.2/fair_scheduler.html Regards, Philips On Fri, Dec 7, 2012 at 5:15 PM, Nitin Pawar nitinpawar...@gmail.comwrote: ways to handle this 1) create separate job queues for hive and hbase users on jt and allocate resources according to your needs 2) you can not actually limit how many maps can be launched as its decided on run time by looking at split size. If you want less number of maps to be launched then increase split size...also keep in mind you will need larger memory for maps On Dec 7, 2012 2:32 PM, Philips Kokoh Prasetyo philipsko...@gmail.com wrote: Hi everyone, My cluster also runs HBase for real time processing. Hive query (on a big table) occupies all the map tasks so that the other service cannot run properly. Does anyone know how to limit the number of running map in hive? I see mapred.reduce.tasks in the configuration properties, but I don't see the configuration for mapper. Thank you. Regards, Philips
Map side join
Hello everybody, I have got a question. I didn't came across any post which says somethign about this. I have got two tables. Lets say A and B. I want to join A B in HIVE. I am currently using HIVE 0.9 version. The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 = B.id2) AND (A.id3 = B.id3) Can I ask HIVE to use map side join in this scenario? Should I give a hint to HIVE by saying /*+mapjoin(B)*/ Get back to me if you want any more information in this regard. Thanks and regards, Souvik.
Re: Map side join
Hi Souvik In earlier versions of hive you had to give the map join hint. But in later versions just set hive.auto.convert.join = true; Hive automatically selects the smaller table. It is better to give the smaller table as the first one in join. You can use a map join if you are joining a small table with a large one, in terms of data size. By small, better to have the smaller table size in range of MBs. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Souvik Banerjee souvikbaner...@gmail.com Date: Fri, 7 Dec 2012 13:58:25 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: Map side join Hello everybody, I have got a question. I didn't came across any post which says somethign about this. I have got two tables. Lets say A and B. I want to join A B in HIVE. I am currently using HIVE 0.9 version. The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 = B.id2) AND (A.id3 = B.id3) Can I ask HIVE to use map side join in this scenario? Should I give a hint to HIVE by saying /*+mapjoin(B)*/ Get back to me if you want any more information in this regard. Thanks and regards, Souvik.
Re: Hive double-precision question
Hi, Periya: This is a problem to me also. I filed https://issues.apache.org/jira/browse/HIVE-3715 I have a patch working in local. I am doing more tests right now will post it soon. Thanks, Johnny On Fri, Dec 7, 2012 at 1:27 PM, Periya.Data periya.d...@gmail.com wrote: Hi Hive Users, I recently noticed an interesting behavior with Hive and I am unable to find the reason for it. Your insights into this is much appreciated. I am trying to compute the distance between two zip codes. I have the distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF and using Hive's built-in functions. There are some discrepancies from the 3rd decimal place when I see the output got from using Hive UDF and Hive's built-in functions. Here is an example: zip1 zip 2 Hadoop Built-in function SAS R Linux + Java 00501 11720 4.49493083698542000 4.49508858 4.49508858054005 4.49508857976933000 The formula used to compute distance is this (UDF): double long1 = Math.atan(1)/45 * ux; double lat1 = Math.atan(1)/45 * uy; double long2 = Math.atan(1)/45 * mx; double lat2 = Math.atan(1)/45 * my; double X1 = long1; double Y1 = lat1; double X2 = long2; double Y2 = lat2; double distance = 3949.99 * Math.acos(Math.sin(Y1) * Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 - X2)); The one used using built-in functions (same as above): 3949.99*acos( sin(u_y_coord * (atan(1)/45 )) * sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))* cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord * (atan(1)/45) - m_x_coord * (atan(1)/45)) ) - The Hive's built-in functions used are acos, sin, cos and atan. - for another try, I used Hive UDF, with Java's math library (Math.acos, Math.atan etc) - All variables used are double. I expected the value from Hadoop UDF (and Built-in functions) to be identical with that got from plain Java code in Linux. But they are not. The built-in function (as well as UDF) gives 49493083698542000 whereas simple Java program running in Linux gives 49508857976933000. The linux machine is similar to the Hadoop cluster machines. Linux version - Red Hat 5.5 Java - latest. Hive - 0.7.1 Hadoop - 0.20.2 This discrepancy is very consistent across thousands of zip-code distances. It is not a one-off occurrence. In some cases, I see the difference from the 4th decimal place. Some more examples: zip1 zip 2 Hadoop Built-in function SAS R Linux + Java 00602 00617 42.7909525390341 42.79072812 42.79072812185650 42.7907281218564 00603 00617 40.2404401665518 40.2402289 40.24022889740920 40.2402288974091 00605 00617 40.1919176128838 40.19186416 40.19186415807060 40.1918641580706 I have not tested the individual sin, cos, atan function returns. That will be my next test. But, at the very least, why is there a difference in the values between Hadoop's UDF/built-ins and that from Linux + Java? I am assuming that Hive's built-in mathematical functions are nothing but the underlying Java functions. Thanks, PD.
Re: Hive double-precision question
Periya: If you want to see what the built in Hive UDFs are doing, the code is here: https://github.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic and https://github.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf You can find out which UDF name maps to what class by looking at https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java If my memory serves me right, there was some interesting stuff Hive does when mapping Java types to Hive datatypes. I am not sure how relevant it is to this discussion but I will have to look further to comment more. In the meanwhile take a look at the UDF code and see if your personal Java code on Linux is equivalent to the Hive UDF code. Keep us posted! Mark On Fri, Dec 7, 2012 at 1:27 PM, Periya.Data periya.d...@gmail.com wrote: Hi Hive Users, I recently noticed an interesting behavior with Hive and I am unable to find the reason for it. Your insights into this is much appreciated. I am trying to compute the distance between two zip codes. I have the distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF and using Hive's built-in functions. There are some discrepancies from the 3rd decimal place when I see the output got from using Hive UDF and Hive's built-in functions. Here is an example: zip1 zip 2 Hadoop Built-in function SAS R Linux + Java 00501 11720 4.49493083698542000 4.49508858 4.49508858054005 4.49508857976933000 The formula used to compute distance is this (UDF): double long1 = Math.atan(1)/45 * ux; double lat1 = Math.atan(1)/45 * uy; double long2 = Math.atan(1)/45 * mx; double lat2 = Math.atan(1)/45 * my; double X1 = long1; double Y1 = lat1; double X2 = long2; double Y2 = lat2; double distance = 3949.99 * Math.acos(Math.sin(Y1) * Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 - X2)); The one used using built-in functions (same as above): 3949.99*acos( sin(u_y_coord * (atan(1)/45 )) * sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))* cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord * (atan(1)/45) - m_x_coord * (atan(1)/45)) ) - The Hive's built-in functions used are acos, sin, cos and atan. - for another try, I used Hive UDF, with Java's math library (Math.acos, Math.atan etc) - All variables used are double. I expected the value from Hadoop UDF (and Built-in functions) to be identical with that got from plain Java code in Linux. But they are not. The built-in function (as well as UDF) gives 49493083698542000 whereas simple Java program running in Linux gives 49508857976933000. The linux machine is similar to the Hadoop cluster machines. Linux version - Red Hat 5.5 Java - latest. Hive - 0.7.1 Hadoop - 0.20.2 This discrepancy is very consistent across thousands of zip-code distances. It is not a one-off occurrence. In some cases, I see the difference from the 4th decimal place. Some more examples: zip1 zip 2 Hadoop Built-in function SAS R Linux + Java 00602 00617 42.7909525390341 42.79072812 42.79072812185650 42.7907281218564 00603 00617 40.2404401665518 40.2402289 40.24022889740920 40.2402288974091 00605 00617 40.1919176128838 40.19186416 40.19186415807060 40.1918641580706 I have not tested the individual sin, cos, atan function returns. That will be my next test. But, at the very least, why is there a difference in the values between Hadoop's UDF/built-ins and that from Linux + Java? I am assuming that Hive's built-in mathematical functions are nothing but the underlying Java functions. Thanks, PD.
RE: Hive double-precision question
This sounds like https://issues.apache.org/jira/browse/HIVE-2586 , where comparing float/doubles will not work because of the way floating point numbers are represented. Perhaps there is a comparison between a float and double type because of some internal representation in the Java library, or the UDF. Ed Capriolo's book has a good section about workarounds and caveats for working with floats/doubles in hive. Thanks, Lauren From: Periya.Data [mailto:periya.d...@gmail.com] Sent: Friday, December 07, 2012 1:28 PM To: user@hive.apache.org; cdh-u...@cloudera.org Subject: Hive double-precision question Hi Hive Users, I recently noticed an interesting behavior with Hive and I am unable to find the reason for it. Your insights into this is much appreciated. I am trying to compute the distance between two zip codes. I have the distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF and using Hive's built-in functions. There are some discrepancies from the 3rd decimal place when I see the output got from using Hive UDF and Hive's built-in functions. Here is an example: zip1 zip 2 Hadoop Built-in functionSAS R Linux + Java 00501 11720 4.49493083698542000 4.49508858 4.49508858054005 4.49508857976933000 The formula used to compute distance is this (UDF): double long1 = Math.atan(1)/45 * ux; double lat1 = Math.atan(1)/45 * uy; double long2 = Math.atan(1)/45 * mx; double lat2 = Math.atan(1)/45 * my; double X1 = long1; double Y1 = lat1; double X2 = long2; double Y2 = lat2; double distance = 3949.99 * Math.acos(Math.sin(Y1) * Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 - X2)); The one used using built-in functions (same as above): 3949.99*acos( sin(u_y_coord * (atan(1)/45 )) * sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))* cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord * (atan(1)/45) - m_x_coord * (atan(1)/45)) ) - The Hive's built-in functions used are acos, sin, cos and atan. - for another try, I used Hive UDF, with Java's math library (Math.acos, Math.atan etc) - All variables used are double. I expected the value from Hadoop UDF (and Built-in functions) to be identical with that got from plain Java code in Linux. But they are not. The built-in function (as well as UDF) gives 49493083698542000 whereas simple Java program running in Linux gives 49508857976933000. The linux machine is similar to the Hadoop cluster machines. Linux version - Red Hat 5.5 Java - latest. Hive - 0.7.1 Hadoop - 0.20.2 This discrepancy is very consistent across thousands of zip-code distances. It is not a one-off occurrence. In some cases, I see the difference from the 4th decimal place. Some more examples: zip1 zip 2 Hadoop Built-in functionSAS R Linux + Java 00602 00617 42.7909525390341 42.79072812 42.79072812185650 42.7907281218564 00603 00617 40.2404401665518 40.2402289 40.24022889740920 40.2402288974091 00605 00617 40.1919176128838 40.19186416 40.19186415807060 40.1918641580706 I have not tested the individual sin, cos, atan function returns. That will be my next test. But, at the very least, why is there a difference in the values between Hadoop's UDF/built-ins and that from Linux + Java? I am assuming that Hive's built-in mathematical functions are nothing but the underlying Java functions. Thanks, PD.
Re: Hive double-precision question
Thanks Lauren, Mark Grover and Zhang. Will have to see the source code in Hive to see what is happening and if I can make the results consistent... Interested to see Zhang's patch. I shall watch that Jira. -PD On Fri, Dec 7, 2012 at 2:12 PM, Lauren Yang lauren.y...@microsoft.comwrote: This sounds like https://issues.apache.org/jira/browse/HIVE-2586 , where comparing float/doubles will not work because of the way floating point numbers are represented. ** ** Perhaps there is a comparison between a float and double type because of some internal representation in the Java library, or the UDF. ** ** Ed Capriolo’s book has a good section about workarounds and caveats for working with floats/doubles in hive. ** ** Thanks, Lauren *From:* Periya.Data [mailto:periya.d...@gmail.com] *Sent:* Friday, December 07, 2012 1:28 PM *To:* user@hive.apache.org; cdh-u...@cloudera.org *Subject:* Hive double-precision question ** ** Hi Hive Users, I recently noticed an interesting behavior with Hive and I am unable to find the reason for it. Your insights into this is much appreciated. I am trying to compute the distance between two zip codes. I have the distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF and using Hive's built-in functions. There are some discrepancies from the 3rd decimal place when I see the output got from using Hive UDF and Hive's built-in functions. Here is an example: zip1 zip 2 Hadoop Built-in function SAS R Linux + Java 00501 11720 4.49493083698542000 4.49508858 4.49508858054005 4.49508857976933000 The formula used to compute distance is this (UDF): double long1 = Math.atan(1)/45 * ux; double lat1 = Math.atan(1)/45 * uy; double long2 = Math.atan(1)/45 * mx; double lat2 = Math.atan(1)/45 * my; double X1 = long1; double Y1 = lat1; double X2 = long2; double Y2 = lat2; double distance = 3949.99 * Math.acos(Math.sin(Y1) * Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 - X2)); The one used using built-in functions (same as above): 3949.99*acos( sin(u_y_coord * (atan(1)/45 )) * sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))* cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord * (atan(1)/45) - m_x_coord * (atan(1)/45)) ) - The Hive's built-in functions used are acos, sin, cos and atan. - for another try, I used Hive UDF, with Java's math library (Math.acos, Math.atan etc) - All variables used are double. I expected the value from Hadoop UDF (and Built-in functions) to be identical with that got from plain Java code in Linux. But they are not. The built-in function (as well as UDF) gives 49493083698542000 whereas simple Java program running in Linux gives 49508857976933000. The linux machine is similar to the Hadoop cluster machines. Linux version - Red Hat 5.5 Java - latest. Hive - 0.7.1 Hadoop - 0.20.2 This discrepancy is very consistent across thousands of zip-code distances. It is not a one-off occurrence. In some cases, I see the difference from the 4th decimal place. Some more examples: zip1 zip 2 Hadoop Built-in function SAS R Linux + Java 00602 00617 42.7909525390341 42.79072812 42.79072812185650 42.7907281218564 00603 00617 40.2404401665518 40.2402289 40.24022889740920 40.2402288974091 00605 00617 40.1919176128838 40.19186416 40.19186415807060 40.1918641580706 I have not tested the individual sin, cos, atan function returns. That will be my next test. But, at the very least, why is there a difference in the values between Hadoop's UDF/built-ins and that from Linux + Java? I am assuming that Hive's built-in mathematical functions are nothing but the underlying Java functions. Thanks, PD.
Re: Hive double-precision question
Hi Mark, Thanks for the pointers. I looked at the code and it looks like my Java code and the Hive code are similar...(I am a basic-level Java guy). The UDF below uses Math.sinwhich is what I used to test linux + Java result. I have to see what this DoubleWritable and Serde2 is all about... package org.apache.hadoop.hive.ql.udf; import org.apache.hadoop.hive.ql.exec.Description; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.hive.serde2.io.DoubleWritable; /** * UDFSin. * */ @Description(name = sin, value = _FUNC_(x) - returns the sine of x (x is in radians), extended = Example:\n + SELECT _FUNC_(0) FROM src LIMIT 1;\n + 0) public class UDFSin extends UDF { private DoubleWritable result = new DoubleWritable(); public UDFSin() { } public DoubleWritable evaluate(DoubleWritable a) { if (a == null) { return null; } else { result.set(Math.sin(a.get())); return result; } } } On Fri, Dec 7, 2012 at 2:02 PM, Mark Grover grover.markgro...@gmail.comwrote: Periya: If you want to see what the built in Hive UDFs are doing, the code is here: https://github.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic and https://github.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf You can find out which UDF name maps to what class by looking at https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java If my memory serves me right, there was some interesting stuff Hive does when mapping Java types to Hive datatypes. I am not sure how relevant it is to this discussion but I will have to look further to comment more. In the meanwhile take a look at the UDF code and see if your personal Java code on Linux is equivalent to the Hive UDF code. Keep us posted! Mark On Fri, Dec 7, 2012 at 1:27 PM, Periya.Data periya.d...@gmail.com wrote: Hi Hive Users, I recently noticed an interesting behavior with Hive and I am unable to find the reason for it. Your insights into this is much appreciated. I am trying to compute the distance between two zip codes. I have the distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF and using Hive's built-in functions. There are some discrepancies from the 3rd decimal place when I see the output got from using Hive UDF and Hive's built-in functions. Here is an example: zip1 zip 2 Hadoop Built-in function SAS R Linux + Java 00501 11720 4.49493083698542000 4.49508858 4.49508858054005 4.49508857976933000 The formula used to compute distance is this (UDF): double long1 = Math.atan(1)/45 * ux; double lat1 = Math.atan(1)/45 * uy; double long2 = Math.atan(1)/45 * mx; double lat2 = Math.atan(1)/45 * my; double X1 = long1; double Y1 = lat1; double X2 = long2; double Y2 = lat2; double distance = 3949.99 * Math.acos(Math.sin(Y1) * Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 - X2)); The one used using built-in functions (same as above): 3949.99*acos( sin(u_y_coord * (atan(1)/45 )) * sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))* cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord * (atan(1)/45) - m_x_coord * (atan(1)/45)) ) - The Hive's built-in functions used are acos, sin, cos and atan. - for another try, I used Hive UDF, with Java's math library (Math.acos, Math.atan etc) - All variables used are double. I expected the value from Hadoop UDF (and Built-in functions) to be identical with that got from plain Java code in Linux. But they are not. The built-in function (as well as UDF) gives 49493083698542000 whereas simple Java program running in Linux gives 49508857976933000. The linux machine is similar to the Hadoop cluster machines. Linux version - Red Hat 5.5 Java - latest. Hive - 0.7.1 Hadoop - 0.20.2 This discrepancy is very consistent across thousands of zip-code distances. It is not a one-off occurrence. In some cases, I see the difference from the 4th decimal place. Some more examples: zip1 zip 2 Hadoop Built-in function SAS R Linux + Java 00602 00617 42.7909525390341 42.79072812 42.79072812185650 42.7907281218564 00603 00617 40.2404401665518 40.2402289 40.24022889740920 40.2402288974091 00605 00617 40.1919176128838 40.19186416 40.19186415807060 40.1918641580706 I have not tested the individual sin, cos, atan function returns. That will be my next test. But, at the very least, why is there a difference in the values between Hadoop's UDF/built-ins and that from Linux + Java? I am assuming that Hive's built-in mathematical functions are nothing but the underlying Java functions.
Load data in (external table) from symbolic link
Hi, I am trying to create an external table in Hive by pointing it to a file that has symbolic links in its path reference. Hive seems to complain with the following error indicating that it thinks the symbolic link is a file: java.io.IOException: Open failed for file: /dir1/dir2/dir3_symlink, error: Invalid argument (22) at com.mapr.fs.MapRClient.open(MapRClient.java:190) at com.mapr.fs.MapRFileSystem.open(MapRFileSystem.java:327) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:460) at org.apache.hadoop.mapred.LineRecordReader.init(LineRecordReader.java:93) at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:54) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:237) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:383) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:336) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1109) at org.apache.hadoop.mapred.Child.main(Child.java:264) Any help would be appreciated.
PK violation during Hive add partition
Hello, We are running into intermittent errors while running the below query. Some background on this, our table (tbl_someTable) that we're altering is an external table, and the query below is run concurrently by multiple oozie workflows. ALTER TABLE tbl_someTable ADD IF NOT EXISTS PARTITION(cluster_address = '${CLUSTERADDRESS}', upload_date = '${PREVIOUSDATE}' , upload_hour = '${PREVIOUSHOUR}') LOCATION 'asv://${RAWLOGSCONTAINER}/${CLUSTERADDRESS}/someLog/${PREVIOUSDATE}/${PREVIOUSHOUR}'; The errors we're getting are below. Is this a known issue and is there a workaround for it? Thanks karlen stderr logs WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files. Logging initialized using configuration in jar:file:/c:/hdfs/mapred/local/taskTracker/distcache/5662320028645753518_889604055_1925270295/10.175.202.81/user/dssxuser/share/lib/hive/hive-common-0.9.0.jar!/hive-log4j.properties Hive history file=/tmp/dssxuser/hive_job_log_dssxuser_201212070113_1149932084.txt FAILED: Error in metadata: javax.jdo.JDODataStoreException: Insert of object org.apache.hadoop.hive.metastore.model.MPartition@2a4e50fmailto:org.apache.hadoop.hive.metastore.model.MPartition@2a4e50f using statement INSERT INTO PARTITIONS (PART_ID,CREATE_TIME,SD_ID,PART_NAME,LAST_ACCESS_TIME,TBL_ID) VALUES (?,?,?,?,?,?) failed : Violation of PRIMARY KEY constraint apos;PK_partitions_PART_IDapos;. Cannot insert duplicate key in object apos;dbo.PARTITIONSapos;. The duplicate key value is (221). NestedThrowables: com.microsoft.sqlserver.jdbc.SQLServerException: Violation of PRIMARY KEY constraint apos;PK_partitions_PART_IDapos;. Cannot insert duplicate key in object apos;dbo.PARTITIONSapos;. The duplicate key value is (221). FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask Intercepting System.exit(9) Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.HiveMain], exit code [9] stderr logs WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files. Logging initialized using configuration in jar:file:/c:/hdfs/mapred/local/taskTracker/distcache/2751940372978647467_889604055_1925270295/10.175.202.81/user/dssxuser/share/lib/hive/hive-common-0.9.0.jar!/hive-log4j.properties Hive history file=/tmp/dssxuser/hive_job_log_dssxuser_201212071515_173032638.txt FAILED: Error in metadata: javax.jdo.JDODataStoreException: Insert of object org.apache.hadoop.hive.metastore.model.MSerDeInfo@31ce40d5mailto:org.apache.hadoop.hive.metastore.model.MSerDeInfo@31ce40d5 using statement INSERT INTO SERDES (SERDE_ID,SLIB,NAME) VALUES (?,?,?) failed : Violation of PRIMARY KEY constraint apos;PK_serdes_SERDE_IDapos;. Cannot insert duplicate key in object apos;dbo.SERDESapos;. The duplicate key value is (2006). NestedThrowables: com.microsoft.sqlserver.jdbc.SQLServerException: Violation of PRIMARY KEY constraint apos;PK_serdes_SERDE_IDapos;. Cannot insert duplicate key in object apos;dbo.SERDESapos;. The duplicate key value is (2006). FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask Intercepting System.exit(9) Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.HiveMain], exit code [9]