Re: Hive parallel execution deadlocks, need restart of yarn-nodemanager

2012-12-07 Thread Alexandre Fouche
Ah i see, i had missed the fact that each MR jobs had an ApplicationManager 
that was taking a container, there were none free to run mappers (my jobs 
usually have only one mapper due to small input data). I understood that thanks 
to your explanations and using more nodes with a greater concurrency, and like 
before all containers were running an ApplicationManager !

Thank you very much ! 


--
Alexandre Fouche
Lead operations engineer, cloud architect
http://www.cleverscale.com | @cleverscale
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Thursday 6 December 2012 at 21:08, Vinod Kumar Vavilapalli wrote:

 
 You mentioned you only have one NodeManager.
 
 So, is hive generating 3 MapReduce jobs? And how many map and reduce tasks 
 for each job?
 
 What is your yarn.nodemanager.resource.memory-mb? That determines the maximum 
 number of containers you can run.
 
 You are running into an issue where all the jobs are running in parallel, and 
 because job now has one 'ApplicationMaster' which also occupies a container, 
 the jobs are getting into a scheduling livelock. On single node you will not 
 have enough capacity to run many jobs in parallel.
 Thanks,
 +Vinod
 
 On Dec 6, 2012, at 5:24 AM, Alexandre Fouche wrote:
  Is there a known deadlock issue or bug when using Hive parallel execution 
  with more parallel hive threads than there are computing nodemanagers ?
  
  On my test cluster, i have set Hive parallel excution to 2 or 3 threads, 
  and have only 1 computing nodemanager with 5 cpu cores.
  
  When i run a hive request with a lot of unions that decomposes in a lot of 
  jobs to be executed in parallel, after a few jobs done, it always endup 
  deadlocking on 0% at mapping for all parallel jobs (from Hive0server2 
  logs). If i restart hadoop-yarn-nodemanager on the nodemanager server, Hive 
  gets out of its deadlock and continues, until getting deadlocked a bit 
  later again.
  
  Alex 



Set the number of mapper in hive

2012-12-07 Thread Philips Kokoh Prasetyo
Hi everyone,

My cluster also runs HBase for real time processing. Hive query (on a big
table) occupies all the map tasks so that the other service cannot run
properly.
Does anyone know how to limit the number of running map in hive?
I see mapred.reduce.tasks in the configuration properties, but I don't see
the configuration for mapper.

Thank you.

Regards,
Philips


Re: How to set an empty value to hive.querylog.location to disable the creation of hive history file

2012-12-07 Thread Bing Li
do you mean NOT support disable the creation of hive history files OR NOT
support using an empty string to achieve this?

If Hive doesn't support disable the creation of query logs, do you know the
reason?

Thanks,
- Bing

2012/12/6 Hezhiqiang (Ransom) ransom.hezhiqi...@huawei.com

  It’s not supported now. 

 I think you a rise it in JIRA.

 ** **

 Regards

 Ransom

 ** **

 *From:* Bing Li [mailto:sarah.lib...@gmail.com]
 *Sent:* Thursday, December 06, 2012 5:06 PM
 *To:* user@hive.apache.org
 *Subject:* Re: How to set an empty value to hive.querylog.location to
 disable the creation of hive history file

 ** **

 it will exit with error like

 FAILED: Failed to open Query Log: /dev/null/hive_job_log_xxx.txt

 and pointed that the path is not a directory.



 

 2012/12/6 Jithendranath Joijoide pixelma...@gmail.com

 How about setting it to /dev/null . Not sure if that would help in your
 case. Just an hack.

 ** **

 Regards.

 ** **

 On Thu, Dec 6, 2012 at 2:14 PM, Bing Li sarah.lib...@gmail.com wrote:***
 *

 Hi, all
 Refer to https://cwiki.apache.org/Hive/adminmanual-configuration.html, if
 I set hive.querylog.location to an empty string, it won't create
 structured log.

 I filed hive-site.xml in HIVE_HOME/conf and add the following setting,
 property
   namehive.querylog.location/name
   value/value
 /property

 BUT it didn't work, when launch HIVE_HOME/bin/hive, it created a history
 file in /tmp/user.name which is the default directory of this property.

 Do you know how to set an EMPTY value in hive-site.xml?


 Thanks,
 - Bing

 ** **

 ** **



Re: Set the number of mapper in hive

2012-12-07 Thread Nitin Pawar
ways to handle this
1) create separate job queues for hive and hbase users on jt and allocate
resources according to your needs
2) you can not actually limit how many maps can be launched as its decided
on run time by looking at split size. If you want less number of maps to be
launched then increase split size...also keep in mind you will need larger
memory for maps
On Dec 7, 2012 2:32 PM, Philips Kokoh Prasetyo philipsko...@gmail.com
wrote:

 Hi everyone,

 My cluster also runs HBase for real time processing. Hive query (on a big
 table) occupies all the map tasks so that the other service cannot run
 properly.
 Does anyone know how to limit the number of running map in hive?
 I see mapred.reduce.tasks in the configuration properties, but I don't see
 the configuration for mapper.

 Thank you.

 Regards,
 Philips



Re: Set the number of mapper in hive

2012-12-07 Thread Philips Kokoh Prasetyo
Hi Nitin,

Thanks for the reply.
Do you mean using fair scheduler to separate job queue?
http://hadoop.apache.org/docs/r0.20.2/fair_scheduler.html


Regards,
Philips


On Fri, Dec 7, 2012 at 5:15 PM, Nitin Pawar nitinpawar...@gmail.com wrote:

 ways to handle this
 1) create separate job queues for hive and hbase users on jt and allocate
 resources according to your needs
 2) you can not actually limit how many maps can be launched as its decided
 on run time by looking at split size. If you want less number of maps to be
 launched then increase split size...also keep in mind you will need larger
 memory for maps
 On Dec 7, 2012 2:32 PM, Philips Kokoh Prasetyo philipsko...@gmail.com
 wrote:

 Hi everyone,

 My cluster also runs HBase for real time processing. Hive query (on a big
 table) occupies all the map tasks so that the other service cannot run
 properly.
 Does anyone know how to limit the number of running map in hive?
 I see mapred.reduce.tasks in the configuration properties, but I don't
 see the configuration for mapper.

 Thank you.

 Regards,
 Philips




Re: Set the number of mapper in hive

2012-12-07 Thread Nitin Pawar
Yes
On Dec 7, 2012 3:09 PM, Philips Kokoh Prasetyo philipsko...@gmail.com
wrote:

 Hi Nitin,

 Thanks for the reply.
 Do you mean using fair scheduler to separate job queue?
 http://hadoop.apache.org/docs/r0.20.2/fair_scheduler.html


 Regards,
 Philips


 On Fri, Dec 7, 2012 at 5:15 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 ways to handle this
 1) create separate job queues for hive and hbase users on jt and allocate
 resources according to your needs
 2) you can not actually limit how many maps can be launched as its
 decided on run time by looking at split size. If you want less number of
 maps to be launched then increase split size...also keep in mind you will
 need larger memory for maps
  On Dec 7, 2012 2:32 PM, Philips Kokoh Prasetyo philipsko...@gmail.com
 wrote:

 Hi everyone,

 My cluster also runs HBase for real time processing. Hive query (on a
 big table) occupies all the map tasks so that the other service cannot run
 properly.
 Does anyone know how to limit the number of running map in hive?
 I see mapred.reduce.tasks in the configuration properties, but I don't
 see the configuration for mapper.

 Thank you.

 Regards,
 Philips





Map side join

2012-12-07 Thread Souvik Banerjee
Hello everybody,

I have got a question. I didn't came across any post which says somethign
about this.
I have got two tables. Lets say A and B.
I want to join A  B in HIVE. I am currently using HIVE 0.9 version.
The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 =
B.id2) AND (A.id3 = B.id3)

Can I ask HIVE to use map side join in this scenario? Should I give a hint
to HIVE by saying /*+mapjoin(B)*/

Get back to me if you want any more information in this regard.

Thanks and regards,
Souvik.


Re: Map side join

2012-12-07 Thread bejoy_ks
Hi Souvik

In earlier versions of hive you had to give the map join hint. But in later 
versions just set hive.auto.convert.join = true;
Hive automatically selects the smaller table. It is better to give the smaller 
table as the first  one in join.

You can use a map join if you are joining a small table with a large one, in 
terms of data size. By small, better to have the smaller table size in range of 
MBs.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-Original Message-
From: Souvik Banerjee souvikbaner...@gmail.com
Date: Fri, 7 Dec 2012 13:58:25 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Map side join

Hello everybody,

I have got a question. I didn't came across any post which says somethign
about this.
I have got two tables. Lets say A and B.
I want to join A  B in HIVE. I am currently using HIVE 0.9 version.
The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 =
B.id2) AND (A.id3 = B.id3)

Can I ask HIVE to use map side join in this scenario? Should I give a hint
to HIVE by saying /*+mapjoin(B)*/

Get back to me if you want any more information in this regard.

Thanks and regards,
Souvik.



Re: Hive double-precision question

2012-12-07 Thread Johnny Zhang
Hi, Periya:
This is a problem to me also. I filed
https://issues.apache.org/jira/browse/HIVE-3715

I have a patch working in local. I am doing more tests right now will post
it soon.

Thanks,
Johnny


On Fri, Dec 7, 2012 at 1:27 PM, Periya.Data periya.d...@gmail.com wrote:

 Hi Hive Users,
 I recently noticed an interesting behavior with Hive and I am unable
 to find the reason for it. Your insights into this is much appreciated.

 I am trying to compute the distance between two zip codes. I have the
 distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF
 and using Hive's built-in functions. There are some discrepancies from the
 3rd decimal place when I see the output got from using Hive UDF and Hive's
 built-in functions. Here is an example:

 zip1  zip 2  Hadoop Built-in function
 SAS  R   Linux +
 Java
 00501   11720   4.49493083698542000 4.49508858 4.49508858054005
 4.49508857976933000
 The formula used to compute distance is this (UDF):

 double long1 = Math.atan(1)/45 * ux;
 double lat1 = Math.atan(1)/45 * uy;
 double long2 = Math.atan(1)/45 * mx;
 double lat2 = Math.atan(1)/45 * my;

 double X1 = long1;
 double Y1 = lat1;
 double X2 = long2;
 double Y2 = lat2;

 double distance = 3949.99 * Math.acos(Math.sin(Y1) *
 Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 -
 X2));


 The one used using built-in functions (same as above):
 3949.99*acos(  sin(u_y_coord * (atan(1)/45 )) *
 sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))*
 cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord *
 (atan(1)/45) - m_x_coord * (atan(1)/45)) )




 - The Hive's built-in functions used are acos, sin, cos and atan.
 - for another try, I used Hive UDF, with Java's math library (Math.acos,
 Math.atan etc)
 - All variables used are double.

 I expected the value from Hadoop UDF (and Built-in functions) to be
 identical with that got from plain Java code in Linux. But they are not.
 The built-in function (as well as UDF) gives 49493083698542000 whereas
 simple Java program running in Linux gives 49508857976933000. The linux
 machine is similar to the Hadoop cluster machines.

 Linux version - Red Hat 5.5
 Java - latest.
 Hive - 0.7.1
 Hadoop - 0.20.2

 This discrepancy is very consistent across thousands of zip-code
 distances. It is not a one-off occurrence. In some cases, I see the
 difference from the 4th decimal place. Some more examples:

 zip1  zip 2  Hadoop Built-in function
 SAS  R   Linux +
 Java
00602   00617   42.7909525390341 42.79072812 42.79072812185650
 42.7907281218564  00603   00617   40.2404401665518 40.2402289
 40.24022889740920 40.2402288974091  00605   00617
 40.1919176128838 40.19186416 40.19186415807060 40.1918641580706
 I have not tested the individual sin, cos, atan function returns. That
 will be my next test. But, at the very least, why is there a difference in
 the values between Hadoop's UDF/built-ins and that from Linux + Java?  I am
 assuming that Hive's built-in mathematical functions are nothing but the
 underlying Java functions.

 Thanks,
 PD.




Re: Hive double-precision question

2012-12-07 Thread Mark Grover
Periya:
If you want to see what the built in Hive UDFs are doing, the code is here:
https://github.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic
and
https://github.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf

You can find out which UDF name maps to what class by looking at
https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java

If my memory serves me right, there was some interesting stuff Hive does
when mapping Java types to Hive datatypes. I am not sure how relevant it is
to this discussion but I will have to look further to comment more.

In the meanwhile take a look at the UDF code and see if your personal Java
code on Linux is equivalent to the Hive UDF code.

Keep us posted!
Mark

On Fri, Dec 7, 2012 at 1:27 PM, Periya.Data periya.d...@gmail.com wrote:

 Hi Hive Users,
 I recently noticed an interesting behavior with Hive and I am unable
 to find the reason for it. Your insights into this is much appreciated.

 I am trying to compute the distance between two zip codes. I have the
 distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF
 and using Hive's built-in functions. There are some discrepancies from the
 3rd decimal place when I see the output got from using Hive UDF and Hive's
 built-in functions. Here is an example:

 zip1  zip 2  Hadoop Built-in function
 SAS  R   Linux +
 Java
 00501   11720   4.49493083698542000 4.49508858 4.49508858054005
 4.49508857976933000
 The formula used to compute distance is this (UDF):

 double long1 = Math.atan(1)/45 * ux;
 double lat1 = Math.atan(1)/45 * uy;
 double long2 = Math.atan(1)/45 * mx;
 double lat2 = Math.atan(1)/45 * my;

 double X1 = long1;
 double Y1 = lat1;
 double X2 = long2;
 double Y2 = lat2;

 double distance = 3949.99 * Math.acos(Math.sin(Y1) *
 Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 -
 X2));


 The one used using built-in functions (same as above):
 3949.99*acos(  sin(u_y_coord * (atan(1)/45 )) *
 sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))*
 cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord *
 (atan(1)/45) - m_x_coord * (atan(1)/45)) )




 - The Hive's built-in functions used are acos, sin, cos and atan.
 - for another try, I used Hive UDF, with Java's math library (Math.acos,
 Math.atan etc)
 - All variables used are double.

 I expected the value from Hadoop UDF (and Built-in functions) to be
 identical with that got from plain Java code in Linux. But they are not.
 The built-in function (as well as UDF) gives 49493083698542000 whereas
 simple Java program running in Linux gives 49508857976933000. The linux
 machine is similar to the Hadoop cluster machines.

 Linux version - Red Hat 5.5
 Java - latest.
 Hive - 0.7.1
 Hadoop - 0.20.2

 This discrepancy is very consistent across thousands of zip-code
 distances. It is not a one-off occurrence. In some cases, I see the
 difference from the 4th decimal place. Some more examples:

 zip1  zip 2  Hadoop Built-in function
 SAS  R   Linux +
 Java
00602   00617   42.7909525390341 42.79072812 42.79072812185650
 42.7907281218564  00603   00617   40.2404401665518 40.2402289
 40.24022889740920 40.2402288974091  00605   00617
 40.1919176128838 40.19186416 40.19186415807060 40.1918641580706
 I have not tested the individual sin, cos, atan function returns. That
 will be my next test. But, at the very least, why is there a difference in
 the values between Hadoop's UDF/built-ins and that from Linux + Java?  I am
 assuming that Hive's built-in mathematical functions are nothing but the
 underlying Java functions.

 Thanks,
 PD.




RE: Hive double-precision question

2012-12-07 Thread Lauren Yang
This sounds like https://issues.apache.org/jira/browse/HIVE-2586 , where 
comparing float/doubles will not work because of the way floating point numbers 
are represented.

Perhaps there is a comparison between a  float and double type because of some 
internal representation in the Java library, or the UDF.

Ed Capriolo's book has a good section about workarounds and caveats for working 
with floats/doubles in hive.

Thanks,
Lauren
From: Periya.Data [mailto:periya.d...@gmail.com]
Sent: Friday, December 07, 2012 1:28 PM
To: user@hive.apache.org; cdh-u...@cloudera.org
Subject: Hive double-precision question

Hi Hive Users,
I recently noticed an interesting behavior with Hive and I am unable to 
find the reason for it. Your insights into this is much appreciated.

I am trying to compute the distance between two zip codes. I have the distances 
computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF and using Hive's 
built-in functions. There are some discrepancies from the 3rd decimal place 
when I see the output got from using Hive UDF and Hive's built-in functions. 
Here is an example:

zip1  zip 2  Hadoop Built-in functionSAS
  R   Linux + Java
00501

11720

4.49493083698542000

4.49508858

4.49508858054005

4.49508857976933000


The formula used to compute distance is this (UDF):

double long1 = Math.atan(1)/45 * ux;
double lat1 = Math.atan(1)/45 * uy;
double long2 = Math.atan(1)/45 * mx;
double lat2 = Math.atan(1)/45 * my;

double X1 = long1;
double Y1 = lat1;
double X2 = long2;
double Y2 = lat2;

double distance = 3949.99 * Math.acos(Math.sin(Y1) *
Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 - X2));


The one used using built-in functions (same as above):
3949.99*acos(  sin(u_y_coord * (atan(1)/45 )) *
sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))*
cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord *
(atan(1)/45) - m_x_coord * (atan(1)/45)) )




- The Hive's built-in functions used are acos, sin, cos and atan.
- for another try, I used Hive UDF, with Java's math library (Math.acos, 
Math.atan etc)
- All variables used are double.

I expected the value from Hadoop UDF (and Built-in functions) to be identical 
with that got from plain Java code in Linux. But they are not. The built-in 
function (as well as UDF) gives 49493083698542000 whereas simple Java program 
running in Linux gives 49508857976933000. The linux machine is similar to the 
Hadoop cluster machines.

Linux version - Red Hat 5.5
Java - latest.
Hive - 0.7.1
Hadoop - 0.20.2

This discrepancy is very consistent across thousands of zip-code distances. It 
is not a one-off occurrence. In some cases, I see the difference from the 4th 
decimal place. Some more examples:

zip1  zip 2  Hadoop Built-in functionSAS
  R   Linux + Java
00602

00617

42.7909525390341

42.79072812

42.79072812185650

42.7907281218564

00603

00617

40.2404401665518

40.2402289

40.24022889740920

40.2402288974091

00605

00617

40.1919176128838

40.19186416

40.19186415807060

40.1918641580706


I have not tested the individual sin, cos, atan function returns. That will be 
my next test. But, at the very least, why is there a difference in the values 
between Hadoop's UDF/built-ins and that from Linux + Java?  I am assuming that 
Hive's built-in mathematical functions are nothing but the underlying Java 
functions.

Thanks,
PD.


Re: Hive double-precision question

2012-12-07 Thread Periya.Data
Thanks Lauren, Mark Grover and Zhang. Will have to see the source code in
Hive to see what is happening and if I can make the results consistent...

Interested to see Zhang's patch. I shall watch that Jira.

-PD

On Fri, Dec 7, 2012 at 2:12 PM, Lauren Yang lauren.y...@microsoft.comwrote:

  This sounds like https://issues.apache.org/jira/browse/HIVE-2586 , where
 comparing float/doubles will not work because of the way floating point
 numbers are represented.

 ** **

 Perhaps there is a comparison between a  float and double type because of
 some internal representation in the Java library, or the UDF.

 ** **

 Ed Capriolo’s book has a good section about workarounds and caveats for
 working with floats/doubles in hive.

 ** **

 Thanks,

 Lauren

 *From:* Periya.Data [mailto:periya.d...@gmail.com]
 *Sent:* Friday, December 07, 2012 1:28 PM
 *To:* user@hive.apache.org; cdh-u...@cloudera.org
 *Subject:* Hive double-precision question

 ** **

 Hi Hive Users,
 I recently noticed an interesting behavior with Hive and I am unable
 to find the reason for it. Your insights into this is much appreciated.

 I am trying to compute the distance between two zip codes. I have the
 distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF
 and using Hive's built-in functions. There are some discrepancies from the
 3rd decimal place when I see the output got from using Hive UDF and Hive's
 built-in functions. Here is an example:

 zip1  zip 2  Hadoop Built-in function
 SAS  R   Linux +
 Java

 00501  

 11720  

 4.49493083698542000

 4.49508858

 4.49508858054005

 4.49508857976933000


 The formula used to compute distance is this (UDF):

 double long1 = Math.atan(1)/45 * ux;
 double lat1 = Math.atan(1)/45 * uy;
 double long2 = Math.atan(1)/45 * mx;
 double lat2 = Math.atan(1)/45 * my;

 double X1 = long1;
 double Y1 = lat1;
 double X2 = long2;
 double Y2 = lat2;

 double distance = 3949.99 * Math.acos(Math.sin(Y1) *
 Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1 -
 X2));


 The one used using built-in functions (same as above):
 3949.99*acos(  sin(u_y_coord * (atan(1)/45 )) *
 sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))*
 cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord *
 (atan(1)/45) - m_x_coord * (atan(1)/45)) )




 - The Hive's built-in functions used are acos, sin, cos and atan.
 - for another try, I used Hive UDF, with Java's math library (Math.acos,
 Math.atan etc)
 - All variables used are double.

 I expected the value from Hadoop UDF (and Built-in functions) to be
 identical with that got from plain Java code in Linux. But they are not.
 The built-in function (as well as UDF) gives 49493083698542000 whereas
 simple Java program running in Linux gives 49508857976933000. The linux
 machine is similar to the Hadoop cluster machines.

 Linux version - Red Hat 5.5
 Java - latest.
 Hive - 0.7.1
 Hadoop - 0.20.2

 This discrepancy is very consistent across thousands of zip-code
 distances. It is not a one-off occurrence. In some cases, I see the
 difference from the 4th decimal place. Some more examples:

 zip1  zip 2  Hadoop Built-in function
 SAS  R   Linux +
 Java

 00602  

 00617  

 42.7909525390341

 42.79072812

 42.79072812185650

 42.7907281218564

 00603  

 00617  

 40.2404401665518

 40.2402289

 40.24022889740920

 40.2402288974091

 00605  

 00617  

 40.1919176128838

 40.19186416

 40.19186415807060

 40.1918641580706


 I have not tested the individual sin, cos, atan function returns. That
 will be my next test. But, at the very least, why is there a difference in
 the values between Hadoop's UDF/built-ins and that from Linux + Java?  I am
 assuming that Hive's built-in mathematical functions are nothing but the
 underlying Java functions.

 Thanks,
 PD.



Re: Hive double-precision question

2012-12-07 Thread Periya.Data
Hi Mark,
   Thanks for the pointers. I looked at the code and it looks like my Java
code and the Hive code are similar...(I am a basic-level Java guy). The UDF
below uses Math.sinwhich is what I used to test linux + Java result.
I have to see what this DoubleWritable and Serde2 is all about...

package org.apache.hadoop.hive.ql.udf;

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.serde2.io.DoubleWritable;

/**
* UDFSin.
*
*/
@Description(name = sin,
value = _FUNC_(x) - returns the sine of x (x is in radians),
extended = Example:\n 
+   SELECT _FUNC_(0) FROM src LIMIT 1;\n +  0)
public class UDFSin extends UDF {
  private DoubleWritable result = new DoubleWritable();

  public UDFSin() {
  }



public DoubleWritable evaluate(DoubleWritable a) {
if (a == null) {
  return null;
} else {
  result.set(Math.sin(a.get()));
  return result;
}
  }
}







On Fri, Dec 7, 2012 at 2:02 PM, Mark Grover grover.markgro...@gmail.comwrote:

 Periya:
 If you want to see what the built in Hive UDFs are doing, the code is here:

 https://github.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic
 and

 https://github.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf

 You can find out which UDF name maps to what class by looking at
 https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java

 If my memory serves me right, there was some interesting stuff Hive does
 when mapping Java types to Hive datatypes. I am not sure how relevant it is
 to this discussion but I will have to look further to comment more.

 In the meanwhile take a look at the UDF code and see if your personal Java
 code on Linux is equivalent to the Hive UDF code.

 Keep us posted!
 Mark

 On Fri, Dec 7, 2012 at 1:27 PM, Periya.Data periya.d...@gmail.com wrote:

 Hi Hive Users,
 I recently noticed an interesting behavior with Hive and I am unable
 to find the reason for it. Your insights into this is much appreciated.

 I am trying to compute the distance between two zip codes. I have the
 distances computed in various 'platforms' - SAS, R, Linux+Java, Hive UDF
 and using Hive's built-in functions. There are some discrepancies from the
 3rd decimal place when I see the output got from using Hive UDF and Hive's
 built-in functions. Here is an example:

 zip1  zip 2  Hadoop Built-in function
 SAS  R   Linux +
 Java
 00501   11720   4.49493083698542000 4.49508858 4.49508858054005
 4.49508857976933000
 The formula used to compute distance is this (UDF):

 double long1 = Math.atan(1)/45 * ux;
 double lat1 = Math.atan(1)/45 * uy;
 double long2 = Math.atan(1)/45 * mx;
 double lat2 = Math.atan(1)/45 * my;

 double X1 = long1;
 double Y1 = lat1;
 double X2 = long2;
 double Y2 = lat2;

 double distance = 3949.99 * Math.acos(Math.sin(Y1) *
 Math.sin(Y2) + Math.cos(Y1) * Math.cos(Y2) * Math.cos(X1
 - X2));


 The one used using built-in functions (same as above):
 3949.99*acos(  sin(u_y_coord * (atan(1)/45 )) *
 sin(m_y_coord * (atan(1)/45 )) + cos(u_y_coord * (atan(1)/45 ))*
 cos(m_y_coord * (atan(1)/45 ))*cos(u_x_coord *
 (atan(1)/45) - m_x_coord * (atan(1)/45)) )




 - The Hive's built-in functions used are acos, sin, cos and atan.
 - for another try, I used Hive UDF, with Java's math library (Math.acos,
 Math.atan etc)
 - All variables used are double.

 I expected the value from Hadoop UDF (and Built-in functions) to be
 identical with that got from plain Java code in Linux. But they are not.
 The built-in function (as well as UDF) gives 49493083698542000 whereas
 simple Java program running in Linux gives 49508857976933000. The linux
 machine is similar to the Hadoop cluster machines.

 Linux version - Red Hat 5.5
 Java - latest.
 Hive - 0.7.1
 Hadoop - 0.20.2

 This discrepancy is very consistent across thousands of zip-code
 distances. It is not a one-off occurrence. In some cases, I see the
 difference from the 4th decimal place. Some more examples:

 zip1  zip 2  Hadoop Built-in function
 SAS  R   Linux +
 Java
00602   00617   42.7909525390341 42.79072812 42.79072812185650
 42.7907281218564  00603   00617   40.2404401665518 40.2402289
 40.24022889740920 40.2402288974091  00605   00617
 40.1919176128838 40.19186416 40.19186415807060 40.1918641580706
 I have not tested the individual sin, cos, atan function returns. That
 will be my next test. But, at the very least, why is there a difference in
 the values between Hadoop's UDF/built-ins and that from Linux + Java?  I am
 assuming that Hive's built-in mathematical functions are nothing but the
 underlying Java functions.

Load data in (external table) from symbolic link

2012-12-07 Thread Hadoop Inquirer
Hi,

I am trying to create an external table in Hive by pointing it to a file
that has symbolic links in its path reference.

Hive seems to complain with the following error indicating that it thinks
the symbolic link is a file:

java.io.IOException: Open failed for file: /dir1/dir2/dir3_symlink, error:
Invalid argument (22)
 at com.mapr.fs.MapRClient.open(MapRClient.java:190)
 at com.mapr.fs.MapRFileSystem.open(MapRFileSystem.java:327)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:460)
 at
org.apache.hadoop.mapred.LineRecordReader.init(LineRecordReader.java:93)
 at
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:54)
 at
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:237)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:383)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:336)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1109)
 at org.apache.hadoop.mapred.Child.main(Child.java:264)

Any help would be appreciated.


PK violation during Hive add partition

2012-12-07 Thread Karlen Lie
Hello,

We are running into intermittent errors while running the below query.  Some 
background on this, our table (tbl_someTable) that we're altering is an 
external table, and the query below is run concurrently by multiple oozie 
workflows.

ALTER TABLE tbl_someTable ADD IF NOT EXISTS PARTITION(cluster_address = 
'${CLUSTERADDRESS}', upload_date = '${PREVIOUSDATE}' , upload_hour = 
'${PREVIOUSHOUR}')
LOCATION 
'asv://${RAWLOGSCONTAINER}/${CLUSTERADDRESS}/someLog/${PREVIOUSDATE}/${PREVIOUSHOUR}';

The errors we're getting are below.

Is this a known issue and is there a workaround for it?

Thanks
karlen

stderr logs
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use 
org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Logging initialized using configuration in 
jar:file:/c:/hdfs/mapred/local/taskTracker/distcache/5662320028645753518_889604055_1925270295/10.175.202.81/user/dssxuser/share/lib/hive/hive-common-0.9.0.jar!/hive-log4j.properties
Hive history 
file=/tmp/dssxuser/hive_job_log_dssxuser_201212070113_1149932084.txt
FAILED: Error in metadata: javax.jdo.JDODataStoreException: Insert of object 
org.apache.hadoop.hive.metastore.model.MPartition@2a4e50fmailto:org.apache.hadoop.hive.metastore.model.MPartition@2a4e50f
 using statement INSERT INTO PARTITIONS 
(PART_ID,CREATE_TIME,SD_ID,PART_NAME,LAST_ACCESS_TIME,TBL_ID) VALUES 
(?,?,?,?,?,?) failed : Violation of PRIMARY KEY constraint 
apos;PK_partitions_PART_IDapos;. Cannot insert duplicate key in object 
apos;dbo.PARTITIONSapos;. The duplicate key value is (221).
NestedThrowables:
com.microsoft.sqlserver.jdbc.SQLServerException: Violation of PRIMARY KEY 
constraint apos;PK_partitions_PART_IDapos;. Cannot insert duplicate key in 
object apos;dbo.PARTITIONSapos;. The duplicate key value is (221).
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask
Intercepting System.exit(9)
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.HiveMain], 
exit code [9]


stderr logs
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use 
org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Logging initialized using configuration in 
jar:file:/c:/hdfs/mapred/local/taskTracker/distcache/2751940372978647467_889604055_1925270295/10.175.202.81/user/dssxuser/share/lib/hive/hive-common-0.9.0.jar!/hive-log4j.properties
Hive history file=/tmp/dssxuser/hive_job_log_dssxuser_201212071515_173032638.txt
FAILED: Error in metadata: javax.jdo.JDODataStoreException: Insert of object 
org.apache.hadoop.hive.metastore.model.MSerDeInfo@31ce40d5mailto:org.apache.hadoop.hive.metastore.model.MSerDeInfo@31ce40d5
 using statement INSERT INTO SERDES (SERDE_ID,SLIB,NAME) VALUES (?,?,?) 
failed : Violation of PRIMARY KEY constraint apos;PK_serdes_SERDE_IDapos;. 
Cannot insert duplicate key in object apos;dbo.SERDESapos;. The duplicate key 
value is (2006).
NestedThrowables:
com.microsoft.sqlserver.jdbc.SQLServerException: Violation of PRIMARY KEY 
constraint apos;PK_serdes_SERDE_IDapos;. Cannot insert duplicate key in 
object apos;dbo.SERDESapos;. The duplicate key value is (2006).
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask
Intercepting System.exit(9)
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.HiveMain], 
exit code [9]