hive partition pruning on joining on partition column

2013-10-11 Thread java8964 java8964
I have the requirement trying to support in hive, not sure if it is doable.
I have the hadoop 1.1.1 with Hive 0.9.0 (Using deby as the meta store)
If I partition my data by a dt column, so if my table 'foo' have some 
partitions like 'dt=2013-07-01' to 'dt=2013-07-30'.
Now the user want to query all the data of Saturday only.
To make it flexiable, instead of asking end user to find out what date in that 
month are Saturday, I add a lookup table (just called it 'bar') in the HIVE 
with following columns:
year, month, day, dt_format, week_of_day
So I want to see if I can join with foo and bar to still get the partition 
pruning:
select *from foojoin baron (bar.year=2013 and bar.month=7 and bar.day_of_week=6 
and bar.dt_foramt = foo.dt)
I tried several ways, like switch the table order, join with subquery etc, none 
of them will make partition pruning works in this case on table foo. 
Can this really archivable in hive?
Thanks
Yong  

RE: Question about how to add the debug info into the hive core jar

2013-03-20 Thread java8964 java8964

I am not sure the existing logging information is enough for me.
The exception trace is as following:
Caused by: java.lang.IndexOutOfBoundsException: Index: 8, Size: 8at 
java.util.ArrayList.rangeCheck(ArrayList.java:604)at 
java.util.ArrayList.get(ArrayList.java:382)at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485)at
 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485)
It is hive 0.9.0, and I look into the source code of LazySImpleSerDe.java 
around line 485:
  List? extends StructField fields = soi.getAllStructFieldRefs();  
list = soi.getStructFieldsDataAsList(obj);  if (list == null) {
out.write(nullSequence.getBytes(), 0, nullSequence.getLength());  } else {  
  for (int i = 0; i  list.size(); i++) {  if (i  0) {
out.write(separator);  } serialize(out, list.get(i), 
fields.get(i).getFieldObjectInspector(), -- line 485
  separators, level + 1, nullSequence, escaped, escapeChar,  
needsEscape);} }   
For this exception to happen, it means that the soi (Which is my 
StructObjectInspector class) must return different length of collection object 
as fields and list.But I already add the logger in my 
StructorObjectInspector, which proves the same length collection returned from 
both method of getAllStructFieldRefs() and getStructFiledsDataAsList(Object).So 
I really don't know how this exception could happen in the Hive code.
I have 2 options right now:1) Change the above code to add more debug 
information to return at runtime to check what kind of content in the either 
fields object or list object, to understand why their length not same. But 
I have problem to make my new jar to be loaded by hadoop.2) Enable remote 
debug. There is very limited example on the internet about how to enable the 
hive server side MR jobs remote debug, even some wiki pages claim it is doable, 
but without concrete examples.
Thanks

From: ashet...@hortonworks.com
Subject: Re: Question about how to add the debug info into the hive core jar
Date: Wed, 20 Mar 2013 17:35:36 -0700
To: user@hive.apache.org

Hi Yong, 
Have you tried running the H query in debug mode. Hive log level can be changed 
by passing the following conf while hive client is running.  #hive -hiveconf 
hive.root.logger=ALL,console -e  DDL statement ;#hive -hiveconf 
hive.root.logger=ALL,console -f ddl.sql ;   Hope this helps
 Thanks

On Mar 20, 2013, at 1:45 PM, java8964 java8964 java8...@hotmail.com wrote:Hi, 
I have the hadoop running in  pseudo-distributed mode on my linux box. Right 
now I face a problem about a Hive, which throws Exception in a table for some 
data which used my custom SerDe and InputFormat class.
To help me to trace the root cause, I need to modify the code of 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe to add more debug logging 
information to understand why the exception happens.
After I modify the hive code, I can compile it and generate a new 
hive-serde.jar file, with the same name as the release version, just size 
changed.
Now I put my new hive-serde.jar under $HIVE_HOME/lib folder, replace the old 
one, and run the query which failed. But after the failure, if I check the 
$HADOOP_HOME/logs/user_logs/, I saw the Exception stacktrace still looked like 
generated by the original hive-serde class. The reason is that the line number 
shown in the log doesn't match with the new code I changed to add the debug 
information.
My question is, if I have this new compiled hive-serde.jar file, besides 
$HIVE_HOME/lib, where should I put it in?
1) This is a pseudo environments. Everything (namenode, data node, job tracker 
and tasktracer are all running in one box)2) After I replace hive-serde.jar 
with my new jar, I even stop all the hadoop java processing and restart them.3) 
But when I run the query in the hive session, I still saw the log generated by 
the old hive-serde.jar class. Why?
Thank for any help
Yong
  

RE: difference between add jar in hive session and hive --auxpath

2013-03-08 Thread java8964 java8964

This is in HIVE-0.9.0
hive list 
jars;/nfs_home/common/userlibs/google-collections-1.0.jar/nfs_home/common/userlibs/elephant-bird-hive-3.0.7.jar/nfs_home/common/userlibs/protobuf-java-2.3.0.jar/nfs_home/common/userlibs/elephant-bird-core-3.0.7.jarfile:/usr/lib/hive/lib/hive-builtins-0.9.0-cdh4.1.2.jarhive
 desc table;java.lang.NoClassDefFoundError: 
com/twitter/elephantbird/mapreduce/io/ProtobufConverterat 
com.twitter.elephantbird.hive.serde.ProtobufDeserializer.initialize(ProtobufDeserializer.java:45)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:203)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:260)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:253)
at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:490)at 
org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:162)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:930)at 
org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:844)at 
org.apache.hadoop.hive.ql.exec.DDLTask.describeTable(DDLTask.java:2545)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:309)at 
org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) 
   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1331)at 
org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1117)at 
org.apache.hadoop.hive.ql.Driver.run(Driver.java:950)at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:744)at 
org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:607)at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)at 
org.apache.hadoop.util.RunJar.main(RunJar.java:208)Caused by: 
java.lang.ClassNotFoundException: 
com.twitter.elephantbird.mapreduce.io.ProtobufConverterat 
java.net.URLClassLoader$1.run(URLClassLoader.java:202)at 
java.security.AccessController.doPrivileged(Native Method)at 
java.net.URLClassLoader.findClass(URLClassLoader.java:190)at 
java.lang.ClassLoader.loadClass(ClassLoader.java:307)at 
java.lang.ClassLoader.loadClass(ClassLoader.java:248)... 25 moreFAILED: 
Execution Error, return code -101 from 
org.apache.hadoop.hive.ql.exec.DDLTaskhive exit;[y130zhan@daca2 userlibs]$ jar 
tvf /nfs_home/common/userlibs/elephant-bird-core-3.0.7.jar | grep 
ProtobufConverter  4825 Mon Mar 04 16:50:46 UTC 2013 
com/twitter/elephantbird/mapreduce/io/ProtobufConverter.class   732 Mon Mar 04 
16:50:46 UTC 2013 
com/twitter/elephantbird/mapreduce/io/ProtobufConverter$1.class

From: vkavul...@outlook.com
To: user@hive.apache.org
Subject: RE: difference between add jar in hive session and hive --auxpath
Date: Thu, 7 Mar 2013 16:44:41 -0800




If properly done, add jar jar-file should work the same as passing the jar 
with --auxpath. Can you run list jars; command from CLI or Hue and check if 
you see the jar file.

From: java8...@hotmail.com
To: user@hive.apache.org
Subject: difference between add jar in hive session and hive --auxpath
Date: Thu, 7 Mar 2013 17:47:26 -0500





Hi, 
I have a hive table which uses the jar file provided from the elephant-bird, 
which is a framework integrated between lzo and google protobuf data and 
hadoop/hive.
If I use the hive command like this:
hive --auxpath path_to_jars, it works fine to query my table, 
but if I use the add jar after I started the hive session, I will get 
ClassNotFoundException in the runtime of my query of the classes in those jars.
My questions are:
1) What is the different between hive --auxpath and add jar in the hive 
session?2) This problem makes it is hard to access my table in the HUE, as it 
only supports add jar, but not --auxpath option. Any suggestions?

Thanks
Yong
  

A bug belongs to Hive or Elephant-bird

2013-03-08 Thread java8964 java8964

Hi, 
Hive 0.9.0 + Elephant-Bird 3.0.7
I faced a problem to use the elephant-bird with hive. I know what maybe cause 
this problem, but I don't know which side this bug belongs to. Let me know 
explain what is the problem.
If we define a google protobuf file, with field name like 'dateString' (the 
field contains an uppercase 'S'), then when I query the table like this: 
select dateString from table .

I will get the following exception trace:
Caused by:
java.lang.RuntimeException: cannot find field datestring from
[org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@49aacd5f
 .at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:321)

   
at
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector.getStructFieldRef(UnionStructObjectInspector.java:96)

   
at 
org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:57)

   
at org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:878)

   
at 
org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:904)

   
at
org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:60)

   
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)

   
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433)

   
at
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389)

   
at
org.apache.hadoop.hive.ql.exec.FilterOperator.initializeOp(FilterOperator.java:73)

   
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)

   
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433)

   
at
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389)

   
at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:133)

   
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)

   
at
org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:444)

   
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)

   
at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:98)

Here is the code for the method throws this error:
  public static StructField getStandardStructFieldRef(String fieldName,  
List? extends StructField fields) {fieldName = fieldName.toLowerCase();   
 for (int i = 0; i  fields.size(); i++) {  if 
(fields.get(i).getFieldName().equals(fieldName)) {return fields.get(i); 
 }}// For backward compatibility: fieldNames can also be integer 
Strings.try {  int i = Integer.parseInt(fieldName);  if (i = 0  
i  fields.size()) {return fields.get(i);  }} catch 
(NumberFormatException e) {  // ignore}throw new 
RuntimeException(cannot find field  + fieldName +  from + fields);  
  // return null;  }
I understand the problem happens because at this time, the fileName is 
datestring (all lowercase charcters), but the Listfields contains the 
fieldName for that field is dateString, and that is why the RuntimeException 
happened.
But I don't know which side this bug belongs to, or I want to know more inside 
detail about the Hive implementation contract.
From this link: 
https://cwiki.apache.org/Hive/user-faq.html#UserFAQ-AreHiveQLidentifiers%2528e.g.tablenames%252Ccolumnnames%252Cetc%2529casesensitive%253F
I know that in hive, the table name and column name should be case insensitive, 
so even though in my Query, I used select dateString, the fieldName changed 
to datestring in the code, but the StructField of ObjectInspector from the 
elephant-bird return the EXACTLY fieldname, defined in the code, dateString 
in this case. of course, I can change my protof file to only use lowercase 
field name to bypass this bug, but my questions are:
1) If I implement my ObjectInspector, should I pay attention to the field name? 
Is it needed to be lowercase? 2) I would consider this as a bug of hive, right? 
If this line:
fieldName = fieldName.toLowerCase(); to lowercase the data,
then the comparing should also do it by lowering case by changing
if (fields.get(i).getFieldName().equals(fieldName))
to 
if (fields.get(i).getFieldName().toLowerCase().equals(fieldName))
right?
Thanks
Yong
  

difference between add jar in hive session and hive --auxpath

2013-03-07 Thread java8964 java8964

Hi, 
I have a hive table which uses the jar file provided from the elephant-bird, 
which is a framework integrated between lzo and google protobuf data and 
hadoop/hive.
If I use the hive command like this:
hive --auxpath path_to_jars, it works fine to query my table, 
but if I use the add jar after I started the hive session, I will get 
ClassNotFoundException in the runtime of my query of the classes in those jars.
My questions are:
1) What is the different between hive --auxpath and add jar in the hive 
session?2) This problem makes it is hard to access my table in the HUE, as it 
only supports add jar, but not --auxpath option. Any suggestions?

Thanks
Yong  

RE: reg : getting table values in inputFormat in serde

2012-12-21 Thread java8964 java8964

Actually I am backing up this question. In additional for that, I wonder if it 
is possible we can access the table properties from the UDF too.
I also have XML data, but with namespace into it. The XPATH UDF coming from 
HIVE doesn't support namespace. To support the namespace in XML is simple, just 
need a NamespaceContextImpl. But the trick part is to having a generic 
solution, we want some properties can passed to underline InputFormat or 
xpath_ns UDF, for example, the prefix/URI mapping, or the real class name who 
implements the NamespaceContext interface. But right now, it is not easy to 
pass this kind information, even if we define them into the table properties.
Right now, I have to kind of hard-coded into the UDF or InputFormat, instead of 
providing a generic solution.
Yong

From: mohit_chaudhar...@infosys.com
To: user@hive.apache.org
Subject: RE: reg : getting table values in inputFormat in serde
Date: Fri, 21 Dec 2012 11:39:54 +









Actually I am storing data from xml file into hive table using serde.
Now I want table properties which I given during table creation  in inputformat 
.
 
 
From: Nitin Pawar [mailto:nitinpawar...@gmail.com]


Sent: Friday, December 21, 2012 3:16 PM

To: user@hive.apache.org

Subject: Re: reg : getting table values in inputFormat in serde
 

What kind of table properties you want to be accessed ? and what you want to 
achieve from  it in serde ? 


 

On Fri, Dec 21, 2012 at 11:04 AM, Mohit Chaudhary01 
mohit_chaudhar...@infosys.com wrote:


Hi

In serde program I need table properties in inputformat file .

 So can anyone please tell me how can I do this?
 
 
thanks






 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are 
not 
to copy, disclose, or distribute this e-mail or its contents to any other 
person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken 
every reasonable precaution to minimize this risk, but is not liable for any 
damage 
you may sustain as a result of any virus in this e-mail. You should carry out 
your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this 
e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***










 

-- 

Nitin Pawar

  

xpath UDF in hive support namespace?

2012-12-19 Thread java8964 java8964

Hi, I have a question related to the XPATH UDF currently in HIVE.
From the original Jira story about this UDF:  
https://issues.apache.org/jira/browse/HIVE-1027, It looks like the UDF won't 
support namespace in the XML, is that true?
Any later HIVE version does support namespace, if so, what is the version? And 
also if this UDF does support namespace, does anyone have some examples how to 
do it in xpath UDF of hive? All the examples in the wiki page are without name 
space.

Thanks
Yong  

RE: xpath UDF in hive support namespace?

2012-12-19 Thread java8964 java8964

Hi, Mark:
Thanks for your response. I am thinking how to change it to support the 
namespace. The easy way is allow end user to pass into properties of mapping of 
(prefix and URI), but from the HIVE UDF, it is not easy to allow end users to 
pass in any properties like Serde. Maybe that's the reason why it isn't 
supported initially.
Yong

 Date: Wed, 19 Dec 2012 14:04:20 -0800
 Subject: Re: xpath UDF in hive support namespace?
 From: grover.markgro...@gmail.com
 To: user@hive.apache.org
 
 Hi, the source code for the xpath UDFs is at
 https://github.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/xml
 
 I checked and saw that there were no changes made to these UDFs after
 there initial creation by HIVE-1027. So, that's all we have. Sorry.
 However, if you would like make additions to those UDFs, you are
 welcome to do so by creating a JIRA and posting a patch. UDFs are an
 easy and excellent way to contribute back to the Hive community.
 
 Thanks!
 
 Mark
 
 On Wed, Dec 19, 2012 at 8:52 AM, java8964 java8964 java8...@hotmail.com 
 wrote:
  Hi, I have a question related to the XPATH UDF currently in HIVE.
 
  From the original Jira story about this UDF:
  https://issues.apache.org/jira/browse/HIVE-1027, It looks like the UDF won't
  support namespace in the XML, is that true?
 
  Any later HIVE version does support namespace, if so, what is the version?
  And also if this UDF does support namespace, does anyone have some examples
  how to do it in xpath UDF of hive? All the examples in the wiki page are
  without name space.
 
 
  Thanks
 
  Yong
  

RE: Array index support non-constant expresssion

2012-12-12 Thread java8964 java8964

OK. 
I followed the hive source code of 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFArrayContains and wrote the 
UDF. It is quite simple. 
It works fine as I expected for simple case, but when I try to run it under 
some complex query, the hive MR jobs failed with some strange errors. What I 
mean is that it failed in HIVE code base, from stuck trace, I can not see this 
failure has anything to do with my custom code.
I would like some help if some one can tell me what went wrong.
For example, I created this UDF called darray, stand for dynamic array, which 
supports the non-constant value as the index location of the array.
The following query works fine as I expected:
hive select c_poi.provider_str as provider_str, c_poi.name as name from 
(select darray(search_results, c.index_loc) as c_poi from search_table lateral 
view explode(search_clicks) clickTable as c) a limit 5;POI  
   ADDRESS   some addressPOIPOI 
   ADDRESSS some address
Of course, in this case, I only want the provider_str = 'POI' returned, and 
filter out any rows with provider_str != 'POI', so it sounds simple, I changed 
the query to the following:
hive select c_poi.provider_str as provider_str, c_poi.name as name from 
(select darray(search_results, c.rank) as c_poi from search_table lateral view 
explode(search_clicks) clickTable as c) a where c_poi.provider_str = 'POI' 
limit 5;Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks 
is set to 0 since there's no reduce operatorCannot run job locally: Input Size 
(= 178314025) is larger than hive.exec.mode.local.auto.inputbytes.max (= 
134217728)Starting Job = job_201212031001_0100, Tracking URL = 
http://blevine-desktop:50030/jobdetails.jsp?jobid=job_201212031001_0100Kill 
Command = /home/yzhang/hadoop/bin/hadoop job  
-Dmapred.job.tracker=blevine-desktop:8021 -kill job_201212031001_01002012-12-12 
11:45:24,090 Stage-1 map = 0%,  reduce = 0%2012-12-12 11:45:43,173 Stage-1 map 
= 100%,  reduce = 100%Ended Job = job_201212031001_0100 with errorsFAILED: 
Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
I am only add a Where limitation, but to my surprise, the MR jobs generated by 
HIVE failed. I am testing this in my local standalone cluster, which is running 
CDH3U3 release. When I check the hadoop userlog, here is what I got:
2012-12-12 11:40:22,421 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 
SELECT 
struct_col0:bigint,_col1:string,_col2:string,_col3:string,_col4:string,_col5:string,_col6:boolean,_col7:boolean,_col8:boolean,_col9:boolean,_col10:boolean,_col11:boolean,_col12:string,_col13:string,_col14:structlat:double,lon:double,query_text_raw:string,query_text_normalized:string,query_string:string,llcountry:string,ipcountry:string,request_cnt:int,address:structcountry:string,state:string,zip:string,city:string,street:string,house:string,categories_id:arrayint,categories_name:arraystring,lang_raw:string,lang_rose:string,lang:string,viewport:structtop_lat:double,left_lon:double,bottom_lat:double,right_lon:double,_col15:structversions:int,physical_host:string,nose_request_id:string,client_type:string,ip:int,time_taken:int,user_agent:string,http_host:string,http_referrer:string,http_status:smallint,http_size:int,accept_language:string,md5:string,datacenter:string,tlv_map_data_version:string,tlv_devide_software_version:string,csid:int,rid:string,xncrid:string,cbfn:string,sources:arraystructtm:bigint,tm_date:string,tm_time:string,md5:string,time_taken:int,_col16:arraystructprovider_str:string,name:string,lat:double,lon:double,dyn:boolean,authoritative:boolean,search_center:boolean,_col17:arraystructrank:int,action:int,tm:bigint,event:string,is_csid:boolean,is_rid:boolean,is_pbapi:boolean,is_nac:boolean,_col18:string,_col19:structrank:int,action:int,tm:bigint,event:string,is_csid:boolean,is_rid:boolean,is_pbapi:boolean,is_nac:boolean2012-12-12
 11:40:22,440 WARN org.apache.hadoop.mapred.Child: Error running 
childjava.lang.RuntimeException: Error in configuring objectat 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)  
  at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:387)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)at 
org.apache.hadoop.mapred.Child$4.run(Child.java:270)at 
java.security.AccessController.doPrivileged(Native Method)at 
javax.security.auth.Subject.doAs(Subject.java:396)at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at org.apache.hadoop.mapred.Child.main(Child.java:264)Caused by: 
java.lang.reflect.InvocationTargetExceptionat 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

RE: Array index support non-constant expresssion

2012-12-12 Thread java8964 java8964

Hi,
I played my query further, and found out it is very puzzle to explain the 
following behaviors:
1) The following query works:
select c_poi.provider_str, c_poi.name from (select darray(search_results, 
c.rank) as c_poi from nulf_search lateral view explode(search_clicks) 
clickTable as c) a
I get get all the result from the above query without any problem.
2) The following query NOT works:
select c_poi.provider_str, c_poi.name from (select darray(search_results, 
c.rank) as c_poi from nulf_search lateral view explode(search_clicks) 
clickTable as c) a where c_poi.provider_str = 'POI'
As long as I add the where criteria on provider_str, or even I added another 
level of sub query like following:
selectps, namefrom (select c_poi.provider_str as ps, c_poi.name as name from 
(select darray(search_results, c.rank) as c_poi from nulf_search lateral view 
explode(search_clicks) clickTable as c) a ) bwhere ps = 'POI'
any kind of criteria I tried to add on provider_str, the hive MR jobs failed in 
the same error I shown below.
Any idea why this happened? Is it related to the data? But provider_str is just 
a simple String type.
Thanks
Yong
From: java8...@hotmail.com
To: user@hive.apache.org
Subject: RE: Array index support non-constant expresssion
Date: Wed, 12 Dec 2012 12:15:27 -0500





OK. 
I followed the hive source code of 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFArrayContains and wrote the 
UDF. It is quite simple. 
It works fine as I expected for simple case, but when I try to run it under 
some complex query, the hive MR jobs failed with some strange errors. What I 
mean is that it failed in HIVE code base, from stuck trace, I can not see this 
failure has anything to do with my custom code.
I would like some help if some one can tell me what went wrong.
For example, I created this UDF called darray, stand for dynamic array, which 
supports the non-constant value as the index location of the array.
The following query works fine as I expected:
hive select c_poi.provider_str as provider_str, c_poi.name as name from 
(select darray(search_results, c.index_loc) as c_poi from search_table lateral 
view explode(search_clicks) clickTable as c) a limit 5;POI  
   ADDRESS   some addressPOIPOI 
   ADDRESSS some address
Of course, in this case, I only want the provider_str = 'POI' returned, and 
filter out any rows with provider_str != 'POI', so it sounds simple, I changed 
the query to the following:
hive select c_poi.provider_str as provider_str, c_poi.name as name from 
(select darray(search_results, c.rank) as c_poi from search_table lateral view 
explode(search_clicks) clickTable as c) a where c_poi.provider_str = 'POI' 
limit 5;Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks 
is set to 0 since there's no reduce operatorCannot run job locally: Input Size 
(= 178314025) is larger than hive.exec.mode.local.auto.inputbytes.max (= 
134217728)Starting Job = job_201212031001_0100, Tracking URL = 
http://blevine-desktop:50030/jobdetails.jsp?jobid=job_201212031001_0100Kill 
Command = /home/yzhang/hadoop/bin/hadoop job  
-Dmapred.job.tracker=blevine-desktop:8021 -kill job_201212031001_01002012-12-12 
11:45:24,090 Stage-1 map = 0%,  reduce = 0%2012-12-12 11:45:43,173 Stage-1 map 
= 100%,  reduce = 100%Ended Job = job_201212031001_0100 with errorsFAILED: 
Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
I am only add a Where limitation, but to my surprise, the MR jobs generated by 
HIVE failed. I am testing this in my local standalone cluster, which is running 
CDH3U3 release. When I check the hadoop userlog, here is what I got:
2012-12-12 11:40:22,421 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 
SELECT 

RE: Array index support non-constant expresssion

2012-12-12 Thread java8964 java8964
) 
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) 
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) 
   at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389)   
 at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:133)
at 
org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)at 
org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:444)   
 at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)   
 at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:98) 
   ... 22 more2012-12-12 20:36:21,365 INFO org.apache.hadoop.mapred.Task: 
Runnning cleanup for the task


 Date: Thu, 13 Dec 2012 09:06:25 +0900
 Subject: Re: Array index support non-constant expresssion
 From: navis@nexr.com
 To: user@hive.apache.org
 
 Could you try it with CP/PPD disabled?
 
 set hive.optimize.cp=false;
 set hive.optimize.ppd=false;
 
 2012/12/13 java8964 java8964 java8...@hotmail.com:
  Hi,
 
  I played my query further, and found out it is very puzzle to explain the
  following behaviors:
 
  1) The following query works:
 
  select c_poi.provider_str, c_poi.name from (select darray(search_results,
  c.rank) as c_poi from nulf_search lateral view explode(search_clicks)
  clickTable as c) a
 
  I get get all the result from the above query without any problem.
 
  2) The following query NOT works:
 
  select c_poi.provider_str, c_poi.name from (select darray(search_results,
  c.rank) as c_poi from nulf_search lateral view explode(search_clicks)
  clickTable as c) a where c_poi.provider_str = 'POI'
 
  As long as I add the where criteria on provider_str, or even I added another
  level of sub query like following:
 
  select
  ps, name
  from
  (select c_poi.provider_str as ps, c_poi.name as name from (select
  darray(search_results, c.rank) as c_poi from nulf_search lateral view
  explode(search_clicks) clickTable as c) a ) b
  where ps = 'POI'
 
  any kind of criteria I tried to add on provider_str, the hive MR jobs failed
  in the same error I shown below.
 
  Any idea why this happened? Is it related to the data? But provider_str is
  just a simple String type.
 
  Thanks
 
  Yong
 
  
  From: java8...@hotmail.com
  To: user@hive.apache.org
  Subject: RE: Array index support non-constant expresssion
  Date: Wed, 12 Dec 2012 12:15:27 -0500
 
 
  OK.
 
  I followed the hive source code of
  org.apache.hadoop.hive.ql.udf.generic.GenericUDFArrayContains and wrote the
  UDF. It is quite simple.
 
  It works fine as I expected for simple case, but when I try to run it under
  some complex query, the hive MR jobs failed with some strange errors. What I
  mean is that it failed in HIVE code base, from stuck trace, I can not see
  this failure has anything to do with my custom code.
 
  I would like some help if some one can tell me what went wrong.
 
  For example, I created this UDF called darray, stand for dynamic array,
  which supports the non-constant value as the index location of the array.
 
  The following query works fine as I expected:
 
  hive select c_poi.provider_str as provider_str, c_poi.name as name from
  (select darray(search_results, c.index_loc) as c_poi from search_table
  lateral view explode(search_clicks) clickTable as c) a limit 5;
  POI 
  ADDRESS   some address
  POI
  POI
  ADDRESSS some address
 
  Of course, in this case, I only want the provider_str = 'POI' returned, and
  filter out any rows with provider_str != 'POI', so it sounds simple, I
  changed the query to the following:
 
  hive select c_poi.provider_str as provider_str, c_poi.name as name from
  (select darray(search_results, c.rank) as c_poi from search_table lateral
  view explode(search_clicks) clickTable as c) a where c_poi.provider_str =
  'POI' limit 5;
  Total MapReduce jobs = 1
  Launching Job 1 out of 1
  Number of reduce tasks is set to 0 since there's no reduce operator
  Cannot run job locally: Input Size (= 178314025) is larger than
  hive.exec.mode.local.auto.inputbytes.max (= 134217728)
  Starting Job = job_201212031001_0100, Tracking URL =
  http://blevine-desktop:50030/jobdetails.jsp?jobid=job_201212031001_0100
  Kill Command = /home/yzhang/hadoop/bin/hadoop job
  -Dmapred.job.tracker=blevine-desktop:8021 -kill job_201212031001_0100
  2012-12-12 11:45:24,090 Stage-1 map = 0%,  reduce = 0%
  2012-12-12 11:45:43,173 Stage-1 map = 100%,  reduce = 100%
  Ended Job = job_201212031001_0100 with errors
  FAILED: Execution Error, return code 2 from
  org.apache.hadoop.hive.ql.exec.MapRedTask
 
  I am only add a Where limitation, but to my surprise, the MR jobs generated
  by HIVE failed. I am testing this in my local standalone cluster, which is
  running CDH3U3 release. When I check the hadoop userlog, here is what

Array index support non-constant expresssion

2012-12-11 Thread java8964 java8964

Hi, In our project to use the HIVE on CDH3U4 release (Hive 0.7.1), I have a 
hive table like the following:
Table foo (   search_results arraystructid:bigint,..   
search_clicks arraystructindex_loc:int, ..)
As you can see, the 2nd column, which represents the list of search results 
clicked, contains the index location of which results was clicked (starting 
from 0). Now I need to find out the click count of some IDs, but this IDs only 
exists in the search_result struct.
When I tried to write a query for this, I will try to do following first:

select search_results[c.index_loc] from foo lateral view explode(search_clicks) 
clickTable as c;
But it won't work in hive, as the following error message returned:AILED: Error 
in semantic analysis: Line 1:7 Non-constant expressions for array indexes not 
supported rank.
It looks like that currently Hive (at least 0.7) does NOT support non-constant 
expressions as the array index.
I searched on google, and found out the following HIVE jira ticket:
https://issues.apache.org/jira/browse/HIVE-1955
It looks like someone has the same request to support it, but not yet.
But there is a comment in the above ticket that it can be done in an UDF. My 
question is, can anyone share some ideas about how to archive this in an UDF, 
as it maybe the only option for me right now?
Thanks

RE: need help on writing hive query

2012-10-31 Thread java8964 java8964

If you don't need to join current_web_page and previous_web_page, assuming you 
can just trust the time stamp, as Phil points out, an custom UDF of 
collect_list() is the way to go.
You need to implement collect_list() UDF by yourself, hive doesn't have one by 
default.But it should be straight forward. In fact, you can reuse the code of 
collect_set(), replace the internal set with a Java ArrayList, then
select user_id, collect_list(user_current_web_page)from(select user_id, 
user_current_web_pageorder by user_id asc, user_visiting_time asc)agroup by 
user_id
Yong
 Subject: Re: need help on writing hive query
 From: matthewt...@gmail.com
 Date: Wed, 31 Oct 2012 17:53:06 -0400
 To: user@hive.apache.org
 
 I did a similar query a few months ago.  In short, I left-padded the page 
 name with the time stamp, grouped with collect_set, and then used 
 sort_array().  There was some other cleanup work and converting back to 
 string to remove the time stamps, but it remained in order.
 
 If there's an easier way, please let me know.
 
 Matt Tucker
 
 On Oct 31, 2012, at 5:37 PM, Tom Brown tombrow...@gmail.com wrote:
 
  It wouldn't retrieve the user's path in a single string, but you could
  simply select the user id and current page, ordered by the timestamp.
  
  It would require a second step to turn it into the single string path,
  so that might be a deal-breaker.
  
  --Tom
  
  On Wed, Oct 31, 2012 at 3:32 PM, Philip Tromans
  philip.j.trom...@gmail.com wrote:
  You could use collect_set() and GROUP BY. That wouldn't preserve order
  though.
  
  Phil.
  
  On Oct 31, 2012 9:18 PM, qiaoresearcher qiaoresearc...@gmail.com wrote:
  
  Hi all,
  
  here is the question. Assume we have a table like:
  
  --
  user_id||  user_visiting_time||  user_current_web_page ||
  user_previous_web_page
  user 1 time (1,1)   page 1
  page 0
  user 1 time (1,2)   page 2
  page 1
  user 1 time (1,3 )  page 3
  page 2
  .  ..
  
  user n time (n,1)   page 1
  page 0
  user n time (n,2)   page 2
  page 1
  user n time (n,3)   page 3
  page 2
  
  that is, in each row, we know the current web page that user is viewing,
  and we know the previous web page the user coming from
  
  now we want to generate a list for each user that recorded the complete
  path the user is taking:
  i.e., how can we use hive to generate output like:
  
  
  user 1 :  page 1   page 2 page 3  page 4  .. (till reach the
  beginning page of user 1)
  user 2:   page 1 page 2 page 3  page 4 page 5  ...  ( till reach
  the beginning page of user 2)
  the web pages viewed by user 1 and user 2 might be different.
  
  can we generate this using hive?
  
  thanks,
  

is it possible to disable running termiatePartial and merge() methods in UDAF

2012-10-01 Thread java8964 java8964

Hi, 
I am trying to implement a UDAF of Kurtosis (�a 
href=http://en.wikipedia.org/wiki/Kurtosis;http://en.wikipedia.org/wiki/Kurtosis�/a
 in the hive.
I already found a library to do it, from Apache commons math (�a 
href=http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/moment/Kurtosis.html;http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/moment/Kurtosis.html�/a).
 
But it does NOT support merging partial result in it. 
I am not a Math guru, so I don't know if this Kurtosis can be done in parallel 
or not.
My question is simple, if I use the above library to implement this UDAF in 
hive, is there a configuration in the hive (either at runtime or at UDAF level) 
to make surethat for this UDAF I implemented, the terminatePartial/merge will 
NEVER be invoked? 
I have 2 settings I think maybe will do the trick:
1) set hive.exec.reducers.max=12) set hive.map.aggr=false
Does either one will do the thing I want? Or anything else? Or is it even 
possible?
Thanks
Yong  

RE: How can I get the constant value from the ObjectInspector in the UDF

2012-09-26 Thread java8964 java8964

I understand your message. But in this situation, I want to do the following:
1) I want to get the value 10 in the initialization stage. I understand your 
point that the value will only available in the evaluate stage, but keep in 
mind that for this 10 in my example, it is a constants value. It won't change 
for every evaluating. It is kind of value I should be able to get in the 
initialization stage, right? The hive Query analyzer should understand this 
parameter in the function in fact is a constants value, and will be able to 
provide to me during the initialization stage.2) Further question, can I get 
more information from the object inspector? For example, when I write the UDF, 
I want to make sure the first parameter is a numeric type. I can get the type, 
which I am able to valid it based on the type. But the question is if I want to 
error in some case, I want to show the end user the NAME of the parameter in my 
error message, instead of just position.
For example, in the UDF as msum(column_name, 10), if I find out the type of the 
column_name is NOT a numeric type, I want in the error message I give to the 
end user, that 'column_name' should be numeric type. But right now, in the API, 
I can not get this information. Only thing I can get is the category type 
information, but I want more.
Is it possible to do that in hive 0.7.1?
Thanks for your help.
Yong

Date: Thu, 27 Sep 2012 02:32:19 +0900
Subject: Re: How can I get the constant value from the ObjectInspector in the 
UDF
From: chen.song...@gmail.com
To: user@hive.apache.org

Hi Yong
The way GenericUDF works is as follows.
ObjectInspector initialize(ObjectInspector[] arguments) is called only once for 
one GenericUDF instance used in your Hive query. This phase is for preparation 
steps of UDF, such as syntax check and type inference.









Object evaluate(DeferredObject[] arguments) is called to evaluate against 
actual arguments. This should be where the actual calculation happens and where 
you can get the real values you talked about.

Thanks,Chen

On Wed, Sep 26, 2012 at 4:17 AM, java8964 java8964 java8...@hotmail.com wrote:





Hi, I am using Cloudera release cdh3u3, which has the hive 0.71 version.
I am trying to write a hive UDF function as to calculate the moving sum. Right 
now, I am having trouble to get the constrant value passed in in the 
initialization stage.

For example, let's assume the function is like the following format:
msum(salary, 10) - salary is a int type column
which means the end user wants to calculate the last 10 rows of salary.

I kind of know how to implement this UDF. But I have one problem right now.
1) This is not a UDAF, as each row will return one data back as the moving 
sum.2) I create an UDF class extends from the GenericUDF.
3) I can get the column type from the ObjectInspector[] passed to me in the 
initialize() method to verify that 'salary' and 10 both needs to be numeric 
type (later one needs to be integer)4) But I also want to get the real value of 
10, in this case, in the initialize() stage, so I can create the corresponding 
data structure based on the value end user specified here.
5) I looks around the javadoc of ObjectInspector class. I know at run time the 
real class of the 2nd parameter is WritableIntObjectInspector. I can get the 
type, but how I can get the real value of it?6) This is kind of 
ConstantsObjectInspector, should be able to give the value to me, as it already 
knows the type is int. What how?
7) I don't want to try to get the value at the evaluate stage. Can I get this 
value at the initialize stage?
Thanks
Yong  


-- 
Chen Song



  

How can I get the constant value from the ObjectInspector in the UDF

2012-09-25 Thread java8964 java8964

Hi, I am using Cloudera release cdh3u3, which has the hive 0.71 version.
I am trying to write a hive UDF function as to calculate the moving sum. Right 
now, I am having trouble to get the constrant value passed in in the 
initialization stage.
For example, let's assume the function is like the following format:
msum(salary, 10) - salary is a int type column
which means the end user wants to calculate the last 10 rows of salary.
I kind of know how to implement this UDF. But I have one problem right now.
1) This is not a UDAF, as each row will return one data back as the moving 
sum.2) I create an UDF class extends from the GenericUDF.3) I can get the 
column type from the ObjectInspector[] passed to me in the initialize() method 
to verify that 'salary' and 10 both needs to be numeric type (later one needs 
to be integer)4) But I also want to get the real value of 10, in this case, in 
the initialize() stage, so I can create the corresponding data structure based 
on the value end user specified here.5) I looks around the javadoc of 
ObjectInspector class. I know at run time the real class of the 2nd parameter 
is WritableIntObjectInspector. I can get the type, but how I can get the real 
value of it?6) This is kind of ConstantsObjectInspector, should be able to give 
the value to me, as it already knows the type is int. What how?7) I don't want 
to try to get the value at the evaluate stage. Can I get this value at the 
initialize stage?
Thanks
Yong  

Question about org.apache.hadoop.hive.contrib.serde2.RegexSerDe

2012-04-03 Thread java8964 java8964

Hi, 
I have a question about the behavior of the class 
org.apache.hadoop.hive.contrib.serde2.RegexSerDe. Here is the example I tested 
using the Cloudra hive-0.7.1-cdh3u3 release. The above class did NOT do what I 
expect, any one knows the reason?
user:~/tmp more Test.javaimport java.io.*;import java.text.*;
class Test {public static void main (String[] argv) throws Exception{   
 String line = aaa,\bbb\,\cc,c\;String[] tokens = 
line.split(,(?=([^\]*\[^\]*\)*[^\]*$));int i = 1;
for(String t : tokens) {System.out.println(i +  +t);
i++;}}}
:~/tmp java Test1 aaa2 bbb3 cc,c
As you can see, the Java regular expression ,(?=([^\]*\[^\]*\)*[^\]*$) 
did what I want it to do, it parse the string aaa,bbb,cc,c to 3 tokens: 
(aaa), (bbb), and (cc,c). So the regular expression works fine.
Now in the hive:
:~ more test.txtaaa,bbb,cc,c:~ hiveHive history 
file=/tmp/user/hive_job_log_user_201204031242_591028210.txthive create table 
test(  c1 string,  c2 string,  c3 string ) row format  
   SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH 
SERDEPROPERTIES ( input.regex = ,(?=([^\]*\[^\]*\)*[^\]*$) ) 
STORED AS TEXTFILE;OKTime taken: 0.401 secondshive load data local inpath 
'test.txt' overwrite into table test;Copying data from 
file:/home/user/test.txtCopying file: file:/home/user/test.txtLoading data to 
table dev.testDeleted hdfs://host/user/hive/warehouse/dev.db/testOKTime taken: 
0.282 secondshive select * from test; 
OKNULLNULLNULL
When I query this table, I don't get what I expected. I expect the output 
should be the 3 strings like this -aaabbb   cc,c
Why the output gives me 3 NULLs?
Thanks for your help.