hive partition pruning on joining on partition column
I have the requirement trying to support in hive, not sure if it is doable. I have the hadoop 1.1.1 with Hive 0.9.0 (Using deby as the meta store) If I partition my data by a dt column, so if my table 'foo' have some partitions like 'dt=2013-07-01' to 'dt=2013-07-30'. Now the user want to query all the data of Saturday only. To make it flexiable, instead of asking end user to find out what date in that month are Saturday, I add a lookup table (just called it 'bar') in the HIVE with following columns: year, month, day, dt_format, week_of_day So I want to see if I can join with foo and bar to still get the partition pruning: select *from foojoin baron (bar.year=2013 and bar.month=7 and bar.day_of_week=6 and bar.dt_foramt = foo.dt) I tried several ways, like switch the table order, join with subquery etc, none of them will make partition pruning works in this case on table foo. Can this really archivable in hive? Thanks Yong
RE: Question about how to add the debug info into the hive core jar
I am not sure the existing logging information is enough for me. The exception trace is as following: Caused by: java.lang.IndexOutOfBoundsException: Index: 8, Size: 8at java.util.ArrayList.rangeCheck(ArrayList.java:604)at java.util.ArrayList.get(ArrayList.java:382)at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485)at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485) It is hive 0.9.0, and I look into the source code of LazySImpleSerDe.java around line 485: List? extends StructField fields = soi.getAllStructFieldRefs(); list = soi.getStructFieldsDataAsList(obj); if (list == null) { out.write(nullSequence.getBytes(), 0, nullSequence.getLength()); } else { for (int i = 0; i list.size(); i++) { if (i 0) { out.write(separator); } serialize(out, list.get(i), fields.get(i).getFieldObjectInspector(), -- line 485 separators, level + 1, nullSequence, escaped, escapeChar, needsEscape);} } For this exception to happen, it means that the soi (Which is my StructObjectInspector class) must return different length of collection object as fields and list.But I already add the logger in my StructorObjectInspector, which proves the same length collection returned from both method of getAllStructFieldRefs() and getStructFiledsDataAsList(Object).So I really don't know how this exception could happen in the Hive code. I have 2 options right now:1) Change the above code to add more debug information to return at runtime to check what kind of content in the either fields object or list object, to understand why their length not same. But I have problem to make my new jar to be loaded by hadoop.2) Enable remote debug. There is very limited example on the internet about how to enable the hive server side MR jobs remote debug, even some wiki pages claim it is doable, but without concrete examples. Thanks From: ashet...@hortonworks.com Subject: Re: Question about how to add the debug info into the hive core jar Date: Wed, 20 Mar 2013 17:35:36 -0700 To: user@hive.apache.org Hi Yong, Have you tried running the H query in debug mode. Hive log level can be changed by passing the following conf while hive client is running. #hive -hiveconf hive.root.logger=ALL,console -e DDL statement ;#hive -hiveconf hive.root.logger=ALL,console -f ddl.sql ; Hope this helps Thanks On Mar 20, 2013, at 1:45 PM, java8964 java8964 java8...@hotmail.com wrote:Hi, I have the hadoop running in pseudo-distributed mode on my linux box. Right now I face a problem about a Hive, which throws Exception in a table for some data which used my custom SerDe and InputFormat class. To help me to trace the root cause, I need to modify the code of org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe to add more debug logging information to understand why the exception happens. After I modify the hive code, I can compile it and generate a new hive-serde.jar file, with the same name as the release version, just size changed. Now I put my new hive-serde.jar under $HIVE_HOME/lib folder, replace the old one, and run the query which failed. But after the failure, if I check the $HADOOP_HOME/logs/user_logs/, I saw the Exception stacktrace still looked like generated by the original hive-serde class. The reason is that the line number shown in the log doesn't match with the new code I changed to add the debug information. My question is, if I have this new compiled hive-serde.jar file, besides $HIVE_HOME/lib, where should I put it in? 1) This is a pseudo environments. Everything (namenode, data node, job tracker and tasktracer are all running in one box)2) After I replace hive-serde.jar with my new jar, I even stop all the hadoop java processing and restart them.3) But when I run the query in the hive session, I still saw the log generated by the old hive-serde.jar class. Why? Thank for any help Yong
RE: difference between add jar in hive session and hive --auxpath
This is in HIVE-0.9.0 hive list jars;/nfs_home/common/userlibs/google-collections-1.0.jar/nfs_home/common/userlibs/elephant-bird-hive-3.0.7.jar/nfs_home/common/userlibs/protobuf-java-2.3.0.jar/nfs_home/common/userlibs/elephant-bird-core-3.0.7.jarfile:/usr/lib/hive/lib/hive-builtins-0.9.0-cdh4.1.2.jarhive desc table;java.lang.NoClassDefFoundError: com/twitter/elephantbird/mapreduce/io/ProtobufConverterat com.twitter.elephantbird.hive.serde.ProtobufDeserializer.initialize(ProtobufDeserializer.java:45) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:203) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:260) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:253) at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:490)at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:162) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:930)at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:844)at org.apache.hadoop.hive.ql.exec.DDLTask.describeTable(DDLTask.java:2545) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:309)at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1331)at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1117)at org.apache.hadoop.hive.ql.Driver.run(Driver.java:950)at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:744)at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:607)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597)at org.apache.hadoop.util.RunJar.main(RunJar.java:208)Caused by: java.lang.ClassNotFoundException: com.twitter.elephantbird.mapreduce.io.ProtobufConverterat java.net.URLClassLoader$1.run(URLClassLoader.java:202)at java.security.AccessController.doPrivileged(Native Method)at java.net.URLClassLoader.findClass(URLClassLoader.java:190)at java.lang.ClassLoader.loadClass(ClassLoader.java:307)at java.lang.ClassLoader.loadClass(ClassLoader.java:248)... 25 moreFAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.DDLTaskhive exit;[y130zhan@daca2 userlibs]$ jar tvf /nfs_home/common/userlibs/elephant-bird-core-3.0.7.jar | grep ProtobufConverter 4825 Mon Mar 04 16:50:46 UTC 2013 com/twitter/elephantbird/mapreduce/io/ProtobufConverter.class 732 Mon Mar 04 16:50:46 UTC 2013 com/twitter/elephantbird/mapreduce/io/ProtobufConverter$1.class From: vkavul...@outlook.com To: user@hive.apache.org Subject: RE: difference between add jar in hive session and hive --auxpath Date: Thu, 7 Mar 2013 16:44:41 -0800 If properly done, add jar jar-file should work the same as passing the jar with --auxpath. Can you run list jars; command from CLI or Hue and check if you see the jar file. From: java8...@hotmail.com To: user@hive.apache.org Subject: difference between add jar in hive session and hive --auxpath Date: Thu, 7 Mar 2013 17:47:26 -0500 Hi, I have a hive table which uses the jar file provided from the elephant-bird, which is a framework integrated between lzo and google protobuf data and hadoop/hive. If I use the hive command like this: hive --auxpath path_to_jars, it works fine to query my table, but if I use the add jar after I started the hive session, I will get ClassNotFoundException in the runtime of my query of the classes in those jars. My questions are: 1) What is the different between hive --auxpath and add jar in the hive session?2) This problem makes it is hard to access my table in the HUE, as it only supports add jar, but not --auxpath option. Any suggestions? Thanks Yong
A bug belongs to Hive or Elephant-bird
Hi, Hive 0.9.0 + Elephant-Bird 3.0.7 I faced a problem to use the elephant-bird with hive. I know what maybe cause this problem, but I don't know which side this bug belongs to. Let me know explain what is the problem. If we define a google protobuf file, with field name like 'dateString' (the field contains an uppercase 'S'), then when I query the table like this: select dateString from table . I will get the following exception trace: Caused by: java.lang.RuntimeException: cannot find field datestring from [org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@49aacd5f .at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:321) at org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector.getStructFieldRef(UnionStructObjectInspector.java:96) at org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:57) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:878) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:904) at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:60) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.FilterOperator.initializeOp(FilterOperator.java:73) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:133) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:444) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:98) Here is the code for the method throws this error: public static StructField getStandardStructFieldRef(String fieldName, List? extends StructField fields) {fieldName = fieldName.toLowerCase(); for (int i = 0; i fields.size(); i++) { if (fields.get(i).getFieldName().equals(fieldName)) {return fields.get(i); }}// For backward compatibility: fieldNames can also be integer Strings.try { int i = Integer.parseInt(fieldName); if (i = 0 i fields.size()) {return fields.get(i); }} catch (NumberFormatException e) { // ignore}throw new RuntimeException(cannot find field + fieldName + from + fields); // return null; } I understand the problem happens because at this time, the fileName is datestring (all lowercase charcters), but the Listfields contains the fieldName for that field is dateString, and that is why the RuntimeException happened. But I don't know which side this bug belongs to, or I want to know more inside detail about the Hive implementation contract. From this link: https://cwiki.apache.org/Hive/user-faq.html#UserFAQ-AreHiveQLidentifiers%2528e.g.tablenames%252Ccolumnnames%252Cetc%2529casesensitive%253F I know that in hive, the table name and column name should be case insensitive, so even though in my Query, I used select dateString, the fieldName changed to datestring in the code, but the StructField of ObjectInspector from the elephant-bird return the EXACTLY fieldname, defined in the code, dateString in this case. of course, I can change my protof file to only use lowercase field name to bypass this bug, but my questions are: 1) If I implement my ObjectInspector, should I pay attention to the field name? Is it needed to be lowercase? 2) I would consider this as a bug of hive, right? If this line: fieldName = fieldName.toLowerCase(); to lowercase the data, then the comparing should also do it by lowering case by changing if (fields.get(i).getFieldName().equals(fieldName)) to if (fields.get(i).getFieldName().toLowerCase().equals(fieldName)) right? Thanks Yong
difference between add jar in hive session and hive --auxpath
Hi, I have a hive table which uses the jar file provided from the elephant-bird, which is a framework integrated between lzo and google protobuf data and hadoop/hive. If I use the hive command like this: hive --auxpath path_to_jars, it works fine to query my table, but if I use the add jar after I started the hive session, I will get ClassNotFoundException in the runtime of my query of the classes in those jars. My questions are: 1) What is the different between hive --auxpath and add jar in the hive session?2) This problem makes it is hard to access my table in the HUE, as it only supports add jar, but not --auxpath option. Any suggestions? Thanks Yong
RE: reg : getting table values in inputFormat in serde
Actually I am backing up this question. In additional for that, I wonder if it is possible we can access the table properties from the UDF too. I also have XML data, but with namespace into it. The XPATH UDF coming from HIVE doesn't support namespace. To support the namespace in XML is simple, just need a NamespaceContextImpl. But the trick part is to having a generic solution, we want some properties can passed to underline InputFormat or xpath_ns UDF, for example, the prefix/URI mapping, or the real class name who implements the NamespaceContext interface. But right now, it is not easy to pass this kind information, even if we define them into the table properties. Right now, I have to kind of hard-coded into the UDF or InputFormat, instead of providing a generic solution. Yong From: mohit_chaudhar...@infosys.com To: user@hive.apache.org Subject: RE: reg : getting table values in inputFormat in serde Date: Fri, 21 Dec 2012 11:39:54 + Actually I am storing data from xml file into hive table using serde. Now I want table properties which I given during table creation in inputformat . From: Nitin Pawar [mailto:nitinpawar...@gmail.com] Sent: Friday, December 21, 2012 3:16 PM To: user@hive.apache.org Subject: Re: reg : getting table values in inputFormat in serde What kind of table properties you want to be accessed ? and what you want to achieve from it in serde ? On Fri, Dec 21, 2012 at 11:04 AM, Mohit Chaudhary01 mohit_chaudhar...@infosys.com wrote: Hi In serde program I need table properties in inputformat file . So can anyone please tell me how can I do this? thanks CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS*** -- Nitin Pawar
xpath UDF in hive support namespace?
Hi, I have a question related to the XPATH UDF currently in HIVE. From the original Jira story about this UDF: https://issues.apache.org/jira/browse/HIVE-1027, It looks like the UDF won't support namespace in the XML, is that true? Any later HIVE version does support namespace, if so, what is the version? And also if this UDF does support namespace, does anyone have some examples how to do it in xpath UDF of hive? All the examples in the wiki page are without name space. Thanks Yong
RE: xpath UDF in hive support namespace?
Hi, Mark: Thanks for your response. I am thinking how to change it to support the namespace. The easy way is allow end user to pass into properties of mapping of (prefix and URI), but from the HIVE UDF, it is not easy to allow end users to pass in any properties like Serde. Maybe that's the reason why it isn't supported initially. Yong Date: Wed, 19 Dec 2012 14:04:20 -0800 Subject: Re: xpath UDF in hive support namespace? From: grover.markgro...@gmail.com To: user@hive.apache.org Hi, the source code for the xpath UDFs is at https://github.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/xml I checked and saw that there were no changes made to these UDFs after there initial creation by HIVE-1027. So, that's all we have. Sorry. However, if you would like make additions to those UDFs, you are welcome to do so by creating a JIRA and posting a patch. UDFs are an easy and excellent way to contribute back to the Hive community. Thanks! Mark On Wed, Dec 19, 2012 at 8:52 AM, java8964 java8964 java8...@hotmail.com wrote: Hi, I have a question related to the XPATH UDF currently in HIVE. From the original Jira story about this UDF: https://issues.apache.org/jira/browse/HIVE-1027, It looks like the UDF won't support namespace in the XML, is that true? Any later HIVE version does support namespace, if so, what is the version? And also if this UDF does support namespace, does anyone have some examples how to do it in xpath UDF of hive? All the examples in the wiki page are without name space. Thanks Yong
RE: Array index support non-constant expresssion
OK. I followed the hive source code of org.apache.hadoop.hive.ql.udf.generic.GenericUDFArrayContains and wrote the UDF. It is quite simple. It works fine as I expected for simple case, but when I try to run it under some complex query, the hive MR jobs failed with some strange errors. What I mean is that it failed in HIVE code base, from stuck trace, I can not see this failure has anything to do with my custom code. I would like some help if some one can tell me what went wrong. For example, I created this UDF called darray, stand for dynamic array, which supports the non-constant value as the index location of the array. The following query works fine as I expected: hive select c_poi.provider_str as provider_str, c_poi.name as name from (select darray(search_results, c.index_loc) as c_poi from search_table lateral view explode(search_clicks) clickTable as c) a limit 5;POI ADDRESS some addressPOIPOI ADDRESSS some address Of course, in this case, I only want the provider_str = 'POI' returned, and filter out any rows with provider_str != 'POI', so it sounds simple, I changed the query to the following: hive select c_poi.provider_str as provider_str, c_poi.name as name from (select darray(search_results, c.rank) as c_poi from search_table lateral view explode(search_clicks) clickTable as c) a where c_poi.provider_str = 'POI' limit 5;Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks is set to 0 since there's no reduce operatorCannot run job locally: Input Size (= 178314025) is larger than hive.exec.mode.local.auto.inputbytes.max (= 134217728)Starting Job = job_201212031001_0100, Tracking URL = http://blevine-desktop:50030/jobdetails.jsp?jobid=job_201212031001_0100Kill Command = /home/yzhang/hadoop/bin/hadoop job -Dmapred.job.tracker=blevine-desktop:8021 -kill job_201212031001_01002012-12-12 11:45:24,090 Stage-1 map = 0%, reduce = 0%2012-12-12 11:45:43,173 Stage-1 map = 100%, reduce = 100%Ended Job = job_201212031001_0100 with errorsFAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask I am only add a Where limitation, but to my surprise, the MR jobs generated by HIVE failed. I am testing this in my local standalone cluster, which is running CDH3U3 release. When I check the hadoop userlog, here is what I got: 2012-12-12 11:40:22,421 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: SELECT struct_col0:bigint,_col1:string,_col2:string,_col3:string,_col4:string,_col5:string,_col6:boolean,_col7:boolean,_col8:boolean,_col9:boolean,_col10:boolean,_col11:boolean,_col12:string,_col13:string,_col14:structlat:double,lon:double,query_text_raw:string,query_text_normalized:string,query_string:string,llcountry:string,ipcountry:string,request_cnt:int,address:structcountry:string,state:string,zip:string,city:string,street:string,house:string,categories_id:arrayint,categories_name:arraystring,lang_raw:string,lang_rose:string,lang:string,viewport:structtop_lat:double,left_lon:double,bottom_lat:double,right_lon:double,_col15:structversions:int,physical_host:string,nose_request_id:string,client_type:string,ip:int,time_taken:int,user_agent:string,http_host:string,http_referrer:string,http_status:smallint,http_size:int,accept_language:string,md5:string,datacenter:string,tlv_map_data_version:string,tlv_devide_software_version:string,csid:int,rid:string,xncrid:string,cbfn:string,sources:arraystructtm:bigint,tm_date:string,tm_time:string,md5:string,time_taken:int,_col16:arraystructprovider_str:string,name:string,lat:double,lon:double,dyn:boolean,authoritative:boolean,search_center:boolean,_col17:arraystructrank:int,action:int,tm:bigint,event:string,is_csid:boolean,is_rid:boolean,is_pbapi:boolean,is_nac:boolean,_col18:string,_col19:structrank:int,action:int,tm:bigint,event:string,is_csid:boolean,is_rid:boolean,is_pbapi:boolean,is_nac:boolean2012-12-12 11:40:22,440 WARN org.apache.hadoop.mapred.Child: Error running childjava.lang.RuntimeException: Error in configuring objectat org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:387) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)at org.apache.hadoop.mapred.Child$4.run(Child.java:270)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:396)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) at org.apache.hadoop.mapred.Child.main(Child.java:264)Caused by: java.lang.reflect.InvocationTargetExceptionat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
RE: Array index support non-constant expresssion
Hi, I played my query further, and found out it is very puzzle to explain the following behaviors: 1) The following query works: select c_poi.provider_str, c_poi.name from (select darray(search_results, c.rank) as c_poi from nulf_search lateral view explode(search_clicks) clickTable as c) a I get get all the result from the above query without any problem. 2) The following query NOT works: select c_poi.provider_str, c_poi.name from (select darray(search_results, c.rank) as c_poi from nulf_search lateral view explode(search_clicks) clickTable as c) a where c_poi.provider_str = 'POI' As long as I add the where criteria on provider_str, or even I added another level of sub query like following: selectps, namefrom (select c_poi.provider_str as ps, c_poi.name as name from (select darray(search_results, c.rank) as c_poi from nulf_search lateral view explode(search_clicks) clickTable as c) a ) bwhere ps = 'POI' any kind of criteria I tried to add on provider_str, the hive MR jobs failed in the same error I shown below. Any idea why this happened? Is it related to the data? But provider_str is just a simple String type. Thanks Yong From: java8...@hotmail.com To: user@hive.apache.org Subject: RE: Array index support non-constant expresssion Date: Wed, 12 Dec 2012 12:15:27 -0500 OK. I followed the hive source code of org.apache.hadoop.hive.ql.udf.generic.GenericUDFArrayContains and wrote the UDF. It is quite simple. It works fine as I expected for simple case, but when I try to run it under some complex query, the hive MR jobs failed with some strange errors. What I mean is that it failed in HIVE code base, from stuck trace, I can not see this failure has anything to do with my custom code. I would like some help if some one can tell me what went wrong. For example, I created this UDF called darray, stand for dynamic array, which supports the non-constant value as the index location of the array. The following query works fine as I expected: hive select c_poi.provider_str as provider_str, c_poi.name as name from (select darray(search_results, c.index_loc) as c_poi from search_table lateral view explode(search_clicks) clickTable as c) a limit 5;POI ADDRESS some addressPOIPOI ADDRESSS some address Of course, in this case, I only want the provider_str = 'POI' returned, and filter out any rows with provider_str != 'POI', so it sounds simple, I changed the query to the following: hive select c_poi.provider_str as provider_str, c_poi.name as name from (select darray(search_results, c.rank) as c_poi from search_table lateral view explode(search_clicks) clickTable as c) a where c_poi.provider_str = 'POI' limit 5;Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks is set to 0 since there's no reduce operatorCannot run job locally: Input Size (= 178314025) is larger than hive.exec.mode.local.auto.inputbytes.max (= 134217728)Starting Job = job_201212031001_0100, Tracking URL = http://blevine-desktop:50030/jobdetails.jsp?jobid=job_201212031001_0100Kill Command = /home/yzhang/hadoop/bin/hadoop job -Dmapred.job.tracker=blevine-desktop:8021 -kill job_201212031001_01002012-12-12 11:45:24,090 Stage-1 map = 0%, reduce = 0%2012-12-12 11:45:43,173 Stage-1 map = 100%, reduce = 100%Ended Job = job_201212031001_0100 with errorsFAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask I am only add a Where limitation, but to my surprise, the MR jobs generated by HIVE failed. I am testing this in my local standalone cluster, which is running CDH3U3 release. When I check the hadoop userlog, here is what I got: 2012-12-12 11:40:22,421 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: SELECT
RE: Array index support non-constant expresssion
) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:133) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)at org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:444) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:98) ... 22 more2012-12-12 20:36:21,365 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task Date: Thu, 13 Dec 2012 09:06:25 +0900 Subject: Re: Array index support non-constant expresssion From: navis@nexr.com To: user@hive.apache.org Could you try it with CP/PPD disabled? set hive.optimize.cp=false; set hive.optimize.ppd=false; 2012/12/13 java8964 java8964 java8...@hotmail.com: Hi, I played my query further, and found out it is very puzzle to explain the following behaviors: 1) The following query works: select c_poi.provider_str, c_poi.name from (select darray(search_results, c.rank) as c_poi from nulf_search lateral view explode(search_clicks) clickTable as c) a I get get all the result from the above query without any problem. 2) The following query NOT works: select c_poi.provider_str, c_poi.name from (select darray(search_results, c.rank) as c_poi from nulf_search lateral view explode(search_clicks) clickTable as c) a where c_poi.provider_str = 'POI' As long as I add the where criteria on provider_str, or even I added another level of sub query like following: select ps, name from (select c_poi.provider_str as ps, c_poi.name as name from (select darray(search_results, c.rank) as c_poi from nulf_search lateral view explode(search_clicks) clickTable as c) a ) b where ps = 'POI' any kind of criteria I tried to add on provider_str, the hive MR jobs failed in the same error I shown below. Any idea why this happened? Is it related to the data? But provider_str is just a simple String type. Thanks Yong From: java8...@hotmail.com To: user@hive.apache.org Subject: RE: Array index support non-constant expresssion Date: Wed, 12 Dec 2012 12:15:27 -0500 OK. I followed the hive source code of org.apache.hadoop.hive.ql.udf.generic.GenericUDFArrayContains and wrote the UDF. It is quite simple. It works fine as I expected for simple case, but when I try to run it under some complex query, the hive MR jobs failed with some strange errors. What I mean is that it failed in HIVE code base, from stuck trace, I can not see this failure has anything to do with my custom code. I would like some help if some one can tell me what went wrong. For example, I created this UDF called darray, stand for dynamic array, which supports the non-constant value as the index location of the array. The following query works fine as I expected: hive select c_poi.provider_str as provider_str, c_poi.name as name from (select darray(search_results, c.index_loc) as c_poi from search_table lateral view explode(search_clicks) clickTable as c) a limit 5; POI ADDRESS some address POI POI ADDRESSS some address Of course, in this case, I only want the provider_str = 'POI' returned, and filter out any rows with provider_str != 'POI', so it sounds simple, I changed the query to the following: hive select c_poi.provider_str as provider_str, c_poi.name as name from (select darray(search_results, c.rank) as c_poi from search_table lateral view explode(search_clicks) clickTable as c) a where c_poi.provider_str = 'POI' limit 5; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Cannot run job locally: Input Size (= 178314025) is larger than hive.exec.mode.local.auto.inputbytes.max (= 134217728) Starting Job = job_201212031001_0100, Tracking URL = http://blevine-desktop:50030/jobdetails.jsp?jobid=job_201212031001_0100 Kill Command = /home/yzhang/hadoop/bin/hadoop job -Dmapred.job.tracker=blevine-desktop:8021 -kill job_201212031001_0100 2012-12-12 11:45:24,090 Stage-1 map = 0%, reduce = 0% 2012-12-12 11:45:43,173 Stage-1 map = 100%, reduce = 100% Ended Job = job_201212031001_0100 with errors FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask I am only add a Where limitation, but to my surprise, the MR jobs generated by HIVE failed. I am testing this in my local standalone cluster, which is running CDH3U3 release. When I check the hadoop userlog, here is what
Array index support non-constant expresssion
Hi, In our project to use the HIVE on CDH3U4 release (Hive 0.7.1), I have a hive table like the following: Table foo ( search_results arraystructid:bigint,.. search_clicks arraystructindex_loc:int, ..) As you can see, the 2nd column, which represents the list of search results clicked, contains the index location of which results was clicked (starting from 0). Now I need to find out the click count of some IDs, but this IDs only exists in the search_result struct. When I tried to write a query for this, I will try to do following first: select search_results[c.index_loc] from foo lateral view explode(search_clicks) clickTable as c; But it won't work in hive, as the following error message returned:AILED: Error in semantic analysis: Line 1:7 Non-constant expressions for array indexes not supported rank. It looks like that currently Hive (at least 0.7) does NOT support non-constant expressions as the array index. I searched on google, and found out the following HIVE jira ticket: https://issues.apache.org/jira/browse/HIVE-1955 It looks like someone has the same request to support it, but not yet. But there is a comment in the above ticket that it can be done in an UDF. My question is, can anyone share some ideas about how to archive this in an UDF, as it maybe the only option for me right now? Thanks
RE: need help on writing hive query
If you don't need to join current_web_page and previous_web_page, assuming you can just trust the time stamp, as Phil points out, an custom UDF of collect_list() is the way to go. You need to implement collect_list() UDF by yourself, hive doesn't have one by default.But it should be straight forward. In fact, you can reuse the code of collect_set(), replace the internal set with a Java ArrayList, then select user_id, collect_list(user_current_web_page)from(select user_id, user_current_web_pageorder by user_id asc, user_visiting_time asc)agroup by user_id Yong Subject: Re: need help on writing hive query From: matthewt...@gmail.com Date: Wed, 31 Oct 2012 17:53:06 -0400 To: user@hive.apache.org I did a similar query a few months ago. In short, I left-padded the page name with the time stamp, grouped with collect_set, and then used sort_array(). There was some other cleanup work and converting back to string to remove the time stamps, but it remained in order. If there's an easier way, please let me know. Matt Tucker On Oct 31, 2012, at 5:37 PM, Tom Brown tombrow...@gmail.com wrote: It wouldn't retrieve the user's path in a single string, but you could simply select the user id and current page, ordered by the timestamp. It would require a second step to turn it into the single string path, so that might be a deal-breaker. --Tom On Wed, Oct 31, 2012 at 3:32 PM, Philip Tromans philip.j.trom...@gmail.com wrote: You could use collect_set() and GROUP BY. That wouldn't preserve order though. Phil. On Oct 31, 2012 9:18 PM, qiaoresearcher qiaoresearc...@gmail.com wrote: Hi all, here is the question. Assume we have a table like: -- user_id|| user_visiting_time|| user_current_web_page || user_previous_web_page user 1 time (1,1) page 1 page 0 user 1 time (1,2) page 2 page 1 user 1 time (1,3 ) page 3 page 2 . .. user n time (n,1) page 1 page 0 user n time (n,2) page 2 page 1 user n time (n,3) page 3 page 2 that is, in each row, we know the current web page that user is viewing, and we know the previous web page the user coming from now we want to generate a list for each user that recorded the complete path the user is taking: i.e., how can we use hive to generate output like: user 1 : page 1 page 2 page 3 page 4 .. (till reach the beginning page of user 1) user 2: page 1 page 2 page 3 page 4 page 5 ... ( till reach the beginning page of user 2) the web pages viewed by user 1 and user 2 might be different. can we generate this using hive? thanks,
is it possible to disable running termiatePartial and merge() methods in UDAF
Hi, I am trying to implement a UDAF of Kurtosis (�a href=http://en.wikipedia.org/wiki/Kurtosis;http://en.wikipedia.org/wiki/Kurtosis�/a in the hive. I already found a library to do it, from Apache commons math (�a href=http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/moment/Kurtosis.html;http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/moment/Kurtosis.html�/a). But it does NOT support merging partial result in it. I am not a Math guru, so I don't know if this Kurtosis can be done in parallel or not. My question is simple, if I use the above library to implement this UDAF in hive, is there a configuration in the hive (either at runtime or at UDAF level) to make surethat for this UDAF I implemented, the terminatePartial/merge will NEVER be invoked? I have 2 settings I think maybe will do the trick: 1) set hive.exec.reducers.max=12) set hive.map.aggr=false Does either one will do the thing I want? Or anything else? Or is it even possible? Thanks Yong
RE: How can I get the constant value from the ObjectInspector in the UDF
I understand your message. But in this situation, I want to do the following: 1) I want to get the value 10 in the initialization stage. I understand your point that the value will only available in the evaluate stage, but keep in mind that for this 10 in my example, it is a constants value. It won't change for every evaluating. It is kind of value I should be able to get in the initialization stage, right? The hive Query analyzer should understand this parameter in the function in fact is a constants value, and will be able to provide to me during the initialization stage.2) Further question, can I get more information from the object inspector? For example, when I write the UDF, I want to make sure the first parameter is a numeric type. I can get the type, which I am able to valid it based on the type. But the question is if I want to error in some case, I want to show the end user the NAME of the parameter in my error message, instead of just position. For example, in the UDF as msum(column_name, 10), if I find out the type of the column_name is NOT a numeric type, I want in the error message I give to the end user, that 'column_name' should be numeric type. But right now, in the API, I can not get this information. Only thing I can get is the category type information, but I want more. Is it possible to do that in hive 0.7.1? Thanks for your help. Yong Date: Thu, 27 Sep 2012 02:32:19 +0900 Subject: Re: How can I get the constant value from the ObjectInspector in the UDF From: chen.song...@gmail.com To: user@hive.apache.org Hi Yong The way GenericUDF works is as follows. ObjectInspector initialize(ObjectInspector[] arguments) is called only once for one GenericUDF instance used in your Hive query. This phase is for preparation steps of UDF, such as syntax check and type inference. Object evaluate(DeferredObject[] arguments) is called to evaluate against actual arguments. This should be where the actual calculation happens and where you can get the real values you talked about. Thanks,Chen On Wed, Sep 26, 2012 at 4:17 AM, java8964 java8964 java8...@hotmail.com wrote: Hi, I am using Cloudera release cdh3u3, which has the hive 0.71 version. I am trying to write a hive UDF function as to calculate the moving sum. Right now, I am having trouble to get the constrant value passed in in the initialization stage. For example, let's assume the function is like the following format: msum(salary, 10) - salary is a int type column which means the end user wants to calculate the last 10 rows of salary. I kind of know how to implement this UDF. But I have one problem right now. 1) This is not a UDAF, as each row will return one data back as the moving sum.2) I create an UDF class extends from the GenericUDF. 3) I can get the column type from the ObjectInspector[] passed to me in the initialize() method to verify that 'salary' and 10 both needs to be numeric type (later one needs to be integer)4) But I also want to get the real value of 10, in this case, in the initialize() stage, so I can create the corresponding data structure based on the value end user specified here. 5) I looks around the javadoc of ObjectInspector class. I know at run time the real class of the 2nd parameter is WritableIntObjectInspector. I can get the type, but how I can get the real value of it?6) This is kind of ConstantsObjectInspector, should be able to give the value to me, as it already knows the type is int. What how? 7) I don't want to try to get the value at the evaluate stage. Can I get this value at the initialize stage? Thanks Yong -- Chen Song
How can I get the constant value from the ObjectInspector in the UDF
Hi, I am using Cloudera release cdh3u3, which has the hive 0.71 version. I am trying to write a hive UDF function as to calculate the moving sum. Right now, I am having trouble to get the constrant value passed in in the initialization stage. For example, let's assume the function is like the following format: msum(salary, 10) - salary is a int type column which means the end user wants to calculate the last 10 rows of salary. I kind of know how to implement this UDF. But I have one problem right now. 1) This is not a UDAF, as each row will return one data back as the moving sum.2) I create an UDF class extends from the GenericUDF.3) I can get the column type from the ObjectInspector[] passed to me in the initialize() method to verify that 'salary' and 10 both needs to be numeric type (later one needs to be integer)4) But I also want to get the real value of 10, in this case, in the initialize() stage, so I can create the corresponding data structure based on the value end user specified here.5) I looks around the javadoc of ObjectInspector class. I know at run time the real class of the 2nd parameter is WritableIntObjectInspector. I can get the type, but how I can get the real value of it?6) This is kind of ConstantsObjectInspector, should be able to give the value to me, as it already knows the type is int. What how?7) I don't want to try to get the value at the evaluate stage. Can I get this value at the initialize stage? Thanks Yong
Question about org.apache.hadoop.hive.contrib.serde2.RegexSerDe
Hi, I have a question about the behavior of the class org.apache.hadoop.hive.contrib.serde2.RegexSerDe. Here is the example I tested using the Cloudra hive-0.7.1-cdh3u3 release. The above class did NOT do what I expect, any one knows the reason? user:~/tmp more Test.javaimport java.io.*;import java.text.*; class Test {public static void main (String[] argv) throws Exception{ String line = aaa,\bbb\,\cc,c\;String[] tokens = line.split(,(?=([^\]*\[^\]*\)*[^\]*$));int i = 1; for(String t : tokens) {System.out.println(i + +t); i++;}}} :~/tmp java Test1 aaa2 bbb3 cc,c As you can see, the Java regular expression ,(?=([^\]*\[^\]*\)*[^\]*$) did what I want it to do, it parse the string aaa,bbb,cc,c to 3 tokens: (aaa), (bbb), and (cc,c). So the regular expression works fine. Now in the hive: :~ more test.txtaaa,bbb,cc,c:~ hiveHive history file=/tmp/user/hive_job_log_user_201204031242_591028210.txthive create table test( c1 string, c2 string, c3 string ) row format SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( input.regex = ,(?=([^\]*\[^\]*\)*[^\]*$) ) STORED AS TEXTFILE;OKTime taken: 0.401 secondshive load data local inpath 'test.txt' overwrite into table test;Copying data from file:/home/user/test.txtCopying file: file:/home/user/test.txtLoading data to table dev.testDeleted hdfs://host/user/hive/warehouse/dev.db/testOKTime taken: 0.282 secondshive select * from test; OKNULLNULLNULL When I query this table, I don't get what I expected. I expect the output should be the 3 strings like this -aaabbb cc,c Why the output gives me 3 NULLs? Thanks for your help.