RE: Hive Join returns incorrect results on bigint=string
Based on this wiki page: https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-TypeSystem The string will do a implicit conversion to double, as Double is the only common ancestor between bigint and string. So the result is unpredictable if you are talking about double. Yong Date: Mon, 6 Oct 2014 14:20:57 -0700 Subject: Hive Join returns incorrect results on bigint=string From: a...@rocketfuelinc.com To: user@hive.apache.org Recently, by mistake, I encountered a situation where I ended up doing a join key comparison between a string and a bigint. The returned results are incorrect even though the strings have exactly same integer values as the bigint values. When I do a Join on bigint = cast(string as bigint), the results are correct. Is this the expected behavior, or Hive is supposed to do an automatic cast and compare as strings? -- Thanks and Regards,Ashu PachauriRocket Scientist,Rocket Fuel Inc.1- 650 - 200- 5390
How put the third party jar first in classpath of Hive UDF
Hi, Currently our production is using Hive 0.9.0. There is already a complex Hive query running on hadoop daily to generate millions records output. What I want to do is to transfer this result to Cassandra. I tried to do it in UDF, as then I can send the data at reducer level, to maximum the transfer speed. Everything should work, until I tested in cluster. We are using Netflix/Astyanax Cassandra client driver to write the data, which requires the google concurrent library Guava 14. I found out that the hive 0.9.0 already include Guava 9, which leads to the following exception shown up in my Hive CLI: SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/opt/ibm/biginsights/hive/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: Found binding in [jar:file:/opt/ibm/biginsights/IHC/lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.Exception in thread main java.lang.NoSuchMethodError: com/google/common/util/concurrent/MoreExecutors.listeningDecorator(Ljava/util/concurrent/ExecutorService;)Lcom/google/common/util/concurrent/ListeningExecutorService; at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.init(ThriftKeyspaceImpl.java:114) at com.netflix.astyanax.thrift.ThriftFamilyFactory.createKeyspace(ThriftFamilyFactory.java:41) at com.netflix.astyanax.AstyanaxContext$Builder.buildKeyspace(AstyanaxContext.java:146) Now I know what is the problem, but I cannot find a good solution.In Hadoop, I can use mapreduce.job.user.classpath.first to set my version of Guava picked up first, but if I set mapreduce.job.user.classpath.first=true in hive CLI, it didn't work.Then I google around, and found another setting looks like specified for Hive, so I set mapreduce.task.classpath.user.precedence=true; in Hive CLI, but it still didn't work. It doesn't look like that Hive has a way to allow user's third party jar files to be put in front of classpath in UDF. Does anyone face this kind of problem before? What is the best solution? Thanks Yong
RE: python UDF and Avro tables
Are you trying to read the Avro file directly in your UDF? If so, that is not the correct way to do it in UDF. Hive can support Avro file natively. Don't know your UDF requirement, but here is normally what I will do: Create the table in hive as using AvroContainerInputFormat create external table foorow format serde 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'stored asinputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'outputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'location '/xxx.avro'tblproperties ('avro.schema.url'='hdfs://.avsc'); In this case, the hive will map the table structure based on the avro schema file. Then you can register your UDF and start to use it. Remember, in this case, when your python UDF being invoked, the avro data will be wrapped as a JSON string, passed to your python UDF through STDIN. For example, if you do select MYUDF(col1) from foo, then the col1 data from Avro will be passed to your python script as a JSON string, even if the col1 is a nested structure. Then it is up to your python script to handle the JSON string, and return whatever output result through STDOUT. Yong From: kevin.wei...@imc-chicago.com To: user@hive.apache.org Subject: python UDF and Avro tables Date: Thu, 24 Jul 2014 15:52:03 + Hi All, I hope I’m not duplicating a previous question, but I couldn’t find any search functionality for the user list archives. I have written a relatively simple python script that is meant to take a field from a hive query and transform it (just some string processing through a dict) given that certain conditions are met. After reading this guide: http://blog.spryinc.com/2013/09/a-guide-to-user-defined-functions-in.html it would appear that the python script needs to read from STDIN the native file format (in my case Avro) and write to STDOUT. I implemented this functionality using the python fastavro deserializer and cStringIO for the STDIN/STDOUT bit. I then placed the appropriate python modules on all the nodes (which I could probably do a bit better by simply storing in HDFS). Unfortunately, I’m still getting errors while trying to transform my field which are appended below. I believe the problem is that HDFS can end up splitting the files at arbitrary points and you could have an Avro file with no schema appended to the top. Has anyone had any luck running a python UDF on an Avro table? Cheers! Traceback (most recent call last): File coltoskip.py, line 33, in module reader = avro.reader(avrofile) File _reader.py, line 368, in fastavro._reader.iter_avro.__init__ (fastavro/_reader.c:6438) ValueError: cannot read header - is it an avro file? org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error occurred when trying to close the Operator running your custom script. at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java:514) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613) at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.Child.main(Child.java:262) org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error occurred when trying to close the Operator running your custom script. at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java:514) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613) at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.mapred.Child.main(Child.java:262)
RE: does the HBase-Hive integration support using HBase index (primary key or secondary index) in the JOIN implementatoin?
I don't think Hbase-Hive integration part is that smart, be able to utilize the index existing in the HBase. But I think it depends on the version you are using. From my experience, there are a lot of improvement space in the Hbase-hive integration, especially push down logic into HBase engine. Yong From: tedd...@gmail.com Date: Thu, 24 Jul 2014 14:03:42 -0700 Subject: does the HBase-Hive integration support using HBase index (primary key or secondary index) in the JOIN implementatoin? To: user@hive.apache.org if I do a join of a table based on txt file and a table based on HBase, and say the latter is very large, is HIVE smart enough to utilize the HBase table's index to do the join, instead of implementing this as a regular map reduce job, where each table is scanned fully, bucketed on join keys, and then the matching items found out through the reducer? thanksYang
RE: Vectorization with UDFs returns incorrect results
When you turn vectorized on, does the following query consistently return 1 in the output? select ten_thousand() from testTabOrc Yong Date: Fri, 30 May 2014 08:24:43 -0400 Subject: Vectorization with UDFs returns incorrect results From: bbowman...@gmail.com To: user@hive.apache.org Hive 0.13 Hadoop 2.4 I am having an issue when using the combination of vectorized query execution, BETWEEN, and a custom UDF. When I have vectorization on, my query returns an empty set. When I then turn vectorization off, my query returns the correct results. Example Query: SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1 - X) and UDF_1 My UDFs seem to be working for everything else except this specific circumstance. Is this a issue in the hive software or am I writing my UDFs in such a way that they do not work with vectorization? If the latter, what is the correct way? I created a test scenario where I was able to reproduce this problem I am seeing: TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 1): package com.test; import org.apache.hadoop.hive.ql.exec.Description;import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import java.lang.String; import java.lang.*; public class tenThousand extends UDF { private final LongWritable result = new LongWritable(); public LongWritable evaluate() {result.set(1);return result; }} TEST DATA (test.input):1|CBCABC|12 2|DBCABC|133|EBCABC|144|ABCABC|155|BBCABC|16 6|CBCABC|17 CREATING ORC TABLE: 0: jdbc:hive2://server:10002/db create table testTabOrc (first bigint, second varchar(20), third int) partitioned by (range int) clustered by (first) sorted by (first) into 8 buckets stored as orc tblproperties (orc.compress = SNAPPY, orc.index = true); CREATE LOADING TABLE:0: jdbc:hive2://server:10002/db create table loadingDir (first bigint, second varchar(20), third int) partitioned by (range int) row format delimited fields terminated by '|' stored as textfile; COPY IN DATA:[root@server]# hadoop fs -copyFromLocal /tmp/test.input /db/loading/. ORC DATA:[root@server]# beeline -u jdbc:hive2://server:10002/db -n root --hiveconf hive.exec.dynamic.partition.mode=nonstrict --hiveconf hive.enforce.sorting=true -e insert into table testTabOrc partition(range) select * from loadingDir; LOAD TEST FUNCTION:0: jdbc:hive2://server:10002/db add jar /opt/hadoop/lib/testFunction.jar 0: jdbc:hive2://server:10002/db create temporary function ten_thousand as 'com.test.tenThousand'; TURN OFF VECTORIZATION:0: jdbc:hive2://server:10002/db set hive.vectorized.execution.enabled=false; QUERY (RESULTS AS EXPECTED):0: jdbc:hive2://server:10002/db select first from testTabOrc where first between ten_thousand()-1 and ten_thousand()-9995; ++| first |++ | 1 || 2 || 3 |++ 3 rows selected (15.286 seconds) TURN ON VECTORIZATION: 0: jdbc:hive2://server:10002/db set hive.vectorized.execution.enabled=true; QUERY AGAIN (WRONG RESULTS): 0: jdbc:hive2://server:10002/db select first from testTabOrc where first between ten_thousand()-1 and ten_thousand()-9995; ++| first |++++ No rows selected (17.763 seconds)
RE: Hive Avro union data access
Your alias_host column is an array, from your Avro specification, right? If so, just use [] to access the specified element in the array select alias_host[0] from array_tests where aliat_host[0] like '%test%' If you want to query all the elements in the array, google explode lateral view of hive. Yong From: sathish.vall...@emc.com To: user@hive.apache.org Subject: Hive Avro union data access Date: Fri, 30 May 2014 06:21:19 + Hi, I have an Hive table created with 3 different union data types for alias_host column name as shown. (arraystring,string, null). CREATE EXTERNAL TABLE array_tests ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ('avro.schema.literal'='{name:sessions,type:record,fields:[{default:null,name:alias_host,type: [{ type : array, items : string},string,null]}]}') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/arrayTests'; How to access and query the contents of this table in where clause.The queries below like these can be possible if the datatype is not union but when once I set the datatype as union the following queries are failing. Eg: select alias_host from array_tests where alias_host like ‘%test%’ limit 1000;Error: Error while processing statement: FAILED: SemanticException [Error 10016]: Line 1:32 Argument type mismatch 'alias_host': The 1st argument of EQUAL is expected to a primitive type, but union is found (state=42000,code=10016) Can anyone suggest how to access and query the contents of union data types. RegardsSathish Valluri
RE: Vectorization with UDFs returns incorrect results
I downloaded the Hive13 and confirmed this problem existed. My suggestion is to change your query to select first from testTabOrc where first = ten_thousand()-1 and first = ten_thousand()-9995; which works in my environment (Hive 0.13 + hadoop 2.2) You can create a Jira for it. It looks like the problem caused by between with Vectorization and UDF. BTW, your UDF looks fine to me. Yong Date: Fri, 30 May 2014 11:22:18 -0400 Subject: Re: Vectorization with UDFs returns incorrect results From: bbowman...@gmail.com To: user@hive.apache.org Yong, Thanks for the reply. Yes the query select ten_thousand() from testTabOrc does return consistent results regardless of the vectorization setting. I ran that query with the vectorization off and then turned it on and ran it 5 more times. Each time it returned correct results. See output below. Thanks,Ben = 0: jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc; ++| _c0 |++| 1 | | 1 || 1 || 1 || 1 | | 1 |++6 rows selected (17.807 seconds)0: jdbc:hive2://server:10002/db set hive.vectorized.execution.enabled=true; No rows affected (0.002 seconds)0: jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc;++ | _c0 |++| 1 || 1 | | 1 || 1 || 1 || 1 | ++6 rows selected (17.23 seconds)0: jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc; ++| _c0 |++| 1 | | 1 || 1 || 1 || 1 | | 1 |++6 rows selected (14.267 seconds)0: jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc; ++| _c0 |++| 1 | | 1 || 1 || 1 || 1 | | 1 |++6 rows selected (18.259 seconds)0: jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc; ++| _c0 |++| 1 | | 1 || 1 || 1 || 1 | | 1 |++6 rows selected (18.778 seconds) On Fri, May 30, 2014 at 10:52 AM, java8964 java8...@hotmail.com wrote: When you turn vectorized on, does the following query consistently return 1 in the output? select ten_thousand() from testTabOrc Yong Date: Fri, 30 May 2014 08:24:43 -0400 Subject: Vectorization with UDFs returns incorrect results From: bbowman...@gmail.com To: user@hive.apache.org Hive 0.13 Hadoop 2.4 I am having an issue when using the combination of vectorized query execution, BETWEEN, and a custom UDF. When I have vectorization on, my query returns an empty set. When I then turn vectorization off, my query returns the correct results. Example Query: SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1 - X) and UDF_1 My UDFs seem to be working for everything else except this specific circumstance. Is this a issue in the hive software or am I writing my UDFs in such a way that they do not work with vectorization? If the latter, what is the correct way? I created a test scenario where I was able to reproduce this problem I am seeing: TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 1): package com.test; import org.apache.hadoop.hive.ql.exec.Description;import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import java.lang.String; import java.lang.*; public class tenThousand extends UDF { private final LongWritable result = new LongWritable(); public LongWritable evaluate() {result.set(1);return result; }} TEST DATA (test.input):1|CBCABC|12 2|DBCABC|133|EBCABC|144|ABCABC|155|BBCABC|16 6|CBCABC|17 CREATING ORC TABLE: 0: jdbc:hive2://server:10002/db create table testTabOrc (first bigint, second varchar(20), third int) partitioned by (range int) clustered by (first) sorted by (first) into 8 buckets stored as orc tblproperties (orc.compress = SNAPPY, orc.index = true); CREATE LOADING TABLE:0: jdbc:hive2://server:10002/db create table loadingDir (first bigint, second varchar(20), third int) partitioned by (range int) row format delimited fields terminated by '|' stored as textfile; COPY IN DATA:[root@server]# hadoop fs -copyFromLocal /tmp/test.input /db/loading/. ORC DATA:[root@server]# beeline -u jdbc:hive2://server:10002/db -n root --hiveconf hive.exec.dynamic.partition.mode=nonstrict --hiveconf hive.enforce.sorting=true -e insert into table testTabOrc partition(range) select * from loadingDir; LOAD TEST FUNCTION:0: jdbc:hive2://server:10002/db add jar /opt/hadoop/lib/testFunction.jar 0: jdbc:hive2://server:10002/db create temporary function ten_thousand as 'com.test.tenThousand'; TURN OFF VECTORIZATION:0: jdbc:hive2://server:10002/db set hive.vectorized.execution.enabled=false; QUERY (RESULTS AS EXPECTED):0: jdbc:hive2://server:10002/db select first from testTabOrc where first between ten_thousand()-1
RE: LEFT SEMI JOIN
From Hive manual, there is only left semi join, no semi join, nor inner semi join. From the Database world, it is just a traditional name for this kind of join: LEFT semi join, as a reminder to the reader that the resultset comes out from the LEFT table ONLY. Yong From: lukas.e...@datageekery.com To: user@hive.apache.org Subject: LEFT SEMI JOIN Date: Tue, 13 May 2014 09:30:52 + Hello, We were approached to add support for Hive SQL in jOOQ [1], which might be useful for a greater community in general. I've gone through the Hive SQL syntax and I've encountered this interesting clause: The LEFT SEMI JOIN clause [2] Example: SELECT a.key, a.val FROM a LEFT SEMI JOIN b on (a.key = b.key) My question is: Why LEFT? I find this misleading. Compare this with a LEFT OUTER JOIN, which will always return all tuples from relation A at least once, regardless if there are any tuples in relation B matched by the JOIN predicate in ON. So in other words, a LEFT SEMI JOIN is completely useless as it *should* always return ALL tuples from relation A, compared to a more useful SEMI JOIN or INNER SEMI JOIN. What do you think? Lukas [1]: http://www.jooq.org [2]: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins -- Lukas Eder - Head of RD | lukas.e...@datageekery.com | +41 44 586 82 56 Data Geekery GmbH | Binzstrasse 23 | CH-8045 Zürich | Switzerland http://www.datageekery.com | Get back in control of your SQLT
RE: Does hive instantiate new udf object for each record
The reason you saw that is because when you provide evaluate() method, you didn't specified the type of column it can be used. So Hive will just create test instance again and again for every new row, as it doesn't know how or which column to apply your UDF. I changed your code as below: public class test extends UDF { private Text t; public Text evaluate (String s) { if(t==null) { t=new Text(initialization); } else { t=new Text(OK); } return t; } public Text evaluate () { if(t==null) { t=new Text(initialization); } else { t=new Text(OK); } return t; } } Now, if you invoke your UDF like this: select test(colA) from AnyTable; You should see one Init and the rest are OK, make sense? Yong From: sky880883...@hotmail.com To: user@hive.apache.org Subject: RE: Does hive instantiate new udf object for each record Date: Tue, 25 Mar 2014 10:17:46 +0800 I have implemented a simple udf for test. public class test extends UDF { private Text t; public Text evaluate () { if(t==null) { t=new Text(initialization); } else { t=new Text(OK); } return t; } } And the test query: select test() from AnyTable; I got initialization initialization initialization ... I have also implemented a similar GenericUDF, and got similar result. What' wrong with my code? Best Regards,ypgFrom: java8...@hotmail.com To: user@hive.apache.org Subject: RE: Does hive instantiate new udf object for each record Date: Mon, 24 Mar 2014 16:58:49 -0400 Your UDF object will only initialized once per map or reducer. When you said your UDF object being initialized for each row, why do you think so? Do you have log to make you think that way? If OK, please provide more information, so we can help you, like your example code, log etc Yong Date: Tue, 25 Mar 2014 00:30:21 +0800 From: sky880883...@hotmail.com To: user@hive.apache.org Subject: Does hive instantiate new udf object for each record Hi all, I'm trying to implement a udf which makes use of some data structures like binary tree. However, it seems that hive instantiates new udf object for each row in the table. Then the data structures would be also initialized again and again for each row.Whereas, in the book Programming Hive, a geoip function is taken for an example showing that a LookupService object is saved in a reference so it only needs to be initialized once in the lifetime of a map or reduce task that initializes it. The code for this function can be found here (https://github.com/edwardcapriolo/hive-geoip/). Could anyone give me some ideas how to make the udf object initialize once in the lifetime of a map or reduce task? Best Regards,ypg
RE: Does hive instantiate new udf object for each record
Your UDF object will only initialized once per map or reducer. When you said your UDF object being initialized for each row, why do you think so? Do you have log to make you think that way? If OK, please provide more information, so we can help you, like your example code, log etc Yong Date: Tue, 25 Mar 2014 00:30:21 +0800 From: sky880883...@hotmail.com To: user@hive.apache.org Subject: Does hive instantiate new udf object for each record Hi all, I'm trying to implement a udf which makes use of some data structures like binary tree. However, it seems that hive instantiates new udf object for each row in the table. Then the data structures would be also initialized again and again for each row.Whereas, in the book Programming Hive, a geoip function is taken for an example showing that a LookupService object is saved in a reference so it only needs to be initialized once in the lifetime of a map or reduce task that initializes it. The code for this function can be found here (https://github.com/edwardcapriolo/hive-geoip/). Could anyone give me some ideas how to make the udf object initialize once in the lifetime of a map or reduce task? Best Regards,ypg
RE: Joins Failing
It looks like his job failed in OOM in mapper tasks: Job failed as tasks failed. failedMaps:1 failedReduces:0 So what he need is to increase the mapper heap size request. Yong Date: Mon, 24 Mar 2014 16:16:50 -0400 Subject: Re: Joins Failing From: divakarredd...@gmail.com To: user@hive.apache.org CC: jason.herba...@bateswhite.com I hope, this property will fix your issue. set mapred.reduce.child.java.opts=-Xmx4096m; On Mon, Mar 24, 2014 at 3:59 PM, Clay McDonald stuart.mcdon...@bateswhite.com wrote: I believe I found my issue. 2014-03-24 15:49:38,775 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space Clay From: Clay McDonald [mailto:stuart.mcdon...@bateswhite.com] Sent: Monday, March 24, 2014 2:07 PM To: 'user@hive.apache.org' Subject: Joins Failing My join query is failing. Any suggestions on how I should troubleshoot this? 92651139753_0036_1_conf.xml to hdfs://dc-bigdata5.bateswhite.com:8020/mr-history/tmp/root/job_1392651139753_0036_conf.xml_tmp 2014-03-24 13:48:58,244 INFO [Thread-65] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Copied to done location: hdfs://dc-bigdata5.bateswhite.com:8020/mr-history/tmp/root/job_1392651139753_0036_conf.xml_tmp 2014-03-24 13:48:58,248 INFO [Thread-65] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Moved tmp to done: hdfs://dc-bigdata5.bateswhite.com:8020/mr-history/tmp/root/job_1392651139753_0036.summary_tmp to hdfs://dc-bigdata5.bateswhite.com:8020/mr-history/tmp/root/job_1392651139753_0036.summary 2014-03-24 13:48:58,249 INFO [Thread-65] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Moved tmp to done: hdfs://dc-bigdata5.bateswhite.com:8020/mr-history/tmp/root/job_1392651139753_0036_conf.xml_tmp to hdfs://dc-bigdata5.bateswhite.com:8020/mr-history/tmp/root/job_1392651139753_0036_conf.xml 2014-03-24 13:48:58,251 INFO [Thread-65] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Moved tmp to done: hdfs://dc-bigdata5.bateswhite.com:8020/mr-history/tmp/root/job_1392651139753_0036-1395683232406-root-SELECT+COUNT%28A.UPC_ID%29+AS+CNT%0D%0AFR...B.UPC_ID%28Stage-1395683335043-0-0-FAILED-default.jhist_tmp to hdfs://dc-bigdata5.bateswhite.com:8020/mr-history/tmp/root/job_1392651139753_0036-1395683232406-root-SELECT+COUNT%28A.UPC_ID%29+AS+CNT%0D%0AFR...B.UPC_ID%28Stage-1395683335043-0-0-FAILED-default.jhist 2014-03-24 13:48:58,251 INFO [Thread-65] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopped JobHistoryEventHandler. super.stop() 2014-03-24 13:48:58,252 INFO [Thread-65] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Setting job diagnostics to Task failed task_1392651139753_0036_m_11 Job failed as tasks failed. failedMaps:1 failedReduces:0 Thanks Clay
RE: Issue with Querying External Hive Table created on hbase
I am not sure about your question. Do you mean the query runs very fast if you run like 'select * from hbase_table', but very slow for 'select * from hbase where row_key = ?' I think it should be the other way round, right? Yong Date: Wed, 19 Mar 2014 11:42:39 -0700 From: sunil_ra...@yahoo.com Subject: Issue with Querying External Hive Table created on hbase To: user@hive.apache.org Hi All I am trying to query External Hive Table created on hbase ( hbase table is compressed using gzip) . I am getting quick response, if I use select * from hbase_acct_pref_dim_, but the query is taking for ever if I try to retrieve data based on the row_key. hive select * from hbase_acct_pref_dim_ where key = 30001; Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 02014-03-19 11:14:04,432 Stage-1 map = 0%, reduce = 0%2014-03-19 11:15:04,617 Stage-1 map = 0%, reduce = 0%2014-03-19 11:16:04,792 Stage-1 map = 0%, reduce = 0%2014-03-19 11:17:04,969 Stage-1 map = 0%, reduce = 0%2014-03-19 11:18:05,140 Stage-1 map = 0%, reduce = 0%2014-03-19 11:19:05,315 Stage-1 map = 0%, reduce = 0%2014-03-19 11:20:05,484 Stage-1 map = 0%, reduce = 0%2014-03-19 11:21:05,667 Stage-1 map = 0%, reduce = 0%2014-03-19 11:22:05,835 Stage-1 map = 0%, reduce = 0% Any Help is appreciated. Thanks, Sunil S Ranka Blog :: http://sranka.wordpress.com Superior BI is the antidote to Business Failure NOTHING IS IMPOSSIBLE EVEN THE WORD IMPOSSIBLE SAYS ,I M POSSIBLE.
RE: Using an UDF in the WHERE (IN) clause
What version of hive you are using? It is good to know that if it works in newer version. Yong Date: Tue, 11 Mar 2014 08:33:06 +0100 Subject: Re: Using an UDF in the WHERE (IN) clause From: petter.von.dolw...@gmail.com To: user@hive.apache.org Hi Young, I must argue that the partition pruning do actually work if I don't use the IN clause. What I wanted to achieve in my original query was to specify a range of partitions in a simple way. The same query can be expressed as SELECT * FROM mytable WHERE partitionCol = UDF(2014-03-10) and partitionCol = UDF(2014-03-11); This UDF returns an INT (rather than an INT array). Both this UDF and the original one are annotated with @UDFType(deterministic = true) (if that has any impact) . This variant works fine and does partition pruning. Note that I don't have another column as input to my UDF but a static value. Thanks, Petter 2014-03-11 0:16 GMT+01:00 java8964 java8...@hotmail.com: I don't know from syntax point of view, if Hive will allow to do columnA IN UDF(columnB). What I do know that even let's say above work, it won't do the partition pruning. The partition pruning in Hive is strict static, any dynamic values provided to partition column won't enable partition pruning, even though it is a feature I missed too. Yong Date: Mon, 10 Mar 2014 16:23:01 +0100 Subject: Using an UDF in the WHERE (IN) clause From: petter.von.dolw...@gmail.com To: user@hive.apache.org Hi, I'm trying to get the following query to work. The parser don't like it. Anybody aware of a workaround? SELECT * FROM mytable WHERE partitionCol IN my_udf(2014-03-10); partitionCol is my partition column of type INT and I want to achieve early pruning. I've tried returning an array of INTs from my_udf and also a plain string in the format (1,2,3). It seems like the parser wont allow me to put an UDF in this place. Any help appreciated. Thanks, Petter
RE: Using an UDF in the WHERE (IN) clause
I don't know from syntax point of view, if Hive will allow to do columnA IN UDF(columnB). What I do know that even let's say above work, it won't do the partition pruning. The partition pruning in Hive is strict static, any dynamic values provided to partition column won't enable partition pruning, even though it is a feature I missed too. Yong Date: Mon, 10 Mar 2014 16:23:01 +0100 Subject: Using an UDF in the WHERE (IN) clause From: petter.von.dolw...@gmail.com To: user@hive.apache.org Hi, I'm trying to get the following query to work. The parser don't like it. Anybody aware of a workaround? SELECT * FROM mytable WHERE partitionCol IN my_udf(2014-03-10); partitionCol is my partition column of type INT and I want to achieve early pruning. I've tried returning an array of INTs from my_udf and also a plain string in the format (1,2,3). It seems like the parser wont allow me to put an UDF in this place. Any help appreciated. Thanks, Petter
RE: Setting | Verifying | Hive Query Parameters from Java
If you want to set some properties of hive, just run it as it is in your JDBC connection. Any command in the hive JDBC will send to the server as the same if you run set hive.server2.async.exec.threads=50; in the hive session. Run the command set hive.server2.async.exec.threads=50; as a SQL statement, it will adjust the value for your JDBC connection. About the properties setting, I am not sure if it will work in Hive JDBC. Hive JDBC is a limited JDBC implementation based on Hive, so it maybe won't work, but I don't know for sure. Yong From: rinku.g...@fisglobal.com To: user@hive.apache.org Subject: RE: Setting | Verifying | Hive Query Parameters from Java Date: Thu, 6 Mar 2014 11:12:52 + Hi All, Can anybody help me on below mail trail. Thanks Rinku Garg From: Garg, Rinku Sent: Tuesday, March 04, 2014 5:14 PM To: user@hive.apache.org Subject: Setting | Verifying | Hive Query Parameters from Java Hi All, We have installed CDH4.2.0 and hive-0.10.0-cdh4.2.0. Both are working as desired. We need to set hive configuration parameter from Java while making JDBC connection. We have written a java program to execute queries on hive server with some configurations properties setting dynamically . We are doing it as below CONNECTION_URL=jdbc:hive://master149:1/default Next, we are doing following method to set properties through java props.setProperty(hive.server2.async.exec.threads,50); props.setProperty(hive.server2.thrift.max.worker.threads,500); props.setProperty(hive.groupby.orderby.position.alias,false); and a hive connection is made as given below hiveConnection = DriverManager.getConnection(connectionURL,props); by above steps when a hive connection is made using hive-jdbc and we are getting hive query results as desired. QUERY: 1. Are we doing rightly for setting up the hive properties, if yes then how can we verify that? 2. If the above is not the right way, then how can we achieve setting hive configuration parameters from Java using JDBC? Thanks Rinku Garg _ The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you.
RE: Best way to avoid cross join
Hi, Wolli: Cross join doesn't mean Hive has to use one reduce. From query point of view, the following cases will use one reducer: 1) Order by in your query (Instead of using sort by)2) Only one reducer group, which means all the data have to send to one reducer, as there is only one reducer group. In your case, distinct count of id1 will be the reducer group count. Did you explicitly set the reducer count in your hive session? Yong Date: Wed, 5 Mar 2014 14:17:24 +0100 Subject: Best way to avoid cross join From: darkwoll...@gmail.com To: user@hive.apache.org Hey everyone, before i write a lot of text, i just post something which is already written:http://www.sqlservercentral.com/Forums/Topic1328496-360-1.aspx The first posts adresses a pretty similar problem i also have. Currently my implementation looks like this: SELECT id1, MAX( CASEWHEN m.keyword IS NULLTHEN 0 WHEN instr(m.keyword, prep_kw.keyword) 0THEN 1 ELSE 0 END) AS flagFROM (select id1, keyword from import1) m CROSS JOIN (SELECT keyword FROM et_keywords) prep_kw GROUP BY id1; Since there is a cross join involved, the execution gets pinned down to 1 reducer only and it takes ages to complete. The thread i posted is solving this with some special SQLserver tactics. But I was wondering if anybody has encountered the problem in Hive already and found a better way to solve this. I'm using Hive 0.11 on a MapR Distribution, if this is somehow important. CheersWolli
RE: Best way to avoid cross join
Sorry, my mistake. I didn't pay attention that you are using cross join. Yes, cross join will always use one reducer, at least that is my understand. Yong Date: Wed, 5 Mar 2014 15:27:48 +0100 Subject: Re: Best way to avoid cross join From: darkwoll...@gmail.com To: user@hive.apache.org hey Yong, Even without the group by (pure cross join) the query is only using one reducer. Even specifying more reducers doesn't help: set mapred.reduce.tasks=50; SELECT id1, m.keyword, prep_kw.keyword FROM (select id1, keyword from import1) m CROSS JOIN (SELECT keyword FROM et_keywords) prep_kw; ... Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1 What could be setup wrong here? Or can it be avoided to use this ugly cross join at all? I mean my original problem is actually something else ;-) CheersWolli 2014-03-05 15:07 GMT+01:00 java8964 java8...@hotmail.com: Hi, Wolli: Cross join doesn't mean Hive has to use one reduce. From query point of view, the following cases will use one reducer: 1) Order by in your query (Instead of using sort by)2) Only one reducer group, which means all the data have to send to one reducer, as there is only one reducer group. In your case, distinct count of id1 will be the reducer group count. Did you explicitly set the reducer count in your hive session? Yong Date: Wed, 5 Mar 2014 14:17:24 +0100 Subject: Best way to avoid cross join From: darkwoll...@gmail.com To: user@hive.apache.org Hey everyone, before i write a lot of text, i just post something which is already written:http://www.sqlservercentral.com/Forums/Topic1328496-360-1.aspx The first posts adresses a pretty similar problem i also have. Currently my implementation looks like this: SELECT id1, MAX( CASEWHEN m.keyword IS NULLTHEN 0 WHEN instr(m.keyword, prep_kw.keyword) 0THEN 1 ELSE 0 END) AS flagFROM (select id1, keyword from import1) m CROSS JOIN (SELECT keyword FROM et_keywords) prep_kw GROUP BY id1; Since there is a cross join involved, the execution gets pinned down to 1 reducer only and it takes ages to complete. The thread i posted is solving this with some special SQLserver tactics. But I was wondering if anybody has encountered the problem in Hive already and found a better way to solve this. I'm using Hive 0.11 on a MapR Distribution, if this is somehow important. CheersWolli
RE: Metastore performance on HDFS-backed table with 15000+ partitions
That is good to know. We are using Hive 0.9. Right now the biggest table contains 2 years data, and we partitioned by hour, as the data volume is big. So right now, it has 2*365*24 around 17000+ partitions. So far we didn't see too much problem yet, but I do have some concerns about it. We are using IBM BigInsight, which is using derby as the hive metastore, not as mysql as my most experience was on. Yong From: norbert.bur...@gmail.com Date: Thu, 27 Feb 2014 07:57:05 -0500 Subject: Re: Metastore performance on HDFS-backed table with 15000+ partitions To: user@hive.apache.org Thanks everyone for the feedback. Just to follow up in case someone else runs into this: I can confirm that local client works around the OOMEs, but it's still very slow. It does seem like we were hitting some combination of HIVE-4051 and HIVE-5158. We'll try reducing partition count first, and then switch to 0.12.0 if that doesn't improve things significantly. Fwiw - http://www.slideshare.net/oom65/optimize-hivequeriespptx also has has some good rules-of-thumb. Norbert On Sat, Feb 22, 2014 at 1:27 PM, Stephen Sprague sprag...@gmail.com wrote: yeah. That traceback pretty much spells it out - its metastore related and that's where the partitions are stored. I'm with the others on this. HiveServer2 is still a little jankey on memory management. I bounce mine once a day at midnight just to play it safe (and because i can.) Again, for me, i use the hive local client for production jobs and remote client for adhoc stuff. you may wish to confirm the local hive client has no problem with your query. other than that you either increase your heap size on the HS2 process and hope for the best and/or file a bug report. bottom line hiveserver2 isn't production bullet proof just yet, IMHO. Others may disagree. Regards, Stephen. On Sat, Feb 22, 2014 at 9:50 AM, Norbert Burger norbert.bur...@gmail.com wrote: Thanks all for the quick feedback. I'm a bit surprised to learn 15k is considered too much, but we can work around it. I guess I'm also curious why the query planner needs to know about all partitions even in the case of simple select/limit queries, where the query might target only a single partition. Here's the client-side OOME with HADOOP_HEAPSIZE=2048: https://gist.githubusercontent.com/nburger/3286d2052060e2efe161/raw/dc30231086803c1d33b9137b5844d2d0e20e350d/gistfile1.txt This was from a CDH4.3.0 client hitting HIveServer2. Any idea what's consuming the heap? Norbert On Sat, Feb 22, 2014 at 10:32 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Dont make tbales with that many partitions. It is an anti pattern. I hwve tables with 2000 partitions a day and that is rewlly to many. Hive needs go load that informqtion into memory to plan the query. On Saturday, February 22, 2014, Terje Marthinussen tmarthinus...@gmail.com wrote: Query optimizer in hive is awful on memory consumption. 15k partitions sounds a bit early for it to fail though.. What is your heap size? Regards, Terje On 22 Feb 2014, at 12:05, Norbert Burger norbert.bur...@gmail.com wrote: Hi folks, We are running CDH 4.3.0 Hive (0.10.0+121) with a MySQL metastore. In Hive, we have an external table backed by HDFS which has a 3-level partitioning scheme that currently has 15000+ partitions. Within the last day or so, queries against this table have started failing. A simple query which shouldn't take very long at all (select * from ... limit 10) fails after several minutes with a client OOME. I get the same outcome on count(*) queries (which I thought wouldn't send any data back to the client). Increasing heap on both client and server JVMs (via HADOOP_HEAPSIZE) doesn't have any impact. We were only able to work around the client OOMEs by reducing the number of partitions in the table. Looking at the MySQL querylog, my thought is that the Hive client is quite busy making requests for partitions that doesn't contribute to the query. Has anyone else had similar experience against tables this size? Thanks, Norbert -- Sorry this was sent from mobile. Will do less grammar and spell check than usual.
RE: Hive query parser bug resulting in FAILED: NullPointerException null
Can you reproduce with an empty table? I can't reproduce it. Also, can you paste the stack trace? Yong From: krishnanj...@gmail.com Date: Thu, 27 Feb 2014 12:44:28 + Subject: Hive query parser bug resulting in FAILED: NullPointerException null To: user@hive.apache.org Hi all, we've experienced a bug which seems to be caused by having a query constraint involving partitioned columns. The following query results in FAILED: NullPointerException null being returned nearly instantly: EXPLAIN SELECT col1FROM tbl1WHERE(part_col1 = 2014 AND part_col2 = 2)OR part_col1 2014; The exception doesn't happen if any of the conditions are removed. The table is defined like the following: CREATE TABLE tbl1 ( col1STRING, ... col12 STRING)PARTITIONED BY (part_col1 INT, part_col2 TINYINT, part_col3 TINYINT)STORED AS SEQUENCEFILE; Unfortunately I cannot construct a test case to replicate this. Seen as though it appears to be a query parser bug, I thought the following would replicate it: CREATE TABLE tbl2 LIKE tbl1;EXPLAIN SELECT col1FROM tbl2WHERE(part_col1 = 2014 AND part_col2 = 2)OR part_col1 2014; But it does not. Could it somehow be data specific? Does the query parser use partition information? Are there any logs I could see to investigate this further? Or is this a known bug? We're using hive 0.10.0-cdh4.4.0. Cheers, Krishna
Hive trunk unit test failed
Hi, I tried to run the all tests in my local Linux x64 of current Hive trunk code. My mvn clean package -DskipTests -Phadoop-2 -Pdist will work fine if I skip tests. The following unit test failed, and then it stopped. I traced the code down to a native method invoked atorg.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native Method) throw InvocationTargetException. My questions are: 1) Did it mean the native code not available in my environment causing the above error?2) If so, since the latest hive build is using Maven, and I can see the hadoop-2.2.0 all jar files downloaded in my local repository, why this error still happen?3) Is it possible that because of my local environment is 64bit, but default hadoop-2.2.0 coming with 32bit native code? If so, how to fix that during the hive build? Thanks Yong Running org.apache.hadoop.hive.serde2.avro.TestAvroSerdeUtilsTests run: 8, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.802 sec FAILURE! - in org.apache.hadoop.hive.serde2.avro.TestAvroSerdeUtilsdetemineSchemaTriesToOpenUrl(org.apache.hadoop.hive.serde2.avro.TestAvroSerdeUtils) Time elapsed: 0.377 sec ERROR!java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native Method) at org.apache.hadoop.security.JniBasedUnixGroupsMapping.clinit(JniBasedUnixGroupsMapping.java:49) at org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.init(JniBasedUnixGroupsMappingWithFallback.java:38) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:129) at org.apache.hadoop.security.Groups.init(Groups.java:55) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:182) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:235) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:214) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:669) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571) at org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:2590) at org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:2582) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2448) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367) at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFromFS(AvroSerdeUtils.java:110) at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:71) at org.apache.hadoop.hive.serde2.avro.TestAvroSerdeUtils.detemineSchemaTriesToOpenUrl(TestAvroSerdeUtils.java:139)
RE: Hive trunk unit test failed
OK. Now I understand that this error is due to missing the Hadoop native library. If I manually add libhadoop.so into java.library.path for this unit test, it passed. So either the hadoop 2.2.0 coming from Maven reponsitory includes 32bit of hadoop native library, or totally missed it. Now the question is what is the correct way to run the unit tests in the new maven build? Thanks Yong From: java8...@hotmail.com To: user@hive.apache.org Subject: Hive trunk unit test failed Date: Wed, 26 Feb 2014 14:49:41 -0500 Hi, I tried to run the all tests in my local Linux x64 of current Hive trunk code. My mvn clean package -DskipTests -Phadoop-2 -Pdist will work fine if I skip tests. The following unit test failed, and then it stopped. I traced the code down to a native method invoked atorg.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native Method) throw InvocationTargetException. My questions are: 1) Did it mean the native code not available in my environment causing the above error?2) If so, since the latest hive build is using Maven, and I can see the hadoop-2.2.0 all jar files downloaded in my local repository, why this error still happen?3) Is it possible that because of my local environment is 64bit, but default hadoop-2.2.0 coming with 32bit native code? If so, how to fix that during the hive build? Thanks Yong Running org.apache.hadoop.hive.serde2.avro.TestAvroSerdeUtilsTests run: 8, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.802 sec FAILURE! - in org.apache.hadoop.hive.serde2.avro.TestAvroSerdeUtilsdetemineSchemaTriesToOpenUrl(org.apache.hadoop.hive.serde2.avro.TestAvroSerdeUtils) Time elapsed: 0.377 sec ERROR!java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native Method) at org.apache.hadoop.security.JniBasedUnixGroupsMapping.clinit(JniBasedUnixGroupsMapping.java:49) at org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.init(JniBasedUnixGroupsMappingWithFallback.java:38) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:129) at org.apache.hadoop.security.Groups.init(Groups.java:55) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:182) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:235) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:214) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:669) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571) at org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:2590) at org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:2582) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2448) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367) at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFromFS(AvroSerdeUtils.java:110) at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:71) at org.apache.hadoop.hive.serde2.avro.TestAvroSerdeUtils.detemineSchemaTriesToOpenUrl(TestAvroSerdeUtils.java:139)
RE: java.lang.RuntimeException: cannot find field key from [0:_col0, 1:_col2, 2:_col3]
Works for me on 0.10. Yong Date: Tue, 25 Feb 2014 11:37:32 -0800 From: kumarbuyonl...@yahoo.com Subject: Re: java.lang.RuntimeException: cannot find field key from [0:_col0, 1:_col2, 2:_col3] To: user@hive.apache.org Hi,Thanks for looking into it.I am also trying this on hive 0.11 to see if it works there. If you get a chance to reproduce this problem on hive 0.10, please let me know. Thanks. On Monday, February 24, 2014 10:59 PM, java8964 java8...@hotmail.com wrote:My guess is that your UDTF will return an array of struct. I don't have Hive 0.10 in handy right now, but I write a simple UDTF to return an array of struct to test on Hive 0.12 release.hive desc test;OKid int None namestring NoneTime taken: 0.074 seconds, Fetched: 2 row(s)hive select * from test;OK1Apples,Bananas,CarrotsTime taken: 0.08 seconds, Fetched: 1 row(s)The pair UDTF will output Apples,Bananas,Carrotsto Apples, BananasApples, CarrotsBananas, Carrotsan array of 2 elements struct.hive select id, name, m1, m2 from test lateral view pair(name) p as m1, m2 where m1 is not null;OK1 Apples,Bananas,Carrots Apples Bananas1 Apples,Bananas,Carrots Apples Carrots1Apples,Bananas,Carrots Bananas CarrotsTime taken: 7.683 seconds, Fetched: 3 row(s)hive select id, name, m1, m2 from test lateral view pair(name) p as m1, m2 where m1 = 'Apples';OK1 Apples,Bananas,Carrots Apples Bananas1 Apples,Bananas,Carrots Apples CarrotsTime taken: 7.726 seconds, Fetched: 2 row(s)hive set hive.optimize.ppd=true;hive select id, name, m1, m2 from test lateral view pair(name) p as m1, m2 where m1 is not null;Total MapReduce jobs = 1OK1Apples,Bananas,Carrots Apples Bananas1 Apples,Bananas,Carrots Apples Carrots1Apples,Bananas,Carrots Bananas CarrotsTime taken: 7.716 seconds, Fetched: 3 row(s)I cannot reproduce your error in Hive 0.12, as you can see. I can test on Hive 0.10 tomorrow when I have time, but can your test your case in Hive 0.12, or review your UDTF again?YongDate: Mon, 24 Feb 2014 07:09:44 -0800From: kumarbuyonline@yahoo.comSubject: Re: java.lang.RuntimeException: cannot find field key from [0:_col0, 1:_col2, 2:_col3]To: user@hive.apache.org; kumarbuyonline@yahoo.comAs suggested, I changed the query like this:select x.f1,x,f2,x,f3,x.f4from ( select e.f1 as f1,e.f2 as f2,e.f3 as f3,e.f4 as f4 from mytable LATERAL VIEW myfunc(p1,p2,p3,p4) e as f1,f2,f3,f4 where lang=123) x where x.f3 is not null;And it still doesn't work. I am getting the same error. If anyone has any ideas, please let me know.Thanks. On Friday, February 21, 2014 11:27 AM, Kumar V kumarbuyonl...@yahoo.com wrote:Line 316 in my UDTF where is shows the error is the line where I call forward().The whole trace is :Caused by: java.lang.RuntimeException: cannot find field key from [0:_col0, 1:_col2, 2:_col6, 3:_col7, 4:_col8, 5:_col9] at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:346) at org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.getStructFieldRef(StandardStructObjectInspector.java:143) at org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:57) at org.apache.hadoop.hive.ql.exec.ExprNodeFieldEvaluator.initialize(ExprNodeFieldEvaluator.java:55) at org.apache.hadoop.hive.ql.exec.ExprNodeFieldEvaluator.initialize(ExprNodeFieldEvaluator.java:55) at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:128) at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:128) at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:128) at org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:85) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800) at org.apache.hadoop.hive.ql.exec.LateralViewJoinOperator.processOp(LateralViewJoinOperator.java:133) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800) at org.apache.hadoop.hive.ql.exec.UDTFOperator.forwardUDTFOutput(UDTFOperator.java:112) at org.apache.hadoop.hive.ql.udf.generic.UDTFCollector.collect(UDTFCollector.java:44) at org.apache.hadoop.hive.ql.udf.generic.GenericUDTF.forward(GenericUDTF.java:81) at pb2CSVReadFile.FlatTableFileUDTFTx.process(FlatTableFileUDTFTx.java:316) at org.apache.hadoop.hive.ql.exec.UDTFOperator.processOp(UDTFOperator.java:98) at org.apache.hadoop.hive.ql.exec.Operator.process
RE: hive query to calculate percentage
one query won't work, as totalcount is not in group by. You have 2 options: 1) use the sub query select a.timestamp_dt, a.totalcount/b.total_sumfrom daily_count_per_kg_domain a join(select timestamp_dt, sum(totalcount) as total_sumfromdaily_count_per_kg_domaingroup by timestamp_dt) b on (a.timestamp_dt = b.timestamp_dt) 2) If you are using hive 11 or above, using windows functions. Yong Date: Tue, 25 Feb 2014 18:27:34 -0600 Subject: Re: hive query to calculate percentage From: kkrishna...@gmail.com To: user@hive.apache.org Modfiy the query to :select totalcount / sum(totalcount) from daily_count_per_kg_domain where timestamp_dt = '20140219' group by timestamp_dt; if you dont specify the where clause, you will get result for all partitions. On Tue, Feb 25, 2014 at 3:14 PM, Manish maa...@gmail.com wrote: I have a partitioned table on timestamp_dt: desc daily_count_per_kg_domain; OK ddnamesyskg string totalcount int timestamp_dtstring hive select * from daily_count_per_kg_domain; OK sys_kg_band 224 20140219 sys_kg_event343520140219 sys_kg_movies 44987 20140219 sys_kg_oly 417220140219 sys_kg_sp_countr549920140219 sys_kg_sports 395420140219 sys_kg_tv 21387 20140219 sys_kg_venue152 20140219 sys_kgathlete 900020140219 sys_kgpeople300064 20140219 Looking to compute percentages for each row, (per day): 100* totalcount / sum(totalcount) Intuitively i tried : select totalcount / sum(totalcount) from daily_count_per_kg_domain where timestamp_dt = '20140219' ; FAILED: SemanticException [Error 10025]: Line 1:7 Expression not in GROUP BY key 'totalcount' I am not sure what group by on totalcount means.. Any ides ?
RE: part-m-00000 files and their size - Hive table
Yes, it is good that the file sizes are evenly close, but not very important, unless there are files very small (compared to the block size). The reasons are: Your files should be splitable to be used in Hadoop (Or in Hive, it is the same thing). If they are splitable, then 1G file will use 10 blocks (assume the block size is 128M), and 256M file will take 2 blocks. So these 2 files will generate 12 mapper tasks, and will be equally run in your cluster. From performance point of view, you have 12 mapper tasks, and they are equally processed in the cluster. So one 1G file plus one 256M file are not big deal. But if you have one file are very small, like 10M, that one file will also consume one mapper task, and that is kind of bad for performance, as hadoop starting one mapper task only consuming 10M data, which is bad, because starting/stop tasks is using quite some resource, but only processing 10M data. The reason you see unevenly file size of the output of sqoop is that it is hard for sqoop to split your source data evenly. For example, if you dump table A from DB to hive, sqoop will do the following: 1) Identify the primary/unique keys of the table.2) Find out the min/max value of the keys, let say they are (1 to 1,000,000)3) Based on # of your mapper task, split them. If you run sqoop with 4 mappers, then the data will be split into 4 groups (1, 250,000) (250,001, 500,000) (500,001, 750,000) (750,001, 1,000,000). As you can image, your data most likely are not even distributed by the primary keys in that 4 groups, then you will get unevenly output as part-m-xxx files. Keep in mind that it is not required to use primary keys or unique keys as the split column. So you can choose whateven column in your table make sense. Pick up whateven can make the split more even. Yong Date: Tue, 25 Feb 2014 17:42:20 -0800 From: hadoop...@yahoo.com Subject: part-m-0 files and their size - Hive table To: user@hive.apache.org Hi, I am loading data to HDFS files through sqoop and creating a Hive table to point to these files. The mapper files through sqoop example are generated like this below. part-m-0 part-m-1 part-m-2 My question is -1) For Hive query performance , how important or significant is the distribution of the file sizes above. part_m_0 say 1 GBpart_m_1 say 3 GBpart_m_1 say 0.25 GB Vs part_m_0 say 1.4 GBpart_m_1 say 1.4 GBpart_m_1 say 1.45 B NOTE : The size and no of files is just for sample. The real numbers are far bigger. I am assuming the uniform distribution has a performance benefit . If so, what is the reason and can I know the technical details.
RE: java.lang.RuntimeException: cannot find field key from [0:_col0, 1:_col2, 2:_col3]
What is your stracktrace? Can you paste here? It is maybe a different bug. If you put e.f3 null at an outsider query? Does that work? Or maybe you have to enhance your UDTF to push that filter into your UDTF. It is not perfect, but maybe a solution for you as now. You can create a new Jira if it is a new bug. Yong Date: Fri, 21 Feb 2014 07:18:32 -0800 From: kumarbuyonl...@yahoo.com Subject: java.lang.RuntimeException: cannot find field key from [0:_col0, 1:_col2, 2:_col3] To: user@hive.apache.org Hi, I have a UDTF which works fine except when I do a query like the following : select e.* from mytable LATERAL VIEW myfunc(p1,p2,p3,p4) e as f1,f2,f3,f4 where lang=123 and e.f3 null; The error I see is: java.lang.RuntimeException: cannot find field key from [0:_col0, 1:_col2, 2:_col3] If i remove 'and e.f3 null' from the WHERE clause, it works fine. Also, with e.f3 null in the WHERE clause, if I add the setting hive.optimize.ppd=false it works fine, but now, instead of using 600 mappers, it uses about 10,000 mappers and runs for more than 2 hours instead of a few minutes. I am using hive 0.10. I saw the jira HIVE-3226 which says that it has been fixed in hive 0.10. Is this the bug that I am hitting now ?Any other ideas of how to make it work ? I am actually on CDH 4.4 which has hive 0.10. Please let me know. Thanks,Murali.
Hbase + Hive scan performance
Hi, I know this has been asked before. I did google around this topic and tried to understand as much as possible, but I kind of got difference answers based on different places. So I like to ask what I have faced and if someone can help me again on this topic. I created one table with one column family with 20+ columns in the hive. It is populated around 150M records from a 20G csv file. What I want to check if how fast I can get for a full scan in MR job from the Hbase table. It is running in a 10 nodes hadoop cluster (With Hadoop 1.1.1 + Hbase 0.94.3 + Hive 0.9) , 8 of them as Data + Task nodes, and one is NN and Hbase master, and another one is running 2nd NN. 4 nodes of 8 data nodes also run Hbase region servers. I use the following code example to get row count from a MR job, http://hbase.apache.org/book/mapreduce.example.htmlAt first, the mapper tasks run very slow, as I commented out the following 2 lines on purpose: scan.setCaching(1000);// 1 is the default in Scan, which will be bad for MapReduce jobs scan.setCacheBlocks(false); // don't set to true for MR jobs Then I added the above 2 lines, I almost get 10X faster compared to the first run. That's good, it proved to me that above 2 lines are important for Hbase full scan. Now the question comes to in Hive. I already created the table in the Hive linking to the Hbase table, then I started my hive session like this: hive --auxpath $HIVE_HOME/lib/hive-hbase-handler-0.9.0.jar,$HIVE_HOME/lib/hbase-0.94.3.jar,$HIVE_HOME/lib/zookeeper-3.4.5.jar,$HIVE_HOME/lib/guava-r09.jar -hiveconf hbase.master=Hbase_master:port If I run this query select count(*) from table, I can see the mappers performance is very bad, almost as bad as my 1st run above. I searched this mailing list, it looks like there is a setting in Hive session to change the scan caching size, same as 1st line of above code base, from here: http://mail-archives.apache.org/mod_mbox/hbase-user/201110.mbox/%3CCAGpTDNfn11jZAJ2mfboEqkfudXaU9HGsY4b=2x1spwf4qmu...@mail.gmail.com%3E So I add the following settings in my hive session: set hbase.client.scanner.caching=1000; To my surprise, after this setting in hive session, the new MR job generated from the Hive query still very slow, same as before this settings. Here is what I found so far: 1) In my owner MR code, before I add the 2 lines of code change or after, in the job.xml of MR job, I both saw this setting in the job.xml: hbase.client.scanner.caching=1So this setting is the same in both run, but the performance improved great after the code change. 2) In hive run, I saw the setting hbase.client.scanner.caching changed from 1 to 1000 in job.xml, which is what I set in the hive session, but performance has not too much change. So the setting was changed, but it didn't help the performance as I expected. My questions are following: 1) Is there any change in the hive (0.9) do the same as the 1st line of code change? From google and hbase document, it looks like the above configuration is the one, but it didn't help me.2) Even assume the above setting is correct, why we have this Hive Jira to fix the Hbase scan cache and marked ONLY fixed in Hive 0.12? The Jira ticket is here: https://issues.apache.org/jira/browse/HIVE-36033) Is there any hive setting can do the same as 2nd line code change above? If so, what is it? I google around and cannot find one. Thanks Yong
RE: HiveMetaStoreClient only sees one of my DBs ?
Best mailing list for this question is hive, but I will try to give my guess here anyway. If you only see 'default' database, most likely you are using hive 'LocalMetaStore'. For helping yourself to find out the problem, try to find out following information: 1) What kind of Hive metastore you are using? Did you configure any thing? Most widely used are Derby or Mysql. 2) If you are using a remote Hive metastore, did the hive-conf folder passed to the class path of your Java program? From the following log, it looks like wrong hive conf files are passed in: [2013-12-30 11:44:03,261] INFO ugi=myuser_name ip=unknown-ip-addr cmd=get_all_databases Yong From: tedd...@gmail.com Date: Mon, 30 Dec 2013 11:48:55 -0800 Subject: HiveMetaStoreClient only sees one of my DBs ? To: hive-u...@hadoop.apache.org if I log into my hive shell, do show databases; , I see many DBs: Logging initialized using configuration in file:/etc/hive/conf/hive-log4j.propertieshive show databases; OKconfconfnewdefault moneytestdbTime taken: 1.57 seconds, Fetched: 6 row(s) but somehow if I run the following java code using the client API, it only shows default public static void main(String args[]) throws MetaException { HiveMetaStoreClient cli = new HiveMetaStoreClient(new HiveConf()); for(String db: cli.getAllDatabases()) System.out.println(db); } stence)[2013-12-30 11:44:02,900] INFO Validating 2 unique key(s) for table DBS (DataNucleus.Datastore.Schema) [2013-12-30 11:44:02,913] INFO Validating 0 foreign key(s) for table DBS (DataNucleus.Datastore.Schema)[2013-12-30 11:44:02,919] INFO Validating 2 index(es) for table DBS (DataNucleus.Datastore.Schema) [2013-12-30 11:44:02,926] INFO Validating 1 unique key(s) for table DATABASE_PARAMS (DataNucleus.Datastore.Schema)[2013-12-30 11:44:02,937] INFO Validating 1 foreign key(s) for table DATABASE_PARAMS (DataNucleus.Datastore.Schema) [2013-12-30 11:44:02,942] INFO Validating 2 index(es) for table DATABASE_PARAMS (DataNucleus.Datastore.Schema)[2013-12-30 11:44:03,087] INFO Listener found initialisation for persistable class org.apache.hadoop.hive.metastore.model.MDatabase (DataNucleus.MetaData) [2013-12-30 11:44:03,211] INFO 0: get_all_databases (org.apache.hadoop.hive.metastore.HiveMetaStore)[2013-12-30 11:44:03,261] INFO ugi=myuser_name ip=unknown-ip-addr cmd=get_all_databases (org.apache.hadoop.hive.metastore.HiveMetaStore.audit) default why is it showing only 1 db? what setttings of default are different from the others to enable it to be shown? also I wonder how is that HiveConf() initialized ? how does it even know the hive port and config settings ? is it hardcoded to /etc/hive/conf/hive-site.xml ? thanksYang
Why from_utc_timestamp works for some bigint, but not others
Hi, I am using Hive 0.9.0, and not sure why the from_utc_timestamp gave me error to the following value, but works for others. The following example shows 2 bigint as 2 epoch value of milliseconds level. They are only 11 seconds difference. One works fine in hive 0.9.0 with from_utc_timestamp UDF, the others just broken. Why? yzhang@yzhang-linux:~ hivehive use tracking2;OKTime taken: 1.569 secondshive select from_utc_timestamp(1371746476159, 'EST') from track_sent;OK2013-06-20 07:41:16.3182013-06-20 07:41:16.318Time taken: 8.64 secondshive select from_utc_timestamp(1371746487759, 'EST') from track_sent;FAILED: Hive Internal Error: java.lang.IllegalArgumentException(Value out of range)java.lang.IllegalArgumentException: Value out of range at java.sql.Timestamp.setNanos(Timestamp.java:287) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFFromUtcTimestamp.applyOffset(GenericUDFFromUtcTimestamp.java:101) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFFromUtcTimestamp.evaluate(GenericUDFFromUtcTimestamp.java:94) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:127) at org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc.newInstance(ExprNodeGenericFuncDesc.java:214) at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:767) at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:888) at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:125) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102) at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:165) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:7755) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:2310) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:2112) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:6165) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:6136) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:6762) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7531) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431)at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:689) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) at java.lang.reflect.Method.invoke(Method.java:611) at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
hive partition pruning on joining on partition column
I have the requirement trying to support in hive, not sure if it is doable. I have the hadoop 1.1.1 with Hive 0.9.0 (Using deby as the meta store) If I partition my data by a dt column, so if my table 'foo' have some partitions like 'dt=2013-07-01' to 'dt=2013-07-30'. Now the user want to query all the data of Saturday only. To make it flexiable, instead of asking end user to find out what date in that month are Saturday, I add a lookup table (just called it 'bar') in the HIVE with following columns: year, month, day, dt_format, week_of_day So I want to see if I can join with foo and bar to still get the partition pruning: select *from foojoin baron (bar.year=2013 and bar.month=7 and bar.day_of_week=6 and bar.dt_foramt = foo.dt) I tried several ways, like switch the table order, join with subquery etc, none of them will make partition pruning works in this case on table foo. Can this really archivable in hive? Thanks Yong
RE: Question about how to add the debug info into the hive core jar
I am not sure the existing logging information is enough for me. The exception trace is as following: Caused by: java.lang.IndexOutOfBoundsException: Index: 8, Size: 8at java.util.ArrayList.rangeCheck(ArrayList.java:604)at java.util.ArrayList.get(ArrayList.java:382)at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485)at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485) It is hive 0.9.0, and I look into the source code of LazySImpleSerDe.java around line 485: List? extends StructField fields = soi.getAllStructFieldRefs(); list = soi.getStructFieldsDataAsList(obj); if (list == null) { out.write(nullSequence.getBytes(), 0, nullSequence.getLength()); } else { for (int i = 0; i list.size(); i++) { if (i 0) { out.write(separator); } serialize(out, list.get(i), fields.get(i).getFieldObjectInspector(), -- line 485 separators, level + 1, nullSequence, escaped, escapeChar, needsEscape);} } For this exception to happen, it means that the soi (Which is my StructObjectInspector class) must return different length of collection object as fields and list.But I already add the logger in my StructorObjectInspector, which proves the same length collection returned from both method of getAllStructFieldRefs() and getStructFiledsDataAsList(Object).So I really don't know how this exception could happen in the Hive code. I have 2 options right now:1) Change the above code to add more debug information to return at runtime to check what kind of content in the either fields object or list object, to understand why their length not same. But I have problem to make my new jar to be loaded by hadoop.2) Enable remote debug. There is very limited example on the internet about how to enable the hive server side MR jobs remote debug, even some wiki pages claim it is doable, but without concrete examples. Thanks From: ashet...@hortonworks.com Subject: Re: Question about how to add the debug info into the hive core jar Date: Wed, 20 Mar 2013 17:35:36 -0700 To: user@hive.apache.org Hi Yong, Have you tried running the H query in debug mode. Hive log level can be changed by passing the following conf while hive client is running. #hive -hiveconf hive.root.logger=ALL,console -e DDL statement ;#hive -hiveconf hive.root.logger=ALL,console -f ddl.sql ; Hope this helps Thanks On Mar 20, 2013, at 1:45 PM, java8964 java8964 java8...@hotmail.com wrote:Hi, I have the hadoop running in pseudo-distributed mode on my linux box. Right now I face a problem about a Hive, which throws Exception in a table for some data which used my custom SerDe and InputFormat class. To help me to trace the root cause, I need to modify the code of org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe to add more debug logging information to understand why the exception happens. After I modify the hive code, I can compile it and generate a new hive-serde.jar file, with the same name as the release version, just size changed. Now I put my new hive-serde.jar under $HIVE_HOME/lib folder, replace the old one, and run the query which failed. But after the failure, if I check the $HADOOP_HOME/logs/user_logs/, I saw the Exception stacktrace still looked like generated by the original hive-serde class. The reason is that the line number shown in the log doesn't match with the new code I changed to add the debug information. My question is, if I have this new compiled hive-serde.jar file, besides $HIVE_HOME/lib, where should I put it in? 1) This is a pseudo environments. Everything (namenode, data node, job tracker and tasktracer are all running in one box)2) After I replace hive-serde.jar with my new jar, I even stop all the hadoop java processing and restart them.3) But when I run the query in the hive session, I still saw the log generated by the old hive-serde.jar class. Why? Thank for any help Yong
RE: difference between add jar in hive session and hive --auxpath
This is in HIVE-0.9.0 hive list jars;/nfs_home/common/userlibs/google-collections-1.0.jar/nfs_home/common/userlibs/elephant-bird-hive-3.0.7.jar/nfs_home/common/userlibs/protobuf-java-2.3.0.jar/nfs_home/common/userlibs/elephant-bird-core-3.0.7.jarfile:/usr/lib/hive/lib/hive-builtins-0.9.0-cdh4.1.2.jarhive desc table;java.lang.NoClassDefFoundError: com/twitter/elephantbird/mapreduce/io/ProtobufConverterat com.twitter.elephantbird.hive.serde.ProtobufDeserializer.initialize(ProtobufDeserializer.java:45) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:203) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:260) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:253) at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:490)at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:162) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:930)at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:844)at org.apache.hadoop.hive.ql.exec.DDLTask.describeTable(DDLTask.java:2545) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:309)at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1331)at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1117)at org.apache.hadoop.hive.ql.Driver.run(Driver.java:950)at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:744)at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:607)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597)at org.apache.hadoop.util.RunJar.main(RunJar.java:208)Caused by: java.lang.ClassNotFoundException: com.twitter.elephantbird.mapreduce.io.ProtobufConverterat java.net.URLClassLoader$1.run(URLClassLoader.java:202)at java.security.AccessController.doPrivileged(Native Method)at java.net.URLClassLoader.findClass(URLClassLoader.java:190)at java.lang.ClassLoader.loadClass(ClassLoader.java:307)at java.lang.ClassLoader.loadClass(ClassLoader.java:248)... 25 moreFAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.DDLTaskhive exit;[y130zhan@daca2 userlibs]$ jar tvf /nfs_home/common/userlibs/elephant-bird-core-3.0.7.jar | grep ProtobufConverter 4825 Mon Mar 04 16:50:46 UTC 2013 com/twitter/elephantbird/mapreduce/io/ProtobufConverter.class 732 Mon Mar 04 16:50:46 UTC 2013 com/twitter/elephantbird/mapreduce/io/ProtobufConverter$1.class From: vkavul...@outlook.com To: user@hive.apache.org Subject: RE: difference between add jar in hive session and hive --auxpath Date: Thu, 7 Mar 2013 16:44:41 -0800 If properly done, add jar jar-file should work the same as passing the jar with --auxpath. Can you run list jars; command from CLI or Hue and check if you see the jar file. From: java8...@hotmail.com To: user@hive.apache.org Subject: difference between add jar in hive session and hive --auxpath Date: Thu, 7 Mar 2013 17:47:26 -0500 Hi, I have a hive table which uses the jar file provided from the elephant-bird, which is a framework integrated between lzo and google protobuf data and hadoop/hive. If I use the hive command like this: hive --auxpath path_to_jars, it works fine to query my table, but if I use the add jar after I started the hive session, I will get ClassNotFoundException in the runtime of my query of the classes in those jars. My questions are: 1) What is the different between hive --auxpath and add jar in the hive session?2) This problem makes it is hard to access my table in the HUE, as it only supports add jar, but not --auxpath option. Any suggestions? Thanks Yong
A bug belongs to Hive or Elephant-bird
Hi, Hive 0.9.0 + Elephant-Bird 3.0.7 I faced a problem to use the elephant-bird with hive. I know what maybe cause this problem, but I don't know which side this bug belongs to. Let me know explain what is the problem. If we define a google protobuf file, with field name like 'dateString' (the field contains an uppercase 'S'), then when I query the table like this: select dateString from table . I will get the following exception trace: Caused by: java.lang.RuntimeException: cannot find field datestring from [org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@49aacd5f .at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:321) at org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector.getStructFieldRef(UnionStructObjectInspector.java:96) at org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:57) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:878) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:904) at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:60) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.FilterOperator.initializeOp(FilterOperator.java:73) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:133) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:444) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:98) Here is the code for the method throws this error: public static StructField getStandardStructFieldRef(String fieldName, List? extends StructField fields) {fieldName = fieldName.toLowerCase(); for (int i = 0; i fields.size(); i++) { if (fields.get(i).getFieldName().equals(fieldName)) {return fields.get(i); }}// For backward compatibility: fieldNames can also be integer Strings.try { int i = Integer.parseInt(fieldName); if (i = 0 i fields.size()) {return fields.get(i); }} catch (NumberFormatException e) { // ignore}throw new RuntimeException(cannot find field + fieldName + from + fields); // return null; } I understand the problem happens because at this time, the fileName is datestring (all lowercase charcters), but the Listfields contains the fieldName for that field is dateString, and that is why the RuntimeException happened. But I don't know which side this bug belongs to, or I want to know more inside detail about the Hive implementation contract. From this link: https://cwiki.apache.org/Hive/user-faq.html#UserFAQ-AreHiveQLidentifiers%2528e.g.tablenames%252Ccolumnnames%252Cetc%2529casesensitive%253F I know that in hive, the table name and column name should be case insensitive, so even though in my Query, I used select dateString, the fieldName changed to datestring in the code, but the StructField of ObjectInspector from the elephant-bird return the EXACTLY fieldname, defined in the code, dateString in this case. of course, I can change my protof file to only use lowercase field name to bypass this bug, but my questions are: 1) If I implement my ObjectInspector, should I pay attention to the field name? Is it needed to be lowercase? 2) I would consider this as a bug of hive, right? If this line: fieldName = fieldName.toLowerCase(); to lowercase the data, then the comparing should also do it by lowering case by changing if (fields.get(i).getFieldName().equals(fieldName)) to if (fields.get(i).getFieldName().toLowerCase().equals(fieldName)) right? Thanks Yong
difference between add jar in hive session and hive --auxpath
Hi, I have a hive table which uses the jar file provided from the elephant-bird, which is a framework integrated between lzo and google protobuf data and hadoop/hive. If I use the hive command like this: hive --auxpath path_to_jars, it works fine to query my table, but if I use the add jar after I started the hive session, I will get ClassNotFoundException in the runtime of my query of the classes in those jars. My questions are: 1) What is the different between hive --auxpath and add jar in the hive session?2) This problem makes it is hard to access my table in the HUE, as it only supports add jar, but not --auxpath option. Any suggestions? Thanks Yong
RE: reg : getting table values in inputFormat in serde
Actually I am backing up this question. In additional for that, I wonder if it is possible we can access the table properties from the UDF too. I also have XML data, but with namespace into it. The XPATH UDF coming from HIVE doesn't support namespace. To support the namespace in XML is simple, just need a NamespaceContextImpl. But the trick part is to having a generic solution, we want some properties can passed to underline InputFormat or xpath_ns UDF, for example, the prefix/URI mapping, or the real class name who implements the NamespaceContext interface. But right now, it is not easy to pass this kind information, even if we define them into the table properties. Right now, I have to kind of hard-coded into the UDF or InputFormat, instead of providing a generic solution. Yong From: mohit_chaudhar...@infosys.com To: user@hive.apache.org Subject: RE: reg : getting table values in inputFormat in serde Date: Fri, 21 Dec 2012 11:39:54 + Actually I am storing data from xml file into hive table using serde. Now I want table properties which I given during table creation in inputformat . From: Nitin Pawar [mailto:nitinpawar...@gmail.com] Sent: Friday, December 21, 2012 3:16 PM To: user@hive.apache.org Subject: Re: reg : getting table values in inputFormat in serde What kind of table properties you want to be accessed ? and what you want to achieve from it in serde ? On Fri, Dec 21, 2012 at 11:04 AM, Mohit Chaudhary01 mohit_chaudhar...@infosys.com wrote: Hi In serde program I need table properties in inputformat file . So can anyone please tell me how can I do this? thanks CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS*** -- Nitin Pawar
xpath UDF in hive support namespace?
Hi, I have a question related to the XPATH UDF currently in HIVE. From the original Jira story about this UDF: https://issues.apache.org/jira/browse/HIVE-1027, It looks like the UDF won't support namespace in the XML, is that true? Any later HIVE version does support namespace, if so, what is the version? And also if this UDF does support namespace, does anyone have some examples how to do it in xpath UDF of hive? All the examples in the wiki page are without name space. Thanks Yong
RE: xpath UDF in hive support namespace?
Hi, Mark: Thanks for your response. I am thinking how to change it to support the namespace. The easy way is allow end user to pass into properties of mapping of (prefix and URI), but from the HIVE UDF, it is not easy to allow end users to pass in any properties like Serde. Maybe that's the reason why it isn't supported initially. Yong Date: Wed, 19 Dec 2012 14:04:20 -0800 Subject: Re: xpath UDF in hive support namespace? From: grover.markgro...@gmail.com To: user@hive.apache.org Hi, the source code for the xpath UDFs is at https://github.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/xml I checked and saw that there were no changes made to these UDFs after there initial creation by HIVE-1027. So, that's all we have. Sorry. However, if you would like make additions to those UDFs, you are welcome to do so by creating a JIRA and posting a patch. UDFs are an easy and excellent way to contribute back to the Hive community. Thanks! Mark On Wed, Dec 19, 2012 at 8:52 AM, java8964 java8964 java8...@hotmail.com wrote: Hi, I have a question related to the XPATH UDF currently in HIVE. From the original Jira story about this UDF: https://issues.apache.org/jira/browse/HIVE-1027, It looks like the UDF won't support namespace in the XML, is that true? Any later HIVE version does support namespace, if so, what is the version? And also if this UDF does support namespace, does anyone have some examples how to do it in xpath UDF of hive? All the examples in the wiki page are without name space. Thanks Yong
RE: Array index support non-constant expresssion
OK. I followed the hive source code of org.apache.hadoop.hive.ql.udf.generic.GenericUDFArrayContains and wrote the UDF. It is quite simple. It works fine as I expected for simple case, but when I try to run it under some complex query, the hive MR jobs failed with some strange errors. What I mean is that it failed in HIVE code base, from stuck trace, I can not see this failure has anything to do with my custom code. I would like some help if some one can tell me what went wrong. For example, I created this UDF called darray, stand for dynamic array, which supports the non-constant value as the index location of the array. The following query works fine as I expected: hive select c_poi.provider_str as provider_str, c_poi.name as name from (select darray(search_results, c.index_loc) as c_poi from search_table lateral view explode(search_clicks) clickTable as c) a limit 5;POI ADDRESS some addressPOIPOI ADDRESSS some address Of course, in this case, I only want the provider_str = 'POI' returned, and filter out any rows with provider_str != 'POI', so it sounds simple, I changed the query to the following: hive select c_poi.provider_str as provider_str, c_poi.name as name from (select darray(search_results, c.rank) as c_poi from search_table lateral view explode(search_clicks) clickTable as c) a where c_poi.provider_str = 'POI' limit 5;Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks is set to 0 since there's no reduce operatorCannot run job locally: Input Size (= 178314025) is larger than hive.exec.mode.local.auto.inputbytes.max (= 134217728)Starting Job = job_201212031001_0100, Tracking URL = http://blevine-desktop:50030/jobdetails.jsp?jobid=job_201212031001_0100Kill Command = /home/yzhang/hadoop/bin/hadoop job -Dmapred.job.tracker=blevine-desktop:8021 -kill job_201212031001_01002012-12-12 11:45:24,090 Stage-1 map = 0%, reduce = 0%2012-12-12 11:45:43,173 Stage-1 map = 100%, reduce = 100%Ended Job = job_201212031001_0100 with errorsFAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask I am only add a Where limitation, but to my surprise, the MR jobs generated by HIVE failed. I am testing this in my local standalone cluster, which is running CDH3U3 release. When I check the hadoop userlog, here is what I got: 2012-12-12 11:40:22,421 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: SELECT struct_col0:bigint,_col1:string,_col2:string,_col3:string,_col4:string,_col5:string,_col6:boolean,_col7:boolean,_col8:boolean,_col9:boolean,_col10:boolean,_col11:boolean,_col12:string,_col13:string,_col14:structlat:double,lon:double,query_text_raw:string,query_text_normalized:string,query_string:string,llcountry:string,ipcountry:string,request_cnt:int,address:structcountry:string,state:string,zip:string,city:string,street:string,house:string,categories_id:arrayint,categories_name:arraystring,lang_raw:string,lang_rose:string,lang:string,viewport:structtop_lat:double,left_lon:double,bottom_lat:double,right_lon:double,_col15:structversions:int,physical_host:string,nose_request_id:string,client_type:string,ip:int,time_taken:int,user_agent:string,http_host:string,http_referrer:string,http_status:smallint,http_size:int,accept_language:string,md5:string,datacenter:string,tlv_map_data_version:string,tlv_devide_software_version:string,csid:int,rid:string,xncrid:string,cbfn:string,sources:arraystructtm:bigint,tm_date:string,tm_time:string,md5:string,time_taken:int,_col16:arraystructprovider_str:string,name:string,lat:double,lon:double,dyn:boolean,authoritative:boolean,search_center:boolean,_col17:arraystructrank:int,action:int,tm:bigint,event:string,is_csid:boolean,is_rid:boolean,is_pbapi:boolean,is_nac:boolean,_col18:string,_col19:structrank:int,action:int,tm:bigint,event:string,is_csid:boolean,is_rid:boolean,is_pbapi:boolean,is_nac:boolean2012-12-12 11:40:22,440 WARN org.apache.hadoop.mapred.Child: Error running childjava.lang.RuntimeException: Error in configuring objectat org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:387) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)at org.apache.hadoop.mapred.Child$4.run(Child.java:270)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:396)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) at org.apache.hadoop.mapred.Child.main(Child.java:264)Caused by: java.lang.reflect.InvocationTargetExceptionat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
RE: Array index support non-constant expresssion
Hi, I played my query further, and found out it is very puzzle to explain the following behaviors: 1) The following query works: select c_poi.provider_str, c_poi.name from (select darray(search_results, c.rank) as c_poi from nulf_search lateral view explode(search_clicks) clickTable as c) a I get get all the result from the above query without any problem. 2) The following query NOT works: select c_poi.provider_str, c_poi.name from (select darray(search_results, c.rank) as c_poi from nulf_search lateral view explode(search_clicks) clickTable as c) a where c_poi.provider_str = 'POI' As long as I add the where criteria on provider_str, or even I added another level of sub query like following: selectps, namefrom (select c_poi.provider_str as ps, c_poi.name as name from (select darray(search_results, c.rank) as c_poi from nulf_search lateral view explode(search_clicks) clickTable as c) a ) bwhere ps = 'POI' any kind of criteria I tried to add on provider_str, the hive MR jobs failed in the same error I shown below. Any idea why this happened? Is it related to the data? But provider_str is just a simple String type. Thanks Yong From: java8...@hotmail.com To: user@hive.apache.org Subject: RE: Array index support non-constant expresssion Date: Wed, 12 Dec 2012 12:15:27 -0500 OK. I followed the hive source code of org.apache.hadoop.hive.ql.udf.generic.GenericUDFArrayContains and wrote the UDF. It is quite simple. It works fine as I expected for simple case, but when I try to run it under some complex query, the hive MR jobs failed with some strange errors. What I mean is that it failed in HIVE code base, from stuck trace, I can not see this failure has anything to do with my custom code. I would like some help if some one can tell me what went wrong. For example, I created this UDF called darray, stand for dynamic array, which supports the non-constant value as the index location of the array. The following query works fine as I expected: hive select c_poi.provider_str as provider_str, c_poi.name as name from (select darray(search_results, c.index_loc) as c_poi from search_table lateral view explode(search_clicks) clickTable as c) a limit 5;POI ADDRESS some addressPOIPOI ADDRESSS some address Of course, in this case, I only want the provider_str = 'POI' returned, and filter out any rows with provider_str != 'POI', so it sounds simple, I changed the query to the following: hive select c_poi.provider_str as provider_str, c_poi.name as name from (select darray(search_results, c.rank) as c_poi from search_table lateral view explode(search_clicks) clickTable as c) a where c_poi.provider_str = 'POI' limit 5;Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks is set to 0 since there's no reduce operatorCannot run job locally: Input Size (= 178314025) is larger than hive.exec.mode.local.auto.inputbytes.max (= 134217728)Starting Job = job_201212031001_0100, Tracking URL = http://blevine-desktop:50030/jobdetails.jsp?jobid=job_201212031001_0100Kill Command = /home/yzhang/hadoop/bin/hadoop job -Dmapred.job.tracker=blevine-desktop:8021 -kill job_201212031001_01002012-12-12 11:45:24,090 Stage-1 map = 0%, reduce = 0%2012-12-12 11:45:43,173 Stage-1 map = 100%, reduce = 100%Ended Job = job_201212031001_0100 with errorsFAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask I am only add a Where limitation, but to my surprise, the MR jobs generated by HIVE failed. I am testing this in my local standalone cluster, which is running CDH3U3 release. When I check the hadoop userlog, here is what I got: 2012-12-12 11:40:22,421 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: SELECT
RE: Array index support non-constant expresssion
) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:133) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)at org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:444) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:98) ... 22 more2012-12-12 20:36:21,365 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task Date: Thu, 13 Dec 2012 09:06:25 +0900 Subject: Re: Array index support non-constant expresssion From: navis@nexr.com To: user@hive.apache.org Could you try it with CP/PPD disabled? set hive.optimize.cp=false; set hive.optimize.ppd=false; 2012/12/13 java8964 java8964 java8...@hotmail.com: Hi, I played my query further, and found out it is very puzzle to explain the following behaviors: 1) The following query works: select c_poi.provider_str, c_poi.name from (select darray(search_results, c.rank) as c_poi from nulf_search lateral view explode(search_clicks) clickTable as c) a I get get all the result from the above query without any problem. 2) The following query NOT works: select c_poi.provider_str, c_poi.name from (select darray(search_results, c.rank) as c_poi from nulf_search lateral view explode(search_clicks) clickTable as c) a where c_poi.provider_str = 'POI' As long as I add the where criteria on provider_str, or even I added another level of sub query like following: select ps, name from (select c_poi.provider_str as ps, c_poi.name as name from (select darray(search_results, c.rank) as c_poi from nulf_search lateral view explode(search_clicks) clickTable as c) a ) b where ps = 'POI' any kind of criteria I tried to add on provider_str, the hive MR jobs failed in the same error I shown below. Any idea why this happened? Is it related to the data? But provider_str is just a simple String type. Thanks Yong From: java8...@hotmail.com To: user@hive.apache.org Subject: RE: Array index support non-constant expresssion Date: Wed, 12 Dec 2012 12:15:27 -0500 OK. I followed the hive source code of org.apache.hadoop.hive.ql.udf.generic.GenericUDFArrayContains and wrote the UDF. It is quite simple. It works fine as I expected for simple case, but when I try to run it under some complex query, the hive MR jobs failed with some strange errors. What I mean is that it failed in HIVE code base, from stuck trace, I can not see this failure has anything to do with my custom code. I would like some help if some one can tell me what went wrong. For example, I created this UDF called darray, stand for dynamic array, which supports the non-constant value as the index location of the array. The following query works fine as I expected: hive select c_poi.provider_str as provider_str, c_poi.name as name from (select darray(search_results, c.index_loc) as c_poi from search_table lateral view explode(search_clicks) clickTable as c) a limit 5; POI ADDRESS some address POI POI ADDRESSS some address Of course, in this case, I only want the provider_str = 'POI' returned, and filter out any rows with provider_str != 'POI', so it sounds simple, I changed the query to the following: hive select c_poi.provider_str as provider_str, c_poi.name as name from (select darray(search_results, c.rank) as c_poi from search_table lateral view explode(search_clicks) clickTable as c) a where c_poi.provider_str = 'POI' limit 5; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Cannot run job locally: Input Size (= 178314025) is larger than hive.exec.mode.local.auto.inputbytes.max (= 134217728) Starting Job = job_201212031001_0100, Tracking URL = http://blevine-desktop:50030/jobdetails.jsp?jobid=job_201212031001_0100 Kill Command = /home/yzhang/hadoop/bin/hadoop job -Dmapred.job.tracker=blevine-desktop:8021 -kill job_201212031001_0100 2012-12-12 11:45:24,090 Stage-1 map = 0%, reduce = 0% 2012-12-12 11:45:43,173 Stage-1 map = 100%, reduce = 100% Ended Job = job_201212031001_0100 with errors FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask I am only add a Where limitation, but to my surprise, the MR jobs generated by HIVE failed. I am testing this in my local standalone cluster, which is running CDH3U3 release. When I check the hadoop userlog, here is what
Array index support non-constant expresssion
Hi, In our project to use the HIVE on CDH3U4 release (Hive 0.7.1), I have a hive table like the following: Table foo ( search_results arraystructid:bigint,.. search_clicks arraystructindex_loc:int, ..) As you can see, the 2nd column, which represents the list of search results clicked, contains the index location of which results was clicked (starting from 0). Now I need to find out the click count of some IDs, but this IDs only exists in the search_result struct. When I tried to write a query for this, I will try to do following first: select search_results[c.index_loc] from foo lateral view explode(search_clicks) clickTable as c; But it won't work in hive, as the following error message returned:AILED: Error in semantic analysis: Line 1:7 Non-constant expressions for array indexes not supported rank. It looks like that currently Hive (at least 0.7) does NOT support non-constant expressions as the array index. I searched on google, and found out the following HIVE jira ticket: https://issues.apache.org/jira/browse/HIVE-1955 It looks like someone has the same request to support it, but not yet. But there is a comment in the above ticket that it can be done in an UDF. My question is, can anyone share some ideas about how to archive this in an UDF, as it maybe the only option for me right now? Thanks
RE: need help on writing hive query
If you don't need to join current_web_page and previous_web_page, assuming you can just trust the time stamp, as Phil points out, an custom UDF of collect_list() is the way to go. You need to implement collect_list() UDF by yourself, hive doesn't have one by default.But it should be straight forward. In fact, you can reuse the code of collect_set(), replace the internal set with a Java ArrayList, then select user_id, collect_list(user_current_web_page)from(select user_id, user_current_web_pageorder by user_id asc, user_visiting_time asc)agroup by user_id Yong Subject: Re: need help on writing hive query From: matthewt...@gmail.com Date: Wed, 31 Oct 2012 17:53:06 -0400 To: user@hive.apache.org I did a similar query a few months ago. In short, I left-padded the page name with the time stamp, grouped with collect_set, and then used sort_array(). There was some other cleanup work and converting back to string to remove the time stamps, but it remained in order. If there's an easier way, please let me know. Matt Tucker On Oct 31, 2012, at 5:37 PM, Tom Brown tombrow...@gmail.com wrote: It wouldn't retrieve the user's path in a single string, but you could simply select the user id and current page, ordered by the timestamp. It would require a second step to turn it into the single string path, so that might be a deal-breaker. --Tom On Wed, Oct 31, 2012 at 3:32 PM, Philip Tromans philip.j.trom...@gmail.com wrote: You could use collect_set() and GROUP BY. That wouldn't preserve order though. Phil. On Oct 31, 2012 9:18 PM, qiaoresearcher qiaoresearc...@gmail.com wrote: Hi all, here is the question. Assume we have a table like: -- user_id|| user_visiting_time|| user_current_web_page || user_previous_web_page user 1 time (1,1) page 1 page 0 user 1 time (1,2) page 2 page 1 user 1 time (1,3 ) page 3 page 2 . .. user n time (n,1) page 1 page 0 user n time (n,2) page 2 page 1 user n time (n,3) page 3 page 2 that is, in each row, we know the current web page that user is viewing, and we know the previous web page the user coming from now we want to generate a list for each user that recorded the complete path the user is taking: i.e., how can we use hive to generate output like: user 1 : page 1 page 2 page 3 page 4 .. (till reach the beginning page of user 1) user 2: page 1 page 2 page 3 page 4 page 5 ... ( till reach the beginning page of user 2) the web pages viewed by user 1 and user 2 might be different. can we generate this using hive? thanks,
is it possible to disable running termiatePartial and merge() methods in UDAF
Hi, I am trying to implement a UDAF of Kurtosis (�a href=http://en.wikipedia.org/wiki/Kurtosis;http://en.wikipedia.org/wiki/Kurtosis�/a in the hive. I already found a library to do it, from Apache commons math (�a href=http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/moment/Kurtosis.html;http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/moment/Kurtosis.html�/a). But it does NOT support merging partial result in it. I am not a Math guru, so I don't know if this Kurtosis can be done in parallel or not. My question is simple, if I use the above library to implement this UDAF in hive, is there a configuration in the hive (either at runtime or at UDAF level) to make surethat for this UDAF I implemented, the terminatePartial/merge will NEVER be invoked? I have 2 settings I think maybe will do the trick: 1) set hive.exec.reducers.max=12) set hive.map.aggr=false Does either one will do the thing I want? Or anything else? Or is it even possible? Thanks Yong
RE: How can I get the constant value from the ObjectInspector in the UDF
I understand your message. But in this situation, I want to do the following: 1) I want to get the value 10 in the initialization stage. I understand your point that the value will only available in the evaluate stage, but keep in mind that for this 10 in my example, it is a constants value. It won't change for every evaluating. It is kind of value I should be able to get in the initialization stage, right? The hive Query analyzer should understand this parameter in the function in fact is a constants value, and will be able to provide to me during the initialization stage.2) Further question, can I get more information from the object inspector? For example, when I write the UDF, I want to make sure the first parameter is a numeric type. I can get the type, which I am able to valid it based on the type. But the question is if I want to error in some case, I want to show the end user the NAME of the parameter in my error message, instead of just position. For example, in the UDF as msum(column_name, 10), if I find out the type of the column_name is NOT a numeric type, I want in the error message I give to the end user, that 'column_name' should be numeric type. But right now, in the API, I can not get this information. Only thing I can get is the category type information, but I want more. Is it possible to do that in hive 0.7.1? Thanks for your help. Yong Date: Thu, 27 Sep 2012 02:32:19 +0900 Subject: Re: How can I get the constant value from the ObjectInspector in the UDF From: chen.song...@gmail.com To: user@hive.apache.org Hi Yong The way GenericUDF works is as follows. ObjectInspector initialize(ObjectInspector[] arguments) is called only once for one GenericUDF instance used in your Hive query. This phase is for preparation steps of UDF, such as syntax check and type inference. Object evaluate(DeferredObject[] arguments) is called to evaluate against actual arguments. This should be where the actual calculation happens and where you can get the real values you talked about. Thanks,Chen On Wed, Sep 26, 2012 at 4:17 AM, java8964 java8964 java8...@hotmail.com wrote: Hi, I am using Cloudera release cdh3u3, which has the hive 0.71 version. I am trying to write a hive UDF function as to calculate the moving sum. Right now, I am having trouble to get the constrant value passed in in the initialization stage. For example, let's assume the function is like the following format: msum(salary, 10) - salary is a int type column which means the end user wants to calculate the last 10 rows of salary. I kind of know how to implement this UDF. But I have one problem right now. 1) This is not a UDAF, as each row will return one data back as the moving sum.2) I create an UDF class extends from the GenericUDF. 3) I can get the column type from the ObjectInspector[] passed to me in the initialize() method to verify that 'salary' and 10 both needs to be numeric type (later one needs to be integer)4) But I also want to get the real value of 10, in this case, in the initialize() stage, so I can create the corresponding data structure based on the value end user specified here. 5) I looks around the javadoc of ObjectInspector class. I know at run time the real class of the 2nd parameter is WritableIntObjectInspector. I can get the type, but how I can get the real value of it?6) This is kind of ConstantsObjectInspector, should be able to give the value to me, as it already knows the type is int. What how? 7) I don't want to try to get the value at the evaluate stage. Can I get this value at the initialize stage? Thanks Yong -- Chen Song
How can I get the constant value from the ObjectInspector in the UDF
Hi, I am using Cloudera release cdh3u3, which has the hive 0.71 version. I am trying to write a hive UDF function as to calculate the moving sum. Right now, I am having trouble to get the constrant value passed in in the initialization stage. For example, let's assume the function is like the following format: msum(salary, 10) - salary is a int type column which means the end user wants to calculate the last 10 rows of salary. I kind of know how to implement this UDF. But I have one problem right now. 1) This is not a UDAF, as each row will return one data back as the moving sum.2) I create an UDF class extends from the GenericUDF.3) I can get the column type from the ObjectInspector[] passed to me in the initialize() method to verify that 'salary' and 10 both needs to be numeric type (later one needs to be integer)4) But I also want to get the real value of 10, in this case, in the initialize() stage, so I can create the corresponding data structure based on the value end user specified here.5) I looks around the javadoc of ObjectInspector class. I know at run time the real class of the 2nd parameter is WritableIntObjectInspector. I can get the type, but how I can get the real value of it?6) This is kind of ConstantsObjectInspector, should be able to give the value to me, as it already knows the type is int. What how?7) I don't want to try to get the value at the evaluate stage. Can I get this value at the initialize stage? Thanks Yong
Question about org.apache.hadoop.hive.contrib.serde2.RegexSerDe
Hi, I have a question about the behavior of the class org.apache.hadoop.hive.contrib.serde2.RegexSerDe. Here is the example I tested using the Cloudra hive-0.7.1-cdh3u3 release. The above class did NOT do what I expect, any one knows the reason? user:~/tmp more Test.javaimport java.io.*;import java.text.*; class Test {public static void main (String[] argv) throws Exception{ String line = aaa,\bbb\,\cc,c\;String[] tokens = line.split(,(?=([^\]*\[^\]*\)*[^\]*$));int i = 1; for(String t : tokens) {System.out.println(i + +t); i++;}}} :~/tmp java Test1 aaa2 bbb3 cc,c As you can see, the Java regular expression ,(?=([^\]*\[^\]*\)*[^\]*$) did what I want it to do, it parse the string aaa,bbb,cc,c to 3 tokens: (aaa), (bbb), and (cc,c). So the regular expression works fine. Now in the hive: :~ more test.txtaaa,bbb,cc,c:~ hiveHive history file=/tmp/user/hive_job_log_user_201204031242_591028210.txthive create table test( c1 string, c2 string, c3 string ) row format SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( input.regex = ,(?=([^\]*\[^\]*\)*[^\]*$) ) STORED AS TEXTFILE;OKTime taken: 0.401 secondshive load data local inpath 'test.txt' overwrite into table test;Copying data from file:/home/user/test.txtCopying file: file:/home/user/test.txtLoading data to table dev.testDeleted hdfs://host/user/hive/warehouse/dev.db/testOKTime taken: 0.282 secondshive select * from test; OKNULLNULLNULL When I query this table, I don't get what I expected. I expect the output should be the 3 strings like this -aaabbb cc,c Why the output gives me 3 NULLs? Thanks for your help.