Hive Avro union data access
Hi, I have an Hive table created with 3 different union data types for alias_host column name as shown. (arraystring,string, null). CREATE EXTERNAL TABLE array_tests ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ('avro.schema.literal'='{name:sessions,type:record,fields:[{default:null,name:alias_host,type: [{ type : array, items : string },string,null]}]} ') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/arrayTests'; How to access and query the contents of this table in where clause. The queries below like these can be possible if the datatype is not union but when once I set the datatype as union the following queries are failing. Eg: select alias_host from array_tests where alias_host like ‘%test%’ limit 1000; Error: Error while processing statement: FAILED: SemanticException [Error 10016]: Line 1:32 Argument type mismatch 'alias_host': The 1st argument of EQUAL is expected to a primitive type, but union is found (state=42000,code=10016) Can anyone suggest how to access and query the contents of union data types. Regards Sathish Valluri smime.p7s Description: S/MIME cryptographic signature
HDFS Overwriting undo
Hello Everybody, I know that this question may concern Hadoop list but i ve made this mistake when using Hive. I created new database giving the location on HDFS but i found that it removed all other data that exist already. = before creation, the directory on HDFS contains : pns@app11:~$ hadoop fs -ls /user/hive/warehouse Found 25 items drwxr-xr-x - user1 supergroup 0 2013-11-20 13:40 */user/hive/warehouse/*dfy_ans_autres drwxr-xr-x - user1 supergroup 0 2013-11-20 13:40 /user/hive/warehouse/dfy_ans_maillog drwxr-xr-x - user1 supergroup 0 2013-11-20 14:28 /user/hive/warehouse/dfy_cnx drwxr-xr-x - user2 supergroup 0 2014-05-30 06:05 /user/hive/warehouse/pns.db drwxr-xr-x - user2 supergroup 0 2014-02-24 17:00 /user/hive/warehouse/pns_fr_integ drwxr-xr-x - user2 supergroup 0 2014-05-06 15:33 /user/hive/warehouse/pns_logstat.db hive -e CREATE DATABASE my_stats LOCATION 'hdfs://:9000 */user/hive/warehouse/*mystats.db' but now I couldn't see the other directories on HDFS: pns@app11:~/aalshabani$ hls /user/hive/warehouse Found 1 items drwxr-xr-x - user2 supergroup 0 2014-05-30 11:37 */user/hive/warehouse*/mystats.db Is there anyway I could restore the other data?? Best regards. Amjad
Hive Capacity Scheduler Queue in Query Context
Hi We want to share Yarn environment and we are using capacity scheduler. Now we want to share the same hiveserver2 also.How can we do it? Can we mention queue name in query context? Thanks Prabakaran N
Vectorization with UDFs returns incorrect results
Hive 0.13 Hadoop 2.4 I am having an issue when using the combination of vectorized query execution, BETWEEN, and a custom UDF. When I have vectorization on, my query returns an empty set. When I then turn vectorization off, my query returns the correct results. Example Query: SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1 - X) and UDF_1 My UDFs seem to be working for everything else except this specific circumstance. Is this a issue in the hive software or am I writing my UDFs in such a way that they do not work with vectorization? If the latter, what is the correct way? I created a test scenario where I was able to reproduce this problem I am seeing: *TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 1): * package com.test; import org.apache.hadoop.hive.ql.exec.Description; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import java.lang.String; import java.lang.*; public class tenThousand extends UDF { private final LongWritable result = new LongWritable(); public LongWritable evaluate() { result.set(1); return result; } } *TEST DATA (test.input):* 1|CBCABC|12 2|DBCABC|13 3|EBCABC|14 4|ABCABC|15 5|BBCABC|16 6|CBCABC|17 *CREATING ORC TABLE:* 0: jdbc:hive2://server:10002/db create table testTabOrc (first bigint, second varchar(20), third int) partitioned by (range int) clustered by (first) sorted by (first) into 8 buckets stored as orc tblproperties (orc.compress = SNAPPY, orc.index = true); *CREATE LOADING TABLE:* 0: jdbc:hive2://server:10002/db create table loadingDir (first bigint, second varchar(20), third int) partitioned by (range int) row format delimited fields terminated by '|' stored as textfile; *COPY IN DATA:* [root@server]# hadoop fs -copyFromLocal /tmp/test.input /db/loading/. *ORC DATA:* [root@server]# beeline -u jdbc:hive2://server:10002/db -n root --hiveconf hive.exec.dynamic.partition.mode=nonstrict --hiveconf hive.enforce.sorting=true -e insert into table testTabOrc partition(range) select * from loadingDir; *LOAD TEST FUNCTION:* 0: jdbc:hive2://server:10002/db add jar /opt/hadoop/lib/testFunction.jar 0: jdbc:hive2://server:10002/db create temporary function ten_thousand as 'com.test.tenThousand'; *TURN OFF VECTORIZATION:* 0: jdbc:hive2://server:10002/db set hive.vectorized.execution.enabled=false; *QUERY (RESULTS AS EXPECTED):* 0: jdbc:hive2://server:10002/db select first from testTabOrc where first between ten_thousand()-1 and ten_thousand()-9995; ++ | first | ++ | 1 | | 2 | | 3 | ++ 3 rows selected (15.286 seconds) *TURN ON VECTORIZATION:* 0: jdbc:hive2://server:10002/db set hive.vectorized.execution.enabled=true; *QUERY AGAIN (WRONG RESULTS):* 0: jdbc:hive2://server:10002/db select first from testTabOrc where first between ten_thousand()-1 and ten_thousand()-9995; ++ | first | ++ ++ No rows selected (17.763 seconds)
RE: Vectorization with UDFs returns incorrect results
When you turn vectorized on, does the following query consistently return 1 in the output? select ten_thousand() from testTabOrc Yong Date: Fri, 30 May 2014 08:24:43 -0400 Subject: Vectorization with UDFs returns incorrect results From: bbowman...@gmail.com To: user@hive.apache.org Hive 0.13 Hadoop 2.4 I am having an issue when using the combination of vectorized query execution, BETWEEN, and a custom UDF. When I have vectorization on, my query returns an empty set. When I then turn vectorization off, my query returns the correct results. Example Query: SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1 - X) and UDF_1 My UDFs seem to be working for everything else except this specific circumstance. Is this a issue in the hive software or am I writing my UDFs in such a way that they do not work with vectorization? If the latter, what is the correct way? I created a test scenario where I was able to reproduce this problem I am seeing: TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 1): package com.test; import org.apache.hadoop.hive.ql.exec.Description;import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import java.lang.String; import java.lang.*; public class tenThousand extends UDF { private final LongWritable result = new LongWritable(); public LongWritable evaluate() {result.set(1);return result; }} TEST DATA (test.input):1|CBCABC|12 2|DBCABC|133|EBCABC|144|ABCABC|155|BBCABC|16 6|CBCABC|17 CREATING ORC TABLE: 0: jdbc:hive2://server:10002/db create table testTabOrc (first bigint, second varchar(20), third int) partitioned by (range int) clustered by (first) sorted by (first) into 8 buckets stored as orc tblproperties (orc.compress = SNAPPY, orc.index = true); CREATE LOADING TABLE:0: jdbc:hive2://server:10002/db create table loadingDir (first bigint, second varchar(20), third int) partitioned by (range int) row format delimited fields terminated by '|' stored as textfile; COPY IN DATA:[root@server]# hadoop fs -copyFromLocal /tmp/test.input /db/loading/. ORC DATA:[root@server]# beeline -u jdbc:hive2://server:10002/db -n root --hiveconf hive.exec.dynamic.partition.mode=nonstrict --hiveconf hive.enforce.sorting=true -e insert into table testTabOrc partition(range) select * from loadingDir; LOAD TEST FUNCTION:0: jdbc:hive2://server:10002/db add jar /opt/hadoop/lib/testFunction.jar 0: jdbc:hive2://server:10002/db create temporary function ten_thousand as 'com.test.tenThousand'; TURN OFF VECTORIZATION:0: jdbc:hive2://server:10002/db set hive.vectorized.execution.enabled=false; QUERY (RESULTS AS EXPECTED):0: jdbc:hive2://server:10002/db select first from testTabOrc where first between ten_thousand()-1 and ten_thousand()-9995; ++| first |++ | 1 || 2 || 3 |++ 3 rows selected (15.286 seconds) TURN ON VECTORIZATION: 0: jdbc:hive2://server:10002/db set hive.vectorized.execution.enabled=true; QUERY AGAIN (WRONG RESULTS): 0: jdbc:hive2://server:10002/db select first from testTabOrc where first between ten_thousand()-1 and ten_thousand()-9995; ++| first |++++ No rows selected (17.763 seconds)
RE: Hive Avro union data access
Your alias_host column is an array, from your Avro specification, right? If so, just use [] to access the specified element in the array select alias_host[0] from array_tests where aliat_host[0] like '%test%' If you want to query all the elements in the array, google explode lateral view of hive. Yong From: sathish.vall...@emc.com To: user@hive.apache.org Subject: Hive Avro union data access Date: Fri, 30 May 2014 06:21:19 + Hi, I have an Hive table created with 3 different union data types for alias_host column name as shown. (arraystring,string, null). CREATE EXTERNAL TABLE array_tests ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ('avro.schema.literal'='{name:sessions,type:record,fields:[{default:null,name:alias_host,type: [{ type : array, items : string},string,null]}]}') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/arrayTests'; How to access and query the contents of this table in where clause.The queries below like these can be possible if the datatype is not union but when once I set the datatype as union the following queries are failing. Eg: select alias_host from array_tests where alias_host like ‘%test%’ limit 1000;Error: Error while processing statement: FAILED: SemanticException [Error 10016]: Line 1:32 Argument type mismatch 'alias_host': The 1st argument of EQUAL is expected to a primitive type, but union is found (state=42000,code=10016) Can anyone suggest how to access and query the contents of union data types. RegardsSathish Valluri
Re: Vectorization with UDFs returns incorrect results
Yong, Thanks for the reply. Yes the query select ten_thousand() from testTabOrc does return consistent results regardless of the vectorization setting. I ran that query with the vectorization off and then turned it on and ran it 5 more times. Each time it returned correct results. See output below. Thanks, Ben = 0: jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc; ++ | _c0 | ++ | 1 | | 1 | | 1 | | 1 | | 1 | | 1 | ++ 6 rows selected (17.807 seconds) 0: jdbc:hive2://server:10002/db set hive.vectorized.execution.enabled=true; No rows affected (0.002 seconds) 0: jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc; ++ | _c0 | ++ | 1 | | 1 | | 1 | | 1 | | 1 | | 1 | ++ 6 rows selected (17.23 seconds) 0: jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc; ++ | _c0 | ++ | 1 | | 1 | | 1 | | 1 | | 1 | | 1 | ++ 6 rows selected (14.267 seconds) 0: jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc; ++ | _c0 | ++ | 1 | | 1 | | 1 | | 1 | | 1 | | 1 | ++ 6 rows selected (18.259 seconds) 0: jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc; ++ | _c0 | ++ | 1 | | 1 | | 1 | | 1 | | 1 | | 1 | ++ 6 rows selected (18.778 seconds) On Fri, May 30, 2014 at 10:52 AM, java8964 java8...@hotmail.com wrote: When you turn vectorized on, does the following query consistently return 1 in the output? select ten_thousand() from testTabOrc Yong -- Date: Fri, 30 May 2014 08:24:43 -0400 Subject: Vectorization with UDFs returns incorrect results From: bbowman...@gmail.com To: user@hive.apache.org Hive 0.13 Hadoop 2.4 I am having an issue when using the combination of vectorized query execution, BETWEEN, and a custom UDF. When I have vectorization on, my query returns an empty set. When I then turn vectorization off, my query returns the correct results. Example Query: SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1 - X) and UDF_1 My UDFs seem to be working for everything else except this specific circumstance. Is this a issue in the hive software or am I writing my UDFs in such a way that they do not work with vectorization? If the latter, what is the correct way? I created a test scenario where I was able to reproduce this problem I am seeing: *TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 1): * package com.test; import org.apache.hadoop.hive.ql.exec.Description; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import java.lang.String; import java.lang.*; public class tenThousand extends UDF { private final LongWritable result = new LongWritable(); public LongWritable evaluate() { result.set(1); return result; } } *TEST DATA (test.input):* 1|CBCABC|12 2|DBCABC|13 3|EBCABC|14 4|ABCABC|15 5|BBCABC|16 6|CBCABC|17 *CREATING ORC TABLE:* 0: jdbc:hive2://server:10002/db create table testTabOrc (first bigint, second varchar(20), third int) partitioned by (range int) clustered by (first) sorted by (first) into 8 buckets stored as orc tblproperties (orc.compress = SNAPPY, orc.index = true); *CREATE LOADING TABLE:* 0: jdbc:hive2://server:10002/db create table loadingDir (first bigint, second varchar(20), third int) partitioned by (range int) row format delimited fields terminated by '|' stored as textfile; *COPY IN DATA:* [root@server]# hadoop fs -copyFromLocal /tmp/test.input /db/loading/. *ORC DATA:* [root@server]# beeline -u jdbc:hive2://server:10002/db -n root --hiveconf hive.exec.dynamic.partition.mode=nonstrict --hiveconf hive.enforce.sorting=true -e insert into table testTabOrc partition(range) select * from loadingDir; *LOAD TEST FUNCTION:* 0: jdbc:hive2://server:10002/db add jar /opt/hadoop/lib/testFunction.jar 0: jdbc:hive2://server:10002/db create temporary function ten_thousand as 'com.test.tenThousand'; *TURN OFF VECTORIZATION:* 0: jdbc:hive2://server:10002/db set hive.vectorized.execution.enabled=false; *QUERY (RESULTS AS EXPECTED):* 0: jdbc:hive2://server:10002/db select first from testTabOrc where first between ten_thousand()-1 and ten_thousand()-9995; ++ | first | ++ | 1 | | 2 | | 3 | ++ 3 rows selected (15.286 seconds) *TURN ON VECTORIZATION:* 0: jdbc:hive2://server:10002/db set hive.vectorized.execution.enabled=true; *QUERY AGAIN (WRONG RESULTS):* 0: jdbc:hive2://server:10002/db select first from testTabOrc where first between ten_thousand()-1 and ten_thousand()-9995; ++ | first
Help with Query
We have a table with user entered queries, their IP. How could we write a query that will count and order queries by their count having a unique IP count X. For example if we had the same IP enter the same query Y times we wouldnlt want to include this in the final result unless there have been X-Y other IP's that searched for that query. Is this perhaps better suited fro Pig? Thanks
Re: Metastore 0.13 is not starting up.
Have you tried using schematool to upgrade the metastore schema ? https://cwiki.apache.org/confluence/display/Hive/Hive+Schema+Tool On Wed, May 21, 2014 at 4:29 AM, Biswajit Nayak biswajit.na...@inmobi.com wrote: Hi All, I had metastore(0.12) running previously. But after upgrading to 0.13 it is failing with below error message. The upgrade was a clean new setup. Additional details:- Mysql Version:- 5.6 Mysql Connector :- 5.1.30 Starting Hive Metastore Server log4j:WARN No such property [maxBackupIndex] in org.apache.log4j.DailyRollingFileAppender. javax.jdo.JDOUserException: Could not create increment/table value-generation container `SEQUENCE_TABLE` since autoCreate flags do not allow it. at org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:549) at org.datanucleus.api.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:732) at org.datanucleus.api.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:752) at org.apache.hadoop.hive.metastore.ObjectStore.createDatabase(ObjectStore.java:458) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:108) at $Proxy4.createDatabase(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB_core(HiveMetaStore.java:509) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:524) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:398) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:357) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.init(RetryingHMSHandler.java:54) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59) at org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4967) at org.apache.hadoop.hive.metastore.HiveMetaStore.startMetaStore(HiveMetaStore.java:5187) at org.apache.hadoop.hive.metastore.HiveMetaStore.main(HiveMetaStore.java:5107) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:197) NestedThrowablesStackTrace: Could not create increment/table value-generation container `SEQUENCE_TABLE` since autoCreate flags do not allow it. org.datanucleus.exceptions.NucleusUserException: Could not create increment/table value-generation container `SEQUENCE_TABLE` since autoCreate flags do not allow it. at org.datanucleus.store.rdbms.valuegenerator.TableGenerator.createRepository(TableGenerator.java:261) at org.datanucleus.store.rdbms.valuegenerator.AbstractRDBMSGenerator.obtainGenerationBlock(AbstractRDBMSGenerator.java:162) at org.datanucleus.store.valuegenerator.AbstractGenerator.obtainGenerationBlock(AbstractGenerator.java:197) at org.datanucleus.store.valuegenerator.AbstractGenerator.next(AbstractGenerator.java:105) at org.datanucleus.store.rdbms.RDBMSStoreManager.getStrategyValueForGenerator(RDBMSStoreManager.java:2005) at org.datanucleus.store.AbstractStoreManager.getStrategyValue(AbstractStoreManager.java:1386) at org.datanucleus.ExecutionContextImpl.newObjectId(ExecutionContextImpl.java:3827) at org.datanucleus.state.JDOStateManager.setIdentity(JDOStateManager.java:2571) at org.datanucleus.state.JDOStateManager.initialiseForPersistentNew(JDOStateManager.java:513) at org.datanucleus.state.ObjectProviderFactoryImpl.newForPersistentNew(ObjectProviderFactoryImpl.java:232) at org.datanucleus.ExecutionContextImpl.newObjectProviderForPersistentNew(ExecutionContextImpl.java:1414) at org.datanucleus.ExecutionContextImpl.persistObjectInternal(ExecutionContextImpl.java:2218) at org.datanucleus.ExecutionContextImpl.persistObjectWork(ExecutionContextImpl.java:2065) at org.datanucleus.ExecutionContextImpl.persistObject(ExecutionContextImpl.java:1913) at org.datanucleus.ExecutionContextThreadedImpl.persistObject(ExecutionContextThreadedImpl.java:217) at org.datanucleus.api.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:727) at org.datanucleus.api.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:752) at org.apache.hadoop.hive.metastore.ObjectStore.createDatabase(ObjectStore.java:458) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
RE: Vectorization with UDFs returns incorrect results
I downloaded the Hive13 and confirmed this problem existed. My suggestion is to change your query to select first from testTabOrc where first = ten_thousand()-1 and first = ten_thousand()-9995; which works in my environment (Hive 0.13 + hadoop 2.2) You can create a Jira for it. It looks like the problem caused by between with Vectorization and UDF. BTW, your UDF looks fine to me. Yong Date: Fri, 30 May 2014 11:22:18 -0400 Subject: Re: Vectorization with UDFs returns incorrect results From: bbowman...@gmail.com To: user@hive.apache.org Yong, Thanks for the reply. Yes the query select ten_thousand() from testTabOrc does return consistent results regardless of the vectorization setting. I ran that query with the vectorization off and then turned it on and ran it 5 more times. Each time it returned correct results. See output below. Thanks,Ben = 0: jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc; ++| _c0 |++| 1 | | 1 || 1 || 1 || 1 | | 1 |++6 rows selected (17.807 seconds)0: jdbc:hive2://server:10002/db set hive.vectorized.execution.enabled=true; No rows affected (0.002 seconds)0: jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc;++ | _c0 |++| 1 || 1 | | 1 || 1 || 1 || 1 | ++6 rows selected (17.23 seconds)0: jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc; ++| _c0 |++| 1 | | 1 || 1 || 1 || 1 | | 1 |++6 rows selected (14.267 seconds)0: jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc; ++| _c0 |++| 1 | | 1 || 1 || 1 || 1 | | 1 |++6 rows selected (18.259 seconds)0: jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc; ++| _c0 |++| 1 | | 1 || 1 || 1 || 1 | | 1 |++6 rows selected (18.778 seconds) On Fri, May 30, 2014 at 10:52 AM, java8964 java8...@hotmail.com wrote: When you turn vectorized on, does the following query consistently return 1 in the output? select ten_thousand() from testTabOrc Yong Date: Fri, 30 May 2014 08:24:43 -0400 Subject: Vectorization with UDFs returns incorrect results From: bbowman...@gmail.com To: user@hive.apache.org Hive 0.13 Hadoop 2.4 I am having an issue when using the combination of vectorized query execution, BETWEEN, and a custom UDF. When I have vectorization on, my query returns an empty set. When I then turn vectorization off, my query returns the correct results. Example Query: SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1 - X) and UDF_1 My UDFs seem to be working for everything else except this specific circumstance. Is this a issue in the hive software or am I writing my UDFs in such a way that they do not work with vectorization? If the latter, what is the correct way? I created a test scenario where I was able to reproduce this problem I am seeing: TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 1): package com.test; import org.apache.hadoop.hive.ql.exec.Description;import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import java.lang.String; import java.lang.*; public class tenThousand extends UDF { private final LongWritable result = new LongWritable(); public LongWritable evaluate() {result.set(1);return result; }} TEST DATA (test.input):1|CBCABC|12 2|DBCABC|133|EBCABC|144|ABCABC|155|BBCABC|16 6|CBCABC|17 CREATING ORC TABLE: 0: jdbc:hive2://server:10002/db create table testTabOrc (first bigint, second varchar(20), third int) partitioned by (range int) clustered by (first) sorted by (first) into 8 buckets stored as orc tblproperties (orc.compress = SNAPPY, orc.index = true); CREATE LOADING TABLE:0: jdbc:hive2://server:10002/db create table loadingDir (first bigint, second varchar(20), third int) partitioned by (range int) row format delimited fields terminated by '|' stored as textfile; COPY IN DATA:[root@server]# hadoop fs -copyFromLocal /tmp/test.input /db/loading/. ORC DATA:[root@server]# beeline -u jdbc:hive2://server:10002/db -n root --hiveconf hive.exec.dynamic.partition.mode=nonstrict --hiveconf hive.enforce.sorting=true -e insert into table testTabOrc partition(range) select * from loadingDir; LOAD TEST FUNCTION:0: jdbc:hive2://server:10002/db add jar /opt/hadoop/lib/testFunction.jar 0: jdbc:hive2://server:10002/db create temporary function ten_thousand as 'com.test.tenThousand'; TURN OFF VECTORIZATION:0: jdbc:hive2://server:10002/db set hive.vectorized.execution.enabled=false; QUERY (RESULTS AS EXPECTED):0: jdbc:hive2://server:10002/db select first from testTabOrc where first between ten_thousand()-1 and
Need urgent help on hive query performance
Hi, Does anybody help urgently on optimizing hive query performance? I am looking more Hadoop tuning point of view. Currently, small amount of table takes much time to query? We are running EMR cluster with 1 MASTER node, 2 Core Nodes and Task Nodes. Quick help is much appreciated. Thanks, Shouvanik This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com
Re: Need urgent help on hive query performance
I feel it's pretty hard to answer this without understanding the following: 1. What exactly are you trying to query? CSV? Avro? 2. Where is your data? HDFS? HBase? Local filesystem? 3. What version of hive are you using? 4. What is an example of a query that is slow? Some queries like joins and stuff would be inherently slower than other simpler ones(though can be optimized). Thanks, -- Swarnim On Fri, May 30, 2014 at 5:32 PM, shouvanik.hal...@accenture.com wrote: Can you please give a specific example or blog to refer to. I did not understand *From:* Ashish Garg [mailto:gargcreation1...@gmail.com] *Sent:* Friday, May 30, 2014 3:31 PM *To:* user@hive.apache.org *Subject:* Re: Need urgent help on hive query performance try partitioning the table and run the queries which are partition specific. Hope this helps. Thanks and Regards, Ashish Garg. On Fri, May 30, 2014 at 6:05 PM, shouvanik.hal...@accenture.com wrote: Hi, Does anybody help urgently on optimizing hive query performance? I am looking more Hadoop tuning point of view. Currently, small amount of table takes much time to query? We are running EMR cluster with 1 MASTER node, 2 Core Nodes and Task Nodes. Quick help is much appreciated. Thanks, Shouvanik -- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Swarnim
Re: Need urgent help on hive query performance
hive Create External Table Emp( id INT, name STRING, Salary INT) PARTITIONED BY (Country STRING, State STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ LOCATION ‘/user/data/’; Now load the data which is partition specific. For example, hive LOAD DATA LOCAL INPATH ‘---‘ OVERWRITE INTO TABLE Emp PARTITION (Country=’US’ , State=’NJ’); Now try running queries like hive Select Count(*), MAX(Salary) FROM Emp Where Country='US' And State='NJ'; This will optimize your query performance. On Fri, May 30, 2014 at 6:32 PM, shouvanik.hal...@accenture.com wrote: Can you please give a specific example or blog to refer to. I did not understand *From:* Ashish Garg [mailto:gargcreation1...@gmail.com] *Sent:* Friday, May 30, 2014 3:31 PM *To:* user@hive.apache.org *Subject:* Re: Need urgent help on hive query performance try partitioning the table and run the queries which are partition specific. Hope this helps. Thanks and Regards, Ashish Garg. On Fri, May 30, 2014 at 6:05 PM, shouvanik.hal...@accenture.com wrote: Hi, Does anybody help urgently on optimizing hive query performance? I am looking more Hadoop tuning point of view. Currently, small amount of table takes much time to query? We are running EMR cluster with 1 MASTER node, 2 Core Nodes and Task Nodes. Quick help is much appreciated. Thanks, Shouvanik -- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com
RE: Need urgent help on hive query performance
Pls find the answers From: kulkarni.swar...@gmail.com [mailto:kulkarni.swar...@gmail.com] Sent: Friday, May 30, 2014 3:34 PM To: user@hive.apache.org Subject: Re: Need urgent help on hive query performance I feel it's pretty hard to answer this without understanding the following: 1. What exactly are you trying to query? CSV? Avro? HIVE table 2. Where is your data? HDFS? HBase? Local filesystem? Data is in s3 3. What version of hive are you using? Hive 0.12 4. What is an example of a query that is slow? Some queries like joins and stuff would be inherently slower than other simpler ones(though can be optimized). It has innumerable no of joins. Since its client specific query, u understand I cannot share. Sorry about that Thanks, -- Swarnim On Fri, May 30, 2014 at 5:32 PM, shouvanik.hal...@accenture.commailto:shouvanik.hal...@accenture.com wrote: Can you please give a specific example or blog to refer to. I did not understand From: Ashish Garg [mailto:gargcreation1...@gmail.commailto:gargcreation1...@gmail.com] Sent: Friday, May 30, 2014 3:31 PM To: user@hive.apache.orgmailto:user@hive.apache.org Subject: Re: Need urgent help on hive query performance try partitioning the table and run the queries which are partition specific. Hope this helps. Thanks and Regards, Ashish Garg. On Fri, May 30, 2014 at 6:05 PM, shouvanik.hal...@accenture.commailto:shouvanik.hal...@accenture.com wrote: Hi, Does anybody help urgently on optimizing hive query performance? I am looking more Hadoop tuning point of view. Currently, small amount of table takes much time to query? We are running EMR cluster with 1 MASTER node, 2 Core Nodes and Task Nodes. Quick help is much appreciated. Thanks, Shouvanik This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.comhttp://www.accenture.com -- Swarnim
Re: Need urgent help on hive query performance
It has innumerable no of joins. Since its client specific query, u understand I cannot share. Sorry about that Like I said, Joins are slow and in not done correctly could have terrible performance. A couple of handy techniques depend on how exactly are you trying to perform the join. For instance, if you are trying to join a smaller table to a larger one, a map join could work well for you where the smaller table is kept in-memory when the join is performed. Also if you are able to break your table down to smaller buckets, you might as well be able to use a bucketed map join for instance. Following link should be helpful[1][2]. Hope this helps. [1] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization [2] http://stackoverflow.com/questions/20199077/hive-efficient-join-of-two-tables On Fri, May 30, 2014 at 5:38 PM, shouvanik.hal...@accenture.com wrote: Pls find the answers *From:* kulkarni.swar...@gmail.com [mailto:kulkarni.swar...@gmail.com] *Sent:* Friday, May 30, 2014 3:34 PM *To:* user@hive.apache.org *Subject:* Re: Need urgent help on hive query performance I feel it's pretty hard to answer this without understanding the following: 1. What exactly are you trying to query? CSV? Avro? HIVE table 2. Where is your data? HDFS? HBase? Local filesystem? Data is in s3 3. What version of hive are you using? Hive 0.12 4. What is an example of a query that is slow? Some queries like joins and stuff would be inherently slower than other simpler ones(though can be optimized). It has innumerable no of joins. Since its client specific query, u understand I cannot share. Sorry about that Thanks, -- Swarnim On Fri, May 30, 2014 at 5:32 PM, shouvanik.hal...@accenture.com wrote: Can you please give a specific example or blog to refer to. I did not understand *From:* Ashish Garg [mailto:gargcreation1...@gmail.com] *Sent:* Friday, May 30, 2014 3:31 PM *To:* user@hive.apache.org *Subject:* Re: Need urgent help on hive query performance try partitioning the table and run the queries which are partition specific. Hope this helps. Thanks and Regards, Ashish Garg. On Fri, May 30, 2014 at 6:05 PM, shouvanik.hal...@accenture.com wrote: Hi, Does anybody help urgently on optimizing hive query performance? I am looking more Hadoop tuning point of view. Currently, small amount of table takes much time to query? We are running EMR cluster with 1 MASTER node, 2 Core Nodes and Task Nodes. Quick help is much appreciated. Thanks, Shouvanik -- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Swarnim -- Swarnim
Re: Need urgent help on hive query performance
Another dimension, Try storing Hive table in ORC format. From my experience, it significantly improves the performance compare to other formats. Since you mentioned about join queries, on a side note, as a long term goal, you probably want to explore Hive with Tez. --Bala G. On Fri, May 30, 2014 at 3:59 PM, kulkarni.swar...@gmail.com kulkarni.swar...@gmail.com wrote: It has innumerable no of joins. Since its client specific query, u understand I cannot share. Sorry about that Like I said, Joins are slow and in not done correctly could have terrible performance. A couple of handy techniques depend on how exactly are you trying to perform the join. For instance, if you are trying to join a smaller table to a larger one, a map join could work well for you where the smaller table is kept in-memory when the join is performed. Also if you are able to break your table down to smaller buckets, you might as well be able to use a bucketed map join for instance. Following link should be helpful[1][2]. Hope this helps. [1] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization [2] http://stackoverflow.com/questions/20199077/hive-efficient-join-of-two-tables On Fri, May 30, 2014 at 5:38 PM, shouvanik.hal...@accenture.com wrote: Pls find the answers *From:* kulkarni.swar...@gmail.com [mailto:kulkarni.swar...@gmail.com] *Sent:* Friday, May 30, 2014 3:34 PM *To:* user@hive.apache.org *Subject:* Re: Need urgent help on hive query performance I feel it's pretty hard to answer this without understanding the following: 1. What exactly are you trying to query? CSV? Avro? HIVE table 2. Where is your data? HDFS? HBase? Local filesystem? Data is in s3 3. What version of hive are you using? Hive 0.12 4. What is an example of a query that is slow? Some queries like joins and stuff would be inherently slower than other simpler ones(though can be optimized). It has innumerable no of joins. Since its client specific query, u understand I cannot share. Sorry about that Thanks, -- Swarnim On Fri, May 30, 2014 at 5:32 PM, shouvanik.hal...@accenture.com wrote: Can you please give a specific example or blog to refer to. I did not understand *From:* Ashish Garg [mailto:gargcreation1...@gmail.com] *Sent:* Friday, May 30, 2014 3:31 PM *To:* user@hive.apache.org *Subject:* Re: Need urgent help on hive query performance try partitioning the table and run the queries which are partition specific. Hope this helps. Thanks and Regards, Ashish Garg. On Fri, May 30, 2014 at 6:05 PM, shouvanik.hal...@accenture.com wrote: Hi, Does anybody help urgently on optimizing hive query performance? I am looking more Hadoop tuning point of view. Currently, small amount of table takes much time to query? We are running EMR cluster with 1 MASTER node, 2 Core Nodes and Task Nodes. Quick help is much appreciated. Thanks, Shouvanik -- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Swarnim -- Swarnim
ManagedTablePartition
Hello all, Can we create the partitioned table in managed table and can we specify the location other than default warehouse directory? If yes, then what is the benefit of external table. Is it just dropping the external table will not delete the entire data that is it will only delete the metadata. Is this the only benefit? Thanks and Regards, Ashish Garg