Hive Avro union data access

2014-05-30 Thread Valluri, Sathish
Hi,

 

I have an Hive table created with 3 different union data types for alias_host 
column name as shown. (arraystring,string, null).

 

CREATE EXTERNAL TABLE array_tests ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES 
('avro.schema.literal'='{name:sessions,type:record,fields:[{default:null,name:alias_host,type:
 [{

  type : array,

  items : string

},string,null]}]}

') STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 
'/arrayTests';

 

How to access and query the contents of this table in where clause.

The queries below like these can be possible if the datatype is not union but 
when once I set the datatype as union the following queries are failing.

 

Eg: select alias_host from array_tests where alias_host like ‘%test%’ limit 
1000;

Error: Error while processing statement: FAILED: SemanticException [Error 
10016]: Line 1:32 Argument type mismatch 'alias_host': The 1st argument of 
EQUAL  is expected to a primitive type, but union is found 
(state=42000,code=10016) 

 

Can anyone suggest how to access and query the contents of union data types.

 

Regards

Sathish Valluri

 

 

 

 

 



smime.p7s
Description: S/MIME cryptographic signature


HDFS Overwriting undo

2014-05-30 Thread Amjad ALSHABANI
Hello Everybody,

I know that this question may concern Hadoop list but i ve made this
mistake when using Hive.
 I created new database giving the location on HDFS but i found that it
removed all other data that exist already.

=
before creation, the directory on HDFS contains :
pns@app11:~$ hadoop fs -ls /user/hive/warehouse
Found 25 items
drwxr-xr-x   - user1 supergroup  0 2013-11-20 13:40
*/user/hive/warehouse/*dfy_ans_autres
drwxr-xr-x   - user1 supergroup  0 2013-11-20 13:40
/user/hive/warehouse/dfy_ans_maillog
drwxr-xr-x   - user1 supergroup  0 2013-11-20 14:28
/user/hive/warehouse/dfy_cnx
drwxr-xr-x   - user2   supergroup  0 2014-05-30 06:05
/user/hive/warehouse/pns.db
drwxr-xr-x   - user2  supergroup  0 2014-02-24 17:00
/user/hive/warehouse/pns_fr_integ
drwxr-xr-x   - user2  supergroup  0 2014-05-06 15:33
/user/hive/warehouse/pns_logstat.db


hive -e CREATE DATABASE my_stats LOCATION 'hdfs://:9000
*/user/hive/warehouse/*mystats.db'

but now I couldn't see the other directories on HDFS:

pns@app11:~/aalshabani$ hls /user/hive/warehouse
Found 1 items
drwxr-xr-x   - user2 supergroup  0 2014-05-30 11:37
*/user/hive/warehouse*/mystats.db


Is there anyway I could restore the other data??


Best regards.

Amjad


Hive Capacity Scheduler Queue in Query Context

2014-05-30 Thread Natarajan, Prabakaran 1. (NSN - IN/Bangalore)
Hi

We want to share Yarn environment and we are using capacity scheduler.

Now we want to share the same hiveserver2 also.How can we do it? Can we 
mention queue name in query context?

Thanks
Prabakaran N



Vectorization with UDFs returns incorrect results

2014-05-30 Thread Benjamin Bowman
Hive 0.13  Hadoop 2.4

I am having an issue when using the combination of vectorized query
execution, BETWEEN, and a custom UDF.  When I have vectorization on, my
query returns an empty set.  When I then turn vectorization off, my query
returns the correct results.

Example Query:  SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1
- X) and UDF_1

My UDFs seem to be working for everything else except this specific
circumstance.  Is this a issue in the hive software or am I writing my UDFs
in such a way that they do not work with vectorization?  If the latter,
what is the correct way?

I created a test scenario where I was able to reproduce this problem I am
seeing:

*TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 1):  *
package com.test;

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import java.lang.String;
import java.lang.*;

public class tenThousand extends UDF {

  private final LongWritable result = new LongWritable();

  public LongWritable evaluate() {
result.set(1);
return result;
  }
}

*TEST DATA (test.input):*
1|CBCABC|12
2|DBCABC|13
3|EBCABC|14
4|ABCABC|15
5|BBCABC|16
6|CBCABC|17

*CREATING ORC TABLE:*
0: jdbc:hive2://server:10002/db create table testTabOrc (first bigint,
second varchar(20), third int) partitioned by (range int) clustered by
(first) sorted by (first) into 8 buckets stored as orc tblproperties
(orc.compress = SNAPPY, orc.index = true);

*CREATE LOADING TABLE:*
0: jdbc:hive2://server:10002/db create table loadingDir (first bigint,
second varchar(20), third int) partitioned by (range int) row format
delimited fields terminated by '|' stored as textfile;

*COPY IN DATA:*
[root@server]#  hadoop fs -copyFromLocal /tmp/test.input /db/loading/.

*ORC DATA:*
[root@server]#  beeline -u jdbc:hive2://server:10002/db -n root --hiveconf
hive.exec.dynamic.partition.mode=nonstrict --hiveconf
hive.enforce.sorting=true -e insert into table testTabOrc partition(range)
select * from loadingDir;

*LOAD TEST FUNCTION:*
0: jdbc:hive2://server:10002/db  add jar /opt/hadoop/lib/testFunction.jar
0: jdbc:hive2://server:10002/db  create temporary function ten_thousand as
'com.test.tenThousand';

*TURN OFF VECTORIZATION:*
0: jdbc:hive2://server:10002/db  set
hive.vectorized.execution.enabled=false;

*QUERY (RESULTS AS EXPECTED):*
0: jdbc:hive2://server:10002/db select first from testTabOrc where first
between ten_thousand()-1 and ten_thousand()-9995;
++
| first  |
++
| 1  |
| 2  |
| 3  |
++
3 rows selected (15.286 seconds)

*TURN ON VECTORIZATION:*
0: jdbc:hive2://server:10002/db  set
hive.vectorized.execution.enabled=true;

*QUERY AGAIN (WRONG RESULTS):*
0: jdbc:hive2://server:10002/db select first from testTabOrc where first
between ten_thousand()-1 and ten_thousand()-9995;
++
| first  |
++
++
No rows selected (17.763 seconds)


RE: Vectorization with UDFs returns incorrect results

2014-05-30 Thread java8964
When you turn vectorized on, does the following query consistently return 
1 in the output?
select ten_thousand() from testTabOrc
Yong

Date: Fri, 30 May 2014 08:24:43 -0400
Subject: Vectorization with UDFs returns incorrect results
From: bbowman...@gmail.com
To: user@hive.apache.org

Hive 0.13  Hadoop 2.4
I am having an issue when using the combination of vectorized query execution, 
BETWEEN, and a custom UDF.  When I have vectorization on, my query returns an 
empty set.  When I then turn vectorization off, my query returns the correct 
results. 

Example Query:  SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1 - X) 
and UDF_1
My UDFs seem to be working for everything else except this specific 
circumstance.  Is this a issue in the hive software or am I writing my UDFs in 
such a way that they do not work with vectorization?  If the latter, what is 
the correct way?

I created a test scenario where I was able to reproduce this problem I am 
seeing:
TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 1):  
package com.test;

import org.apache.hadoop.hive.ql.exec.Description;import 
org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.LongWritable;import 
org.apache.hadoop.io.Text;import java.lang.String;
import java.lang.*;
public class tenThousand extends UDF {

  private final LongWritable result = new LongWritable();

  public LongWritable evaluate() {result.set(1);return result;
  }}
TEST DATA (test.input):1|CBCABC|12
2|DBCABC|133|EBCABC|144|ABCABC|155|BBCABC|16
6|CBCABC|17
CREATING ORC TABLE:
0: jdbc:hive2://server:10002/db create table testTabOrc (first bigint, second 
varchar(20), third int) partitioned by (range int) clustered by (first) sorted 
by (first) into 8 buckets stored as orc tblproperties (orc.compress = 
SNAPPY, orc.index = true);

CREATE LOADING TABLE:0: jdbc:hive2://server:10002/db create table loadingDir 
(first bigint, second varchar(20), third int) partitioned by (range int) row 
format delimited fields terminated by '|' stored as textfile;

COPY IN DATA:[root@server]#  hadoop fs -copyFromLocal /tmp/test.input 
/db/loading/.

ORC DATA:[root@server]#  beeline -u jdbc:hive2://server:10002/db -n root 
--hiveconf hive.exec.dynamic.partition.mode=nonstrict --hiveconf 
hive.enforce.sorting=true -e insert into table testTabOrc partition(range) 
select * from loadingDir;

LOAD TEST FUNCTION:0: jdbc:hive2://server:10002/db  add jar 
/opt/hadoop/lib/testFunction.jar

0: jdbc:hive2://server:10002/db  create temporary function ten_thousand as 
'com.test.tenThousand';


TURN OFF VECTORIZATION:0: jdbc:hive2://server:10002/db  set 
hive.vectorized.execution.enabled=false;

QUERY (RESULTS AS EXPECTED):0: jdbc:hive2://server:10002/db select first from 
testTabOrc where first between ten_thousand()-1 and ten_thousand()-9995;
++| first  |++
| 1  || 2  || 3  |++
3 rows selected (15.286 seconds)
TURN ON VECTORIZATION:
0: jdbc:hive2://server:10002/db  set hive.vectorized.execution.enabled=true;
QUERY AGAIN (WRONG RESULTS):
0: jdbc:hive2://server:10002/db select first from testTabOrc where first 
between ten_thousand()-1 and ten_thousand()-9995;
++| first  |++++
No rows selected (17.763 seconds)


  

RE: Hive Avro union data access

2014-05-30 Thread java8964
Your alias_host column is an array, from your Avro specification, right?
If so, just use [] to access the specified element in the array
select alias_host[0] from array_tests where aliat_host[0] like '%test%'
If you want to query all the elements in the array, google explode lateral 
view of hive.
Yong

From: sathish.vall...@emc.com
To: user@hive.apache.org
Subject: Hive Avro union data access
Date: Fri, 30 May 2014 06:21:19 +

Hi, I have an Hive table created with 3 different union data types for 
alias_host column name as shown. (arraystring,string, null). CREATE EXTERNAL 
TABLE array_tests ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES 
('avro.schema.literal'='{name:sessions,type:record,fields:[{default:null,name:alias_host,type:
 [{  type : array,  items : string},string,null]}]}') 
STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 
'/arrayTests'; How to access and query the contents of this table in where 
clause.The queries below like these can be possible if the datatype is not 
union but when once I set the datatype as union the following queries are 
failing. Eg: select alias_host from array_tests where alias_host like ‘%test%’ 
limit 1000;Error: Error while processing statement: FAILED: SemanticException 
[Error 10016]: Line 1:32 Argument type mismatch 'alias_host': The 1st argument 
of EQUAL  is expected to a primitive type, but union is found 
(state=42000,code=10016)  Can anyone suggest how to access and query the 
contents of union data types. RegardsSathish Valluri
 

Re: Vectorization with UDFs returns incorrect results

2014-05-30 Thread Benjamin Bowman
Yong,

Thanks for the reply.  Yes the query select ten_thousand() from testTabOrc
does return consistent results regardless of the vectorization setting.  I
ran that query with the vectorization off and then turned it on and ran it
5 more times.  Each time it returned correct results.  See output below.

Thanks,
Ben

=

0: jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc;
++
|  _c0   |
++
| 1  |
| 1  |
| 1  |
| 1  |
| 1  |
| 1  |
++
6 rows selected (17.807 seconds)
0: jdbc:hive2://server:10002/db  set
hive.vectorized.execution.enabled=true;
No rows affected (0.002 seconds)
0: jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc;
++
|  _c0   |
++
| 1  |
| 1  |
| 1  |
| 1  |
| 1  |
| 1  |
++
6 rows selected (17.23 seconds)
0: jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc;
++
|  _c0   |
++
| 1  |
| 1  |
| 1  |
| 1  |
| 1  |
| 1  |
++
6 rows selected (14.267 seconds)
0: jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc;
++
|  _c0   |
++
| 1  |
| 1  |
| 1  |
| 1  |
| 1  |
| 1  |
++
6 rows selected (18.259 seconds)
0: jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc;
++
|  _c0   |
++
| 1  |
| 1  |
| 1  |
| 1  |
| 1  |
| 1  |
++
6 rows selected (18.778 seconds)





On Fri, May 30, 2014 at 10:52 AM, java8964 java8...@hotmail.com wrote:

 When you turn vectorized on, does the following query consistently
 return 1 in the output?

 select ten_thousand() from testTabOrc

 Yong

 --
 Date: Fri, 30 May 2014 08:24:43 -0400
 Subject: Vectorization with UDFs returns incorrect results
 From: bbowman...@gmail.com
 To: user@hive.apache.org


 Hive 0.13  Hadoop 2.4

 I am having an issue when using the combination of vectorized query
 execution, BETWEEN, and a custom UDF.  When I have vectorization on, my
 query returns an empty set.  When I then turn vectorization off, my query
 returns the correct results.

 Example Query:  SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1
 - X) and UDF_1

 My UDFs seem to be working for everything else except this specific
 circumstance.  Is this a issue in the hive software or am I writing my UDFs
 in such a way that they do not work with vectorization?  If the latter,
 what is the correct way?

 I created a test scenario where I was able to reproduce this problem I am
 seeing:

 *TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 1):  *
 package com.test;

 import org.apache.hadoop.hive.ql.exec.Description;
 import org.apache.hadoop.hive.ql.exec.UDF;
 import org.apache.hadoop.io.LongWritable;
 import org.apache.hadoop.io.Text;
 import java.lang.String;
 import java.lang.*;

 public class tenThousand extends UDF {

   private final LongWritable result = new LongWritable();

   public LongWritable evaluate() {
 result.set(1);
 return result;
   }
 }

 *TEST DATA (test.input):*
 1|CBCABC|12
 2|DBCABC|13
 3|EBCABC|14
 4|ABCABC|15
 5|BBCABC|16
 6|CBCABC|17

 *CREATING ORC TABLE:*
 0: jdbc:hive2://server:10002/db create table testTabOrc (first bigint,
 second varchar(20), third int) partitioned by (range int) clustered by
 (first) sorted by (first) into 8 buckets stored as orc tblproperties
 (orc.compress = SNAPPY, orc.index = true);

 *CREATE LOADING TABLE:*
 0: jdbc:hive2://server:10002/db create table loadingDir (first bigint,
 second varchar(20), third int) partitioned by (range int) row format
 delimited fields terminated by '|' stored as textfile;

 *COPY IN DATA:*
 [root@server]#  hadoop fs -copyFromLocal /tmp/test.input /db/loading/.

 *ORC DATA:*
 [root@server]#  beeline -u jdbc:hive2://server:10002/db -n root
 --hiveconf hive.exec.dynamic.partition.mode=nonstrict --hiveconf
 hive.enforce.sorting=true -e insert into table testTabOrc partition(range)
 select * from loadingDir;

 *LOAD TEST FUNCTION:*
 0: jdbc:hive2://server:10002/db  add jar /opt/hadoop/lib/testFunction.jar
 0: jdbc:hive2://server:10002/db  create temporary function ten_thousand
 as 'com.test.tenThousand';

 *TURN OFF VECTORIZATION:*
 0: jdbc:hive2://server:10002/db  set
 hive.vectorized.execution.enabled=false;

 *QUERY (RESULTS AS EXPECTED):*
 0: jdbc:hive2://server:10002/db select first from testTabOrc where first
 between ten_thousand()-1 and ten_thousand()-9995;
 ++
 | first  |
 ++
 | 1  |
 | 2  |
 | 3  |
 ++
 3 rows selected (15.286 seconds)

 *TURN ON VECTORIZATION:*
 0: jdbc:hive2://server:10002/db  set
 hive.vectorized.execution.enabled=true;

 *QUERY AGAIN (WRONG RESULTS):*
 0: jdbc:hive2://server:10002/db select first from testTabOrc where first
 between ten_thousand()-1 and ten_thousand()-9995;
 ++
 | first 

Help with Query

2014-05-30 Thread Software Dev
We have a table with user entered queries, their IP. How could we
write a query that will count and order queries by their count having
a unique IP count  X. For example if we had the same IP enter the
same query Y times we wouldnlt want to include this in the final
result unless there have been X-Y other IP's that searched for that
query.

Is this perhaps better suited fro Pig?

Thanks


Re: Metastore 0.13 is not starting up.

2014-05-30 Thread Thejas Nair
Have you tried using schematool to upgrade the metastore schema ?
https://cwiki.apache.org/confluence/display/Hive/Hive+Schema+Tool

On Wed, May 21, 2014 at 4:29 AM, Biswajit Nayak
biswajit.na...@inmobi.com wrote:
 Hi All,

 I had metastore(0.12) running previously. But after upgrading to 0.13 it is
 failing with below error message. The upgrade was a clean new setup.

 Additional details:-
 Mysql Version:- 5.6
 Mysql Connector :- 5.1.30


 Starting Hive Metastore Server

 log4j:WARN No such property [maxBackupIndex] in
 org.apache.log4j.DailyRollingFileAppender.

 javax.jdo.JDOUserException: Could not create increment/table
 value-generation container `SEQUENCE_TABLE` since autoCreate flags do not
 allow it.

 at
 org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:549)

 at
 org.datanucleus.api.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:732)

 at
 org.datanucleus.api.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:752)

 at
 org.apache.hadoop.hive.metastore.ObjectStore.createDatabase(ObjectStore.java:458)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

 at java.lang.reflect.Method.invoke(Method.java:597)

 at
 org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:108)

 at $Proxy4.createDatabase(Unknown Source)

 at
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB_core(HiveMetaStore.java:509)

 at
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:524)

 at
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:398)

 at
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:357)

 at
 org.apache.hadoop.hive.metastore.RetryingHMSHandler.init(RetryingHMSHandler.java:54)

 at
 org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59)

 at
 org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4967)

 at
 org.apache.hadoop.hive.metastore.HiveMetaStore.startMetaStore(HiveMetaStore.java:5187)

 at
 org.apache.hadoop.hive.metastore.HiveMetaStore.main(HiveMetaStore.java:5107)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

 at java.lang.reflect.Method.invoke(Method.java:597)

 at org.apache.hadoop.util.RunJar.main(RunJar.java:197)

 NestedThrowablesStackTrace:

 Could not create increment/table value-generation container
 `SEQUENCE_TABLE` since autoCreate flags do not allow it.

 org.datanucleus.exceptions.NucleusUserException: Could not create
 increment/table value-generation container `SEQUENCE_TABLE` since
 autoCreate flags do not allow it.

 at
 org.datanucleus.store.rdbms.valuegenerator.TableGenerator.createRepository(TableGenerator.java:261)

 at
 org.datanucleus.store.rdbms.valuegenerator.AbstractRDBMSGenerator.obtainGenerationBlock(AbstractRDBMSGenerator.java:162)

 at
 org.datanucleus.store.valuegenerator.AbstractGenerator.obtainGenerationBlock(AbstractGenerator.java:197)

 at
 org.datanucleus.store.valuegenerator.AbstractGenerator.next(AbstractGenerator.java:105)

 at
 org.datanucleus.store.rdbms.RDBMSStoreManager.getStrategyValueForGenerator(RDBMSStoreManager.java:2005)

 at
 org.datanucleus.store.AbstractStoreManager.getStrategyValue(AbstractStoreManager.java:1386)

 at
 org.datanucleus.ExecutionContextImpl.newObjectId(ExecutionContextImpl.java:3827)

 at
 org.datanucleus.state.JDOStateManager.setIdentity(JDOStateManager.java:2571)

 at
 org.datanucleus.state.JDOStateManager.initialiseForPersistentNew(JDOStateManager.java:513)

 at
 org.datanucleus.state.ObjectProviderFactoryImpl.newForPersistentNew(ObjectProviderFactoryImpl.java:232)

 at
 org.datanucleus.ExecutionContextImpl.newObjectProviderForPersistentNew(ExecutionContextImpl.java:1414)

 at
 org.datanucleus.ExecutionContextImpl.persistObjectInternal(ExecutionContextImpl.java:2218)

 at
 org.datanucleus.ExecutionContextImpl.persistObjectWork(ExecutionContextImpl.java:2065)

 at
 org.datanucleus.ExecutionContextImpl.persistObject(ExecutionContextImpl.java:1913)

 at
 org.datanucleus.ExecutionContextThreadedImpl.persistObject(ExecutionContextThreadedImpl.java:217)

 at
 org.datanucleus.api.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:727)

 at
 org.datanucleus.api.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:752)

 at
 org.apache.hadoop.hive.metastore.ObjectStore.createDatabase(ObjectStore.java:458)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at
 

RE: Vectorization with UDFs returns incorrect results

2014-05-30 Thread java8964
I downloaded the Hive13 and confirmed this problem existed.
My suggestion is to change your query to
select first from testTabOrc where first = ten_thousand()-1 and first = 
ten_thousand()-9995;
which works in my environment (Hive 0.13 + hadoop 2.2)
You can create a Jira for it. It looks like the problem caused by between 
with Vectorization and UDF.
BTW, your UDF looks fine to me.
Yong

Date: Fri, 30 May 2014 11:22:18 -0400
Subject: Re: Vectorization with UDFs returns incorrect results
From: bbowman...@gmail.com
To: user@hive.apache.org

Yong,
Thanks for the reply.  Yes the query select ten_thousand() from testTabOrc 
does return consistent results regardless of the vectorization setting.  I ran 
that query with the vectorization off and then turned it on and ran it 5 more 
times.  Each time it returned correct results.  See output below. 

Thanks,Ben

=

0: jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc;
++|  _c0   |++| 1  |
| 1  || 1  || 1  || 1  |
| 1  |++6 rows selected (17.807 seconds)0: 
jdbc:hive2://server:10002/db  set hive.vectorized.execution.enabled=true;
No rows affected (0.002 seconds)0: jdbc:hive2://server:10002/db select 
ten_thousand() from testTabOrc;++
|  _c0   |++| 1  || 1  |
| 1  || 1  || 1  || 1  |
++6 rows selected (17.23 seconds)0: jdbc:hive2://server:10002/db 
select ten_thousand() from testTabOrc;
++|  _c0   |++| 1  |
| 1  || 1  || 1  || 1  |
| 1  |++6 rows selected (14.267 seconds)0: 
jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc;
++|  _c0   |++| 1  |
| 1  || 1  || 1  || 1  |
| 1  |++6 rows selected (18.259 seconds)0: 
jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc;
++|  _c0   |++| 1  |
| 1  || 1  || 1  || 1  |
| 1  |++6 rows selected (18.778 seconds) 





On Fri, May 30, 2014 at 10:52 AM, java8964 java8...@hotmail.com wrote:




When you turn vectorized on, does the following query consistently return 
1 in the output?

select ten_thousand() from testTabOrc

Yong

Date: Fri, 30 May 2014 08:24:43 -0400
Subject: Vectorization with UDFs returns incorrect results

From: bbowman...@gmail.com
To: user@hive.apache.org

Hive 0.13  Hadoop 2.4

I am having an issue when using the combination of vectorized query execution, 
BETWEEN, and a custom UDF.  When I have vectorization on, my query returns an 
empty set.  When I then turn vectorization off, my query returns the correct 
results. 


Example Query:  SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1 - X) 
and UDF_1
My UDFs seem to be working for everything else except this specific 
circumstance.  Is this a issue in the hive software or am I writing my UDFs in 
such a way that they do not work with vectorization?  If the latter, what is 
the correct way?


I created a test scenario where I was able to reproduce this problem I am 
seeing:
TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 1):  
package com.test;


import org.apache.hadoop.hive.ql.exec.Description;import 
org.apache.hadoop.hive.ql.exec.UDF;

import org.apache.hadoop.io.LongWritable;import 
org.apache.hadoop.io.Text;import java.lang.String;

import java.lang.*;
public class tenThousand extends UDF {


  private final LongWritable result = new LongWritable();


  public LongWritable evaluate() {result.set(1);return result;

  }}
TEST DATA (test.input):1|CBCABC|12

2|DBCABC|133|EBCABC|144|ABCABC|155|BBCABC|16

6|CBCABC|17
CREATING ORC TABLE:

0: jdbc:hive2://server:10002/db create table testTabOrc (first bigint, second 
varchar(20), third int) partitioned by (range int) clustered by (first) sorted 
by (first) into 8 buckets stored as orc tblproperties (orc.compress = 
SNAPPY, orc.index = true);


CREATE LOADING TABLE:0: jdbc:hive2://server:10002/db create table loadingDir 
(first bigint, second varchar(20), third int) partitioned by (range int) row 
format delimited fields terminated by '|' stored as textfile;


COPY IN DATA:[root@server]#  hadoop fs -copyFromLocal /tmp/test.input 
/db/loading/.


ORC DATA:[root@server]#  beeline -u jdbc:hive2://server:10002/db -n root 
--hiveconf hive.exec.dynamic.partition.mode=nonstrict --hiveconf 
hive.enforce.sorting=true -e insert into table testTabOrc partition(range) 
select * from loadingDir;


LOAD TEST FUNCTION:0: jdbc:hive2://server:10002/db  add jar 
/opt/hadoop/lib/testFunction.jar


0: jdbc:hive2://server:10002/db  create temporary function ten_thousand as 
'com.test.tenThousand';



TURN OFF VECTORIZATION:0: jdbc:hive2://server:10002/db  set 
hive.vectorized.execution.enabled=false;


QUERY (RESULTS AS EXPECTED):0: jdbc:hive2://server:10002/db select first from 
testTabOrc where first between ten_thousand()-1 and 

Need urgent help on hive query performance

2014-05-30 Thread shouvanik.haldar
Hi,

Does anybody  help urgently on optimizing hive query performance? I am looking 
more Hadoop tuning point of view. Currently, small amount of table takes much 
time to query?

We are running EMR cluster with 1 MASTER node, 2 Core Nodes and  Task Nodes.

Quick help is much appreciated.

Thanks,
Shouvanik



This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy.
__

www.accenture.com


Re: Need urgent help on hive query performance

2014-05-30 Thread kulkarni.swar...@gmail.com
I feel it's pretty hard to answer this without understanding the following:

1. What exactly are you trying to query? CSV? Avro? 
2. Where is your data? HDFS? HBase? Local filesystem?
3. What version of hive are you using?
4. What is an example of a query that is slow? Some queries like joins and
stuff would be inherently slower than other simpler ones(though can be
optimized).

Thanks,

-- 
Swarnim


On Fri, May 30, 2014 at 5:32 PM, shouvanik.hal...@accenture.com wrote:

  Can you please give a specific example or blog to refer to. I did not
 understand



 *From:* Ashish Garg [mailto:gargcreation1...@gmail.com]
 *Sent:* Friday, May 30, 2014 3:31 PM
 *To:* user@hive.apache.org
 *Subject:* Re: Need urgent help on hive query performance



 try partitioning the table and run the queries which are partition
 specific. Hope this helps.

 Thanks and Regards,

 Ashish Garg.



 On Fri, May 30, 2014 at 6:05 PM, shouvanik.hal...@accenture.com wrote:

 Hi,



 Does anybody  help urgently on optimizing hive query performance? I am
 looking more Hadoop tuning point of view. Currently, small amount of table
 takes much time to query?



 We are running EMR cluster with 1 MASTER node, 2 Core Nodes and  Task
 Nodes.



 Quick help is much appreciated.



 Thanks,

 Shouvanik


  --


 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise confidential information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the e-mail by you is prohibited. Where allowed
 by local law, electronic communications with Accenture and its affiliates,
 including e-mail and instant messaging (including content), may be scanned
 by our systems for the purposes of information security and assessment of
 internal compliance with Accenture policy.

 __

 www.accenture.com






-- 
Swarnim


Re: Need urgent help on hive query performance

2014-05-30 Thread Ashish Garg
hive Create External Table Emp(

id INT,

name STRING,

Salary INT)

PARTITIONED BY (Country STRING, State STRING)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘\t’

LOCATION ‘/user/data/’;

Now load the data which is partition specific. For example,

hive LOAD DATA LOCAL INPATH ‘---‘

OVERWRITE INTO TABLE Emp

PARTITION (Country=’US’ , State=’NJ’);

Now try running queries like

hive Select Count(*), MAX(Salary) FROM Emp Where Country='US' And
State='NJ';

This will optimize your query performance.


On Fri, May 30, 2014 at 6:32 PM, shouvanik.hal...@accenture.com wrote:

  Can you please give a specific example or blog to refer to. I did not
 understand



 *From:* Ashish Garg [mailto:gargcreation1...@gmail.com]
 *Sent:* Friday, May 30, 2014 3:31 PM
 *To:* user@hive.apache.org
 *Subject:* Re: Need urgent help on hive query performance



 try partitioning the table and run the queries which are partition
 specific. Hope this helps.

 Thanks and Regards,

 Ashish Garg.



 On Fri, May 30, 2014 at 6:05 PM, shouvanik.hal...@accenture.com wrote:

 Hi,



 Does anybody  help urgently on optimizing hive query performance? I am
 looking more Hadoop tuning point of view. Currently, small amount of table
 takes much time to query?



 We are running EMR cluster with 1 MASTER node, 2 Core Nodes and  Task
 Nodes.



 Quick help is much appreciated.



 Thanks,

 Shouvanik


  --


 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise confidential information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the e-mail by you is prohibited. Where allowed
 by local law, electronic communications with Accenture and its affiliates,
 including e-mail and instant messaging (including content), may be scanned
 by our systems for the purposes of information security and assessment of
 internal compliance with Accenture policy.

 __

 www.accenture.com





RE: Need urgent help on hive query performance

2014-05-30 Thread shouvanik.haldar
Pls find the answers



From: kulkarni.swar...@gmail.com [mailto:kulkarni.swar...@gmail.com]
Sent: Friday, May 30, 2014 3:34 PM
To: user@hive.apache.org
Subject: Re: Need urgent help on hive query performance

I feel it's pretty hard to answer this without understanding the following:


1.  What exactly are you trying to query? CSV? Avro? 
HIVE table

2.  Where is your data? HDFS? HBase? Local filesystem?
Data is in s3

3.  What version of hive are you using?
Hive 0.12

4.  What is an example of a query that is slow? Some queries like joins and 
stuff would be inherently slower than other simpler ones(though can be 
optimized).
It has innumerable no of joins. Since its client specific query, u understand I 
cannot share. Sorry about that

Thanks,

--
Swarnim

On Fri, May 30, 2014 at 5:32 PM, 
shouvanik.hal...@accenture.commailto:shouvanik.hal...@accenture.com wrote:
Can you please give a specific example or blog to refer to. I did not understand

From: Ashish Garg 
[mailto:gargcreation1...@gmail.commailto:gargcreation1...@gmail.com]
Sent: Friday, May 30, 2014 3:31 PM
To: user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Need urgent help on hive query performance

try partitioning the table and run the queries which are partition specific. 
Hope this helps.
Thanks and Regards,
Ashish Garg.

On Fri, May 30, 2014 at 6:05 PM, 
shouvanik.hal...@accenture.commailto:shouvanik.hal...@accenture.com wrote:
Hi,

Does anybody  help urgently on optimizing hive query performance? I am looking 
more Hadoop tuning point of view. Currently, small amount of table takes much 
time to query?

We are running EMR cluster with 1 MASTER node, 2 Core Nodes and  Task Nodes.

Quick help is much appreciated.

Thanks,
Shouvanik



This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy.
__

www.accenture.comhttp://www.accenture.com




--
Swarnim


Re: Need urgent help on hive query performance

2014-05-30 Thread kulkarni.swar...@gmail.com
 It has innumerable no of joins. Since its client specific query, u
understand I cannot share. Sorry about that

Like I said, Joins are slow and in not done correctly could have terrible
performance. A couple of handy techniques depend on how exactly are you
trying to perform the join. For instance, if you are trying to join a
smaller table to a larger one, a map join could work well for you where the
smaller table is kept in-memory when the join is performed. Also if you are
able to break your table down to smaller buckets, you might as well be able
to use a bucketed map join for instance. Following link should be
helpful[1][2].

Hope this helps.

[1]
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization
[2]
http://stackoverflow.com/questions/20199077/hive-efficient-join-of-two-tables


On Fri, May 30, 2014 at 5:38 PM, shouvanik.hal...@accenture.com wrote:

  Pls find the answers







 *From:* kulkarni.swar...@gmail.com [mailto:kulkarni.swar...@gmail.com]
 *Sent:* Friday, May 30, 2014 3:34 PM

 *To:* user@hive.apache.org
 *Subject:* Re: Need urgent help on hive query performance



 I feel it's pretty hard to answer this without understanding the following:



 1.  What exactly are you trying to query? CSV? Avro? 

 HIVE table

 2.  Where is your data? HDFS? HBase? Local filesystem?

 Data is in s3

 3.  What version of hive are you using?

 Hive 0.12

 4.  What is an example of a query that is slow? Some queries like
 joins and stuff would be inherently slower than other simpler ones(though
 can be optimized).

 It has innumerable no of joins. Since its client specific query, u
 understand I cannot share. Sorry about that



 Thanks,



 --
 Swarnim



 On Fri, May 30, 2014 at 5:32 PM, shouvanik.hal...@accenture.com wrote:

 Can you please give a specific example or blog to refer to. I did not
 understand



 *From:* Ashish Garg [mailto:gargcreation1...@gmail.com]
 *Sent:* Friday, May 30, 2014 3:31 PM
 *To:* user@hive.apache.org
 *Subject:* Re: Need urgent help on hive query performance



 try partitioning the table and run the queries which are partition
 specific. Hope this helps.

 Thanks and Regards,

 Ashish Garg.



 On Fri, May 30, 2014 at 6:05 PM, shouvanik.hal...@accenture.com wrote:

 Hi,



 Does anybody  help urgently on optimizing hive query performance? I am
 looking more Hadoop tuning point of view. Currently, small amount of table
 takes much time to query?



 We are running EMR cluster with 1 MASTER node, 2 Core Nodes and  Task
 Nodes.



 Quick help is much appreciated.



 Thanks,

 Shouvanik


  --


 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise confidential information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the e-mail by you is prohibited. Where allowed
 by local law, electronic communications with Accenture and its affiliates,
 including e-mail and instant messaging (including content), may be scanned
 by our systems for the purposes of information security and assessment of
 internal compliance with Accenture policy.

 __

 www.accenture.com







 --
 Swarnim




-- 
Swarnim


Re: Need urgent help on hive query performance

2014-05-30 Thread Bala Krishna Gangisetty
Another dimension,

Try storing Hive table in ORC format. From my experience, it significantly
improves the performance compare to other formats.

Since you mentioned about join queries, on a side note, as a long term
goal, you probably want to explore Hive with Tez.

--Bala G.


On Fri, May 30, 2014 at 3:59 PM, kulkarni.swar...@gmail.com 
kulkarni.swar...@gmail.com wrote:

  It has innumerable no of joins. Since its client specific query, u
 understand I cannot share. Sorry about that

 Like I said, Joins are slow and in not done correctly could have terrible
 performance. A couple of handy techniques depend on how exactly are you
 trying to perform the join. For instance, if you are trying to join a
 smaller table to a larger one, a map join could work well for you where the
 smaller table is kept in-memory when the join is performed. Also if you are
 able to break your table down to smaller buckets, you might as well be able
 to use a bucketed map join for instance. Following link should be
 helpful[1][2].

 Hope this helps.

 [1]
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization
 [2]
 http://stackoverflow.com/questions/20199077/hive-efficient-join-of-two-tables


 On Fri, May 30, 2014 at 5:38 PM, shouvanik.hal...@accenture.com wrote:

  Pls find the answers







 *From:* kulkarni.swar...@gmail.com [mailto:kulkarni.swar...@gmail.com]
 *Sent:* Friday, May 30, 2014 3:34 PM

 *To:* user@hive.apache.org
 *Subject:* Re: Need urgent help on hive query performance



 I feel it's pretty hard to answer this without understanding the
 following:



 1.  What exactly are you trying to query? CSV? Avro? 

 HIVE table

 2.  Where is your data? HDFS? HBase? Local filesystem?

 Data is in s3

 3.  What version of hive are you using?

 Hive 0.12

 4.  What is an example of a query that is slow? Some queries like
 joins and stuff would be inherently slower than other simpler ones(though
 can be optimized).

 It has innumerable no of joins. Since its client specific query, u
 understand I cannot share. Sorry about that



 Thanks,



 --
 Swarnim



 On Fri, May 30, 2014 at 5:32 PM, shouvanik.hal...@accenture.com wrote:

 Can you please give a specific example or blog to refer to. I did not
 understand



 *From:* Ashish Garg [mailto:gargcreation1...@gmail.com]
 *Sent:* Friday, May 30, 2014 3:31 PM
 *To:* user@hive.apache.org
 *Subject:* Re: Need urgent help on hive query performance



 try partitioning the table and run the queries which are partition
 specific. Hope this helps.

 Thanks and Regards,

 Ashish Garg.



 On Fri, May 30, 2014 at 6:05 PM, shouvanik.hal...@accenture.com wrote:

 Hi,



 Does anybody  help urgently on optimizing hive query performance? I am
 looking more Hadoop tuning point of view. Currently, small amount of table
 takes much time to query?



 We are running EMR cluster with 1 MASTER node, 2 Core Nodes and  Task
 Nodes.



 Quick help is much appreciated.



 Thanks,

 Shouvanik


  --


 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise confidential information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the e-mail by you is prohibited. Where allowed
 by local law, electronic communications with Accenture and its affiliates,
 including e-mail and instant messaging (including content), may be scanned
 by our systems for the purposes of information security and assessment of
 internal compliance with Accenture policy.

 __

 www.accenture.com







 --
 Swarnim




 --
 Swarnim



ManagedTablePartition

2014-05-30 Thread Ashish Garg
Hello all,
Can we create the partitioned table in managed table and can we specify the
location other than default warehouse directory? If yes, then what is the
benefit of external table. Is it just dropping the external table will not
delete the entire data that is it will only delete the metadata. Is this
the only benefit?
Thanks and Regards,
Ashish Garg