RE: Hive Join returns incorrect results on bigint=string

2014-10-07 Thread java8964
Based on this wiki page:
https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-TypeSystem
The string will do a implicit conversion to double, as Double is the only 
common ancestor between bigint and string.
So the result is unpredictable if you are talking about double.
Yong

Date: Mon, 6 Oct 2014 14:20:57 -0700
Subject: Hive Join returns incorrect results on bigint=string
From: a...@rocketfuelinc.com
To: user@hive.apache.org

Recently, by mistake, I encountered a situation where I ended up doing a join 
key comparison between a string and a bigint. The returned results are 
incorrect  even though the strings have exactly same integer values as the 
bigint values.
When I do a Join on bigint = cast(string as bigint), the results are correct. 
Is this the expected behavior, or Hive is supposed to do an automatic cast and 
compare as strings?  
-- 
Thanks and Regards,Ashu PachauriRocket Scientist,Rocket Fuel Inc.1- 650 - 200- 
5390
  

How put the third party jar first in classpath of Hive UDF

2014-10-02 Thread java8964
Hi, 
Currently our production is using Hive 0.9.0. There is already a complex Hive 
query running on hadoop daily to generate millions records output. What I want 
to do is to transfer this result to Cassandra.
I tried to do it in UDF, as then I can send the data at reducer level, to 
maximum the transfer speed. Everything should work, until I tested in cluster.
We are using Netflix/Astyanax Cassandra client driver to write the data, which 
requires the google concurrent library Guava 14. I found out that the hive 
0.9.0 already include Guava 9, which leads to the following exception shown up 
in my Hive CLI:
SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in 
[jar:file:/opt/ibm/biginsights/hive/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J:
 Found binding in 
[jar:file:/opt/ibm/biginsights/IHC/lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]SLF4J:
 See http://www.slf4j.org/codes.html#multiple_bindings for an 
explanation.Exception in thread main java.lang.NoSuchMethodError: 
com/google/common/util/concurrent/MoreExecutors.listeningDecorator(Ljava/util/concurrent/ExecutorService;)Lcom/google/common/util/concurrent/ListeningExecutorService;
  at 
com.netflix.astyanax.thrift.ThriftKeyspaceImpl.init(ThriftKeyspaceImpl.java:114)
   at 
com.netflix.astyanax.thrift.ThriftFamilyFactory.createKeyspace(ThriftFamilyFactory.java:41)
  at 
com.netflix.astyanax.AstyanaxContext$Builder.buildKeyspace(AstyanaxContext.java:146)
Now I know what is the problem, but I cannot find a good solution.In Hadoop, I 
can use mapreduce.job.user.classpath.first to set my version of Guava picked 
up first, but if I set mapreduce.job.user.classpath.first=true in hive CLI, 
it didn't work.Then I google around, and found another setting looks like 
specified for Hive, so I set mapreduce.task.classpath.user.precedence=true; 
in Hive CLI, but it still didn't work.
It doesn't look like that Hive has a way to allow user's third party jar files 
to be put in front of classpath in UDF.
Does anyone face this kind of problem before? What is the best solution?
Thanks
Yong  

RE: python UDF and Avro tables

2014-07-24 Thread java8964
Are you trying to read the Avro file directly in your UDF? If so, that is not 
the correct way to do it in UDF.
Hive can support Avro file natively. Don't know your UDF requirement, but here 
is normally what I will do:
Create the table in hive as using AvroContainerInputFormat
create external table foorow format serde 
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'stored asinputformat 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'outputformat 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'location 
'/xxx.avro'tblproperties ('avro.schema.url'='hdfs://.avsc');
In this case, the hive will map the table structure based on the avro schema 
file.
Then  you can register your UDF and start to use it.
Remember, in this case, when your python UDF being invoked, the avro data will 
be wrapped as a JSON string, passed to your python UDF through STDIN.
For example, if you do select MYUDF(col1) from foo, then the col1 data from 
Avro will be passed to  your python script as a JSON string, even if the col1 
is a nested structure. Then it is up to your python script to handle the JSON 
string, and return whatever output result through STDOUT.
Yong
From: kevin.wei...@imc-chicago.com
To: user@hive.apache.org
Subject: python UDF and Avro tables
Date: Thu, 24 Jul 2014 15:52:03 +






Hi All,



I hope I’m not duplicating a previous question, but I couldn’t find any search 
functionality for the user list archives.



I have written a relatively simple python script that is meant to take a field 
from a hive query and transform it (just some string processing through a dict) 
given that certain conditions are met. After reading this guide:



http://blog.spryinc.com/2013/09/a-guide-to-user-defined-functions-in.html



it would appear that the python script needs to read from STDIN the native file 
format (in my case Avro) and write to STDOUT. I implemented this functionality 
using the python fastavro deserializer and cStringIO for the STDIN/STDOUT bit. 
I then placed
 the appropriate python modules on all the nodes (which I could probably do a 
bit better by simply storing in HDFS). Unfortunately, I’m still getting errors 
while trying to transform my field which are appended below. I believe the 
problem is that HDFS can
 end up splitting the files at arbitrary points and you could have an Avro file 
with no schema appended to the top. Has anyone had any luck running a python 
UDF on an Avro table? Cheers!




Traceback (most recent call last):
  File coltoskip.py, line 33, in module
reader = avro.reader(avrofile)
  File _reader.py, line 368, in fastavro._reader.iter_avro.__init__ 
(fastavro/_reader.c:6438)
ValueError: cannot read header - is it an avro file?
org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error 
occurred when trying to close the Operator running your custom script.
at 
org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java:514)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
at 
org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error 
occurred when trying to close the Operator running your custom script.
at 
org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java:514)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
at 
org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.Child.main(Child.java:262)

RE: does the HBase-Hive integration support using HBase index (primary key or secondary index) in the JOIN implementatoin?

2014-07-24 Thread java8964
I don't think Hbase-Hive integration part is that smart, be able to utilize the 
index existing in the HBase. But I think it depends on the version you are 
using.
From my experience, there are a lot of improvement space in the Hbase-hive 
integration, especially push down logic into HBase engine.
Yong

From: tedd...@gmail.com
Date: Thu, 24 Jul 2014 14:03:42 -0700
Subject: does the HBase-Hive integration support using HBase index (primary key 
or secondary index) in the JOIN implementatoin?
To: user@hive.apache.org

if I do a join of a table based on txt file and a table based on HBase, and say 
the latter is very large, is HIVE smart enough to utilize the HBase table's 
index to do the join, instead of implementing this as a regular map reduce job, 
where each table is scanned fully, bucketed on join keys, and then the matching 
items found out through the reducer?



thanksYang

RE: Vectorization with UDFs returns incorrect results

2014-05-30 Thread java8964
When you turn vectorized on, does the following query consistently return 
1 in the output?
select ten_thousand() from testTabOrc
Yong

Date: Fri, 30 May 2014 08:24:43 -0400
Subject: Vectorization with UDFs returns incorrect results
From: bbowman...@gmail.com
To: user@hive.apache.org

Hive 0.13  Hadoop 2.4
I am having an issue when using the combination of vectorized query execution, 
BETWEEN, and a custom UDF.  When I have vectorization on, my query returns an 
empty set.  When I then turn vectorization off, my query returns the correct 
results. 

Example Query:  SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1 - X) 
and UDF_1
My UDFs seem to be working for everything else except this specific 
circumstance.  Is this a issue in the hive software or am I writing my UDFs in 
such a way that they do not work with vectorization?  If the latter, what is 
the correct way?

I created a test scenario where I was able to reproduce this problem I am 
seeing:
TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 1):  
package com.test;

import org.apache.hadoop.hive.ql.exec.Description;import 
org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.LongWritable;import 
org.apache.hadoop.io.Text;import java.lang.String;
import java.lang.*;
public class tenThousand extends UDF {

  private final LongWritable result = new LongWritable();

  public LongWritable evaluate() {result.set(1);return result;
  }}
TEST DATA (test.input):1|CBCABC|12
2|DBCABC|133|EBCABC|144|ABCABC|155|BBCABC|16
6|CBCABC|17
CREATING ORC TABLE:
0: jdbc:hive2://server:10002/db create table testTabOrc (first bigint, second 
varchar(20), third int) partitioned by (range int) clustered by (first) sorted 
by (first) into 8 buckets stored as orc tblproperties (orc.compress = 
SNAPPY, orc.index = true);

CREATE LOADING TABLE:0: jdbc:hive2://server:10002/db create table loadingDir 
(first bigint, second varchar(20), third int) partitioned by (range int) row 
format delimited fields terminated by '|' stored as textfile;

COPY IN DATA:[root@server]#  hadoop fs -copyFromLocal /tmp/test.input 
/db/loading/.

ORC DATA:[root@server]#  beeline -u jdbc:hive2://server:10002/db -n root 
--hiveconf hive.exec.dynamic.partition.mode=nonstrict --hiveconf 
hive.enforce.sorting=true -e insert into table testTabOrc partition(range) 
select * from loadingDir;

LOAD TEST FUNCTION:0: jdbc:hive2://server:10002/db  add jar 
/opt/hadoop/lib/testFunction.jar

0: jdbc:hive2://server:10002/db  create temporary function ten_thousand as 
'com.test.tenThousand';


TURN OFF VECTORIZATION:0: jdbc:hive2://server:10002/db  set 
hive.vectorized.execution.enabled=false;

QUERY (RESULTS AS EXPECTED):0: jdbc:hive2://server:10002/db select first from 
testTabOrc where first between ten_thousand()-1 and ten_thousand()-9995;
++| first  |++
| 1  || 2  || 3  |++
3 rows selected (15.286 seconds)
TURN ON VECTORIZATION:
0: jdbc:hive2://server:10002/db  set hive.vectorized.execution.enabled=true;
QUERY AGAIN (WRONG RESULTS):
0: jdbc:hive2://server:10002/db select first from testTabOrc where first 
between ten_thousand()-1 and ten_thousand()-9995;
++| first  |++++
No rows selected (17.763 seconds)


  

RE: Hive Avro union data access

2014-05-30 Thread java8964
Your alias_host column is an array, from your Avro specification, right?
If so, just use [] to access the specified element in the array
select alias_host[0] from array_tests where aliat_host[0] like '%test%'
If you want to query all the elements in the array, google explode lateral 
view of hive.
Yong

From: sathish.vall...@emc.com
To: user@hive.apache.org
Subject: Hive Avro union data access
Date: Fri, 30 May 2014 06:21:19 +

Hi, I have an Hive table created with 3 different union data types for 
alias_host column name as shown. (arraystring,string, null). CREATE EXTERNAL 
TABLE array_tests ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES 
('avro.schema.literal'='{name:sessions,type:record,fields:[{default:null,name:alias_host,type:
 [{  type : array,  items : string},string,null]}]}') 
STORED AS INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 
'/arrayTests'; How to access and query the contents of this table in where 
clause.The queries below like these can be possible if the datatype is not 
union but when once I set the datatype as union the following queries are 
failing. Eg: select alias_host from array_tests where alias_host like ‘%test%’ 
limit 1000;Error: Error while processing statement: FAILED: SemanticException 
[Error 10016]: Line 1:32 Argument type mismatch 'alias_host': The 1st argument 
of EQUAL  is expected to a primitive type, but union is found 
(state=42000,code=10016)  Can anyone suggest how to access and query the 
contents of union data types. RegardsSathish Valluri
 

RE: Vectorization with UDFs returns incorrect results

2014-05-30 Thread java8964
I downloaded the Hive13 and confirmed this problem existed.
My suggestion is to change your query to
select first from testTabOrc where first = ten_thousand()-1 and first = 
ten_thousand()-9995;
which works in my environment (Hive 0.13 + hadoop 2.2)
You can create a Jira for it. It looks like the problem caused by between 
with Vectorization and UDF.
BTW, your UDF looks fine to me.
Yong

Date: Fri, 30 May 2014 11:22:18 -0400
Subject: Re: Vectorization with UDFs returns incorrect results
From: bbowman...@gmail.com
To: user@hive.apache.org

Yong,
Thanks for the reply.  Yes the query select ten_thousand() from testTabOrc 
does return consistent results regardless of the vectorization setting.  I ran 
that query with the vectorization off and then turned it on and ran it 5 more 
times.  Each time it returned correct results.  See output below. 

Thanks,Ben

=

0: jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc;
++|  _c0   |++| 1  |
| 1  || 1  || 1  || 1  |
| 1  |++6 rows selected (17.807 seconds)0: 
jdbc:hive2://server:10002/db  set hive.vectorized.execution.enabled=true;
No rows affected (0.002 seconds)0: jdbc:hive2://server:10002/db select 
ten_thousand() from testTabOrc;++
|  _c0   |++| 1  || 1  |
| 1  || 1  || 1  || 1  |
++6 rows selected (17.23 seconds)0: jdbc:hive2://server:10002/db 
select ten_thousand() from testTabOrc;
++|  _c0   |++| 1  |
| 1  || 1  || 1  || 1  |
| 1  |++6 rows selected (14.267 seconds)0: 
jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc;
++|  _c0   |++| 1  |
| 1  || 1  || 1  || 1  |
| 1  |++6 rows selected (18.259 seconds)0: 
jdbc:hive2://server:10002/db select ten_thousand() from testTabOrc;
++|  _c0   |++| 1  |
| 1  || 1  || 1  || 1  |
| 1  |++6 rows selected (18.778 seconds) 





On Fri, May 30, 2014 at 10:52 AM, java8964 java8...@hotmail.com wrote:




When you turn vectorized on, does the following query consistently return 
1 in the output?

select ten_thousand() from testTabOrc

Yong

Date: Fri, 30 May 2014 08:24:43 -0400
Subject: Vectorization with UDFs returns incorrect results

From: bbowman...@gmail.com
To: user@hive.apache.org

Hive 0.13  Hadoop 2.4

I am having an issue when using the combination of vectorized query execution, 
BETWEEN, and a custom UDF.  When I have vectorization on, my query returns an 
empty set.  When I then turn vectorization off, my query returns the correct 
results. 


Example Query:  SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1 - X) 
and UDF_1
My UDFs seem to be working for everything else except this specific 
circumstance.  Is this a issue in the hive software or am I writing my UDFs in 
such a way that they do not work with vectorization?  If the latter, what is 
the correct way?


I created a test scenario where I was able to reproduce this problem I am 
seeing:
TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 1):  
package com.test;


import org.apache.hadoop.hive.ql.exec.Description;import 
org.apache.hadoop.hive.ql.exec.UDF;

import org.apache.hadoop.io.LongWritable;import 
org.apache.hadoop.io.Text;import java.lang.String;

import java.lang.*;
public class tenThousand extends UDF {


  private final LongWritable result = new LongWritable();


  public LongWritable evaluate() {result.set(1);return result;

  }}
TEST DATA (test.input):1|CBCABC|12

2|DBCABC|133|EBCABC|144|ABCABC|155|BBCABC|16

6|CBCABC|17
CREATING ORC TABLE:

0: jdbc:hive2://server:10002/db create table testTabOrc (first bigint, second 
varchar(20), third int) partitioned by (range int) clustered by (first) sorted 
by (first) into 8 buckets stored as orc tblproperties (orc.compress = 
SNAPPY, orc.index = true);


CREATE LOADING TABLE:0: jdbc:hive2://server:10002/db create table loadingDir 
(first bigint, second varchar(20), third int) partitioned by (range int) row 
format delimited fields terminated by '|' stored as textfile;


COPY IN DATA:[root@server]#  hadoop fs -copyFromLocal /tmp/test.input 
/db/loading/.


ORC DATA:[root@server]#  beeline -u jdbc:hive2://server:10002/db -n root 
--hiveconf hive.exec.dynamic.partition.mode=nonstrict --hiveconf 
hive.enforce.sorting=true -e insert into table testTabOrc partition(range) 
select * from loadingDir;


LOAD TEST FUNCTION:0: jdbc:hive2://server:10002/db  add jar 
/opt/hadoop/lib/testFunction.jar


0: jdbc:hive2://server:10002/db  create temporary function ten_thousand as 
'com.test.tenThousand';



TURN OFF VECTORIZATION:0: jdbc:hive2://server:10002/db  set 
hive.vectorized.execution.enabled=false;


QUERY (RESULTS AS EXPECTED):0: jdbc:hive2://server:10002/db select first from 
testTabOrc where first between ten_thousand()-1

RE: LEFT SEMI JOIN

2014-05-16 Thread java8964
From Hive manual, there is only left semi join, no semi join, nor inner 
semi join.
From the Database world, it is just a traditional name for this kind of join: 
LEFT semi join, as a reminder to the reader that the resultset comes out 
from the LEFT table ONLY.
Yong

 From: lukas.e...@datageekery.com
 To: user@hive.apache.org
 Subject: LEFT SEMI JOIN
 Date: Tue, 13 May 2014 09:30:52 +
 
 Hello,
 
 We were approached to add support for Hive SQL in jOOQ [1], which might be 
 useful for a greater community in general. I've gone through the Hive SQL 
 syntax and I've encountered this interesting clause: The LEFT SEMI JOIN 
 clause [2]
 
 Example:
 SELECT a.key, a.val
 FROM a LEFT SEMI JOIN b on (a.key = b.key)
 
 My question is: Why LEFT? I find this misleading. Compare this with a LEFT 
 OUTER JOIN, which will always return all tuples from relation A at least 
 once, regardless if there are any tuples in relation B matched by the JOIN 
 predicate in ON. So in other words, a LEFT SEMI JOIN is completely useless 
 as it *should* always return ALL tuples from relation A, compared to a more 
 useful SEMI JOIN or INNER SEMI JOIN.
 
 What do you think?
 Lukas
 
 [1]: http://www.jooq.org
 [2]: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
 
 --
 Lukas Eder - Head of RD | lukas.e...@datageekery.com | +41 44 586 82 56
 Data Geekery GmbH | Binzstrasse 23 | CH-8045 Zürich | Switzerland
 
 http://www.datageekery.com | Get back in control of your SQLT
 
 
 
  

RE: Does hive instantiate new udf object for each record

2014-03-25 Thread java8964
The reason you saw that is because when you provide evaluate() method, you 
didn't specified the type of column it can be used. So Hive will just create 
test instance again and again for every new row, as it doesn't know how or 
which column to apply your UDF.
I changed your code as below:
public class test extends UDF {
private Text t;

public Text evaluate (String s) {
if(t==null) {
t=new Text(initialization);
}
else {
t=new Text(OK);
}
return t;
}

public Text evaluate () {
if(t==null) {
t=new Text(initialization);
}
else {
t=new Text(OK);
}
return t;
}
}
Now, if you invoke your UDF like this:
select test(colA) from AnyTable;
You should see one Init and the rest are OK, make sense?
Yong
From: sky880883...@hotmail.com
To: user@hive.apache.org
Subject: RE: Does hive instantiate new udf object for each record
Date: Tue, 25 Mar 2014 10:17:46 +0800




I have implemented a simple udf for test.


public class test extends UDF {
private Text t;

public Text evaluate () {
if(t==null) {
t=new Text(initialization);
}
else {
t=new Text(OK);
}
return t;
}
}

And the test query: select test() from AnyTable;
I got
initialization
initialization
initialization
...

I have also implemented a similar GenericUDF, and got similar result.

What' wrong with my code?

Best Regards,ypgFrom: java8...@hotmail.com
To: user@hive.apache.org
Subject: RE: Does hive instantiate new udf object for each record
Date: Mon, 24 Mar 2014 16:58:49 -0400




Your UDF object will only initialized once per map or reducer. 
When you said your UDF object being initialized for each row, why do you think 
so? Do you have log to make you think that way?
If OK, please provide more information, so we can help you, like your example 
code, log etc
Yong

Date: Tue, 25 Mar 2014 00:30:21 +0800
From: sky880883...@hotmail.com
To: user@hive.apache.org
Subject: Does hive instantiate new udf object for each record


Hi all,
I'm trying to implement a udf which makes use of some data structures 
like binary tree. However,  it seems that hive instantiates new udf 
object for each row in the table. Then the data structures would be also 
initialized again and again for each row.Whereas, in the book 
Programming Hive, a geoip function is taken for an example showing that a 
LookupService object is saved in a reference so it only needs to be 
initialized once in the lifetime of a map or reduce task that initializes it. 
The code for this function can be found here 
(https://github.com/edwardcapriolo/hive-geoip/).
Could anyone give me some ideas how to make the udf object initialize 
once in the lifetime of a map or reduce task?

Best Regards,ypg



  

RE: Does hive instantiate new udf object for each record

2014-03-24 Thread java8964
Your UDF object will only initialized once per map or reducer. 
When you said your UDF object being initialized for each row, why do you think 
so? Do you have log to make you think that way?
If OK, please provide more information, so we can help you, like your example 
code, log etc
Yong

Date: Tue, 25 Mar 2014 00:30:21 +0800
From: sky880883...@hotmail.com
To: user@hive.apache.org
Subject: Does hive instantiate new udf object for each record


Hi all,
I'm trying to implement a udf which makes use of some data structures 
like binary tree. However,  it seems that hive instantiates new udf 
object for each row in the table. Then the data structures would be also 
initialized again and again for each row.Whereas, in the book 
Programming Hive, a geoip function is taken for an example showing that a 
LookupService object is saved in a reference so it only needs to be 
initialized once in the lifetime of a map or reduce task that initializes it. 
The code for this function can be found here 
(https://github.com/edwardcapriolo/hive-geoip/).
Could anyone give me some ideas how to make the udf object initialize 
once in the lifetime of a map or reduce task?

Best Regards,ypg


  

RE: Joins Failing

2014-03-24 Thread java8964
It looks like his job failed in OOM in mapper tasks:
Job failed as tasks failed. failedMaps:1 failedReduces:0
So what he need is to increase the mapper heap size request.
Yong

Date: Mon, 24 Mar 2014 16:16:50 -0400
Subject: Re: Joins Failing
From: divakarredd...@gmail.com
To: user@hive.apache.org
CC: jason.herba...@bateswhite.com

I hope, this property will fix your issue.

set mapred.reduce.child.java.opts=-Xmx4096m;






On Mon, Mar 24, 2014 at 3:59 PM, Clay McDonald stuart.mcdon...@bateswhite.com 
wrote:

I believe I found my issue.



2014-03-24 15:49:38,775 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error 
running child : java.lang.OutOfMemoryError: Java heap space





Clay







From: Clay McDonald [mailto:stuart.mcdon...@bateswhite.com]

Sent: Monday, March 24, 2014 2:07 PM

To: 'user@hive.apache.org'

Subject: Joins Failing



My join query is failing. Any suggestions on how I should troubleshoot this?



92651139753_0036_1_conf.xml to 
hdfs://dc-bigdata5.bateswhite.com:8020/mr-history/tmp/root/job_1392651139753_0036_conf.xml_tmp

2014-03-24 13:48:58,244 INFO [Thread-65] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Copied to done 
location: 
hdfs://dc-bigdata5.bateswhite.com:8020/mr-history/tmp/root/job_1392651139753_0036_conf.xml_tmp

2014-03-24 13:48:58,248 INFO [Thread-65] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Moved tmp to 
done: 
hdfs://dc-bigdata5.bateswhite.com:8020/mr-history/tmp/root/job_1392651139753_0036.summary_tmp
 to 
hdfs://dc-bigdata5.bateswhite.com:8020/mr-history/tmp/root/job_1392651139753_0036.summary

2014-03-24 13:48:58,249 INFO [Thread-65] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Moved tmp to 
done: 
hdfs://dc-bigdata5.bateswhite.com:8020/mr-history/tmp/root/job_1392651139753_0036_conf.xml_tmp
 to 
hdfs://dc-bigdata5.bateswhite.com:8020/mr-history/tmp/root/job_1392651139753_0036_conf.xml


2014-03-24 13:48:58,251 INFO [Thread-65] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Moved tmp to 
done: 
hdfs://dc-bigdata5.bateswhite.com:8020/mr-history/tmp/root/job_1392651139753_0036-1395683232406-root-SELECT+COUNT%28A.UPC_ID%29+AS+CNT%0D%0AFR...B.UPC_ID%28Stage-1395683335043-0-0-FAILED-default.jhist_tmp
 to 
hdfs://dc-bigdata5.bateswhite.com:8020/mr-history/tmp/root/job_1392651139753_0036-1395683232406-root-SELECT+COUNT%28A.UPC_ID%29+AS+CNT%0D%0AFR...B.UPC_ID%28Stage-1395683335043-0-0-FAILED-default.jhist


2014-03-24 13:48:58,251 INFO [Thread-65] 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopped 
JobHistoryEventHandler. super.stop()

2014-03-24 13:48:58,252 INFO [Thread-65] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Setting job 
diagnostics to Task failed task_1392651139753_0036_m_11

Job failed as tasks failed. failedMaps:1 failedReduces:0





Thanks Clay


  

RE: Issue with Querying External Hive Table created on hbase

2014-03-21 Thread java8964
I am not sure about your question.
Do you mean the query runs very fast if you run like 'select * from 
hbase_table', but very slow for 'select * from hbase where row_key = ?'
I  think it should be the other way round, right?
Yong

Date: Wed, 19 Mar 2014 11:42:39 -0700
From: sunil_ra...@yahoo.com
Subject: Issue with Querying External Hive Table created on hbase
To: user@hive.apache.org

Hi All
I am trying to query  External Hive Table created on hbase ( hbase table is 
compressed using gzip) .  I am getting quick response, if I use select * 
from hbase_acct_pref_dim_, but the query is taking for ever if I try to 
retrieve data based on the row_key. 
hive select * from hbase_acct_pref_dim_ where key = 30001;

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 
02014-03-19 11:14:04,432 Stage-1 map = 0%,  reduce = 0%2014-03-19 11:15:04,617 
Stage-1 map = 0%,  reduce = 0%2014-03-19 11:16:04,792 Stage-1 map = 0%,  reduce 
= 0%2014-03-19 11:17:04,969 Stage-1 map = 0%,  reduce = 0%2014-03-19 
11:18:05,140 Stage-1 map = 0%,  reduce = 0%2014-03-19 11:19:05,315 Stage-1 map 
= 0%,  reduce = 0%2014-03-19 11:20:05,484 Stage-1 map = 0%,  reduce = 
0%2014-03-19 11:21:05,667 Stage-1 map = 0%,  reduce = 0%2014-03-19 11:22:05,835 
Stage-1 map = 0%,  reduce =
 0%

Any Help is appreciated. 

 Thanks, 
 Sunil S Ranka
 Blog :: http://sranka.wordpress.com
 Superior BI is the antidote to Business Failure
 
  NOTHING IS IMPOSSIBLE EVEN THE WORD 
 IMPOSSIBLE SAYS ,I M POSSIBLE. 

  

RE: Using an UDF in the WHERE (IN) clause

2014-03-11 Thread java8964
What version of hive you are using?
It is good to know that if it works in newer version.
Yong

Date: Tue, 11 Mar 2014 08:33:06 +0100
Subject: Re: Using an UDF in the WHERE (IN) clause
From: petter.von.dolw...@gmail.com
To: user@hive.apache.org

Hi Young,

I must argue that the partition pruning do actually work if I don't use the IN 
clause. What I wanted to achieve in my original query was to specify a range of 
partitions in a simple way. The same query can be expressed as


SELECT * FROM mytable WHERE partitionCol = UDF(2014-03-10) and partitionCol 
= UDF(2014-03-11);

This UDF returns an INT (rather than an INT array). Both this UDF and the 
original one are annotated with @UDFType(deterministic = true) (if that has any 
impact) . This variant works fine and does partition pruning. Note that I don't 
have another column as input to my UDF but a static value.


Thanks,
Petter




2014-03-11 0:16 GMT+01:00 java8964 java8...@hotmail.com:




I don't know from syntax point of view, if Hive will allow to do columnA IN 
UDF(columnB).
What I do know that even let's say above work, it won't do the partition 
pruning.

The partition pruning in Hive is strict static, any dynamic values provided to 
partition column won't enable partition pruning, even though it is a feature I 
missed too.
Yong


Date: Mon, 10 Mar 2014 16:23:01 +0100
Subject: Using an UDF in the WHERE (IN) clause
From: petter.von.dolw...@gmail.com
To: user@hive.apache.org


Hi,

I'm trying to get the following query to work. The parser don't like it. 
Anybody aware of a workaround?

SELECT * FROM mytable WHERE partitionCol IN my_udf(2014-03-10);



partitionCol is my partition column of type INT and I want to achieve early 
pruning. I've tried returning an array of INTs from my_udf and also a plain 
string in the format (1,2,3). It seems like the parser wont allow me to put an 
UDF in this place.



Any help appreciated.

Thanks,
Petter
  

  

RE: Using an UDF in the WHERE (IN) clause

2014-03-10 Thread java8964
I don't know from syntax point of view, if Hive will allow to do columnA IN 
UDF(columnB).
What I do know that even let's say above work, it won't do the partition 
pruning.
The partition pruning in Hive is strict static, any dynamic values provided to 
partition column won't enable partition pruning, even though it is a feature I 
missed too.
Yong

Date: Mon, 10 Mar 2014 16:23:01 +0100
Subject: Using an UDF in the WHERE (IN) clause
From: petter.von.dolw...@gmail.com
To: user@hive.apache.org

Hi,

I'm trying to get the following query to work. The parser don't like it. 
Anybody aware of a workaround?

SELECT * FROM mytable WHERE partitionCol IN my_udf(2014-03-10);


partitionCol is my partition column of type INT and I want to achieve early 
pruning. I've tried returning an array of INTs from my_udf and also a plain 
string in the format (1,2,3). It seems like the parser wont allow me to put an 
UDF in this place.


Any help appreciated.

Thanks,
Petter
  

RE: Setting | Verifying | Hive Query Parameters from Java

2014-03-06 Thread java8964
If you want to set some properties of hive, just run it as it is in your JDBC 
connection. 
Any command in the hive JDBC will send to the server as the same if you run 
set hive.server2.async.exec.threads=50; in the hive session.
Run the command set hive.server2.async.exec.threads=50; as a SQL statement, 
it will adjust the value for  your JDBC connection.
About the properties setting, I am not sure if it will work in Hive JDBC. 
Hive JDBC is a limited JDBC implementation based on Hive, so it maybe won't 
work, but I don't know for sure.
Yong

From: rinku.g...@fisglobal.com
To: user@hive.apache.org
Subject: RE: Setting | Verifying | Hive Query Parameters from Java
Date: Thu, 6 Mar 2014 11:12:52 +









Hi All,
 
Can anybody help me on below mail trail.
 
Thanks
Rinku Garg
 


From: Garg, Rinku


Sent: Tuesday, March 04, 2014 5:14 PM

To: user@hive.apache.org

Subject: Setting | Verifying | Hive Query Parameters from Java


 
Hi All,
 
We have installed CDH4.2.0 and hive-0.10.0-cdh4.2.0. Both are working as 
desired.
 
We need to set hive configuration parameter from Java while making JDBC 
connection.
 
We have written a java program to execute queries on hive server with some 
configurations properties setting dynamically . We are doing it as below

 
CONNECTION_URL=jdbc:hive://master149:1/default
 
Next, we are doing following method to set properties through java
 
props.setProperty(hive.server2.async.exec.threads,50);
props.setProperty(hive.server2.thrift.max.worker.threads,500);
props.setProperty(hive.groupby.orderby.position.alias,false);
 
and a hive connection is made as given below

 
hiveConnection = DriverManager.getConnection(connectionURL,props);
 
by above steps when a hive connection is made using hive-jdbc and we are 
getting hive query results as desired.
 
QUERY: 
 
1.  
Are we doing rightly for setting up the hive properties, if yes then how can we 
verify that?
2.  
If the above is not the right way, then how can we achieve setting hive 
configuration parameters from Java using JDBC?
 
Thanks
Rinku Garg
 



_

The information contained in this message is proprietary and/or confidential. 
If you are not the intended recipient, please: (i) delete the message and all 
copies; (ii) do not disclose, distribute or use the message in any manner; and 
(iii) notify the sender immediately. In addition, please be aware that any 
message addressed to our domain is subject to archiving and review by persons 
other than the intended recipient. Thank you.

  

RE: Best way to avoid cross join

2014-03-05 Thread java8964
Hi, Wolli:
Cross join doesn't mean Hive has to use one reduce.
From query point of view, the following cases will use one reducer:
1) Order by in your query (Instead of using sort by)2) Only one reducer group, 
which means all the data have to send to one reducer, as there is only one 
reducer group.
In your case, distinct count of id1 will be the reducer group count. Did you 
explicitly set the reducer count in your hive session?
Yong 
Date: Wed, 5 Mar 2014 14:17:24 +0100
Subject: Best way to avoid cross join
From: darkwoll...@gmail.com
To: user@hive.apache.org

Hey everyone,
before i write a lot of text, i just post something which is already 
written:http://www.sqlservercentral.com/Forums/Topic1328496-360-1.aspx


The first posts adresses a pretty similar problem i also have. Currently my 
implementation looks like this:
SELECT id1,
  MAX(  CASEWHEN m.keyword IS NULLTHEN 0
WHEN instr(m.keyword, prep_kw.keyword)  0THEN 1
ELSE 0  END) AS flagFROM (select id1, keyword from import1) m
CROSS JOIN  (SELECT keyword FROM et_keywords) prep_kw
GROUP BY id1;
Since there is a cross join involved, the execution gets pinned down to 1 
reducer only and it takes ages to complete. 

The thread i posted is solving this with some special SQLserver tactics. But I 
was wondering if anybody has encountered the problem in Hive already and found 
a better way to solve this.

I'm using Hive 0.11 on a MapR Distribution, if this is somehow important.
CheersWolli   

RE: Best way to avoid cross join

2014-03-05 Thread java8964
Sorry, my mistake. I didn't pay attention that you are using cross join.
Yes, cross join will always use one reducer, at least that is my understand.
Yong

Date: Wed, 5 Mar 2014 15:27:48 +0100
Subject: Re: Best way to avoid cross join
From: darkwoll...@gmail.com
To: user@hive.apache.org

hey Yong,
Even without the group by (pure cross join) the query is only using one 
reducer. Even specifying more reducers doesn't help:
set mapred.reduce.tasks=50;

SELECT id1,   m.keyword, 
   prep_kw.keyword
FROM (select id1, keyword from import1) m
CROSS JOIN  (SELECT keyword FROM et_keywords) prep_kw;

...

Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1


What could be setup wrong here? Or can it be avoided to use this ugly cross 
join at all? I mean my original problem is actually something else ;-)

CheersWolli


2014-03-05 15:07 GMT+01:00 java8964 java8...@hotmail.com:




Hi, Wolli:
Cross join doesn't mean Hive has to use one reduce.
From query point of view, the following cases will use one reducer:

1) Order by in your query (Instead of using sort by)2) Only one reducer group, 
which means all the data have to send to one reducer, as there is only one 
reducer group.
In your case, distinct count of id1 will be the reducer group count. Did you 
explicitly set the reducer count in your hive session?

Yong 
Date: Wed, 5 Mar 2014 14:17:24 +0100
Subject: Best way to avoid cross join
From: darkwoll...@gmail.com
To: user@hive.apache.org


Hey everyone,
before i write a lot of text, i just post something which is already 
written:http://www.sqlservercentral.com/Forums/Topic1328496-360-1.aspx



The first posts adresses a pretty similar problem i also have. Currently my 
implementation looks like this:
SELECT id1,

  MAX(  CASEWHEN m.keyword IS NULLTHEN 0

WHEN instr(m.keyword, prep_kw.keyword)  0THEN 1

ELSE 0  END) AS flagFROM (select id1, keyword from import1) m

CROSS JOIN  (SELECT keyword FROM et_keywords) prep_kw

GROUP BY id1;
Since there is a cross join involved, the execution gets pinned down to 1 
reducer only and it takes ages to complete. 


The thread i posted is solving this with some special SQLserver tactics. But I 
was wondering if anybody has encountered the problem in Hive already and found 
a better way to solve this.


I'm using Hive 0.11 on a MapR Distribution, if this is somehow important.
CheersWolli   

  

RE: Metastore performance on HDFS-backed table with 15000+ partitions

2014-02-27 Thread java8964
That is good to know.
We are using Hive 0.9. Right now the biggest table contains 2 years data, and 
we partitioned by hour, as the data volume is big.
So right now, it has 2*365*24 around 17000+ partitions. So far we didn't see 
too much problem yet, but I do have some concerns about it.
We are using IBM BigInsight, which is using derby as the hive metastore, not as 
mysql as my most experience was on.
Yong

From: norbert.bur...@gmail.com
Date: Thu, 27 Feb 2014 07:57:05 -0500
Subject: Re: Metastore performance on HDFS-backed table with 15000+ partitions
To: user@hive.apache.org

Thanks everyone for the feedback.  Just to follow up in case someone else runs 
into this: I can confirm that local client works around the OOMEs, but it's 
still very slow.
It does seem like we were hitting some combination of HIVE-4051 and HIVE-5158.  
We'll try reducing partition count first, and then switch to 0.12.0 if that 
doesn't improve things significantly.


Fwiw - http://www.slideshare.net/oom65/optimize-hivequeriespptx also has has 
some good rules-of-thumb.



Norbert

On Sat, Feb 22, 2014 at 1:27 PM, Stephen Sprague sprag...@gmail.com wrote:


yeah. That traceback pretty much spells it out - its metastore related and 
that's where the partitions are stored.





I'm with the others on this. HiveServer2 is still a little jankey on memory 
management.  I bounce mine once a day at midnight just to play it safe (and 
because i can.)





Again, for me, i use the hive local client for production jobs and remote 
client for adhoc stuff.

you may wish to confirm the local hive client has no problem with your query.





other than that you either increase your heap size on the HS2 process and hope 
for the best and/or file a bug report.





bottom line hiveserver2 isn't production bullet proof just yet, IMHO. Others 
may disagree.

Regards,
Stephen.







On Sat, Feb 22, 2014 at 9:50 AM, Norbert Burger norbert.bur...@gmail.com 
wrote:




Thanks all for the quick feedback.
I'm a bit surprised to learn 15k is considered too much, but we can work around 
it.  I guess I'm also curious why the query planner needs to know about all 
partitions even in the case of simple select/limit queries, where the query 
might target only a single partition.






Here's the client-side OOME with HADOOP_HEAPSIZE=2048:
https://gist.githubusercontent.com/nburger/3286d2052060e2efe161/raw/dc30231086803c1d33b9137b5844d2d0e20e350d/gistfile1.txt







This was from a CDH4.3.0 client hitting HIveServer2.  Any idea what's consuming 
the heap?
Norbert





On Sat, Feb 22, 2014 at 10:32 AM, Edward Capriolo edlinuxg...@gmail.com wrote:






Dont make tbales with that many partitions. It is an anti pattern. I hwve 
tables with 2000 partitions a day and that is rewlly to many. Hive needs go 
load that informqtion into memory to plan the query.



On Saturday, February 22, 2014, Terje Marthinussen tmarthinus...@gmail.com 
wrote:

 Query optimizer in hive is awful on memory consumption. 15k partitions sounds 
 a bit early for it to fail though..

 What is your heap size?

 Regards,
 Terje

 On 22 Feb 2014, at 12:05, Norbert Burger norbert.bur...@gmail.com wrote:








 Hi folks,

 We are running CDH 4.3.0 Hive (0.10.0+121) with a MySQL metastore.

 In Hive, we have an external table backed by HDFS which has a 3-level 
 partitioning scheme that currently has 15000+ partitions.








 Within the last day or so, queries against this table have started failing.  
 A simple query which shouldn't take very long at all (select * from ... 
 limit 10) fails after several minutes with a client OOME.  I get the same 
 outcome on count(*) queries (which I thought wouldn't send any data back to 
 the client).  Increasing heap on both client and server JVMs (via 
 HADOOP_HEAPSIZE) doesn't have any impact.








 We were only able to work around the client OOMEs by reducing the number of 
 partitions in the table.

 Looking at the MySQL querylog, my thought is that the Hive client is quite 
 busy making requests for partitions that doesn't contribute to the query.  
 Has anyone else had similar experience against tables this size?








 Thanks,
 Norbert


-- 
Sorry this was sent from mobile. Will do less grammar and spell check than 
usual.






  

RE: Hive query parser bug resulting in FAILED: NullPointerException null

2014-02-27 Thread java8964
Can you reproduce with an empty table? I can't reproduce it.
Also, can you paste the stack trace?
Yong

From: krishnanj...@gmail.com
Date: Thu, 27 Feb 2014 12:44:28 +
Subject: Hive query parser bug resulting in FAILED: NullPointerException null
To: user@hive.apache.org

Hi all,
we've experienced a bug which seems to be caused by having a query constraint 
involving partitioned columns. The following query results in FAILED: 
NullPointerException null being returned nearly instantly:



EXPLAIN SELECT  col1FROM  tbl1WHERE(part_col1 = 2014 AND part_col2 = 2)OR 
part_col1  2014;



The exception doesn't happen if any of the conditions are removed. The table is 
defined like the following:
CREATE TABLE tbl1 (  col1STRING,

  ...  col12   STRING)PARTITIONED BY (part_col1 INT, 
part_col2 TINYINT, part_col3 TINYINT)STORED AS SEQUENCEFILE;



Unfortunately I cannot construct a test case to replicate this. Seen as though 
it appears to be a query parser bug, I thought the following would replicate it:


CREATE TABLE tbl2 LIKE tbl1;EXPLAIN SELECT  col1FROM  tbl2WHERE(part_col1 = 
2014 AND part_col2 = 2)OR part_col1  2014;



But it does not. Could it somehow be data specific? Does the query parser use 
partition information?
Are there any logs I could see to investigate this further? Or is this a known 
bug?


We're using hive 0.10.0-cdh4.4.0.

Cheers,
Krishna   

Hive trunk unit test failed

2014-02-26 Thread java8964
Hi, 
I tried to run the all tests in my local Linux x64 of current Hive trunk code. 
My mvn clean package -DskipTests -Phadoop-2 -Pdist will work fine if I skip 
tests.
The following unit test failed, and then it stopped.
I traced the code down to a native method invoked 
atorg.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native 
Method) throw InvocationTargetException.
My questions are:
1) Did it mean the native code not available in my environment causing the 
above error?2) If so, since the latest hive build is using Maven, and I can see 
the hadoop-2.2.0 all jar files downloaded in my local repository, why this 
error still happen?3) Is it possible that because of my local environment is 
64bit, but default hadoop-2.2.0 coming with 32bit native code? If so, how to 
fix that during the hive build?
Thanks
Yong
Running org.apache.hadoop.hive.serde2.avro.TestAvroSerdeUtilsTests run: 8, 
Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.802 sec  FAILURE! - in 
org.apache.hadoop.hive.serde2.avro.TestAvroSerdeUtilsdetemineSchemaTriesToOpenUrl(org.apache.hadoop.hive.serde2.avro.TestAvroSerdeUtils)
  Time elapsed: 0.377 sec   ERROR!java.lang.RuntimeException: 
java.lang.reflect.InvocationTargetException   at 
org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native 
Method) at 
org.apache.hadoop.security.JniBasedUnixGroupsMapping.clinit(JniBasedUnixGroupsMapping.java:49)
 at 
org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.init(JniBasedUnixGroupsMappingWithFallback.java:38)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) 
   at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)  at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:129) at 
org.apache.hadoop.security.Groups.init(Groups.java:55) at 
org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:182)
 at 
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:235)
at 
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:214)
 at 
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:669)
  at 
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
at 
org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:2590)   at 
org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:2582)   at 
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2448)  at 
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367) at 
org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFromFS(AvroSerdeUtils.java:110)
   at 
org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:71)
at 
org.apache.hadoop.hive.serde2.avro.TestAvroSerdeUtils.detemineSchemaTriesToOpenUrl(TestAvroSerdeUtils.java:139)
  

RE: Hive trunk unit test failed

2014-02-26 Thread java8964
OK. Now I understand that this error is due to missing the Hadoop native 
library.
If I manually add libhadoop.so into java.library.path for this unit test, it 
passed.
So either the hadoop 2.2.0 coming from Maven reponsitory includes 32bit of 
hadoop native library, or totally missed it.
Now the question is what is the correct way to run the unit tests in the new 
maven build? 
Thanks
Yong

From: java8...@hotmail.com
To: user@hive.apache.org
Subject: Hive trunk unit test failed
Date: Wed, 26 Feb 2014 14:49:41 -0500




Hi, 
I tried to run the all tests in my local Linux x64 of current Hive trunk code. 
My mvn clean package -DskipTests -Phadoop-2 -Pdist will work fine if I skip 
tests.
The following unit test failed, and then it stopped.
I traced the code down to a native method invoked 
atorg.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native 
Method) throw InvocationTargetException.
My questions are:
1) Did it mean the native code not available in my environment causing the 
above error?2) If so, since the latest hive build is using Maven, and I can see 
the hadoop-2.2.0 all jar files downloaded in my local repository, why this 
error still happen?3) Is it possible that because of my local environment is 
64bit, but default hadoop-2.2.0 coming with 32bit native code? If so, how to 
fix that during the hive build?
Thanks
Yong
Running org.apache.hadoop.hive.serde2.avro.TestAvroSerdeUtilsTests run: 8, 
Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.802 sec  FAILURE! - in 
org.apache.hadoop.hive.serde2.avro.TestAvroSerdeUtilsdetemineSchemaTriesToOpenUrl(org.apache.hadoop.hive.serde2.avro.TestAvroSerdeUtils)
  Time elapsed: 0.377 sec   ERROR!java.lang.RuntimeException: 
java.lang.reflect.InvocationTargetException   at 
org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native 
Method) at 
org.apache.hadoop.security.JniBasedUnixGroupsMapping.clinit(JniBasedUnixGroupsMapping.java:49)
 at 
org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.init(JniBasedUnixGroupsMappingWithFallback.java:38)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) 
   at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)  at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:129) at 
org.apache.hadoop.security.Groups.init(Groups.java:55) at 
org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:182)
 at 
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:235)
at 
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:214)
 at 
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:669)
  at 
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
at 
org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:2590)   at 
org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:2582)   at 
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2448)  at 
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367) at 
org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFromFS(AvroSerdeUtils.java:110)
   at 
org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:71)
at 
org.apache.hadoop.hive.serde2.avro.TestAvroSerdeUtils.detemineSchemaTriesToOpenUrl(TestAvroSerdeUtils.java:139)

  

RE: java.lang.RuntimeException: cannot find field key from [0:_col0, 1:_col2, 2:_col3]

2014-02-25 Thread java8964
Works for me on 0.10.
Yong

Date: Tue, 25 Feb 2014 11:37:32 -0800
From: kumarbuyonl...@yahoo.com
Subject: Re: java.lang.RuntimeException: cannot find field key from [0:_col0, 
1:_col2, 2:_col3]
To: user@hive.apache.org

Hi,Thanks for looking into it.I am also trying this on hive 0.11 to see if 
it works there.  If you get a chance to reproduce this problem on hive 0.10, 
please let me know.
Thanks.   On Monday, February 24, 2014 10:59 PM, java8964 
java8...@hotmail.com wrote:My guess is that your UDTF will return an 
array of struct. I don't have Hive 0.10 in handy right now, but I write a 
simple UDTF to return an array of struct to test on Hive 0.12 release.hive 
desc test;OKid   int None
namestring  
NoneTime taken: 0.074 seconds, Fetched: 2 row(s)hive 
select * from test;OK1Apples,Bananas,CarrotsTime taken: 0.08 seconds, 
Fetched: 1 row(s)The pair UDTF will output Apples,Bananas,Carrotsto Apples, 
BananasApples, CarrotsBananas, Carrotsan array of 2 elements struct.hive 
select id, name, m1, m2 from test lateral view pair(name) p as m1, m2 where m1 
is not
 null;OK1   Apples,Bananas,Carrots  Apples  Bananas1
Apples,Bananas,Carrots  Apples  Carrots1Apples,Bananas,Carrots  Bananas 
CarrotsTime taken: 7.683 seconds, Fetched: 3 row(s)hive select id,
 name, m1, m2 from test lateral view pair(name) p as m1, m2 where m1 = 
'Apples';OK1 Apples,Bananas,Carrots  Apples  Bananas1
Apples,Bananas,Carrots  Apples  CarrotsTime taken: 7.726 seconds, Fetched: 2 
row(s)hive set hive.optimize.ppd=true;hive select id, name, m1, m2 from test 
lateral view pair(name) p as m1, m2 where m1 is not null;Total MapReduce jobs = 
1OK1Apples,Bananas,Carrots  Apples  Bananas1
Apples,Bananas,Carrots  Apples  Carrots1Apples,Bananas,Carrots  Bananas 
CarrotsTime taken: 7.716 seconds, Fetched: 3 row(s)I cannot reproduce your 
error in Hive 0.12, as you can
 see. I can test on Hive 0.10 tomorrow when I have time, but can your test your 
case in Hive 0.12, or review your UDTF again?YongDate: Mon, 24 Feb 2014 
07:09:44 -0800From: kumarbuyonline@yahoo.comSubject: Re: 
java.lang.RuntimeException: cannot find field key from [0:_col0, 1:_col2, 
2:_col3]To: user@hive.apache.org; kumarbuyonline@yahoo.comAs suggested, I 
changed the query like this:select x.f1,x,f2,x,f3,x.f4from (   select e.f1 
as f1,e.f2 as f2,e.f3 as f3,e.f4 as f4 from mytable LATERAL VIEW 
myfunc(p1,p2,p3,p4) e  as f1,f2,f3,f4 where lang=123) x where x.f3 is not
 null;And it still doesn't work. I am getting the same error.  If anyone has 
any ideas, please let me know.Thanks.   On Friday, February 21, 2014 11:27 
AM, Kumar V
 kumarbuyonl...@yahoo.com wrote:Line 316 in my UDTF where is shows the 
error is the line where I call forward().The whole trace is :Caused by: 
java.lang.RuntimeException: cannot find field key from [0:_col0, 1:_col2, 
2:_col6, 3:_col7, 4:_col8, 5:_col9]  at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:346)
  at
 
org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.getStructFieldRef(StandardStructObjectInspector.java:143)
  at 
org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:57)
   at 
org.apache.hadoop.hive.ql.exec.ExprNodeFieldEvaluator.initialize(ExprNodeFieldEvaluator.java:55)
 at 
org.apache.hadoop.hive.ql.exec.ExprNodeFieldEvaluator.initialize(ExprNodeFieldEvaluator.java:55)
 at 
org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:128)
at 
org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:128)
at 
org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:128)
at
 
org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:85) 
   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)   
at
 org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800) at 
org.apache.hadoop.hive.ql.exec.LateralViewJoinOperator.processOp(LateralViewJoinOperator.java:133)
   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)   at 
org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)   at 
org.apache.hadoop.hive.ql.exec.UDTFOperator.forwardUDTFOutput(UDTFOperator.java:112)
 at 
org.apache.hadoop.hive.ql.udf.generic.UDTFCollector.collect(UDTFCollector.java:44)
   at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDTF.forward(GenericUDTF.java:81)  
 at 
pb2CSVReadFile.FlatTableFileUDTFTx.process(FlatTableFileUDTFTx.java:316) at 
org.apache.hadoop.hive.ql.exec.UDTFOperator.processOp(UDTFOperator.java:98)  at 
org.apache.hadoop.hive.ql.exec.Operator.process

RE: hive query to calculate percentage

2014-02-25 Thread java8964
one query won't work, as totalcount is not in group by.
You have 2 options:
1) use the sub query
select a.timestamp_dt, a.totalcount/b.total_sumfrom daily_count_per_kg_domain a 
join(select timestamp_dt, sum(totalcount) as 
total_sumfromdaily_count_per_kg_domaingroup by timestamp_dt) b on 
(a.timestamp_dt = b.timestamp_dt)
2) If you are using hive 11 or above, using windows functions.
Yong

Date: Tue, 25 Feb 2014 18:27:34 -0600
Subject: Re: hive query to calculate percentage
From: kkrishna...@gmail.com
To: user@hive.apache.org

Modfiy the query to :select totalcount / sum(totalcount) from 
daily_count_per_kg_domain where timestamp_dt = '20140219' group by timestamp_dt;

if you dont specify the where clause, you will get result for all partitions.


On Tue, Feb 25, 2014 at 3:14 PM, Manish maa...@gmail.com wrote:

I have a partitioned table  on timestamp_dt:



 desc daily_count_per_kg_domain;

OK

ddnamesyskg string

totalcount  int

timestamp_dtstring



hive select * from daily_count_per_kg_domain;

OK

sys_kg_band 224 20140219

sys_kg_event343520140219

sys_kg_movies   44987   20140219

sys_kg_oly  417220140219

sys_kg_sp_countr549920140219

sys_kg_sports   395420140219

sys_kg_tv   21387   20140219

sys_kg_venue152 20140219

sys_kgathlete   900020140219

sys_kgpeople300064  20140219





Looking to compute percentages for each row, (per day):



100* totalcount / sum(totalcount)





Intuitively i tried :

 select totalcount / sum(totalcount) from daily_count_per_kg_domain where 
 timestamp_dt = '20140219' ;



FAILED: SemanticException [Error 10025]: Line 1:7 Expression not in GROUP BY 
key 'totalcount'



I am not sure what group by on totalcount means..



Any ides ?






  

RE: part-m-00000 files and their size - Hive table

2014-02-25 Thread java8964
Yes, it is good that the file sizes are evenly close, but not very important, 
unless there are files very small (compared to the block size).
The reasons are:
Your files should be splitable to be used in Hadoop (Or in Hive, it is the same 
thing). If they are splitable, then 1G file will use 10 blocks (assume the 
block size is 128M), and 256M file will take 2 blocks. So these 2 files will 
generate 12 mapper tasks, and will be equally run in your cluster. From 
performance point of view, you have 12 mapper tasks, and they are equally 
processed in the cluster. So one 1G file plus one 256M file are not big deal. 
But if you have one file are very small, like 10M, that one file will also 
consume one mapper task, and that is kind of bad for performance, as hadoop 
starting one mapper task only consuming 10M data, which is bad, because 
starting/stop tasks is using quite some resource, but only processing 10M data.
The reason you see unevenly file size of the output of sqoop is that it is hard 
for sqoop to split your source data evenly. For example, if you dump table A 
from DB to hive, sqoop will do the following:
1) Identify the primary/unique keys of the table.2) Find out the min/max value 
of the keys, let say they are (1 to 1,000,000)3) Based on # of your mapper 
task, split them. If you run sqoop with 4 mappers, then the data will be split 
into 4 groups (1, 250,000) (250,001, 500,000) (500,001, 750,000) (750,001, 
1,000,000). As you can image, your data most likely are not even distributed by 
the primary keys in that 4 groups, then you will get unevenly output as 
part-m-xxx files.
Keep in mind that it is not required to use primary keys or unique keys as the 
split column. So you can choose whateven column in your table make sense. Pick 
up whateven can make the split more even.
Yong
Date: Tue, 25 Feb 2014 17:42:20 -0800
From: hadoop...@yahoo.com
Subject: part-m-0 files and their size - Hive table
To: user@hive.apache.org

Hi,
I am loading data to HDFS files through sqoop and creating a Hive table to 
point to these files.
The mapper files through sqoop example are generated like this below.


part-m-0


part-m-1
part-m-2
My question is -1) For Hive query performance , how important or significant is 
the distribution of the file sizes above.
part_m_0 say 1 GBpart_m_1 say 3 GBpart_m_1 say 0.25 GB
Vs
part_m_0 say 1.4
 GBpart_m_1 say 1.4 GBpart_m_1 say  1.45 B

NOTE : The size and no of files is just for sample. The real numbers are far 
bigger.

I am assuming the uniform distribution has a performance benefit .
If so, what is the reason and can I know the technical details. 
  

RE: java.lang.RuntimeException: cannot find field key from [0:_col0, 1:_col2, 2:_col3]

2014-02-21 Thread java8964
What is your stracktrace? Can you paste here?
It is maybe a different bug.
If you put e.f3  null at an outsider query? Does that work?
Or maybe you have to enhance your UDTF to push that filter into your UDTF. It 
is not perfect, but maybe a solution for you as now.
You can create a new Jira if it is a new bug.
Yong
Date: Fri, 21 Feb 2014 07:18:32 -0800
From: kumarbuyonl...@yahoo.com
Subject: java.lang.RuntimeException: cannot find field key from [0:_col0, 
1:_col2, 2:_col3]
To: user@hive.apache.org

Hi,   I have a UDTF which works fine except when I do a query like the 
following :
select e.* from mytable LATERAL VIEW myfunc(p1,p2,p3,p4) e  as f1,f2,f3,f4 
where lang=123 and e.f3  null;
The error I see is:
java.lang.RuntimeException: cannot find field key from [0:_col0, 1:_col2,
 2:_col3]


If i remove 'and e.f3  null' from the WHERE clause, it works fine.
Also, with e.f3  null in the WHERE clause, if I add the setting 
hive.optimize.ppd=false
 it works fine, but now, instead of using 600 mappers, it uses about 10,000 
mappers and runs for more than 2 hours instead of a few minutes.
I am using hive 0.10.  I saw the jira HIVE-3226 which says that it has been 
fixed in hive 0.10.  Is this the bug that I am hitting now ?Any other ideas of 
how to make it work ? I am
 actually on CDH 4.4 which has hive 0.10.
Please let me know.
Thanks,Murali.

Hbase + Hive scan performance

2014-02-10 Thread java8964
Hi, 
I know this has been asked before. I did google around this topic and tried to 
understand as much as possible, but I kind of got difference answers based on 
different places. So I like to ask what I have faced and if someone can help me 
again on this topic.
I created one table with one column family with 20+ columns in the hive. It is 
populated around 150M records from a 20G csv file. What I want to check if how 
fast I can get for a full scan in MR job from the Hbase table.
It is running in a 10 nodes hadoop cluster (With Hadoop 1.1.1 + Hbase 0.94.3 + 
Hive 0.9) , 8 of them as Data + Task nodes, and one is NN and Hbase master, and 
another one is running 2nd NN.
4 nodes of 8 data nodes also run Hbase region servers.
I use the following code example to get row count from a MR job, 
http://hbase.apache.org/book/mapreduce.example.htmlAt first, the mapper tasks 
run very slow, as I commented out the following 2 lines on purpose:
scan.setCaching(1000);// 1 is the default in Scan, which will be bad 
for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs
Then I added the above 2 lines, I almost get 10X faster compared to the first 
run. That's good, it proved to me that above 2 lines are important for Hbase 
full scan.
Now the question comes to in Hive.
I already created the table in the Hive linking to the Hbase table, then I 
started my hive session like this:
hive --auxpath 
$HIVE_HOME/lib/hive-hbase-handler-0.9.0.jar,$HIVE_HOME/lib/hbase-0.94.3.jar,$HIVE_HOME/lib/zookeeper-3.4.5.jar,$HIVE_HOME/lib/guava-r09.jar
 -hiveconf hbase.master=Hbase_master:port
If I run this query select count(*) from table, I can see the mappers 
performance is very bad, almost as bad as my 1st run above.
I searched this mailing list, it looks like there is a setting in Hive session 
to change the scan caching size, same as 1st line of above code base, from here:
http://mail-archives.apache.org/mod_mbox/hbase-user/201110.mbox/%3CCAGpTDNfn11jZAJ2mfboEqkfudXaU9HGsY4b=2x1spwf4qmu...@mail.gmail.com%3E
So I add the following settings in my hive session:
set hbase.client.scanner.caching=1000;
To my surprise, after this setting in hive session, the new MR job generated 
from the Hive query still very slow, same as before this settings.
Here is what I found so far:
1) In my owner MR code, before I add the 2 lines of code change or after, in 
the job.xml of MR job, I both saw this setting in the job.xml: 
hbase.client.scanner.caching=1So this setting is the same in both run, but 
the performance improved great after the code change.
2) In hive run, I saw the setting hbase.client.scanner.caching changed from 1 
to 1000 in job.xml, which is what I set in the hive session, but performance 
has not too much change. So the setting was changed, but it didn't help the 
performance as I expected.
My questions are following:
1) Is there any change in the hive (0.9) do the same as the 1st line of code 
change? From google and hbase document, it looks like the above configuration 
is the one, but it didn't help me.2) Even assume the above setting is correct, 
why we have this Hive Jira to fix the Hbase scan cache and marked ONLY fixed in 
Hive 0.12? The Jira ticket is here: 
https://issues.apache.org/jira/browse/HIVE-36033) Is there any hive setting can 
do the same as 2nd line code change above? If so, what is it? I google around 
and cannot find one.
Thanks
Yong  

RE: HiveMetaStoreClient only sees one of my DBs ?

2013-12-30 Thread java8964
Best mailing list for this question is hive, but I will try to give my guess 
here anyway.
If you only see 'default' database, most likely you are using hive 
'LocalMetaStore'. For helping yourself to find out the problem, try to find out 
following information:
1) What kind of Hive metastore you are using? Did you configure any thing? Most 
widely used are Derby or Mysql. 2) If you are using a remote Hive metastore, 
did the hive-conf folder passed to the class path of your Java program? From 
the following log, it looks like wrong hive conf files are passed in:
[2013-12-30 11:44:03,261] INFO ugi=myuser_name  ip=unknown-ip-addr  
cmd=get_all_databases   
Yong

From: tedd...@gmail.com
Date: Mon, 30 Dec 2013 11:48:55 -0800
Subject: HiveMetaStoreClient only sees one of my DBs ?
To: hive-u...@hadoop.apache.org

if I log into my hive shell, do show databases; , I see many DBs:
Logging initialized using configuration in 
file:/etc/hive/conf/hive-log4j.propertieshive show databases;

OKconfconfnewdefault
moneytestdbTime taken: 1.57 seconds, Fetched: 6 row(s)

but somehow if I run the following java code using the client API, it only 
shows default

public static void main(String args[]) throws MetaException {   
HiveMetaStoreClient cli = new HiveMetaStoreClient(new 
HiveConf());

for(String db: cli.getAllDatabases())   
System.out.println(db);

}

stence)[2013-12-30 11:44:02,900] INFO Validating 2 unique key(s) for table DBS 
(DataNucleus.Datastore.Schema)

[2013-12-30 11:44:02,913] INFO Validating 0 foreign key(s) for table DBS 
(DataNucleus.Datastore.Schema)[2013-12-30 11:44:02,919] INFO Validating 2 
index(es) for table DBS (DataNucleus.Datastore.Schema)

[2013-12-30 11:44:02,926] INFO Validating 1 unique key(s) for table 
DATABASE_PARAMS (DataNucleus.Datastore.Schema)[2013-12-30 11:44:02,937] INFO 
Validating 1 foreign key(s) for table DATABASE_PARAMS 
(DataNucleus.Datastore.Schema)

[2013-12-30 11:44:02,942] INFO Validating 2 index(es) for table DATABASE_PARAMS 
(DataNucleus.Datastore.Schema)[2013-12-30 11:44:03,087] INFO Listener found 
initialisation for persistable class 
org.apache.hadoop.hive.metastore.model.MDatabase (DataNucleus.MetaData)

[2013-12-30 11:44:03,211] INFO 0: get_all_databases 
(org.apache.hadoop.hive.metastore.HiveMetaStore)[2013-12-30 11:44:03,261] INFO 
ugi=myuser_name  ip=unknown-ip-addr  cmd=get_all_databases
(org.apache.hadoop.hive.metastore.HiveMetaStore.audit)

default





why is it showing only 1 db? what setttings of default are different from the 
others to enable it to be shown?  also I wonder how is that HiveConf() 
initialized ? how does it even know the hive port and config settings ? is it 
hardcoded to /etc/hive/conf/hive-site.xml ?



thanksYang




  

Why from_utc_timestamp works for some bigint, but not others

2013-12-06 Thread java8964
Hi, I am using Hive 0.9.0, and not sure why the from_utc_timestamp gave me 
error to the following value, but works for others.
The following example shows 2 bigint as 2 epoch value of milliseconds level. 
They are only 11 seconds difference. One works fine in hive 0.9.0 with 
from_utc_timestamp UDF, the others just broken. Why?

yzhang@yzhang-linux:~ hivehive use tracking2;OKTime taken: 1.569 secondshive 
select from_utc_timestamp(1371746476159, 'EST') from track_sent;OK2013-06-20 
07:41:16.3182013-06-20 07:41:16.318Time taken: 8.64 secondshive select 
from_utc_timestamp(1371746487759, 'EST') from track_sent;FAILED: Hive Internal 
Error: java.lang.IllegalArgumentException(Value out of 
range)java.lang.IllegalArgumentException: Value out of range at 
java.sql.Timestamp.setNanos(Timestamp.java:287)  at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFFromUtcTimestamp.applyOffset(GenericUDFFromUtcTimestamp.java:101)
at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFFromUtcTimestamp.evaluate(GenericUDFFromUtcTimestamp.java:94)
at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:127)
 at 
org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc.newInstance(ExprNodeGenericFuncDesc.java:214)
 at 
org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:767)
  at 
org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:888)
 at 
org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89)
  at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:125)
   at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102)
   at 
org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:165)
  at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:7755)
 at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:2310)
   at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:2112)
   at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:6165)
  at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:6136)
 at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:6762)
 at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7531)
 at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431)at 
org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)at 
org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258) at 
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)  at 
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406) at 
org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:689) at 
org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:611) at 
org.apache.hadoop.util.RunJar.main(RunJar.java:156) 
   

hive partition pruning on joining on partition column

2013-10-11 Thread java8964 java8964
I have the requirement trying to support in hive, not sure if it is doable.
I have the hadoop 1.1.1 with Hive 0.9.0 (Using deby as the meta store)
If I partition my data by a dt column, so if my table 'foo' have some 
partitions like 'dt=2013-07-01' to 'dt=2013-07-30'.
Now the user want to query all the data of Saturday only.
To make it flexiable, instead of asking end user to find out what date in that 
month are Saturday, I add a lookup table (just called it 'bar') in the HIVE 
with following columns:
year, month, day, dt_format, week_of_day
So I want to see if I can join with foo and bar to still get the partition 
pruning:
select *from foojoin baron (bar.year=2013 and bar.month=7 and bar.day_of_week=6 
and bar.dt_foramt = foo.dt)
I tried several ways, like switch the table order, join with subquery etc, none 
of them will make partition pruning works in this case on table foo. 
Can this really archivable in hive?
Thanks
Yong  

RE: Question about how to add the debug info into the hive core jar

2013-03-20 Thread java8964 java8964

I am not sure the existing logging information is enough for me.
The exception trace is as following:
Caused by: java.lang.IndexOutOfBoundsException: Index: 8, Size: 8at 
java.util.ArrayList.rangeCheck(ArrayList.java:604)at 
java.util.ArrayList.get(ArrayList.java:382)at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485)at
 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485)
It is hive 0.9.0, and I look into the source code of LazySImpleSerDe.java 
around line 485:
  List? extends StructField fields = soi.getAllStructFieldRefs();  
list = soi.getStructFieldsDataAsList(obj);  if (list == null) {
out.write(nullSequence.getBytes(), 0, nullSequence.getLength());  } else {  
  for (int i = 0; i  list.size(); i++) {  if (i  0) {
out.write(separator);  } serialize(out, list.get(i), 
fields.get(i).getFieldObjectInspector(), -- line 485
  separators, level + 1, nullSequence, escaped, escapeChar,  
needsEscape);} }   
For this exception to happen, it means that the soi (Which is my 
StructObjectInspector class) must return different length of collection object 
as fields and list.But I already add the logger in my 
StructorObjectInspector, which proves the same length collection returned from 
both method of getAllStructFieldRefs() and getStructFiledsDataAsList(Object).So 
I really don't know how this exception could happen in the Hive code.
I have 2 options right now:1) Change the above code to add more debug 
information to return at runtime to check what kind of content in the either 
fields object or list object, to understand why their length not same. But 
I have problem to make my new jar to be loaded by hadoop.2) Enable remote 
debug. There is very limited example on the internet about how to enable the 
hive server side MR jobs remote debug, even some wiki pages claim it is doable, 
but without concrete examples.
Thanks

From: ashet...@hortonworks.com
Subject: Re: Question about how to add the debug info into the hive core jar
Date: Wed, 20 Mar 2013 17:35:36 -0700
To: user@hive.apache.org

Hi Yong, 
Have you tried running the H query in debug mode. Hive log level can be changed 
by passing the following conf while hive client is running.  #hive -hiveconf 
hive.root.logger=ALL,console -e  DDL statement ;#hive -hiveconf 
hive.root.logger=ALL,console -f ddl.sql ;   Hope this helps
 Thanks

On Mar 20, 2013, at 1:45 PM, java8964 java8964 java8...@hotmail.com wrote:Hi, 
I have the hadoop running in  pseudo-distributed mode on my linux box. Right 
now I face a problem about a Hive, which throws Exception in a table for some 
data which used my custom SerDe and InputFormat class.
To help me to trace the root cause, I need to modify the code of 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe to add more debug logging 
information to understand why the exception happens.
After I modify the hive code, I can compile it and generate a new 
hive-serde.jar file, with the same name as the release version, just size 
changed.
Now I put my new hive-serde.jar under $HIVE_HOME/lib folder, replace the old 
one, and run the query which failed. But after the failure, if I check the 
$HADOOP_HOME/logs/user_logs/, I saw the Exception stacktrace still looked like 
generated by the original hive-serde class. The reason is that the line number 
shown in the log doesn't match with the new code I changed to add the debug 
information.
My question is, if I have this new compiled hive-serde.jar file, besides 
$HIVE_HOME/lib, where should I put it in?
1) This is a pseudo environments. Everything (namenode, data node, job tracker 
and tasktracer are all running in one box)2) After I replace hive-serde.jar 
with my new jar, I even stop all the hadoop java processing and restart them.3) 
But when I run the query in the hive session, I still saw the log generated by 
the old hive-serde.jar class. Why?
Thank for any help
Yong
  

RE: difference between add jar in hive session and hive --auxpath

2013-03-08 Thread java8964 java8964

This is in HIVE-0.9.0
hive list 
jars;/nfs_home/common/userlibs/google-collections-1.0.jar/nfs_home/common/userlibs/elephant-bird-hive-3.0.7.jar/nfs_home/common/userlibs/protobuf-java-2.3.0.jar/nfs_home/common/userlibs/elephant-bird-core-3.0.7.jarfile:/usr/lib/hive/lib/hive-builtins-0.9.0-cdh4.1.2.jarhive
 desc table;java.lang.NoClassDefFoundError: 
com/twitter/elephantbird/mapreduce/io/ProtobufConverterat 
com.twitter.elephantbird.hive.serde.ProtobufDeserializer.initialize(ProtobufDeserializer.java:45)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:203)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:260)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:253)
at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:490)at 
org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:162)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:930)at 
org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:844)at 
org.apache.hadoop.hive.ql.exec.DDLTask.describeTable(DDLTask.java:2545)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:309)at 
org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) 
   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1331)at 
org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1117)at 
org.apache.hadoop.hive.ql.Driver.run(Driver.java:950)at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:744)at 
org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:607)at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)at 
org.apache.hadoop.util.RunJar.main(RunJar.java:208)Caused by: 
java.lang.ClassNotFoundException: 
com.twitter.elephantbird.mapreduce.io.ProtobufConverterat 
java.net.URLClassLoader$1.run(URLClassLoader.java:202)at 
java.security.AccessController.doPrivileged(Native Method)at 
java.net.URLClassLoader.findClass(URLClassLoader.java:190)at 
java.lang.ClassLoader.loadClass(ClassLoader.java:307)at 
java.lang.ClassLoader.loadClass(ClassLoader.java:248)... 25 moreFAILED: 
Execution Error, return code -101 from 
org.apache.hadoop.hive.ql.exec.DDLTaskhive exit;[y130zhan@daca2 userlibs]$ jar 
tvf /nfs_home/common/userlibs/elephant-bird-core-3.0.7.jar | grep 
ProtobufConverter  4825 Mon Mar 04 16:50:46 UTC 2013 
com/twitter/elephantbird/mapreduce/io/ProtobufConverter.class   732 Mon Mar 04 
16:50:46 UTC 2013 
com/twitter/elephantbird/mapreduce/io/ProtobufConverter$1.class

From: vkavul...@outlook.com
To: user@hive.apache.org
Subject: RE: difference between add jar in hive session and hive --auxpath
Date: Thu, 7 Mar 2013 16:44:41 -0800




If properly done, add jar jar-file should work the same as passing the jar 
with --auxpath. Can you run list jars; command from CLI or Hue and check if 
you see the jar file.

From: java8...@hotmail.com
To: user@hive.apache.org
Subject: difference between add jar in hive session and hive --auxpath
Date: Thu, 7 Mar 2013 17:47:26 -0500





Hi, 
I have a hive table which uses the jar file provided from the elephant-bird, 
which is a framework integrated between lzo and google protobuf data and 
hadoop/hive.
If I use the hive command like this:
hive --auxpath path_to_jars, it works fine to query my table, 
but if I use the add jar after I started the hive session, I will get 
ClassNotFoundException in the runtime of my query of the classes in those jars.
My questions are:
1) What is the different between hive --auxpath and add jar in the hive 
session?2) This problem makes it is hard to access my table in the HUE, as it 
only supports add jar, but not --auxpath option. Any suggestions?

Thanks
Yong
  

A bug belongs to Hive or Elephant-bird

2013-03-08 Thread java8964 java8964

Hi, 
Hive 0.9.0 + Elephant-Bird 3.0.7
I faced a problem to use the elephant-bird with hive. I know what maybe cause 
this problem, but I don't know which side this bug belongs to. Let me know 
explain what is the problem.
If we define a google protobuf file, with field name like 'dateString' (the 
field contains an uppercase 'S'), then when I query the table like this: 
select dateString from table .

I will get the following exception trace:
Caused by:
java.lang.RuntimeException: cannot find field datestring from
[org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField@49aacd5f
 .at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:321)

   
at
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector.getStructFieldRef(UnionStructObjectInspector.java:96)

   
at 
org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:57)

   
at org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:878)

   
at 
org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:904)

   
at
org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:60)

   
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)

   
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433)

   
at
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389)

   
at
org.apache.hadoop.hive.ql.exec.FilterOperator.initializeOp(FilterOperator.java:73)

   
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)

   
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433)

   
at
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389)

   
at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:133)

   
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)

   
at
org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:444)

   
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)

   
at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:98)

Here is the code for the method throws this error:
  public static StructField getStandardStructFieldRef(String fieldName,  
List? extends StructField fields) {fieldName = fieldName.toLowerCase();   
 for (int i = 0; i  fields.size(); i++) {  if 
(fields.get(i).getFieldName().equals(fieldName)) {return fields.get(i); 
 }}// For backward compatibility: fieldNames can also be integer 
Strings.try {  int i = Integer.parseInt(fieldName);  if (i = 0  
i  fields.size()) {return fields.get(i);  }} catch 
(NumberFormatException e) {  // ignore}throw new 
RuntimeException(cannot find field  + fieldName +  from + fields);  
  // return null;  }
I understand the problem happens because at this time, the fileName is 
datestring (all lowercase charcters), but the Listfields contains the 
fieldName for that field is dateString, and that is why the RuntimeException 
happened.
But I don't know which side this bug belongs to, or I want to know more inside 
detail about the Hive implementation contract.
From this link: 
https://cwiki.apache.org/Hive/user-faq.html#UserFAQ-AreHiveQLidentifiers%2528e.g.tablenames%252Ccolumnnames%252Cetc%2529casesensitive%253F
I know that in hive, the table name and column name should be case insensitive, 
so even though in my Query, I used select dateString, the fieldName changed 
to datestring in the code, but the StructField of ObjectInspector from the 
elephant-bird return the EXACTLY fieldname, defined in the code, dateString 
in this case. of course, I can change my protof file to only use lowercase 
field name to bypass this bug, but my questions are:
1) If I implement my ObjectInspector, should I pay attention to the field name? 
Is it needed to be lowercase? 2) I would consider this as a bug of hive, right? 
If this line:
fieldName = fieldName.toLowerCase(); to lowercase the data,
then the comparing should also do it by lowering case by changing
if (fields.get(i).getFieldName().equals(fieldName))
to 
if (fields.get(i).getFieldName().toLowerCase().equals(fieldName))
right?
Thanks
Yong
  

difference between add jar in hive session and hive --auxpath

2013-03-07 Thread java8964 java8964

Hi, 
I have a hive table which uses the jar file provided from the elephant-bird, 
which is a framework integrated between lzo and google protobuf data and 
hadoop/hive.
If I use the hive command like this:
hive --auxpath path_to_jars, it works fine to query my table, 
but if I use the add jar after I started the hive session, I will get 
ClassNotFoundException in the runtime of my query of the classes in those jars.
My questions are:
1) What is the different between hive --auxpath and add jar in the hive 
session?2) This problem makes it is hard to access my table in the HUE, as it 
only supports add jar, but not --auxpath option. Any suggestions?

Thanks
Yong  

RE: reg : getting table values in inputFormat in serde

2012-12-21 Thread java8964 java8964

Actually I am backing up this question. In additional for that, I wonder if it 
is possible we can access the table properties from the UDF too.
I also have XML data, but with namespace into it. The XPATH UDF coming from 
HIVE doesn't support namespace. To support the namespace in XML is simple, just 
need a NamespaceContextImpl. But the trick part is to having a generic 
solution, we want some properties can passed to underline InputFormat or 
xpath_ns UDF, for example, the prefix/URI mapping, or the real class name who 
implements the NamespaceContext interface. But right now, it is not easy to 
pass this kind information, even if we define them into the table properties.
Right now, I have to kind of hard-coded into the UDF or InputFormat, instead of 
providing a generic solution.
Yong

From: mohit_chaudhar...@infosys.com
To: user@hive.apache.org
Subject: RE: reg : getting table values in inputFormat in serde
Date: Fri, 21 Dec 2012 11:39:54 +









Actually I am storing data from xml file into hive table using serde.
Now I want table properties which I given during table creation  in inputformat 
.
 
 
From: Nitin Pawar [mailto:nitinpawar...@gmail.com]


Sent: Friday, December 21, 2012 3:16 PM

To: user@hive.apache.org

Subject: Re: reg : getting table values in inputFormat in serde
 

What kind of table properties you want to be accessed ? and what you want to 
achieve from  it in serde ? 


 

On Fri, Dec 21, 2012 at 11:04 AM, Mohit Chaudhary01 
mohit_chaudhar...@infosys.com wrote:


Hi

In serde program I need table properties in inputformat file .

 So can anyone please tell me how can I do this?
 
 
thanks






 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are 
not 
to copy, disclose, or distribute this e-mail or its contents to any other 
person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken 
every reasonable precaution to minimize this risk, but is not liable for any 
damage 
you may sustain as a result of any virus in this e-mail. You should carry out 
your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this 
e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***










 

-- 

Nitin Pawar

  

xpath UDF in hive support namespace?

2012-12-19 Thread java8964 java8964

Hi, I have a question related to the XPATH UDF currently in HIVE.
From the original Jira story about this UDF:  
https://issues.apache.org/jira/browse/HIVE-1027, It looks like the UDF won't 
support namespace in the XML, is that true?
Any later HIVE version does support namespace, if so, what is the version? And 
also if this UDF does support namespace, does anyone have some examples how to 
do it in xpath UDF of hive? All the examples in the wiki page are without name 
space.

Thanks
Yong  

RE: xpath UDF in hive support namespace?

2012-12-19 Thread java8964 java8964

Hi, Mark:
Thanks for your response. I am thinking how to change it to support the 
namespace. The easy way is allow end user to pass into properties of mapping of 
(prefix and URI), but from the HIVE UDF, it is not easy to allow end users to 
pass in any properties like Serde. Maybe that's the reason why it isn't 
supported initially.
Yong

 Date: Wed, 19 Dec 2012 14:04:20 -0800
 Subject: Re: xpath UDF in hive support namespace?
 From: grover.markgro...@gmail.com
 To: user@hive.apache.org
 
 Hi, the source code for the xpath UDFs is at
 https://github.com/apache/hive/tree/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/xml
 
 I checked and saw that there were no changes made to these UDFs after
 there initial creation by HIVE-1027. So, that's all we have. Sorry.
 However, if you would like make additions to those UDFs, you are
 welcome to do so by creating a JIRA and posting a patch. UDFs are an
 easy and excellent way to contribute back to the Hive community.
 
 Thanks!
 
 Mark
 
 On Wed, Dec 19, 2012 at 8:52 AM, java8964 java8964 java8...@hotmail.com 
 wrote:
  Hi, I have a question related to the XPATH UDF currently in HIVE.
 
  From the original Jira story about this UDF:
  https://issues.apache.org/jira/browse/HIVE-1027, It looks like the UDF won't
  support namespace in the XML, is that true?
 
  Any later HIVE version does support namespace, if so, what is the version?
  And also if this UDF does support namespace, does anyone have some examples
  how to do it in xpath UDF of hive? All the examples in the wiki page are
  without name space.
 
 
  Thanks
 
  Yong
  

RE: Array index support non-constant expresssion

2012-12-12 Thread java8964 java8964

OK. 
I followed the hive source code of 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFArrayContains and wrote the 
UDF. It is quite simple. 
It works fine as I expected for simple case, but when I try to run it under 
some complex query, the hive MR jobs failed with some strange errors. What I 
mean is that it failed in HIVE code base, from stuck trace, I can not see this 
failure has anything to do with my custom code.
I would like some help if some one can tell me what went wrong.
For example, I created this UDF called darray, stand for dynamic array, which 
supports the non-constant value as the index location of the array.
The following query works fine as I expected:
hive select c_poi.provider_str as provider_str, c_poi.name as name from 
(select darray(search_results, c.index_loc) as c_poi from search_table lateral 
view explode(search_clicks) clickTable as c) a limit 5;POI  
   ADDRESS   some addressPOIPOI 
   ADDRESSS some address
Of course, in this case, I only want the provider_str = 'POI' returned, and 
filter out any rows with provider_str != 'POI', so it sounds simple, I changed 
the query to the following:
hive select c_poi.provider_str as provider_str, c_poi.name as name from 
(select darray(search_results, c.rank) as c_poi from search_table lateral view 
explode(search_clicks) clickTable as c) a where c_poi.provider_str = 'POI' 
limit 5;Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks 
is set to 0 since there's no reduce operatorCannot run job locally: Input Size 
(= 178314025) is larger than hive.exec.mode.local.auto.inputbytes.max (= 
134217728)Starting Job = job_201212031001_0100, Tracking URL = 
http://blevine-desktop:50030/jobdetails.jsp?jobid=job_201212031001_0100Kill 
Command = /home/yzhang/hadoop/bin/hadoop job  
-Dmapred.job.tracker=blevine-desktop:8021 -kill job_201212031001_01002012-12-12 
11:45:24,090 Stage-1 map = 0%,  reduce = 0%2012-12-12 11:45:43,173 Stage-1 map 
= 100%,  reduce = 100%Ended Job = job_201212031001_0100 with errorsFAILED: 
Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
I am only add a Where limitation, but to my surprise, the MR jobs generated by 
HIVE failed. I am testing this in my local standalone cluster, which is running 
CDH3U3 release. When I check the hadoop userlog, here is what I got:
2012-12-12 11:40:22,421 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 
SELECT 
struct_col0:bigint,_col1:string,_col2:string,_col3:string,_col4:string,_col5:string,_col6:boolean,_col7:boolean,_col8:boolean,_col9:boolean,_col10:boolean,_col11:boolean,_col12:string,_col13:string,_col14:structlat:double,lon:double,query_text_raw:string,query_text_normalized:string,query_string:string,llcountry:string,ipcountry:string,request_cnt:int,address:structcountry:string,state:string,zip:string,city:string,street:string,house:string,categories_id:arrayint,categories_name:arraystring,lang_raw:string,lang_rose:string,lang:string,viewport:structtop_lat:double,left_lon:double,bottom_lat:double,right_lon:double,_col15:structversions:int,physical_host:string,nose_request_id:string,client_type:string,ip:int,time_taken:int,user_agent:string,http_host:string,http_referrer:string,http_status:smallint,http_size:int,accept_language:string,md5:string,datacenter:string,tlv_map_data_version:string,tlv_devide_software_version:string,csid:int,rid:string,xncrid:string,cbfn:string,sources:arraystructtm:bigint,tm_date:string,tm_time:string,md5:string,time_taken:int,_col16:arraystructprovider_str:string,name:string,lat:double,lon:double,dyn:boolean,authoritative:boolean,search_center:boolean,_col17:arraystructrank:int,action:int,tm:bigint,event:string,is_csid:boolean,is_rid:boolean,is_pbapi:boolean,is_nac:boolean,_col18:string,_col19:structrank:int,action:int,tm:bigint,event:string,is_csid:boolean,is_rid:boolean,is_pbapi:boolean,is_nac:boolean2012-12-12
 11:40:22,440 WARN org.apache.hadoop.mapred.Child: Error running 
childjava.lang.RuntimeException: Error in configuring objectat 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)  
  at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:387)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)at 
org.apache.hadoop.mapred.Child$4.run(Child.java:270)at 
java.security.AccessController.doPrivileged(Native Method)at 
javax.security.auth.Subject.doAs(Subject.java:396)at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at org.apache.hadoop.mapred.Child.main(Child.java:264)Caused by: 
java.lang.reflect.InvocationTargetExceptionat 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

RE: Array index support non-constant expresssion

2012-12-12 Thread java8964 java8964

Hi,
I played my query further, and found out it is very puzzle to explain the 
following behaviors:
1) The following query works:
select c_poi.provider_str, c_poi.name from (select darray(search_results, 
c.rank) as c_poi from nulf_search lateral view explode(search_clicks) 
clickTable as c) a
I get get all the result from the above query without any problem.
2) The following query NOT works:
select c_poi.provider_str, c_poi.name from (select darray(search_results, 
c.rank) as c_poi from nulf_search lateral view explode(search_clicks) 
clickTable as c) a where c_poi.provider_str = 'POI'
As long as I add the where criteria on provider_str, or even I added another 
level of sub query like following:
selectps, namefrom (select c_poi.provider_str as ps, c_poi.name as name from 
(select darray(search_results, c.rank) as c_poi from nulf_search lateral view 
explode(search_clicks) clickTable as c) a ) bwhere ps = 'POI'
any kind of criteria I tried to add on provider_str, the hive MR jobs failed in 
the same error I shown below.
Any idea why this happened? Is it related to the data? But provider_str is just 
a simple String type.
Thanks
Yong
From: java8...@hotmail.com
To: user@hive.apache.org
Subject: RE: Array index support non-constant expresssion
Date: Wed, 12 Dec 2012 12:15:27 -0500





OK. 
I followed the hive source code of 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFArrayContains and wrote the 
UDF. It is quite simple. 
It works fine as I expected for simple case, but when I try to run it under 
some complex query, the hive MR jobs failed with some strange errors. What I 
mean is that it failed in HIVE code base, from stuck trace, I can not see this 
failure has anything to do with my custom code.
I would like some help if some one can tell me what went wrong.
For example, I created this UDF called darray, stand for dynamic array, which 
supports the non-constant value as the index location of the array.
The following query works fine as I expected:
hive select c_poi.provider_str as provider_str, c_poi.name as name from 
(select darray(search_results, c.index_loc) as c_poi from search_table lateral 
view explode(search_clicks) clickTable as c) a limit 5;POI  
   ADDRESS   some addressPOIPOI 
   ADDRESSS some address
Of course, in this case, I only want the provider_str = 'POI' returned, and 
filter out any rows with provider_str != 'POI', so it sounds simple, I changed 
the query to the following:
hive select c_poi.provider_str as provider_str, c_poi.name as name from 
(select darray(search_results, c.rank) as c_poi from search_table lateral view 
explode(search_clicks) clickTable as c) a where c_poi.provider_str = 'POI' 
limit 5;Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks 
is set to 0 since there's no reduce operatorCannot run job locally: Input Size 
(= 178314025) is larger than hive.exec.mode.local.auto.inputbytes.max (= 
134217728)Starting Job = job_201212031001_0100, Tracking URL = 
http://blevine-desktop:50030/jobdetails.jsp?jobid=job_201212031001_0100Kill 
Command = /home/yzhang/hadoop/bin/hadoop job  
-Dmapred.job.tracker=blevine-desktop:8021 -kill job_201212031001_01002012-12-12 
11:45:24,090 Stage-1 map = 0%,  reduce = 0%2012-12-12 11:45:43,173 Stage-1 map 
= 100%,  reduce = 100%Ended Job = job_201212031001_0100 with errorsFAILED: 
Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
I am only add a Where limitation, but to my surprise, the MR jobs generated by 
HIVE failed. I am testing this in my local standalone cluster, which is running 
CDH3U3 release. When I check the hadoop userlog, here is what I got:
2012-12-12 11:40:22,421 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 
SELECT 

RE: Array index support non-constant expresssion

2012-12-12 Thread java8964 java8964
) 
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) 
   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) 
   at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389)   
 at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:133)
at 
org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)at 
org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:444)   
 at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)   
 at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:98) 
   ... 22 more2012-12-12 20:36:21,365 INFO org.apache.hadoop.mapred.Task: 
Runnning cleanup for the task


 Date: Thu, 13 Dec 2012 09:06:25 +0900
 Subject: Re: Array index support non-constant expresssion
 From: navis@nexr.com
 To: user@hive.apache.org
 
 Could you try it with CP/PPD disabled?
 
 set hive.optimize.cp=false;
 set hive.optimize.ppd=false;
 
 2012/12/13 java8964 java8964 java8...@hotmail.com:
  Hi,
 
  I played my query further, and found out it is very puzzle to explain the
  following behaviors:
 
  1) The following query works:
 
  select c_poi.provider_str, c_poi.name from (select darray(search_results,
  c.rank) as c_poi from nulf_search lateral view explode(search_clicks)
  clickTable as c) a
 
  I get get all the result from the above query without any problem.
 
  2) The following query NOT works:
 
  select c_poi.provider_str, c_poi.name from (select darray(search_results,
  c.rank) as c_poi from nulf_search lateral view explode(search_clicks)
  clickTable as c) a where c_poi.provider_str = 'POI'
 
  As long as I add the where criteria on provider_str, or even I added another
  level of sub query like following:
 
  select
  ps, name
  from
  (select c_poi.provider_str as ps, c_poi.name as name from (select
  darray(search_results, c.rank) as c_poi from nulf_search lateral view
  explode(search_clicks) clickTable as c) a ) b
  where ps = 'POI'
 
  any kind of criteria I tried to add on provider_str, the hive MR jobs failed
  in the same error I shown below.
 
  Any idea why this happened? Is it related to the data? But provider_str is
  just a simple String type.
 
  Thanks
 
  Yong
 
  
  From: java8...@hotmail.com
  To: user@hive.apache.org
  Subject: RE: Array index support non-constant expresssion
  Date: Wed, 12 Dec 2012 12:15:27 -0500
 
 
  OK.
 
  I followed the hive source code of
  org.apache.hadoop.hive.ql.udf.generic.GenericUDFArrayContains and wrote the
  UDF. It is quite simple.
 
  It works fine as I expected for simple case, but when I try to run it under
  some complex query, the hive MR jobs failed with some strange errors. What I
  mean is that it failed in HIVE code base, from stuck trace, I can not see
  this failure has anything to do with my custom code.
 
  I would like some help if some one can tell me what went wrong.
 
  For example, I created this UDF called darray, stand for dynamic array,
  which supports the non-constant value as the index location of the array.
 
  The following query works fine as I expected:
 
  hive select c_poi.provider_str as provider_str, c_poi.name as name from
  (select darray(search_results, c.index_loc) as c_poi from search_table
  lateral view explode(search_clicks) clickTable as c) a limit 5;
  POI 
  ADDRESS   some address
  POI
  POI
  ADDRESSS some address
 
  Of course, in this case, I only want the provider_str = 'POI' returned, and
  filter out any rows with provider_str != 'POI', so it sounds simple, I
  changed the query to the following:
 
  hive select c_poi.provider_str as provider_str, c_poi.name as name from
  (select darray(search_results, c.rank) as c_poi from search_table lateral
  view explode(search_clicks) clickTable as c) a where c_poi.provider_str =
  'POI' limit 5;
  Total MapReduce jobs = 1
  Launching Job 1 out of 1
  Number of reduce tasks is set to 0 since there's no reduce operator
  Cannot run job locally: Input Size (= 178314025) is larger than
  hive.exec.mode.local.auto.inputbytes.max (= 134217728)
  Starting Job = job_201212031001_0100, Tracking URL =
  http://blevine-desktop:50030/jobdetails.jsp?jobid=job_201212031001_0100
  Kill Command = /home/yzhang/hadoop/bin/hadoop job
  -Dmapred.job.tracker=blevine-desktop:8021 -kill job_201212031001_0100
  2012-12-12 11:45:24,090 Stage-1 map = 0%,  reduce = 0%
  2012-12-12 11:45:43,173 Stage-1 map = 100%,  reduce = 100%
  Ended Job = job_201212031001_0100 with errors
  FAILED: Execution Error, return code 2 from
  org.apache.hadoop.hive.ql.exec.MapRedTask
 
  I am only add a Where limitation, but to my surprise, the MR jobs generated
  by HIVE failed. I am testing this in my local standalone cluster, which is
  running CDH3U3 release. When I check the hadoop userlog, here is what

Array index support non-constant expresssion

2012-12-11 Thread java8964 java8964

Hi, In our project to use the HIVE on CDH3U4 release (Hive 0.7.1), I have a 
hive table like the following:
Table foo (   search_results arraystructid:bigint,..   
search_clicks arraystructindex_loc:int, ..)
As you can see, the 2nd column, which represents the list of search results 
clicked, contains the index location of which results was clicked (starting 
from 0). Now I need to find out the click count of some IDs, but this IDs only 
exists in the search_result struct.
When I tried to write a query for this, I will try to do following first:

select search_results[c.index_loc] from foo lateral view explode(search_clicks) 
clickTable as c;
But it won't work in hive, as the following error message returned:AILED: Error 
in semantic analysis: Line 1:7 Non-constant expressions for array indexes not 
supported rank.
It looks like that currently Hive (at least 0.7) does NOT support non-constant 
expressions as the array index.
I searched on google, and found out the following HIVE jira ticket:
https://issues.apache.org/jira/browse/HIVE-1955
It looks like someone has the same request to support it, but not yet.
But there is a comment in the above ticket that it can be done in an UDF. My 
question is, can anyone share some ideas about how to archive this in an UDF, 
as it maybe the only option for me right now?
Thanks

RE: need help on writing hive query

2012-10-31 Thread java8964 java8964

If you don't need to join current_web_page and previous_web_page, assuming you 
can just trust the time stamp, as Phil points out, an custom UDF of 
collect_list() is the way to go.
You need to implement collect_list() UDF by yourself, hive doesn't have one by 
default.But it should be straight forward. In fact, you can reuse the code of 
collect_set(), replace the internal set with a Java ArrayList, then
select user_id, collect_list(user_current_web_page)from(select user_id, 
user_current_web_pageorder by user_id asc, user_visiting_time asc)agroup by 
user_id
Yong
 Subject: Re: need help on writing hive query
 From: matthewt...@gmail.com
 Date: Wed, 31 Oct 2012 17:53:06 -0400
 To: user@hive.apache.org
 
 I did a similar query a few months ago.  In short, I left-padded the page 
 name with the time stamp, grouped with collect_set, and then used 
 sort_array().  There was some other cleanup work and converting back to 
 string to remove the time stamps, but it remained in order.
 
 If there's an easier way, please let me know.
 
 Matt Tucker
 
 On Oct 31, 2012, at 5:37 PM, Tom Brown tombrow...@gmail.com wrote:
 
  It wouldn't retrieve the user's path in a single string, but you could
  simply select the user id and current page, ordered by the timestamp.
  
  It would require a second step to turn it into the single string path,
  so that might be a deal-breaker.
  
  --Tom
  
  On Wed, Oct 31, 2012 at 3:32 PM, Philip Tromans
  philip.j.trom...@gmail.com wrote:
  You could use collect_set() and GROUP BY. That wouldn't preserve order
  though.
  
  Phil.
  
  On Oct 31, 2012 9:18 PM, qiaoresearcher qiaoresearc...@gmail.com wrote:
  
  Hi all,
  
  here is the question. Assume we have a table like:
  
  --
  user_id||  user_visiting_time||  user_current_web_page ||
  user_previous_web_page
  user 1 time (1,1)   page 1
  page 0
  user 1 time (1,2)   page 2
  page 1
  user 1 time (1,3 )  page 3
  page 2
  .  ..
  
  user n time (n,1)   page 1
  page 0
  user n time (n,2)   page 2
  page 1
  user n time (n,3)   page 3
  page 2
  
  that is, in each row, we know the current web page that user is viewing,
  and we know the previous web page the user coming from
  
  now we want to generate a list for each user that recorded the complete
  path the user is taking:
  i.e., how can we use hive to generate output like:
  
  
  user 1 :  page 1   page 2 page 3  page 4  .. (till reach the
  beginning page of user 1)
  user 2:   page 1 page 2 page 3  page 4 page 5  ...  ( till reach
  the beginning page of user 2)
  the web pages viewed by user 1 and user 2 might be different.
  
  can we generate this using hive?
  
  thanks,
  

is it possible to disable running termiatePartial and merge() methods in UDAF

2012-10-01 Thread java8964 java8964

Hi, 
I am trying to implement a UDAF of Kurtosis (�a 
href=http://en.wikipedia.org/wiki/Kurtosis;http://en.wikipedia.org/wiki/Kurtosis�/a
 in the hive.
I already found a library to do it, from Apache commons math (�a 
href=http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/moment/Kurtosis.html;http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/moment/Kurtosis.html�/a).
 
But it does NOT support merging partial result in it. 
I am not a Math guru, so I don't know if this Kurtosis can be done in parallel 
or not.
My question is simple, if I use the above library to implement this UDAF in 
hive, is there a configuration in the hive (either at runtime or at UDAF level) 
to make surethat for this UDAF I implemented, the terminatePartial/merge will 
NEVER be invoked? 
I have 2 settings I think maybe will do the trick:
1) set hive.exec.reducers.max=12) set hive.map.aggr=false
Does either one will do the thing I want? Or anything else? Or is it even 
possible?
Thanks
Yong  

RE: How can I get the constant value from the ObjectInspector in the UDF

2012-09-26 Thread java8964 java8964

I understand your message. But in this situation, I want to do the following:
1) I want to get the value 10 in the initialization stage. I understand your 
point that the value will only available in the evaluate stage, but keep in 
mind that for this 10 in my example, it is a constants value. It won't change 
for every evaluating. It is kind of value I should be able to get in the 
initialization stage, right? The hive Query analyzer should understand this 
parameter in the function in fact is a constants value, and will be able to 
provide to me during the initialization stage.2) Further question, can I get 
more information from the object inspector? For example, when I write the UDF, 
I want to make sure the first parameter is a numeric type. I can get the type, 
which I am able to valid it based on the type. But the question is if I want to 
error in some case, I want to show the end user the NAME of the parameter in my 
error message, instead of just position.
For example, in the UDF as msum(column_name, 10), if I find out the type of the 
column_name is NOT a numeric type, I want in the error message I give to the 
end user, that 'column_name' should be numeric type. But right now, in the API, 
I can not get this information. Only thing I can get is the category type 
information, but I want more.
Is it possible to do that in hive 0.7.1?
Thanks for your help.
Yong

Date: Thu, 27 Sep 2012 02:32:19 +0900
Subject: Re: How can I get the constant value from the ObjectInspector in the 
UDF
From: chen.song...@gmail.com
To: user@hive.apache.org

Hi Yong
The way GenericUDF works is as follows.
ObjectInspector initialize(ObjectInspector[] arguments) is called only once for 
one GenericUDF instance used in your Hive query. This phase is for preparation 
steps of UDF, such as syntax check and type inference.









Object evaluate(DeferredObject[] arguments) is called to evaluate against 
actual arguments. This should be where the actual calculation happens and where 
you can get the real values you talked about.

Thanks,Chen

On Wed, Sep 26, 2012 at 4:17 AM, java8964 java8964 java8...@hotmail.com wrote:





Hi, I am using Cloudera release cdh3u3, which has the hive 0.71 version.
I am trying to write a hive UDF function as to calculate the moving sum. Right 
now, I am having trouble to get the constrant value passed in in the 
initialization stage.

For example, let's assume the function is like the following format:
msum(salary, 10) - salary is a int type column
which means the end user wants to calculate the last 10 rows of salary.

I kind of know how to implement this UDF. But I have one problem right now.
1) This is not a UDAF, as each row will return one data back as the moving 
sum.2) I create an UDF class extends from the GenericUDF.
3) I can get the column type from the ObjectInspector[] passed to me in the 
initialize() method to verify that 'salary' and 10 both needs to be numeric 
type (later one needs to be integer)4) But I also want to get the real value of 
10, in this case, in the initialize() stage, so I can create the corresponding 
data structure based on the value end user specified here.
5) I looks around the javadoc of ObjectInspector class. I know at run time the 
real class of the 2nd parameter is WritableIntObjectInspector. I can get the 
type, but how I can get the real value of it?6) This is kind of 
ConstantsObjectInspector, should be able to give the value to me, as it already 
knows the type is int. What how?
7) I don't want to try to get the value at the evaluate stage. Can I get this 
value at the initialize stage?
Thanks
Yong  


-- 
Chen Song



  

How can I get the constant value from the ObjectInspector in the UDF

2012-09-25 Thread java8964 java8964

Hi, I am using Cloudera release cdh3u3, which has the hive 0.71 version.
I am trying to write a hive UDF function as to calculate the moving sum. Right 
now, I am having trouble to get the constrant value passed in in the 
initialization stage.
For example, let's assume the function is like the following format:
msum(salary, 10) - salary is a int type column
which means the end user wants to calculate the last 10 rows of salary.
I kind of know how to implement this UDF. But I have one problem right now.
1) This is not a UDAF, as each row will return one data back as the moving 
sum.2) I create an UDF class extends from the GenericUDF.3) I can get the 
column type from the ObjectInspector[] passed to me in the initialize() method 
to verify that 'salary' and 10 both needs to be numeric type (later one needs 
to be integer)4) But I also want to get the real value of 10, in this case, in 
the initialize() stage, so I can create the corresponding data structure based 
on the value end user specified here.5) I looks around the javadoc of 
ObjectInspector class. I know at run time the real class of the 2nd parameter 
is WritableIntObjectInspector. I can get the type, but how I can get the real 
value of it?6) This is kind of ConstantsObjectInspector, should be able to give 
the value to me, as it already knows the type is int. What how?7) I don't want 
to try to get the value at the evaluate stage. Can I get this value at the 
initialize stage?
Thanks
Yong  

Question about org.apache.hadoop.hive.contrib.serde2.RegexSerDe

2012-04-03 Thread java8964 java8964

Hi, 
I have a question about the behavior of the class 
org.apache.hadoop.hive.contrib.serde2.RegexSerDe. Here is the example I tested 
using the Cloudra hive-0.7.1-cdh3u3 release. The above class did NOT do what I 
expect, any one knows the reason?
user:~/tmp more Test.javaimport java.io.*;import java.text.*;
class Test {public static void main (String[] argv) throws Exception{   
 String line = aaa,\bbb\,\cc,c\;String[] tokens = 
line.split(,(?=([^\]*\[^\]*\)*[^\]*$));int i = 1;
for(String t : tokens) {System.out.println(i +  +t);
i++;}}}
:~/tmp java Test1 aaa2 bbb3 cc,c
As you can see, the Java regular expression ,(?=([^\]*\[^\]*\)*[^\]*$) 
did what I want it to do, it parse the string aaa,bbb,cc,c to 3 tokens: 
(aaa), (bbb), and (cc,c). So the regular expression works fine.
Now in the hive:
:~ more test.txtaaa,bbb,cc,c:~ hiveHive history 
file=/tmp/user/hive_job_log_user_201204031242_591028210.txthive create table 
test(  c1 string,  c2 string,  c3 string ) row format  
   SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH 
SERDEPROPERTIES ( input.regex = ,(?=([^\]*\[^\]*\)*[^\]*$) ) 
STORED AS TEXTFILE;OKTime taken: 0.401 secondshive load data local inpath 
'test.txt' overwrite into table test;Copying data from 
file:/home/user/test.txtCopying file: file:/home/user/test.txtLoading data to 
table dev.testDeleted hdfs://host/user/hive/warehouse/dev.db/testOKTime taken: 
0.282 secondshive select * from test; 
OKNULLNULLNULL
When I query this table, I don't get what I expected. I expect the output 
should be the 3 strings like this -aaabbb   cc,c
Why the output gives me 3 NULLs?
Thanks for your help.