Re: Regex and serde with hive

2011-12-23 Thread Mark Grover
Hi Ranjith,
Like Loren, I don't think the regex you are using is correct.

If you use a create table command like the following, it should work:
create external table
my_table(a STRING, b STRING, c STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES  (
input.regex = \([^\]*)\~\([^\]*)\~\([^\]*)\,
output.format.string = a:%1$s,b:%2$s,c:%3$s
)
STORED AS TEXTFILE LOCATION 'my_location';

In general, it's good practice to test your regex before using it in the create 
table statement. Write a small test or use a website like this 
(http://www.regexplanet.com/simple/index.html) to test your regex on the data.

Keep in mind that Regex SerDe is not the most optimal SerDe (in terms of 
performance) and you might at a later stage want to re-think of what other 
SerDe's you can use.

As far as using FIELDS TERMINATED BY goes, you could use something like 
FIELDS TERMINATED BY ~ (or ~), but you will still have to take care of the 
double quote(s).

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com
e: mgro...@oanda.com

Best Trading Platform - World Finance's Forex Awards 2009.
The One to Watch - Treasury Today's Adam Smith Awards 2009.


- Original Message -
From: Loren Siebert lo...@siebert.org
To: user@hive.apache.org
Sent: Friday, December 23, 2011 2:27:15 AM
Subject: Re: Regex and serde with hive


The input regexp does not look right to me. You are expecting a space between 
groups, but your example contains no spaces. And where do you handle the 
first/last quotes? Wouldn’t it look more like this:
input.regex = “\([^\~]*)[\~]*([^\~]*)[\~]*([^\~]*)\


Rather than trying to tackle it all at once, I find it easier to start with a 
table of one column and then build up from there until I have all my columns.


On Dec 22, 2011, at 8:49 PM, Raghunath, Ranjith wrote:





I have been struggling with this for a while so I would appreciate any advice 
that you any of you may have.

I have a file of the format

“Xyz”~”qsd”~”1234”

I created the following table definition to get the data loaded

CREATE TABLE dummy
(f1 string,
f2 string,
f3 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
input.regex = ([^\~]*) ([^\~]*) ([^\~]*)?,
output.format.string = %1$s %2$s %3$s);

When I load the data in and try to perform a select get NULL values. Thanks 
again.

Thank you,
Ranjith



Efficient ways to parse xml from hive column(for selection/filters based on xml node values)

2011-12-23 Thread ravikumar visweswara
Hello All,

One of my hive columns has text data in xml format.  What are all the
efficient ways to parse the xml and query based on certain node values. Biz
User select/filter Query requirements are based 6 or 7 nodes in xml.  Is
there any built-in support or supporting libraries for this in HIVE?
I have used SerDe for unstructured log parsing, but wanted to check the
most efficient way without writing specific UDFS which can parse the xml.

Could some of you share your experiences and best practices?

Thanks and Regards
R


Re: Efficient ways to parse xml from hive column(for selection/filters based on xml node values)

2011-12-23 Thread Mark Grover
You might want to take a look at this:
https://cwiki.apache.org/Hive/languagemanual-xpathudf.html


Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 
e: mgro...@oanda.com 

Best Trading Platform - World Finance's Forex Awards 2009. 
The One to Watch - Treasury Today's Adam Smith Awards 2009. 


- Original Message -
From: ravikumar visweswara talk2had...@gmail.com
To: user@hive.apache.org
Sent: Friday, December 23, 2011 10:35:59 AM
Subject: Efficient ways to parse xml from hive column(for selection/filters 
based on xml node values)

Hello All, 

One of my hive columns has text data in xml format. What are all the efficient 
ways to parse the xml and query based on certain node values. Biz User 
select/filter Query requirements are based 6 or 7 nodes in xml. Is there any 
built-in support or supporting libraries for this in HIVE? 
I have used SerDe for unstructured log parsing, but wanted to check the most 
efficient way without writing specific UDFS which can parse the xml. 

Could some of you share your experiences and best practices? 

Thanks and Regards 
R 


Cannot recognize input near..

2011-12-23 Thread Periya.Data
Hi,
 I am trying to run a simple query in my JDBC-Client java code:

=
// create mon_tbl - Monthly aggregated data
String mon_tblName = month_tbl;
sql = DROP TABLE IF EXISTS  + mon_tblName;
res = stmt.executeQuery(sql);
res = stmt.executeQuery(CREATE TABLE  + mon_tblName +  (ServerIP
STRING, Month INT, BytesSent BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED
BY ',');

sql = INSERT OVERWRITE  + mon_tblName +  SELECT ServerIP,
month(StartTime), SUM(BytesSent) from  + totallogs_tbl +  GROUP BY
ServerIP, month(StartTime);

res = stmt.executeQuery(sql);
=

*ERROR Message:*

Exception in thread main java.sql.SQLException: Query returned non-zero
code: 11, cause: FAILED: Parse Error: line 1:17 cannot recognize input near
'month_tbl' 'SELECT' 'ServerIP' in destination specification

at
org.apache.hadoop.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:189)
at jtv_jdbc_client.main(jtv_jdbc_client.java:72)
root@ip-10-12-97-109:/home/users/jtv/hive_jdbc#
=

I checked the field-names, made sure there are commas between the
attributes. spell-check etc. Can we have built-in aggregate funcitons (like
SUM) and date functions (like month(StartTime)).? To use them, should I do
anything special? I do not understand the error statement.

Please advise.

Thanks,
PD.


Exception related to adding a jar in Hive

2011-12-23 Thread Mark Grover
Hi all,
I would really appreciate your thoughts and comments on this.

I am using Hive 0.7.1 and Hadoop 0.20.203 on Amazon Elastic MapReduce.

My .hiverc file looks like:
add jar /home/hadoop/hive/lib/hive-contrib-0.7.1.jar
add jar my_dir/my_jar.jar

However, with the add jar my_dir/my_jar.jar in my .hiverc, the Hive CLI is 
unable to access the metastore.
When I do, show tables, I get the following on my CLI:
FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected 
exception caught.
NestedThrowables:
java.lang.reflect.InvocationTargetException
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask

(See the complete exception from Hive log at the end of this email).

If I take out the 2nd add jar line, the Hive CLI works as expected. Afterwards, 
if I type the add jar command in CLI (by typing add jar my_dir/my_jar.jar), it 
successfully gets added and show tables command returns successfully. Also, 
this only started happening when I upgraded to Hadoop 0.20.203 from 0.20.2 
(Hive version remains same at 0.7.1). I am still able to use the .hiverc with 
the 2nd add jar line successfully with the older hadoop version.

Does anyone have any thoughts on why this could be happening? Below is the 
complete exception from the Hive log. Thank you in advance!

NestedThrowables:
java.lang.reflect.InvocationTargetException
org.apache.hadoop.hive.ql.metadata.HiveException: 
javax.jdo.JDOFatalInternalException: Unexpected exception caught.
NestedThrowables:
java.lang.reflect.InvocationTargetException
at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1033)
at 
org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1018)
at org.apache.hadoop.hive.ql.exec.DDLTask.showTables(DDLTask.java:1704)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:296)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:171)
at 
org.apache.hadoop.hive.cli.CliDriver.processLineInternal(CliDriver.java:253)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:234)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:487)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: javax.jdo.JDOFatalInternalException: Unexpected exception caught.
NestedThrowables:
java.lang.reflect.InvocationTargetException
at 
javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1186)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:803)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:698)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:234)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:261)
at 
org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:196)
at 
org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:171)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:354)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.executeWithRetry(HiveMetaStore.java:306)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:451)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:232)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:197)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.init(HiveMetaStoreClient.java:108)
at 
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:1914)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:1924)
at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1029)
... 17 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

RE: Regex and serde with hive

2011-12-23 Thread Raghunath, Ranjith
Thanks Mark, Loren, and Vijay for helping out with this. I did try the serde 
configuration and it worked well. I did come across the performance 
implications as I increased the volume so I went ahead and cleansed the data 
prior to inserting it. Thanks again. 

Thank you, 
Ranjith 

-Original Message-
From: Mark Grover [mailto:mgro...@oanda.com] 
Sent: Friday, December 23, 2011 9:29 AM
To: user@hive.apache.org
Subject: Re: Regex and serde with hive

Hi Ranjith,
Like Loren, I don't think the regex you are using is correct.

If you use a create table command like the following, it should work:
create external table
my_table(a STRING, b STRING, c STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES  (
input.regex = \([^\]*)\~\([^\]*)\~\([^\]*)\,
output.format.string = a:%1$s,b:%2$s,c:%3$s
)
STORED AS TEXTFILE LOCATION 'my_location';

In general, it's good practice to test your regex before using it in the create 
table statement. Write a small test or use a website like this 
(http://www.regexplanet.com/simple/index.html) to test your regex on the data.

Keep in mind that Regex SerDe is not the most optimal SerDe (in terms of 
performance) and you might at a later stage want to re-think of what other 
SerDe's you can use.

As far as using FIELDS TERMINATED BY goes, you could use something like 
FIELDS TERMINATED BY ~ (or ~), but you will still have to take care of the 
double quote(s).

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com
e: mgro...@oanda.com

Best Trading Platform - World Finance's Forex Awards 2009.
The One to Watch - Treasury Today's Adam Smith Awards 2009.


- Original Message -
From: Loren Siebert lo...@siebert.org
To: user@hive.apache.org
Sent: Friday, December 23, 2011 2:27:15 AM
Subject: Re: Regex and serde with hive


The input regexp does not look right to me. You are expecting a space between 
groups, but your example contains no spaces. And where do you handle the 
first/last quotes? Wouldn’t it look more like this:
input.regex = “\([^\~]*)[\~]*([^\~]*)[\~]*([^\~]*)\


Rather than trying to tackle it all at once, I find it easier to start with a 
table of one column and then build up from there until I have all my columns.


On Dec 22, 2011, at 8:49 PM, Raghunath, Ranjith wrote:





I have been struggling with this for a while so I would appreciate any advice 
that you any of you may have.

I have a file of the format

“Xyz”~”qsd”~”1234”

I created the following table definition to get the data loaded

CREATE TABLE dummy
(f1 string,
f2 string,
f3 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
input.regex = ([^\~]*) ([^\~]*) ([^\~]*)?,
output.format.string = %1$s %2$s %3$s);

When I load the data in and try to perform a select get NULL values. Thanks 
again.

Thank you,
Ranjith