Re: Regex and serde with hive

2011-12-23 Thread Mark Grover
Hi Ranjith,
Like Loren, I don't think the regex you are using is correct.

If you use a create table command like the following, it should work:
create external table
my_table(a STRING, b STRING, c STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES  (
input.regex = \([^\]*)\~\([^\]*)\~\([^\]*)\,
output.format.string = a:%1$s,b:%2$s,c:%3$s
)
STORED AS TEXTFILE LOCATION 'my_location';

In general, it's good practice to test your regex before using it in the create 
table statement. Write a small test or use a website like this 
(http://www.regexplanet.com/simple/index.html) to test your regex on the data.

Keep in mind that Regex SerDe is not the most optimal SerDe (in terms of 
performance) and you might at a later stage want to re-think of what other 
SerDe's you can use.

As far as using FIELDS TERMINATED BY goes, you could use something like 
FIELDS TERMINATED BY ~ (or ~), but you will still have to take care of the 
double quote(s).

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com
e: mgro...@oanda.com

Best Trading Platform - World Finance's Forex Awards 2009.
The One to Watch - Treasury Today's Adam Smith Awards 2009.


- Original Message -
From: Loren Siebert lo...@siebert.org
To: user@hive.apache.org
Sent: Friday, December 23, 2011 2:27:15 AM
Subject: Re: Regex and serde with hive


The input regexp does not look right to me. You are expecting a space between 
groups, but your example contains no spaces. And where do you handle the 
first/last quotes? Wouldn’t it look more like this:
input.regex = “\([^\~]*)[\~]*([^\~]*)[\~]*([^\~]*)\


Rather than trying to tackle it all at once, I find it easier to start with a 
table of one column and then build up from there until I have all my columns.


On Dec 22, 2011, at 8:49 PM, Raghunath, Ranjith wrote:





I have been struggling with this for a while so I would appreciate any advice 
that you any of you may have.

I have a file of the format

“Xyz”~”qsd”~”1234”

I created the following table definition to get the data loaded

CREATE TABLE dummy
(f1 string,
f2 string,
f3 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
input.regex = ([^\~]*) ([^\~]*) ([^\~]*)?,
output.format.string = %1$s %2$s %3$s);

When I load the data in and try to perform a select get NULL values. Thanks 
again.

Thank you,
Ranjith



RE: Regex and serde with hive

2011-12-23 Thread Raghunath, Ranjith
Thanks Mark, Loren, and Vijay for helping out with this. I did try the serde 
configuration and it worked well. I did come across the performance 
implications as I increased the volume so I went ahead and cleansed the data 
prior to inserting it. Thanks again. 

Thank you, 
Ranjith 

-Original Message-
From: Mark Grover [mailto:mgro...@oanda.com] 
Sent: Friday, December 23, 2011 9:29 AM
To: user@hive.apache.org
Subject: Re: Regex and serde with hive

Hi Ranjith,
Like Loren, I don't think the regex you are using is correct.

If you use a create table command like the following, it should work:
create external table
my_table(a STRING, b STRING, c STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES  (
input.regex = \([^\]*)\~\([^\]*)\~\([^\]*)\,
output.format.string = a:%1$s,b:%2$s,c:%3$s
)
STORED AS TEXTFILE LOCATION 'my_location';

In general, it's good practice to test your regex before using it in the create 
table statement. Write a small test or use a website like this 
(http://www.regexplanet.com/simple/index.html) to test your regex on the data.

Keep in mind that Regex SerDe is not the most optimal SerDe (in terms of 
performance) and you might at a later stage want to re-think of what other 
SerDe's you can use.

As far as using FIELDS TERMINATED BY goes, you could use something like 
FIELDS TERMINATED BY ~ (or ~), but you will still have to take care of the 
double quote(s).

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com
e: mgro...@oanda.com

Best Trading Platform - World Finance's Forex Awards 2009.
The One to Watch - Treasury Today's Adam Smith Awards 2009.


- Original Message -
From: Loren Siebert lo...@siebert.org
To: user@hive.apache.org
Sent: Friday, December 23, 2011 2:27:15 AM
Subject: Re: Regex and serde with hive


The input regexp does not look right to me. You are expecting a space between 
groups, but your example contains no spaces. And where do you handle the 
first/last quotes? Wouldn’t it look more like this:
input.regex = “\([^\~]*)[\~]*([^\~]*)[\~]*([^\~]*)\


Rather than trying to tackle it all at once, I find it easier to start with a 
table of one column and then build up from there until I have all my columns.


On Dec 22, 2011, at 8:49 PM, Raghunath, Ranjith wrote:





I have been struggling with this for a while so I would appreciate any advice 
that you any of you may have.

I have a file of the format

“Xyz”~”qsd”~”1234”

I created the following table definition to get the data loaded

CREATE TABLE dummy
(f1 string,
f2 string,
f3 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
input.regex = ([^\~]*) ([^\~]*) ([^\~]*)?,
output.format.string = %1$s %2$s %3$s);

When I load the data in and try to perform a select get NULL values. Thanks 
again.

Thank you,
Ranjith



Re: Regex and serde with hive

2011-12-22 Thread Loren Siebert
The input regexp does not look right to me. You are expecting a space between 
groups, but your example contains no spaces. And where do you handle the 
first/last quotes? Wouldn’t it look more like this:
input.regex = “\([^\~]*)[\~]*([^\~]*)[\~]*([^\~]*)\

Rather than trying to tackle it all at once, I find it easier to start with a 
table of one column and then build up from there until I have all my columns.

On Dec 22, 2011, at 8:49 PM, Raghunath, Ranjith wrote:

 I have been struggling with this for a while so I would appreciate any advice 
 that you any of you may have.
  
 I have a file of the format
  
 “Xyz”~”qsd”~”1234”
  
 I created the following table definition to get the data loaded
  
 CREATE TABLE dummy
 (f1   string,
   f2string,
   f3 string)
 ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
 WITH SERDEPROPERTIES  (
 input.regex = ([^\~]*) ([^\~]*) ([^\~]*)?,
 output.format.string = %1$s %2$s %3$s);
  
 When I load the data in and try to perform a select get NULL values. Thanks 
 again.
 Thank you,
 Ranjith
  



Re: Regex and serde with hive

2011-12-22 Thread Vijay
If the format is simply delimited like this, you don't need to use the
RegexSerde. Hive's default format with the right FIELDS TERMINATED
BY setting will work great.

-Vijay

On Thu, Dec 22, 2011 at 8:49 PM, Raghunath, Ranjith
ranjith.raghuna...@usaa.com wrote:
 I have been struggling with this for a while so I would appreciate any
 advice that you any of you may have.



 I have a file of the format



 “Xyz”~”qsd”~”1234”



 I created the following table definition to get the data loaded



 CREATE TABLE dummy

 (f1   string,

   f2    string,

   f3 string)

 ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

 WITH SERDEPROPERTIES  (

 input.regex = ([^\~]*) ([^\~]*) ([^\~]*)?,

 output.format.string = %1$s %2$s %3$s);



 When I load the data in and try to perform a select get NULL values. Thanks
 again.

 Thank you,

 Ranjith