Re: Regex and serde with hive
Hi Ranjith, Like Loren, I don't think the regex you are using is correct. If you use a create table command like the following, it should work: create external table my_table(a STRING, b STRING, c STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( input.regex = \([^\]*)\~\([^\]*)\~\([^\]*)\, output.format.string = a:%1$s,b:%2$s,c:%3$s ) STORED AS TEXTFILE LOCATION 'my_location'; In general, it's good practice to test your regex before using it in the create table statement. Write a small test or use a website like this (http://www.regexplanet.com/simple/index.html) to test your regex on the data. Keep in mind that Regex SerDe is not the most optimal SerDe (in terms of performance) and you might at a later stage want to re-think of what other SerDe's you can use. As far as using FIELDS TERMINATED BY goes, you could use something like FIELDS TERMINATED BY ~ (or ~), but you will still have to take care of the double quote(s). Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: mgro...@oanda.com Best Trading Platform - World Finance's Forex Awards 2009. The One to Watch - Treasury Today's Adam Smith Awards 2009. - Original Message - From: Loren Siebert lo...@siebert.org To: user@hive.apache.org Sent: Friday, December 23, 2011 2:27:15 AM Subject: Re: Regex and serde with hive The input regexp does not look right to me. You are expecting a space between groups, but your example contains no spaces. And where do you handle the first/last quotes? Wouldn’t it look more like this: input.regex = “\([^\~]*)[\~]*([^\~]*)[\~]*([^\~]*)\ Rather than trying to tackle it all at once, I find it easier to start with a table of one column and then build up from there until I have all my columns. On Dec 22, 2011, at 8:49 PM, Raghunath, Ranjith wrote: I have been struggling with this for a while so I would appreciate any advice that you any of you may have. I have a file of the format “Xyz”~”qsd”~”1234” I created the following table definition to get the data loaded CREATE TABLE dummy (f1 string, f2 string, f3 string) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( input.regex = ([^\~]*) ([^\~]*) ([^\~]*)?, output.format.string = %1$s %2$s %3$s); When I load the data in and try to perform a select get NULL values. Thanks again. Thank you, Ranjith
RE: Regex and serde with hive
Thanks Mark, Loren, and Vijay for helping out with this. I did try the serde configuration and it worked well. I did come across the performance implications as I increased the volume so I went ahead and cleansed the data prior to inserting it. Thanks again. Thank you, Ranjith -Original Message- From: Mark Grover [mailto:mgro...@oanda.com] Sent: Friday, December 23, 2011 9:29 AM To: user@hive.apache.org Subject: Re: Regex and serde with hive Hi Ranjith, Like Loren, I don't think the regex you are using is correct. If you use a create table command like the following, it should work: create external table my_table(a STRING, b STRING, c STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( input.regex = \([^\]*)\~\([^\]*)\~\([^\]*)\, output.format.string = a:%1$s,b:%2$s,c:%3$s ) STORED AS TEXTFILE LOCATION 'my_location'; In general, it's good practice to test your regex before using it in the create table statement. Write a small test or use a website like this (http://www.regexplanet.com/simple/index.html) to test your regex on the data. Keep in mind that Regex SerDe is not the most optimal SerDe (in terms of performance) and you might at a later stage want to re-think of what other SerDe's you can use. As far as using FIELDS TERMINATED BY goes, you could use something like FIELDS TERMINATED BY ~ (or ~), but you will still have to take care of the double quote(s). Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: mgro...@oanda.com Best Trading Platform - World Finance's Forex Awards 2009. The One to Watch - Treasury Today's Adam Smith Awards 2009. - Original Message - From: Loren Siebert lo...@siebert.org To: user@hive.apache.org Sent: Friday, December 23, 2011 2:27:15 AM Subject: Re: Regex and serde with hive The input regexp does not look right to me. You are expecting a space between groups, but your example contains no spaces. And where do you handle the first/last quotes? Wouldn’t it look more like this: input.regex = “\([^\~]*)[\~]*([^\~]*)[\~]*([^\~]*)\ Rather than trying to tackle it all at once, I find it easier to start with a table of one column and then build up from there until I have all my columns. On Dec 22, 2011, at 8:49 PM, Raghunath, Ranjith wrote: I have been struggling with this for a while so I would appreciate any advice that you any of you may have. I have a file of the format “Xyz”~”qsd”~”1234” I created the following table definition to get the data loaded CREATE TABLE dummy (f1 string, f2 string, f3 string) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( input.regex = ([^\~]*) ([^\~]*) ([^\~]*)?, output.format.string = %1$s %2$s %3$s); When I load the data in and try to perform a select get NULL values. Thanks again. Thank you, Ranjith
Re: Regex and serde with hive
The input regexp does not look right to me. You are expecting a space between groups, but your example contains no spaces. And where do you handle the first/last quotes? Wouldn’t it look more like this: input.regex = “\([^\~]*)[\~]*([^\~]*)[\~]*([^\~]*)\ Rather than trying to tackle it all at once, I find it easier to start with a table of one column and then build up from there until I have all my columns. On Dec 22, 2011, at 8:49 PM, Raghunath, Ranjith wrote: I have been struggling with this for a while so I would appreciate any advice that you any of you may have. I have a file of the format “Xyz”~”qsd”~”1234” I created the following table definition to get the data loaded CREATE TABLE dummy (f1 string, f2string, f3 string) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( input.regex = ([^\~]*) ([^\~]*) ([^\~]*)?, output.format.string = %1$s %2$s %3$s); When I load the data in and try to perform a select get NULL values. Thanks again. Thank you, Ranjith
Re: Regex and serde with hive
If the format is simply delimited like this, you don't need to use the RegexSerde. Hive's default format with the right FIELDS TERMINATED BY setting will work great. -Vijay On Thu, Dec 22, 2011 at 8:49 PM, Raghunath, Ranjith ranjith.raghuna...@usaa.com wrote: I have been struggling with this for a while so I would appreciate any advice that you any of you may have. I have a file of the format “Xyz”~”qsd”~”1234” I created the following table definition to get the data loaded CREATE TABLE dummy (f1 string, f2 string, f3 string) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( input.regex = ([^\~]*) ([^\~]*) ([^\~]*)?, output.format.string = %1$s %2$s %3$s); When I load the data in and try to perform a select get NULL values. Thanks again. Thank you, Ranjith