RE: Regex and serde with hive
Thanks Mark, Loren, and Vijay for helping out with this. I did try the serde configuration and it worked well. I did come across the performance implications as I increased the volume so I went ahead and cleansed the data prior to inserting it. Thanks again. Thank you, Ranjith -Original Message- From: Mark Grover [mailto:mgro...@oanda.com] Sent: Friday, December 23, 2011 9:29 AM To: user@hive.apache.org Subject: Re: Regex and serde with hive Hi Ranjith, Like Loren, I don't think the regex you are using is correct. If you use a create table command like the following, it should work: create external table my_table(a STRING, b STRING, c STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "\"([^\"]*)\"~\"([^\"]*)\"~\"([^\"]*)\"", "output.format.string" = "a:%1$s,b:%2$s,c:%3$s" ) STORED AS TEXTFILE LOCATION 'my_location'; In general, it's good practice to test your regex before using it in the create table statement. Write a small test or use a website like this (http://www.regexplanet.com/simple/index.html) to test your regex on the data. Keep in mind that Regex SerDe is not the most optimal SerDe (in terms of performance) and you might at a later stage want to re-think of what other SerDe's you can use. As far as using "FIELDS TERMINATED BY" goes, you could use something like FIELDS TERMINATED BY ~ (or "~"), but you will still have to take care of the double quote(s). Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: mgro...@oanda.com "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. - Original Message - From: "Loren Siebert" To: user@hive.apache.org Sent: Friday, December 23, 2011 2:27:15 AM Subject: Re: Regex and serde with hive The input regexp does not look right to me. You are expecting a space between groups, but your example contains no spaces. And where do you handle the first/last quotes? Wouldn’t it look more like this: "input.regex" = “\"([^\"~]*)[\"~]*([^\"~]*)[\"~]*([^\"~]*)\"" Rather than trying to tackle it all at once, I find it easier to start with a table of one column and then build up from there until I have all my columns. On Dec 22, 2011, at 8:49 PM, Raghunath, Ranjith wrote: I have been struggling with this for a while so I would appreciate any advice that you any of you may have. I have a file of the format “Xyz”~”qsd”~”1234” I created the following table definition to get the data loaded CREATE TABLE dummy (f1 string, f2 string, f3 string) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^\"~]*) ([^\"~]*) ([^\"~]*)?", "output.format.string" = "%1$s %2$s %3$s"); When I load the data in and try to perform a select get NULL values. Thanks again. Thank you, Ranjith
Exception related to adding a jar in Hive
Hi all, I would really appreciate your thoughts and comments on this. I am using Hive 0.7.1 and Hadoop 0.20.203 on Amazon Elastic MapReduce. My .hiverc file looks like: add jar /home/hadoop/hive/lib/hive-contrib-0.7.1.jar add jar my_dir/my_jar.jar However, with the "add jar my_dir/my_jar.jar" in my .hiverc, the Hive CLI is unable to access the metastore. When I do, "show tables", I get the following on my CLI: FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask (See the complete exception from Hive log at the end of this email). If I take out the 2nd add jar line, the Hive CLI works as expected. Afterwards, if I type the add jar command in CLI (by typing add jar my_dir/my_jar.jar), it successfully gets added and show tables command returns successfully. Also, this only started happening when I upgraded to Hadoop 0.20.203 from 0.20.2 (Hive version remains same at 0.7.1). I am still able to use the .hiverc with the 2nd add jar line successfully with the older hadoop version. Does anyone have any thoughts on why this could be happening? Below is the complete exception from the Hive log. Thank you in advance! NestedThrowables: java.lang.reflect.InvocationTargetException org.apache.hadoop.hive.ql.metadata.HiveException: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1033) at org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1018) at org.apache.hadoop.hive.ql.exec.DDLTask.showTables(DDLTask.java:1704) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:296) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:130) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:171) at org.apache.hadoop.hive.cli.CliDriver.processLineInternal(CliDriver.java:253) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:234) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:487) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1186) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:803) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:698) at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:234) at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:261) at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:196) at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:171) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:354) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.executeWithRetry(HiveMetaStore.java:306) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:451) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:232) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.(HiveMetaStore.java:197) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:108) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:1914) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:1924) at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1029) ... 17 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at s
Cannot recognize input near..
Hi, I am trying to run a simple query in my JDBC-Client java code: = // create mon_tbl - Monthly aggregated data String mon_tblName = "month_tbl"; sql = "DROP TABLE IF EXISTS " + mon_tblName; res = stmt.executeQuery(sql); res = stmt.executeQuery("CREATE TABLE " + mon_tblName + " (ServerIP STRING, Month INT, BytesSent BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','"); sql = "INSERT OVERWRITE " + mon_tblName + " SELECT ServerIP, month(StartTime), SUM(BytesSent) from " + totallogs_tbl + " GROUP BY ServerIP, month(StartTime)"; res = stmt.executeQuery(sql); = *ERROR Message:* Exception in thread "main" java.sql.SQLException: Query returned non-zero code: 11, cause: FAILED: Parse Error: line 1:17 cannot recognize input near 'month_tbl' 'SELECT' 'ServerIP' in destination specification at org.apache.hadoop.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:189) at jtv_jdbc_client.main(jtv_jdbc_client.java:72) root@ip-10-12-97-109:/home/users/jtv/hive_jdbc# = I checked the field-names, made sure there are commas between the attributes. spell-check etc. Can we have built-in aggregate funcitons (like SUM) and date functions (like month(StartTime)).? To use them, should I do anything special? I do not understand the error statement. Please advise. Thanks, PD.
Re: Efficient ways to parse xml from hive column(for selection/filters based on xml node values)
You might want to take a look at this: https://cwiki.apache.org/Hive/languagemanual-xpathudf.html Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: mgro...@oanda.com "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. - Original Message - From: "ravikumar visweswara" To: user@hive.apache.org Sent: Friday, December 23, 2011 10:35:59 AM Subject: Efficient ways to parse xml from hive column(for selection/filters based on xml node values) Hello All, One of my hive columns has text data in xml format. What are all the efficient ways to parse the xml and query based on certain node values. Biz User select/filter Query requirements are based 6 or 7 nodes in xml. Is there any built-in support or supporting libraries for this in HIVE? I have used SerDe for unstructured log parsing, but wanted to check the most efficient way without writing specific UDFS which can parse the xml. Could some of you share your experiences and best practices? Thanks and Regards R
Efficient ways to parse xml from hive column(for selection/filters based on xml node values)
Hello All, One of my hive columns has text data in xml format. What are all the efficient ways to parse the xml and query based on certain node values. Biz User select/filter Query requirements are based 6 or 7 nodes in xml. Is there any built-in support or supporting libraries for this in HIVE? I have used SerDe for unstructured log parsing, but wanted to check the most efficient way without writing specific UDFS which can parse the xml. Could some of you share your experiences and best practices? Thanks and Regards R
Re: Regex and serde with hive
Hi Ranjith, Like Loren, I don't think the regex you are using is correct. If you use a create table command like the following, it should work: create external table my_table(a STRING, b STRING, c STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "\"([^\"]*)\"~\"([^\"]*)\"~\"([^\"]*)\"", "output.format.string" = "a:%1$s,b:%2$s,c:%3$s" ) STORED AS TEXTFILE LOCATION 'my_location'; In general, it's good practice to test your regex before using it in the create table statement. Write a small test or use a website like this (http://www.regexplanet.com/simple/index.html) to test your regex on the data. Keep in mind that Regex SerDe is not the most optimal SerDe (in terms of performance) and you might at a later stage want to re-think of what other SerDe's you can use. As far as using "FIELDS TERMINATED BY" goes, you could use something like FIELDS TERMINATED BY ~ (or "~"), but you will still have to take care of the double quote(s). Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: mgro...@oanda.com "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. - Original Message - From: "Loren Siebert" To: user@hive.apache.org Sent: Friday, December 23, 2011 2:27:15 AM Subject: Re: Regex and serde with hive The input regexp does not look right to me. You are expecting a space between groups, but your example contains no spaces. And where do you handle the first/last quotes? Wouldn’t it look more like this: "input.regex" = “\"([^\"~]*)[\"~]*([^\"~]*)[\"~]*([^\"~]*)\"" Rather than trying to tackle it all at once, I find it easier to start with a table of one column and then build up from there until I have all my columns. On Dec 22, 2011, at 8:49 PM, Raghunath, Ranjith wrote: I have been struggling with this for a while so I would appreciate any advice that you any of you may have. I have a file of the format “Xyz”~”qsd”~”1234” I created the following table definition to get the data loaded CREATE TABLE dummy (f1 string, f2 string, f3 string) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^\"~]*) ([^\"~]*) ([^\"~]*)?", "output.format.string" = "%1$s %2$s %3$s"); When I load the data in and try to perform a select get NULL values. Thanks again. Thank you, Ranjith