Hive crashing after an upgrade - issue with existing larger tables
Hi Experts I was working on hive with larger volume data with hive 0.7 . Recently my hive installation was upgraded to 0.7.1 . After the upgrade I'm having a lot of issues with queries that were already working fine with larger data. The queries that took seconds to return results is now taking hours, for most larger tables even the map reduce jobs are not getting triggered. Queries like Select * and describe are working fine since they don't involve any map reduce jobs. For the jobs that didn't even get triggered I got the following error from job tracker Job initialization failed: java.io.IOException: Split metadata size exceeded 1000. Aborting job job_201106061630_6993 at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:48) at org.apache.hadoop.mapred.JobInProgress.createSplits(JobInProgress.java:807) at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:701) at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4013) at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Looks like some metadata issue. My cluster is on CDH3-u0 . Has anyone faced similar issues before. Please share your thoughts what could be the probable cause of the error. Thank You
Re: Hive crashing after an upgrade - issue with existing larger tables
A small correction to my previous post. The CDH version is CDH u1 not u0 Sorry for the confusion Regards Bejoy K S -Original Message- From: Bejoy Ks bejoy...@yahoo.com Date: Thu, 18 Aug 2011 05:51:58 To: hive user groupuser@hive.apache.org Reply-To: user@hive.apache.org Subject: Hive crashing after an upgrade - issue with existing larger tables Hi Experts I was working on hive with larger volume data with hive 0.7 . Recently my hive installation was upgraded to 0.7.1 . After the upgrade I'm having a lot of issues with queries that were already working fine with larger data. The queries that took seconds to return results is now taking hours, for most larger tables even the map reduce jobs are not getting triggered. Queries like Select * and describe are working fine since they don't involve any map reduce jobs. For the jobs that didn't even get triggered I got the following error from job tracker Job initialization failed: java.io.IOException: Split metadata size exceeded 1000. Aborting job job_201106061630_6993 at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:48) at org.apache.hadoop.mapred.JobInProgress.createSplits(JobInProgress.java:807) at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:701) at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4013) at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Looks like some metadata issue. My cluster is on CDH3-u0 . Has anyone faced similar issues before. Please share your thoughts what could be the probable cause of the error. Thank You
Re: Hive crashing after an upgrade - issue with existing larger tables
Hi, The original CDH3U1 release of Hive contained a configuration bug which we recently fixed in an update. You can get the update by refreshing your Hive packages. Afterwards please verify that you are using the following Hive package: hive-0.7.1+42.9 You can also fix the problem by modifying your hive-site.xml file to include the following setting: mapred.max.split.size=25600 Thanks. Carl On Thu, Aug 18, 2011 at 8:48 AM, bejoy...@yahoo.com wrote: A small correction to my previous post. The CDH version is CDH u1 not u0 Sorry for the confusion Regards Bejoy K S -- *From: * Bejoy Ks bejoy...@yahoo.com *Date: *Thu, 18 Aug 2011 05:51:58 -0700 (PDT) *To: *hive user groupuser@hive.apache.org *ReplyTo: * user@hive.apache.org *Subject: *Hive crashing after an upgrade - issue with existing larger tables Hi Experts I was working on hive with larger volume data with hive 0.7 . Recently my hive installation was upgraded to 0.7.1 . After the upgrade I'm having a lot of issues with queries that were already working fine with larger data. The queries that took seconds to return results is now taking hours, for most larger tables even the map reduce jobs are not getting triggered. Queries like Select * and describe are working fine since they don't involve any map reduce jobs. For the jobs that didn't even get triggered I got the following error from job tracker Job initialization failed: java.io.IOException: Split metadata size exceeded 1000. Aborting job job_201106061630_6993 at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:48) at org.apache.hadoop.mapred.JobInProgress.createSplits(JobInProgress.java:807) at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:701) at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4013) at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Looks like some metadata issue. My cluster is on CDH3-u0 . Has anyone faced similar issues before. Please share your thoughts what could be the probable cause of the error. Thank You
Re: Hive DDL issue
Hive does not work on Cygwin. On Wed, Aug 17, 2011 at 3:38 PM, Siddharth Tiwari siddharth.tiw...@live.com wrote: encountering following issur pls help, on cygwin windows hive show tables; FAILED: Hive Internal Error: java.lang.IllegalArgumentException(java.net.URISyntaxException: Relative path in absolute URI: file:C:/cygwin/tmp//siddharth/hive_2011-08-18_ 04-08-25_850_5502285238716420526) java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/cygwin/tmp//siddharth/hive_2011-08-18_04-08-25_850_550228523871642 0526 at org.apache.hadoop.fs.Path.initialize(Path.java:140) at org.apache.hadoop.fs.Path.init(Path.java:132) at org.apache.hadoop.hive.ql.Context.getScratchDir(Context.java:142) at org.apache.hadoop.hive.ql.Context.getLocalScratchDir(Context.java:168) at org.apache.hadoop.hive.ql.Context.getLocalTmpFileURI(Context.java:282) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:205) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:238) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:340) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:736) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:164) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:241) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:456) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.net.URISyntaxException: Relative path in absolute URI: file:C:/cygwin/tmp//siddharth/hive_2011-08-18_04-08-25_850_5502285238716420526 at java.net.URI.checkPath(URI.java:1787) at java.net.URI.init(URI.java:735) at org.apache.hadoop.fs.Path.initialize(Path.java:137) ... 16 more **** *Cheers !!!* *Siddharth Tiwari* Have a refreshing day !!!
RE: Hive DDL issue
hey carl, Isint there any way to enable it, if not, what is this error about ? what is the problem ? ** Cheers !!! Siddharth Tiwari Have a refreshing day !!! From: c...@cloudera.com Date: Thu, 18 Aug 2011 11:34:03 -0700 Subject: Re: Hive DDL issue To: user@hive.apache.org Hive does not work on Cygwin. On Wed, Aug 17, 2011 at 3:38 PM, Siddharth Tiwari siddharth.tiw...@live.com wrote: encountering following issur pls help, on cygwin windows hive show tables; FAILED: Hive Internal Error: java.lang.IllegalArgumentException(java.net.URISyntaxException: Relative path in absolute URI: file:C:/cygwin/tmp//siddharth/hive_2011-08-18_ 04-08-25_850_5502285238716420526) java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/cygwin/tmp//siddharth/hive_2011-08-18_04-08-25_850_550228523871642 0526 at org.apache.hadoop.fs.Path.initialize(Path.java:140) at org.apache.hadoop.fs.Path.init(Path.java:132) at org.apache.hadoop.hive.ql.Context.getScratchDir(Context.java:142) at org.apache.hadoop.hive.ql.Context.getLocalScratchDir(Context.java:168) at org.apache.hadoop.hive.ql.Context.getLocalTmpFileURI(Context.java:282) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:205) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:238) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:340) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:736) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:164) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:241) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:456) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.net.URISyntaxException: Relative path in absolute URI: file:C:/cygwin/tmp//siddharth/hive_2011-08-18_04-08-25_850_5502285238716420526 at java.net.URI.checkPath(URI.java:1787) at java.net.URI.init(URI.java:735) at org.apache.hadoop.fs.Path.initialize(Path.java:137) ... 16 more ** Cheers !!! Siddharth Tiwari Have a refreshing day !!!
Re: Hive DDL issue
Adding to what Ed said, we don't run regression tests on Cygwin, so Hive on Cygwin is de facto unmaintained. On Thu, Aug 18, 2011 at 12:37 PM, Edward Capriolo edlinuxg...@gmail.comwrote: It did work with cygwin at one point but since it is rarely used in that environment it is not well supported. Your best bet is QEMU or Vmware emulating a linux environment. On Thu, Aug 18, 2011 at 3:14 PM, Siddharth Tiwari siddharth.tiw...@live.com wrote: hey carl, Isint there any way to enable it, if not, what is this error about ? what is the problem ? **** *Cheers !!!* *Siddharth Tiwari* Have a refreshing day !!! -- From: c...@cloudera.com Date: Thu, 18 Aug 2011 11:34:03 -0700 Subject: Re: Hive DDL issue To: user@hive.apache.org Hive does not work on Cygwin. On Wed, Aug 17, 2011 at 3:38 PM, Siddharth Tiwari siddharth.tiw...@live.com wrote: encountering following issur pls help, on cygwin windows hive show tables; FAILED: Hive Internal Error: java.lang.IllegalArgumentException(java.net.URISyntaxException: Relative path in absolute URI: file:C:/cygwin/tmp//siddharth/hive_2011-08-18_ 04-08-25_850_5502285238716420526) java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/cygwin/tmp//siddharth/hive_2011-08-18_04-08-25_850_550228523871642 0526 at org.apache.hadoop.fs.Path.initialize(Path.java:140) at org.apache.hadoop.fs.Path.init(Path.java:132) at org.apache.hadoop.hive.ql.Context.getScratchDir(Context.java:142) at org.apache.hadoop.hive.ql.Context.getLocalScratchDir(Context.java:168) at org.apache.hadoop.hive.ql.Context.getLocalTmpFileURI(Context.java:282) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:205) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:238) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:340) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:736) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:164) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:241) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:456) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.net.URISyntaxException: Relative path in absolute URI: file:C:/cygwin/tmp//siddharth/hive_2011-08-18_04-08-25_850_5502285238716420526 at java.net.URI.checkPath(URI.java:1787) at java.net.URI.init(URI.java:735) at org.apache.hadoop.fs.Path.initialize(Path.java:137) ... 16 more **** *Cheers !!!* *Siddharth Tiwari* Have a refreshing day !!!
RE: Hive DDL issue
okay Ed and Carl, I get the point, the only thing which bothered me was, would it be able to run on cygwin ? what actually was wrong. ** Cheers !!! Siddharth Tiwari Have a refreshing day !!! From: c...@cloudera.com Date: Thu, 18 Aug 2011 13:13:37 -0700 Subject: Re: Hive DDL issue To: user@hive.apache.org Adding to what Ed said, we don't run regression tests on Cygwin, so Hive on Cygwin isde facto unmaintained. On Thu, Aug 18, 2011 at 12:37 PM, Edward Capriolo edlinuxg...@gmail.com wrote: It did work with cygwin at one point but since it is rarely used in that environment it is not well supported. Your best bet is QEMU or Vmware emulating a linux environment. On Thu, Aug 18, 2011 at 3:14 PM, Siddharth Tiwari siddharth.tiw...@live.com wrote: hey carl, Isint there any way to enable it, if not, what is this error about ? what is the problem ? ** Cheers !!! Siddharth Tiwari Have a refreshing day !!! From: c...@cloudera.com Date: Thu, 18 Aug 2011 11:34:03 -0700 Subject: Re: Hive DDL issue To: user@hive.apache.org Hive does not work on Cygwin. On Wed, Aug 17, 2011 at 3:38 PM, Siddharth Tiwari siddharth.tiw...@live.com wrote: encountering following issur pls help, on cygwin windows hive show tables; FAILED: Hive Internal Error: java.lang.IllegalArgumentException(java.net.URISyntaxException: Relative path in absolute URI: file:C:/cygwin/tmp//siddharth/hive_2011-08-18_ 04-08-25_850_5502285238716420526) java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/cygwin/tmp//siddharth/hive_2011-08-18_04-08-25_850_550228523871642 0526 at org.apache.hadoop.fs.Path.initialize(Path.java:140) at org.apache.hadoop.fs.Path.init(Path.java:132) at org.apache.hadoop.hive.ql.Context.getScratchDir(Context.java:142) at org.apache.hadoop.hive.ql.Context.getLocalScratchDir(Context.java:168) at org.apache.hadoop.hive.ql.Context.getLocalTmpFileURI(Context.java:282) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:205) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:238) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:340) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:736) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:164) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:241) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:456) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.net.URISyntaxException: Relative path in absolute URI: file:C:/cygwin/tmp//siddharth/hive_2011-08-18_04-08-25_850_5502285238716420526 at java.net.URI.checkPath(URI.java:1787) at java.net.URI.init(URI.java:735) at org.apache.hadoop.fs.Path.initialize(Path.java:137) ... 16 more ** Cheers !!! Siddharth Tiwari Have a refreshing day !!!
Ignore subdirectories when querying external table
Hi, I have a partitioned external table in Hive, and in the partition directories there are other subdirectories that are not related to the table itself. Hive seems to want to scan those directories, as I am getting an error message when trying to do a SELECT on the table: Failed with exception java.io.IOException:java.io.IOException: Not a file: hdfs://path/to/partition/path/to/subdir Also, it seems to ignore directories prefixed by an underscore (_directory). I am using hive 0.7.1 on Hadoop 0.20.2. Is there a way to force Hive to ignore all subdirectories in external tables and only look at files? Thanks in advance, -Dave
Re: Setting up stats database
Maybe you should use 'hive.stats.jdbcdriver=org.apache.mysql.jdbc.EmbeddedDriver' settings? via http://mail-archives.apache.org/mod_mbox/hive-user/201103.mbox/%3c42360b00-72ec-437a-9d95-93f3ad9f1...@fb.com%3E On Fri, Aug 19, 2011 at 5:45 AM, bharath vissapragada bharathvissapragada1...@gmail.com wrote: Hi, Iam also getting the same error. However I am using mysql for stats. The thing is I configured mysql for metastore and it works fine and all the metadata gets populated normally. When the metastore classes can find the mysql jar in the class path , why cant the stats publisher find it. I looked at the stats source and everything looks fine. My conn string is : jdbc:mysql://ip:3306/TempStatsStoreamp;user=nameamp;password=pwd. Am I missing something? Thanks On Thu, Aug 18, 2011 at 8:19 AM, wd w...@wdicc.com wrote: The error in log is 'java.lang.ClassNotFoundException: org.postgresql.Driver', not can't connect or user name or password error. On Wed, Aug 17, 2011 at 3:53 PM, Jander g jande...@gmail.com wrote: Hi,wd You should configure hive.stats.dbconnectionstring as follows. property namehive.stats.dbconnectionstring/name valuejdbc:postgresql://localhost/hive_statsdb?createDatabaseIfNotExist=trueamp;user=hiveamp;password=pwd/value descriptionThe default connection string for the database that stores temporary hive statistics./description /property Regards, Jander. On Mon, Aug 15, 2011 at 3:09 PM, wd w...@wdicc.com wrote: hi, I'm try to use postgres as stats database. And made following settings in hive-site.xml property namehive.stats.dbclass/name valuejdbc:postgresql/value descriptionThe default database that stores temporary hive statistics./description /property property namehive.stats.autogather/name valuetrue/value descriptionA flag to gather statistics automatically during the INSERT OVERWRITE command./description /property property namehive.stats.jdbcdriver/name valueorg.postgresql.Driver/value descriptionThe JDBC driver for the database that stores temporary hive statistics./description /property property namehive.stats.dbconnectionstring/name valuejdbc:postgresql://localhost/hive_statsdb?createDatabaseIfNotExist=true;user=hive;password=pwd/value descriptionThe default connection string for the database that stores temporary hive statistics./description /property I use postgres as hive meta database, so there is a postgresql-9.0-801.jdbc4.jar file in lib. After run 'analyse table t1 partitions(dt) comput statistics;' in hive cli, it will output some stats info in cli, but nothing in db. And I can found there is the flowing errors 1-08-15 14:54:54,767 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Stats Gathering found a new partition spec = dt=20110805 2011-08-15 14:54:54,767 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 forwarding 1 rows 2011-08-15 14:54:54,767 INFO ExecMapper: ExecMapper: processing 1 rows: used memory = 39953640 2011-08-15 14:54:54,768 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 1 finished. closing... 2011-08-15 14:54:54,768 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 1 forwarded 2 rows 2011-08-15 14:54:54,768 INFO org.apache.hadoop.hive.ql.exec.MapOperator: DESERIALIZE_ERRORS:0 2011-08-15 14:54:54,768 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 finished. closing... 2011-08-15 14:54:54,768 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 forwarded 2 rows 2011-08-15 14:54:54,772 ERROR org.apache.hadoop.hive.ql.stats.jdbc.JDBCStatsPublisher: Error during JDBC connection to jdbc:postgresql://localhost/hive_statsdb?createDatabaseIfNotExist=true;user=hive;password=pwd. java.lang.ClassNotFoundException: org.postgresql.Driver at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:169) at org.apache.hadoop.hive.ql.stats.jdbc.JDBCStatsPublisher.connect(JDBCStatsPublisher.java:55) at org.apache.hadoop.hive.ql.exec.TableScanOperator.publishStats(TableScanOperator.java:202) at org.apache.hadoop.hive.ql.exec.TableScanOperator.closeOp(TableScanOperator.java:164) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:557) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:566) at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:193) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
How to skip the malformatted records while loading data
Hi,everyone, Is there an option to ignore malformatted records while loading data into hive table? Or an option to ignore bad rows while querying data? For instance: 1. Specify a row format explicitly for a new table. hivecreate table tb (id int, pref string, zip string) row format delimited fields terminated by ',' lines terminated by '\n'; 2. Load data into the table from a csv file that with bad records. hiveload data local inpath 'data.csv' overwrite into table tb; The data.csv might look like: 32,aaa,422 --Blank line 33:bbb:423 --Invalid field delimiter : aa,ccc,424 --Non-int number aa 3. Select data hive select * from tb; OK 32 aaa 422 NULL NULL NULL NULL NULL NULL NULL ccc 424 Time taken: 0.196 seconds I have tried to set mapred.skip.map.max.skip.records,but it seems not to work. Thanks in advance. Regards, Xie -- Best Regards Xie Xianshan -- Xie Xianshan Dept.IV of Technology and Development Nanjing Fujitsu Nanda Software Tech. Co., Ltd.(FNST) No. 6 Wenzhu Road, Nanjing, China PostCode: 210012 PHONE: +86+25-86630566-8522 FUJITSU INTERNAL: 7998-8522 MAIL: xi...@cn.fujitsu.com -- This communication is for use by the intended recipient(s) only and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If you are not an intended recipient of this communication, you are hereby notified that any dissemination, distribution or copying hereof is strictly prohibited. If you have received this communication in error, please notify me by reply e-mail, permanently delete this communication from your system, and destroy any hard copies you may have printed
How to skip the malformatted records while loading data
Hi,everyone, Is there an option to ignore malformatted records while loading data into hive table? Or an option to ignore bad rows while querying data? For instance: 1. Specify a row format explicitly for a new table. hivecreate table tb (id int, pref string, zip string) row format delimited fields terminated by ',' lines terminated by '\n'; 2. Load data into the table from a csv file that with bad records. hiveload data local inpath 'data.csv' overwrite into table tb; The data.csv might look like: 32,aaa,422 --Blank line 33:bbb:423 --Invalid field delimiter : aa,ccc,424 --Non-int number aa 3. Select data hive select * from tb; OK 32 aaa 422 NULL NULL NULL NULL NULL NULL NULL ccc 424 Time taken: 0.196 seconds I have tried to set mapred.skip.map.max.skip.records,but it seems not to work. Thanks in advance. Regards, Xie -- Best Regards Xie Xianshan -- Xie Xianshan Dept.IV of Technology and Development Nanjing Fujitsu Nanda Software Tech. Co., Ltd.(FNST) No. 6 Wenzhu Road, Nanjing, China PostCode: 210012 PHONE: +86+25-86630566-8522 FUJITSU INTERNAL: 7998-8522 MAIL: xi...@cn.fujitsu.com -- This communication is for use by the intended recipient(s) only and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If you are not an intended recipient of this communication, you are hereby notified that any dissemination, distribution or copying hereof is strictly prohibited. If you have received this communication in error, please notify me by reply e-mail, permanently delete this communication from your system, and destroy any hard copies you may have printed
Re: Setting up stats database[SOLVED]
Hi, I solved this by placing the jar in ${java_home}/jre/lib and ${java_home}/jre/lib/ext . This is the workaround whenever jdbc drivers wont work. The same thing worked here too. (I hope it works with your postgres too). I am still wondering why hive didn't recognize it in the classpath. Also there is some parsing problem in my connection string and it is getting terminated at ; in jdbc:mysql://ip:3306/TempStatsStoreamp;user=nameamp;password=pwd. I got it worked by adding 2 properties stats.username and stats.password just like the metastore db user and password, and replaced the conn = DriverManager.getConnection(connectionString) with conn = DriverManager.getConnection(connectionString,uname,pwd) by reading them from Conf variable inside JDBCStatsPublisher class. Is this worth filing a JIRA or Am I the only one facing this problem? Thanks On Fri, Aug 19, 2011 at 8:05 AM, wd w...@wdicc.com wrote: Maybe you should use 'hive.stats.jdbcdriver=org.apache.mysql.jdbc.EmbeddedDriver' settings? via http://mail-archives.apache.org/mod_mbox/hive-user/201103.mbox/%3c42360b00-72ec-437a-9d95-93f3ad9f1...@fb.com%3E On Fri, Aug 19, 2011 at 5:45 AM, bharath vissapragada bharathvissapragada1...@gmail.com wrote: Hi, Iam also getting the same error. However I am using mysql for stats. The thing is I configured mysql for metastore and it works fine and all the metadata gets populated normally. When the metastore classes can find the mysql jar in the class path , why cant the stats publisher find it. I looked at the stats source and everything looks fine. My conn string is : jdbc:mysql://ip:3306/TempStatsStoreamp;user=nameamp;password=pwd. Am I missing something? Thanks On Thu, Aug 18, 2011 at 8:19 AM, wd w...@wdicc.com wrote: The error in log is 'java.lang.ClassNotFoundException: org.postgresql.Driver', not can't connect or user name or password error. On Wed, Aug 17, 2011 at 3:53 PM, Jander g jande...@gmail.com wrote: Hi,wd You should configure hive.stats.dbconnectionstring as follows. property namehive.stats.dbconnectionstring/name valuejdbc:postgresql://localhost/hive_statsdb?createDatabaseIfNotExist=trueamp;user=hiveamp;password=pwd/value descriptionThe default connection string for the database that stores temporary hive statistics./description /property Regards, Jander. On Mon, Aug 15, 2011 at 3:09 PM, wd w...@wdicc.com wrote: hi, I'm try to use postgres as stats database. And made following settings in hive-site.xml property namehive.stats.dbclass/name valuejdbc:postgresql/value descriptionThe default database that stores temporary hive statistics./description /property property namehive.stats.autogather/name valuetrue/value descriptionA flag to gather statistics automatically during the INSERT OVERWRITE command./description /property property namehive.stats.jdbcdriver/name valueorg.postgresql.Driver/value descriptionThe JDBC driver for the database that stores temporary hive statistics./description /property property namehive.stats.dbconnectionstring/name valuejdbc:postgresql://localhost/hive_statsdb?createDatabaseIfNotExist=true;user=hive;password=pwd/value descriptionThe default connection string for the database that stores temporary hive statistics./description /property I use postgres as hive meta database, so there is a postgresql-9.0-801.jdbc4.jar file in lib. After run 'analyse table t1 partitions(dt) comput statistics;' in hive cli, it will output some stats info in cli, but nothing in db. And I can found there is the flowing errors 1-08-15 14:54:54,767 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: Stats Gathering found a new partition spec = dt=20110805 2011-08-15 14:54:54,767 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 forwarding 1 rows 2011-08-15 14:54:54,767 INFO ExecMapper: ExecMapper: processing 1 rows: used memory = 39953640 2011-08-15 14:54:54,768 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 1 finished. closing... 2011-08-15 14:54:54,768 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 1 forwarded 2 rows 2011-08-15 14:54:54,768 INFO org.apache.hadoop.hive.ql.exec.MapOperator: DESERIALIZE_ERRORS:0 2011-08-15 14:54:54,768 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 finished. closing... 2011-08-15 14:54:54,768 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 forwarded 2 rows 2011-08-15 14:54:54,772 ERROR org.apache.hadoop.hive.ql.stats.jdbc.JDBCStatsPublisher: Error during JDBC connection to jdbc:postgresql://localhost/hive_statsdb?createDatabaseIfNotExist=true;user=hive;password=pwd. java.lang.ClassNotFoundException: org.postgresql.Driver at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at