Re: Hive built-in functions not working
Try to add the jar explicitly from hive prompt and see if that works. Regards, Ramki. On Fri, Jun 21, 2013 at 7:32 AM, fab wol darkwoll...@gmail.com wrote: I'm using Hadoop 0.20.2 with Hive 0.11. I have succesfully inserted into hive/hdfs some csv-files in seperate tables. selects and joins work flawlessly. When trying to analyse some data, i needed to make use of the built in functions of hive like: - substr - to_date - rand - etc. for example: select sid, request_id, to_date(times), to_unix_timestamp(times) from contents where sid = '500032066010373'; sid and request id are strings here, times is a timestamp column Unfortanetely i only get errors (always the same error stack) when using these functions: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 5 more Caused by: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) ... 10 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 13 more Caused by: java.lang.RuntimeException: Map operator initialization failed at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:121) ... 18 more Caused by: java.lang.NoClassDefFoundError: org/codehaus/jackson/JsonFactory at org.apache.hadoop.hive.ql.udf.generic.GenericUDTFJSONTuple.clinit(GenericUDTFJSONTuple.java:56) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:113) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.registerGenericUDTF(FunctionRegistry.java:526) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.registerGenericUDTF(FunctionRegistry.java:520) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.clinit(FunctionRegistry.java:423) at org.apache.hadoop.hive.ql.exec.DefaultUDFMethodResolver.getEvalMethod(DefaultUDFMethodResolver.java:59) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.initialize(GenericUDFBridge.java:154) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:111) at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:141) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:970) at org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:996) at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:60) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:451) at
Re: Is there a mechanism similar to hadoop -archive in hive (add archive is not apparently)
In the *Attempt two, *are you not supposed to use hivetry as the directory? May be you should try giving the full path /opt/am/ver/1.0/hive/hivetry/classifier_wf.py and see if it works. Regards, Ramki. On Thu, Jun 20, 2013 at 9:28 AM, Stephen Boesch java...@gmail.com wrote: Stephen: would you be willing to share an example of specifying a directory as the add file target?I have not seen this working I have attempted to use it as follows: *We will access a script within the hivetry directory located here:* hive ! ls -l /opt/am/ver/1.0/hive/hivetry/classifier_wf.py; -rwxrwxr-x 1 hadoop hadoop 11241 Jun 18 19:37 /opt/am/ver/1.0/hive/hivetry/classifier_wf.py *Add the directory to hive:* hive add file /opt/am/ver/1.0/hive/hivetry; Added resource: /opt/am/ver/1.0/hive/hivetry *Attempt to run transform query using that script:* * * *Attempt one: use the script name unqualified:* hivefrom (select transform (aappname,qappname) using 'classifier_wf.py' as (aappname2 string, qappname2 string) from eqx ) o insert overwrite table c select o.aappname2, o.qappname2; (Failed: Caused by: java.io.IOException: Cannot run program classifier_wf.py: java.io.IOException: error=2, No such file or directory) *Attempt two: use the script name with the directory name prefix: * hivefrom (select transform (aappname,qappname) using 'hive/classifier_wf.py' as (aappname2 string, qappname2 string) from eqx ) o insert overwrite table c select o.aappname2, o.qappname2; (Failed: Caused by: java.io.IOException: Cannot run program hive/classifier_wf.py: java.io.IOException: error=2, No such file or directory) 2013/6/20 Stephen Sprague sprag...@gmail.com yeah. the archive isn't unpacked on the remote side. I think add archive is mostly used for finding java packages since CLASSPATH will reference the archive (and as such there is no need to expand it.) On Thu, Jun 20, 2013 at 9:00 AM, Stephen Boesch java...@gmail.comwrote: thx for the tip on add file where file is directory. I will try that. 2013/6/20 Stephen Sprague sprag...@gmail.com i personally only know of adding a .jar file via add archive but my experience there is very limited. i believe if you 'add file' and the file is a directory it'll recursively take everything underneath but i know of nothing that inflates or un tars things on the remote end automatically. i would 'add file' your python script and then within that untar your tarball to get at your model data. its just the matter of figuring out the path to that tarball that's kinda up in the air when its added as 'add file'. Yeah. local downlooads directory. What's the literal path is what i'd like to know. :) On Thu, Jun 20, 2013 at 8:37 AM, Stephen Boesch java...@gmail.comwrote: @Stephen: given the 'relative' path for hive is from a local downloads directory on each local tasktracker in the cluster, it was my thought that if the archive were actually being expanded then somedir/somefileinthearchive should work. I will go ahead and test this assumption. In the meantime, is there any facility available in hive for making archived files available to hive jobs? archive or hadoop archive (har) etc? 2013/6/20 Stephen Sprague sprag...@gmail.com what would be interesting would be to run a little experiment and find out what the default PATH is on your data nodes. How much of a pain would it be to run a little python script to print to stderr the value of the environmental variable $PATH and $PWD (or the shell command 'pwd') ? that's of course going through normal channels of add file. the thing is given you're using a relative path hive/parse_qx.py you need to know what the current directory is when the process runs on the data nodes. On Thu, Jun 20, 2013 at 5:32 AM, Stephen Boesch java...@gmail.comwrote: We have a few dozen files that need to be made available to all mappers/reducers in the cluster while running hive transformation steps . It seems the add archive does not make the entries unarchived and thus available directly on the default file path - and that is what we are looking for. To illustrate: add file modelfile.1; add file modelfile.2; .. add file modelfile.N; Then, our model that is invoked during the transformation step *does *have correct access to its model files in the defaul path. But .. those model files take low *minutes* to all load.. instead when we try: add archive modelArchive.tgz. The problem is the archive does not get exploded apparently .. I have an archive for example that contains shell scripts under the hive directory stored inside. I am *not *able to access hive/my-shell-script.sh after adding the archive. Specifically the following fails: $ tar -tvf appm*.tar.gz | grep launch-quixey_to_xml -rwxrwxr-x stephenb/stephenb664 2013-06-18 17:46 appminer/bin/launch-quixey_to_xml.sh from (select
Re: Hive External Table issue
1. I was under the impression that you cannot refer the table location to a file. But, it looks like it works. Please see the discussion in the thread http://mail-archives.apache.org/mod_mbox/hive-user/201303.mbox/% 3c556325346ca26341b6f0530e07f90d96017084360...@gbgh-exch-cms.sig.ads%3e 2. It there are more than one file in the directory, your query gets the data from all the files in that directory. In your case, the regex may not be parsing the data properly. Regards, Ramki. On Thu, Jun 20, 2013 at 11:03 AM, sanjeev sagar sanjeev.sa...@gmail.comwrote: Two issues: 1. I've created external tables in hive based on file location before and it work without any issue. It don't have to be a directory. 2. If there are more than one file in the directory, and you create external table based on directory then how the table knows that which file it need to look for the data? I tried to create the table based on directory, it created the table but all the rows were NULL. -Sanjeev On Thu, Jun 20, 2013 at 10:30 AM, Nitin Pawar nitinpawar...@gmail.comwrote: in hive when you create table and use the location to refer hdfs path, that path is supposed to be a directory. If the directory is not existing it will try to create it and if its a file it will throw an error as its not a directory thats the error you are getting that location you referred is a file. Change it to the directory and see if that works for you On Thu, Jun 20, 2013 at 10:57 PM, sanjeev sagar sanjeev.sa...@gmail.comwrote: I did mention in my mail the hdfs file exists in that location. See below In HDFS: file exists hadoop fs -ls /user/flume/events/request_logs/ ar1.vgs.mypoints.com/13-06-13/FlumeData.1371144648033 Found 1 items -rw-r--r-- 3 hdfs supergroup 2242037226 2013-06-13 11:14 /user/flume/events/request_logs/ ar1.vgs.mypoints.com/13-06-13/FlumeData.1371144648033 so the directory and file both exists. On Thu, Jun 20, 2013 at 10:24 AM, Nitin Pawar nitinpawar...@gmail.comwrote: MetaException(message:hdfs:// h1.vgs.mypoints.com:8020/user/flume/events/request_logs/ar1.vgs.mypoints.com/13-06-13/FlumeData.1371144648033 is not a directory or unable to create one) it clearly says its not a directory. Point to the dictory and it will work On Thu, Jun 20, 2013 at 10:52 PM, sanjeev sagar sanjeev.sa...@gmail.com wrote: Hello Everyone, I'm running into the following Hive external table issue. hive CREATE EXTERNAL TABLE access( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( input.regex = ([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \]*|\[^\]*\) (-|[0-9]*) (-|[0-9]*)(?: ([^ \]*|\[^\]*\) ([^ \]*|\[^\]*\))?, output.format.string = %1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s ) STORED AS TEXTFILE LOCATION '/user/flume/events/request_logs/ ar1.vgs.mypoints.com/13-06-13/FlumeData.1371144648033'; FAILED: Error in metadata: MetaException(message:hdfs:// h1.vgs.mypoints.com:8020/user/flume/events/request_logs/ar1.vgs.mypoints.com/13-06-13/FlumeData.1371144648033 is not a directory or unable to create one) FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask In HDFS: file exists hadoop fs -ls /user/flume/events/request_logs/ ar1.vgs.mypoints.com/13-06-13/FlumeData.1371144648033 Found 1 items -rw-r--r-- 3 hdfs supergroup 2242037226 2013-06-13 11:14 /user/flume/events/request_logs/ ar1.vgs.mypoints.com/13-06-13/FlumeData.1371144648033 I've download the serde2 jar file too and install it in /usr/lib/hive/lib/hive-json-serde-0.2.jar and I've bounced all the hadoop services after that. I even added the jar file manually in hive and run the above sql but still failing. ive add jar /usr/lib/hive/lib/hive-json-serde-0.2.jar ; Added /usr/lib/hive/lib/hive-json-serde-0.2.jar to class path Added resource: /usr/lib/hive/lib/hive-json-serde-0.2.jar Any help would be highly appreciable. -Sanjeev -- Sanjeev Sagar ***Separate yourself from everything that separates you from others ! - Nirankari Baba Hardev Singh ji * ** -- Nitin Pawar -- Sanjeev Sagar ***Separate yourself from everything that separates you from others !- Nirankari Baba Hardev Singh ji * ** -- Nitin Pawar -- Sanjeev Sagar ***Separate yourself from everything that separates you from others ! - Nirankari Baba Hardev Singh ji * **
Re: Hive External Table issue
Nitin, Can you go through the thread with subject S3/EMR Hive: Load contents of a single file on Tue, 26 Mar, 17:11 at http://mail-archives.apache.org/mod_mbox/hive-user/201303.mbox/thread?1 This gives the whole discussion about the topic of table location pointing to a filename vs. directory. Can you give your insight from this discussion and the discussion you mentioned at stackoverflow link? Regards, Ramki. On Thu, Jun 20, 2013 at 11:14 AM, Nitin Pawar nitinpawar...@gmail.comwrote: Also see this JIRA https://issues.apache.org/jira/browse/HIVE-951 I think issue you are facing is due to the JIRA On Thu, Jun 20, 2013 at 11:41 PM, Nitin Pawar nitinpawar...@gmail.comwrote: Mark has answered this before http://stackoverflow.com/questions/11269203/when-creating-an-external-table-in-hive-can-i-point-the-location-to-specific-fil If this link does not answer your question, do let us know On Thu, Jun 20, 2013 at 11:33 PM, sanjeev sagar sanjeev.sa...@gmail.comwrote: Two issues: 1. I've created external tables in hive based on file location before and it work without any issue. It don't have to be a directory. 2. If there are more than one file in the directory, and you create external table based on directory then how the table knows that which file it need to look for the data? I tried to create the table based on directory, it created the table but all the rows were NULL. -Sanjeev On Thu, Jun 20, 2013 at 10:30 AM, Nitin Pawar nitinpawar...@gmail.comwrote: in hive when you create table and use the location to refer hdfs path, that path is supposed to be a directory. If the directory is not existing it will try to create it and if its a file it will throw an error as its not a directory thats the error you are getting that location you referred is a file. Change it to the directory and see if that works for you On Thu, Jun 20, 2013 at 10:57 PM, sanjeev sagar sanjeev.sa...@gmail.com wrote: I did mention in my mail the hdfs file exists in that location. See below In HDFS: file exists hadoop fs -ls /user/flume/events/request_logs/ ar1.vgs.mypoints.com/13-06-13/FlumeData.1371144648033 Found 1 items -rw-r--r-- 3 hdfs supergroup 2242037226 2013-06-13 11:14 /user/flume/events/request_logs/ ar1.vgs.mypoints.com/13-06-13/FlumeData.1371144648033 so the directory and file both exists. On Thu, Jun 20, 2013 at 10:24 AM, Nitin Pawar nitinpawar...@gmail.com wrote: MetaException(message:hdfs:// h1.vgs.mypoints.com:8020/user/flume/events/request_logs/ar1.vgs.mypoints.com/13-06-13/FlumeData.1371144648033 is not a directory or unable to create one) it clearly says its not a directory. Point to the dictory and it will work On Thu, Jun 20, 2013 at 10:52 PM, sanjeev sagar sanjeev.sa...@gmail.com wrote: Hello Everyone, I'm running into the following Hive external table issue. hive CREATE EXTERNAL TABLE access( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( input.regex = ([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \]*|\[^\]*\) (-|[0-9]*) (-|[0-9]*)(?: ([^ \]*|\[^\]*\) ([^ \]*|\[^\]*\))?, output.format.string = %1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s ) STORED AS TEXTFILE LOCATION '/user/flume/events/request_logs/ ar1.vgs.mypoints.com/13-06-13/FlumeData.1371144648033'; FAILED: Error in metadata: MetaException(message:hdfs:// h1.vgs.mypoints.com:8020/user/flume/events/request_logs/ar1.vgs.mypoints.com/13-06-13/FlumeData.1371144648033 is not a directory or unable to create one) FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask In HDFS: file exists hadoop fs -ls /user/flume/events/request_logs/ ar1.vgs.mypoints.com/13-06-13/FlumeData.1371144648033 Found 1 items -rw-r--r-- 3 hdfs supergroup 2242037226 2013-06-13 11:14 /user/flume/events/request_logs/ ar1.vgs.mypoints.com/13-06-13/FlumeData.1371144648033 I've download the serde2 jar file too and install it in /usr/lib/hive/lib/hive-json-serde-0.2.jar and I've bounced all the hadoop services after that. I even added the jar file manually in hive and run the above sql but still failing. ive add jar /usr/lib/hive/lib/hive-json-serde-0.2.jar ; Added /usr/lib/hive/lib/hive-json-serde-0.2.jar to class path Added resource: /usr/lib/hive/lib/hive-json-serde-0.2.jar Any help would be highly appreciable. -Sanjeev -- Sanjeev Sagar ***Separate yourself from everything that separates you from others ! - Nirankari Baba Hardev Singh ji * ** -- Nitin Pawar -- Sanjeev
Re: LZO compression implementation in Hive
Hi Sanjay, Can you quickly give your insights on thip topic, if possible? Regards, Ramki. On Mon, May 20, 2013 at 2:51 PM, Sanjay Subramanian sanjay.subraman...@wizecommerce.com wrote: Hi Programming Hive Book authors Maybe a lot of u have already successfully implemented this but only these last two weeks , we implemented our aggregations using LZO compression in Hive - MR jobs creating LZO files as Input for Hive --- Therafter Hive aggregations creating more LZO files as output. As usual nothing was straight forward :-) Also the other challenge was to neatly tie all into actions in Oozie workflows….but after being underwater for weeks I think I am able to rise above water and breathe ! In the next version of the book , If u guys r planning to add detailed sections on using lzo compression in Hive , let me know…my experiences might be useful :-) Thanks sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Table present in HDFS but 'show table' Returns Empty
May be you are using derby as your metastore. It creates the metastore in the current directory from where you started your hive session. You may have started your hive session from a different directory next time. Please use either mysql as your metastore or set a definite directory in your config file as your metastore if you continue to use derby as your meta store. Regards, Ramki. On Mon, Apr 22, 2013 at 9:48 PM, Xun TANG tangxun.al...@gmail.com wrote: Hi guys, I've created hive tables via 'hive -f' or hive interactive shell. Data are loaded to tables right after, so the tables are not empty. However, when I log out and log in to hive shell again, 'show tables' return 0 table, while 'hadoop list' command shows the hdfs files are still where they were. So my questions are, 1. What triggered hive to 'forget' the tables? 2. How to make hive permanently (or within the lifecycle of hadoop daemons) remember the tables until they were dropped? This problem has puzzled me for a while, and I could not find similar question/answer online... Did I miss some configuration? Thanks, Alice
Re: Exception comes out when counting the rows
Normally, you do not use load data when you use external table. You either specify location in the create external table query or use alter table tablename set location syntax to set the location of the files that the external table stores the data. In this case, try describe extended NMS_CMTS_CPU_CDX_TEST and see what is the value for location. It is better to use the location as I mentioned above for external tables instead of using load data. This way there is no confusion between managed tables and external tables locations. Regards, Ramki. On Mon, Apr 22, 2013 at 2:01 AM, YouPeng Yang yypvsxf19870...@gmail.comwrote: Hi hive users Sorry for missing the title on the previous mail. This is my first time to post a question here. I have gotten an exception when I count the rows of my hive table after I have loaded the data: hivecreate EXTERNAL TABLE NMS_CMTS_CPU_CDX_TEST (CMTSID INT,MSEQ INT,GOTTIME BIGINT,CMTSINDEX INT,CPUTOTAL INT,DESCR STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE; hiveload data inpath '/user/sqoop/NMS_CMTS_CPU_CDX3/NMS_CMTS_CPU_CDX3' into table NMS_CMTS_CPU_CDX_TEST; hive select count(1) from NMS_CMTS_CPU_CDX_TEST; I get an exception on step 3,logs are as follows. Any helps will be gratefull. Regards --- WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files. Execution log at: /tmp/hive/hive_20130422162020_791a7b61-6ba0-466d-99ba-5c2556bafaa4.log Job running in-process (local Hadoop) Hadoop job information for null: number of mappers: 1; number of reducers: 1 2013-04-22 16:24:21,604 null map = 0%, reduce = 0% 2013-04-22 16:25:21,965 null map = 0%, reduce = 0% 2013-04-22 16:26:22,902 null map = 0%, reduce = 0% 2013-04-22 16:26:27,312 null map = 100%, reduce = 0% Ended Job = job_1364348895095_0055 with errors Error during job, obtaining debugging information... Examining task ID: task_1364348895095_0055_m_00 (and more) from job job_1364348895095_0055 Unable to retrieve URL for Hadoop Task logs. Does not contain a valid host:port authority: local Task with the most failures(4): - Task ID: task_1364348895095_0055_m_00 URL: Unavailable - Diagnostic Messages for this Task: Error: java.lang.RuntimeException: java.io.FileNotFoundException: /tmp/hive/hive_2013-04-22_16-20-45_720_3839682514463028560/-mr-10001/89dd576e-fb9d-409a-8b46-2e46b7d21160 (No such file or directory) at org.apache.hadoop.hive.ql.exec.Utilities.getMapRedWork(Utilities.java:224) at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:255) at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:381) at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:374) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:536) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.init(MapTask.java:160) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:381) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147) Caused by: java.io.FileNotFoundException: /tmp/hive/hive_2013-04-22_16-20-45_720_3839682514463028560/-mr-10001/89dd576e-fb9d-409a-8b46-2e46b7d21160 (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:120) at java.io.FileInputStream.init(FileInputStream.java:79) at org.apache.hadoop.hive.ql.exec.Utilities.getMapRedWork(Utilities.java:215) ... 12 more Execution failed with exit status: 2 13/04/22 16:26:28 ERROR exec.Task: Execution failed with exit status: 2 Obtaining error information 13/04/22 16:26:28 ERROR exec.Task: Obtaining error information Task failed! Task ID: Stage-1 Logs: 13/04/22 16:26:28 ERROR exec.Task: Task failed! Task ID: Stage-1 Logs: 13/04/22 16:26:28 ERROR exec.ExecDriver: Execution failed with exit status: 2 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask 13/04/22 16:26:28 ERROR ql.Driver: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask 13/04/22 16:26:28 INFO ql.Driver: /PERFLOG method=Driver.execute start=1366618840573 end=1366619188646 duration=348073 13/04/22 16:26:28 INFO ql.Driver: PERFLOG method=releaseLocks 13/04/22 16:26:28 INFO ql.Driver: /PERFLOG method=releaseLocks
Re: Partition performance
Can you tell how many map tasks are there in each scenario? If my assumption is correct, you should have 336 in the first case and 14 in second case. It looks like it is combing all small files in a folder and running as one map task for all 24 files in a folder, whereas it is running a separate task in these files are there in different partitions (folders). You can try to reuse the JVM and see if the response time is similar. Can you please try the following and let us know how long each strategy takes? hive set mapred.job.reuse.jvm.num.tasks = 24; Run your query that has more partitions and see if the response time is lower. Regards, Ramki. On Fri, Apr 5, 2013 at 11:36 AM, Ian liu...@yahoo.com wrote: Thanks. This is just a test from my local box. So each file is only 1kb. I shared the query plans of these two tests at: http://codetidy.com/paste/raw/5198 http://codetidy.com/paste/raw/5199 Also in the Hadoop log, there is this line for each partition: org.apache.hadoop.hive.ql.exec.MapOperator: Adding alias test1 to work list for file hdfs://localhost:8020/test1/2011/02/01/01 Does that mean each partition will become a map task? I'm still new in Hive, just wondering what are the common strategy for partitioning the hourly logs? I know we shouldn't have too many partitions but I'm wondering what's the reason behind it? If I run this on a real cluster, maybe it won't perform so differently? Thanks. *From:* Dean Wampler dean.wamp...@thinkbiganalytics.com *To:* user@hive.apache.org *Sent:* Thursday, April 4, 2013 4:28 PM *Subject:* Re: Partition performance Also, how big are the files in each directory? Are they roughly the size of one HDFS block or a multiple. Lots of small files will mean lots of mapper tasks will little to do. You can also compare the job tracker console output for each job. I bet the slow one has a lot of very short map and reduce tasks, while the faster one has fewer tasks that run longer. A rule of thumb is that any one task should take 20 seconds or more to amortize over the few seconds spent in start up per task. In other words, if you think about what's happening at the HDFS and MR level, you can learn to predict how fast or slow things will run. Learning to read the output of EXPLAIN or EXPLAIN EXTENDED helps with this. dean On Thu, Apr 4, 2013 at 6:25 PM, Owen O'Malley omal...@apache.org wrote: See slide #9 from my Optimizing Hive Queries talk http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we will improve it, but for now you are much better off with 1,000 partitions than 10,000. -- Owen On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle ramki.pa...@gmail.com wrote: Is it possible for you to send the explain plan of these two queries? Regards, Ramki. On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian sanjay.subraman...@wizecommerce.com wrote: The slow down is most possibly due to large number of partitions. I believe the Hive book authors tell us to be cautious with large number of partitions :-) and I abide by that. Users Please add your points of view and experiences Thanks sanjay From: Ian liu...@yahoo.com Reply-To: user@hive.apache.org user@hive.apache.org, Ian liu...@yahoo.com Date: Thursday, April 4, 2013 4:01 PM To: user@hive.apache.org user@hive.apache.org Subject: Partition performance Hi, I created 3 years of hourly log files (totally 26280 files), and use External Table with partition to query. I tried two partition methods. 1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory per hour). Use date and hour as partition keys. Add 3 years of directories to the table partitions. So there are 26280 partitions. CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt string, hr int); ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION '/test1/2013/04/02/16'; 2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory per day, 24 files in each directory). Use date as partition key. Add 3 years of directories to the table partitions. So there are 1095 partitions. CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt string); ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION '/test2/2013/04/02'; When doing a simple query like SELECT * FROM test1/test2 WHERE dt = '2013-02-01' and dt = '2013-02-14' Using approach #1 takes 320 seconds, but #2 only takes 70 seconds. I'm wondering why there is a big performance difference between these two? These two approaches have the same number of files, only the directory structure is different. So Hive is going to load the same amount of files. Why does the number of partitions have such big impact? Does that mean #2 is a better partition strategy? Thanks. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended
Re: Partition performance
Is it possible for you to send the explain plan of these two queries? Regards, Ramki. On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian sanjay.subraman...@wizecommerce.com wrote: The slow down is most possibly due to large number of partitions. I believe the Hive book authors tell us to be cautious with large number of partitions :-) and I abide by that. Users Please add your points of view and experiences Thanks sanjay From: Ian liu...@yahoo.com Reply-To: user@hive.apache.org user@hive.apache.org, Ian liu...@yahoo.com Date: Thursday, April 4, 2013 4:01 PM To: user@hive.apache.org user@hive.apache.org Subject: Partition performance Hi, I created 3 years of hourly log files (totally 26280 files), and use External Table with partition to query. I tried two partition methods. 1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory per hour). Use date and hour as partition keys. Add 3 years of directories to the table partitions. So there are 26280 partitions. CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt string, hr int); ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION '/test1/2013/04/02/16'; 2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory per day, 24 files in each directory). Use date as partition key. Add 3 years of directories to the table partitions. So there are 1095 partitions. CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt string); ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION '/test2/2013/04/02'; When doing a simple query like SELECT * FROM test1/test2 WHERE dt = '2013-02-01' and dt = '2013-02-14' Using approach #1 takes 320 seconds, but #2 only takes 70 seconds. I'm wondering why there is a big performance difference between these two? These two approaches have the same number of files, only the directory structure is different. So Hive is going to load the same amount of files. Why does the number of partitions have such big impact? Does that mean #2 is a better partition strategy? Thanks. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: S3/EMR Hive: Load contents of a single file
First of all, you cannot point a table to a file. Each table will have a corresponding table. If you want to have all the in the table contains in only one file, simply copy that one file into the directory. The table does not need to know the name of the file. It only matters whether the structure of the data in the file is similar to the table structure. When you query the table, it gets the data from whatever files are there from the corresponding directory. Regards, Ramki. On Tue, Mar 26, 2013 at 10:11 AM, Tony Burton tbur...@sportingindex.comwrote: Hi list, ** ** I've been using hive to perform queries on data hosted on AWS S3, and my tables point at data by specifying the directory in which the data is stored, eg ** ** $ create external table myData (str1 string, str2 string, count1 int) partitioned by snip row format snip stored as textfile location 's3://mybucket/path/to/data'; ** ** where s3://mybucket/path/to/data is the directory that contains the files I'm interested in. My use case now is to create a table with data pointing to a specifc file in a directory: ** ** $ create external table myData (str1 string, str2 string, count1 int) partitioned by snip row format snip stored as textfile location 's3://mybucket/path/to/data/src1.txt'; and I get the error: FAILED: Error in metadata: MetaException(message:Got exception: java.io.IOException Can't make directory for path 's3://spinmetrics/global/counter_Fixture.txt' since it is a file.). Ok, lets try to create the table without specifying the data source: ** ** $ create external table myData (str1 string, str2 string, count1 int) partitioned by snip row format snip stored as textfile ** ** Ok, no problem. Now lets load the data ** ** $ LOAD DATA INPATH 's3://mybucket/path/to/data/src1.txt' INTO TABLE myData; ** ** (referring to https://cwiki.apache.org/Hive/languagemanual-dml.html - ...filepath can refer to a file (in which case hive will move the file into the table)) ** ** Error message is: FAILED: Error in semantic analysis: Line 1:17 Path is not legal ''s3://mybucket/path/to/data/src1.txt: Move from: s3:// mybucket/path/to/data/src1.txt to: hdfs:// 10.48.97.97:9000/mnt/hive_081/warehouse/gfix is not valid. Please check that values for params default.fs.name and hive.metastore.warehouse.dir do not conflict. ** ** So I check my default.fs.name and hive.metastore.warehouse.dir (which have never caused problems before): ** ** $ set fs.default.name; fs.default.name=hdfs://10.48.97.97:9000 $ set hive.metastore.warehouse.dir; hive.metastore.warehouse.dir=/mnt/hive_081/warehouse ** ** Clearly different, but which is correct? Is there an easier way to load a single file into a hive table? Or should I just put each file in a directory and proceed as before? ** ** Thanks! ** ** Tony ** ** ** ** ** ** ** ** ** ** ** ** ** ** *Tony Burton Senior Software Engineer* e: tbur...@sportingindex.com [image: cid:image001.png@01CDC643.43FE7330]http://www.sportingsolutions.com/ ** ** * P *Please consider the environment before printing this email or attachments* This email and any attachments are confidential, protected by copyright and may be legally privileged. If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened. It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email. Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Gateway House, Milverton Street, London, SE11 4AP. Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404) and Gambling Commission (reg. no. 000-027343-R-308898-001). Any financial promotion contained herein has been issued and approved by Sporting Index Ltd. Outbound email has been scanned for viruses and SPAM image001.png
Re: Using TABLESAMPLE on inner queries
You may use percent based (block sampling) sampling for non-bucketed tables, though there are some restrictions. https://cwiki.apache.org/Hive/languagemanual-sampling.html Regards, Ramki. On Wed, Mar 20, 2013 at 12:27 PM, Mark Grover grover.markgro...@gmail.comwrote: Hey Dean, I am not a power user of the sampling feature but my understanding was that sampling in Hive only works on bucketed tables. I am happy to be corrected though. Mark On Wed, Mar 20, 2013 at 12:20 PM, Dean Wampler dean.wamp...@thinkbiganalytics.com wrote: Mark, Aside from what might be wrong here, isn't it true that sampling with the bucket clause still works on non-bucketed tables; it's just inefficient because it still scans the whole table? Or am I an idiot? ;) dean On Wed, Mar 20, 2013 at 2:17 PM, Mark Grover grover.markgro...@gmail.com wrote: Hi Robert, Sampling in Hive is based on buckets. Therefore, you table needs to be appropriately bucketed. I would recommend storing the results of your inner query in a bucketed table. See how to populate a bucketed table at https://cwiki.apache.org/Hive/languagemanual-ddl-bucketedtables.html Then you will be able to be sample through it. If the predicate is on partition column, you may be able to get around the intermediate table requirement but in general, as far as I know, intermediate bucketed table might be the only choice. Mark On Wed, Mar 20, 2013 at 9:56 AM, Robert Li robert...@kontagent.comwrote: Hi Everyone I'm trying to use the TABLESAMPLE function to sample data, however it's a little more complicated and I am having trouble getting it to run. I know that this works fine and it will give me about 25% of the whole dataset select distinct s from testtable TABLESAMPLE(BUCKET 1 OUT OF 4 ON s) where month = 201211 However, in my situation I need to do a TABLESAMPLE on an outer query, a simple example is *select mytest.s * *from * * (select distinct s from testtable where month = 201211)mytest* or something like *select table1.s* *from * * (select distinct s from testtable)table1* * join* * (select distinct s from test2table)table2* * on table1.s=table2.s* How do I use TABLESAMPLE in this case to sample the results of the outer query? I tried placing TABLESAMPLE(BUCKET 1 OUT OF 4 ON s) in various places of my query but it always returns some sort of syntax error and thus not allowing the query to run. Any help is appreciated. Robert ** -- *Dean Wampler, Ph.D.* thinkbiganalytics.com +1-312-339-1330
Re: how to handle variable format data of text file?
One way you can try is to make your ldata as a map field as it contains variable formatted data and write a UDF to get whatever information you need get. Regards, Ramki. On Mon, Mar 18, 2013 at 1:23 AM, Zhiwen Sun pens...@gmail.com wrote: As u defined in create table hql: fields delimited by blank space. So, the other data is omitted if you wanna contain rest data at the end of line. I suggest you use org.apache.hadoop.hive.contrib.serde2.RegexSerDe row format instead of default delimited format. Zhiwen Sun On Mon, Mar 11, 2013 at 12:04 PM, 周梦想 abloz...@gmail.com wrote: I have files like this: 03/11/13 10:59:52 0ec0 1009 180538126 92041 2300 0 0 7 21|47|20|33|11 0:2775 03/11/13 10:59:52 0744 1010 178343610 92042 350 1 0 -1 NULL NULL 22 45 the format is separated by blank space: date time threadid gid userid [variable formated data grouped by fields separated by space ] I'd like to create a table like: hive create external table handresult (hdate string,htime string, thid string, gid int, userid string,ldata string) row format delimited fields terminated by ; OK but the above table will only have a part of the data. select * from handresult; 03/11/13 10:59:52 0ec0 1009 180538126 92041 03/11/13 10:59:52 0744 1010 178343610 92042 the remain data like 2300 0 0 7 21|47|20|33|11 0:2775 I can't get. while ldata may be variance length and format separated by or an array, the ldata we will parse diferent by each gid. how do this? Thanks, Andy Zhou
Re: hive issue with sub-directories
One way it was solved by an user earlier was by subclassing the InputFormat class and overriding the listStatus method so that you can ignore subdirectories. This was done in 0.7.1 version. Not sure if there is any better way in later versions. At least you can use this approach until someone comes up with a better way. Please see the message at http://mail-archives.apache.org/mod_mbox/hive-user/201108.mbox/%3CCAC80dVSwRz3GpTSAzpS1eQw_P79ko+-=jmQVkOE2VMvMam=0...@mail.gmail.com%3E -Ramki. On Sun, Mar 10, 2013 at 11:36 PM, bejoy...@yahoo.com wrote: ** Hi Suresh AFAIK as of now a partition cannot contain sub directories, it can contain only files. You may have to move the sub dirs out of the parent dir 'a' and create separate partitions for those. Regards Bejoy KS Sent from remote device, Please excuse typos -- *From: * Suresh Krishnappa suresh.krishna...@gmail.com *Date: *Mon, 11 Mar 2013 10:58:05 +0530 *To: *user@hive.apache.org *ReplyTo: * user@hive.apache.org *Subject: *Re: hive issue with sub-directories Hi Mark, I am using external table in HIVE. This is how I am adding the partition alter table mytable add partition (pt=1) location '/test/a/'; I am able to run HIVE queries only if '/test/a/b' folder is deleted. How can I retain this folder structure and still issue queries? Thanks Suresh On Sun, Mar 10, 2013 at 12:48 AM, Mark Grover grover.markgro...@gmail.com wrote: Suresh, By default, the partition column name has to be appear in HDFS directory structure. e.g. /user/hive/warehouse/table name/partition col name=partition col value/data1.txt /user/hive/warehouse/table name/partition col name=partition col value/data2.txt On Thu, Mar 7, 2013 at 7:20 AM, Suresh Krishnappa suresh.krishna...@gmail.com wrote: Hi All, I have the following directory structure in hdfs /test/a/ /test/a/1.avro /test/a/2.avro /test/a/b/ /test/a/b/3.avro I created an external HIVE table using Avro Serde and added /test/a as a partition to this table. I am not able to run a select query. Always getting the error 'not a file' on '/test/a/b' Is this by design, a bug or am I missing some configuration? I am using HIVE 0.10 Thanks Suresh
Re: Error while table creation
Please use if not exists clause and let us know if it works. I am wondering if a table exists already in your database with the same name. -Ramki. On Sat, Mar 9, 2013 at 11:23 PM, Abhishek Gayakwad a.gayak...@gmail.comwrote: it is an old installation, we recently upgraded hive 0.7.0 to 0.9.0 and we are not using if not exists clause On Sun, Mar 10, 2013 at 4:37 AM, Ramki Palle ramki.pa...@gmail.comwrote: Just wondering if your create table syntax include if not exists such as CREATE EXTERNAL TABLE IF NOT EXISTS my_table ( ... ... ... ) On Sun, Mar 10, 2013 at 2:34 AM, Viral Bajaria viral.baja...@gmail.comwrote: Is this is a new installation of Hive or did you upgrade ? How many tables do you already have ? On Sat, Mar 9, 2013 at 12:00 PM, Abhishek Gayakwad a.gayak...@gmail.com wrote: while trying to create external table in oozie hive action, I am getting following error, hive version is 0.9.0 Caused by: javax.jdo.JDODataStoreException: Insert of object org.apache.hadoop.hive.metastore.model.MColumnDescriptor@46c1dc0b using statement INSERT INTO `CDS` (`CD_ID`) VALUES (?) failed : Duplicate entry 'X' for key 'PRIMARY' NestedThrowables: com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry 'X' for key 'PRIMARY' at org.datanucleus.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:313) at org.datanucleus.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:660) at org.datanucleus.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:680) at org.apache.hadoop.hive.metastore.ObjectStore.createTable(ObjectStore.java:617) ... 43 more Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry 'X' for key 'PRIMARY' Thanks, Abhishek
Re: java.lang.NoClassDefFoundError: com/jayway/jsonpath/PathUtil
When you execute the following query, hive select * from twitter limit 5; Hive runs it in local mode and not use MapReduce. For the query, hive select tweet_id from twitter limit 5; I think you need to add JSON jars to overcome this error. You might have added these in a previous session. If you want these jars available for all sessions, insert the add jar statements to your $HOME/.hiverc file. To bypass MapReduce set hive.exec.mode.local.auto = true; to suggest Hive to use local mode to execute the query. If it still uses MR, try set hive.fetch.task.conversion = more;. -Ramki. On Sun, Mar 10, 2013 at 12:19 AM, Sai Sai saigr...@yahoo.in wrote: Just wondering if anyone has any suggestions: This executes successfully: hive select * from twitter limit 5; This does not work: hive select tweet_id from twitter limit 5; // I have given the exception info below: Here is the output of this: hive select * from twitter limit 5; OK tweet_idcreated_attextuser_iduser_screen_nameuser_lang 122106088022745088Fri Oct 07 00:28:54 + 2011wkwkw -_- ayo saja mba RT @yullyunet: Sepupuuu, kita lanjalan yok.. Kita karokoe-an.. Ajak mas galih jg kalo dia mau.. @Dindnf: doremifas124735434Dindnfen 122106088018558976Fri Oct 07 00:28:54 + 2011@egg486 특별히 준비했습니다!252828803CocaCola_Koreako 122106088026939392Fri Oct 07 00:28:54 + 2011My offer of free gobbies for all if @amityaffliction play Blair snitch project still stands.168590073SarahYoungBlooden 122106088035328001Fri Oct 07 00:28:54 + 2011the girl nxt to me in the lib got her headphones in dancing and singing loud af like she the only one here haha267296295MONEYyDREAMS_en 122106088005971968Fri Oct 07 00:28:54 + 2011@KUnYoong_B2UTY Bị lsao đấy269182160b2st_b2utyhpen Time taken: 0.154 seconds This does not work: hive select tweet_id from twitter limit 5; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201303050432_0094, Tracking URL = http://ubuntu:50030/jobdetails.jsp?jobid=job_201303050432_0094 Kill Command = /home/satish/work/hadoop-1.0.4/libexec/../bin/hadoop job -kill job_201303050432_0094 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2013-03-10 00:14:44,509 Stage-1 map = 0%, reduce = 0% 2013-03-10 00:15:14,613 Stage-1 map = 100%, reduce = 100% Ended Job = job_201303050432_0094 with errors Error during job, obtaining debugging information... Job Tracking URL: http://ubuntu:50030/jobdetails.jsp?jobid=job_201303050432_0094 Examining task ID: task_201303050432_0094_m_02 (and more) from job job_201303050432_0094 Task with the most failures(4): - Task ID: task_201303050432_0094_m_00 URL: http://ubuntu:50030/taskdetails.jsp?jobid=job_201303050432_0094tipid=task_201303050432_0094_m_00 - Diagnostic Messages for this Task: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 9 more Caused by: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) ... 14 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at
Re: java.lang.NoClassDefFoundError: com/jayway/jsonpath/PathUtil
Well, you get the results faster. Please check this: https://cwiki.apache.org/Hive/gettingstarted.html#GettingStarted-Runtimeconfiguration Under section Hive, Map-Reduce and Local-Mode, it says This can be very useful to run queries over small data sets - in such cases local mode execution is usually significantly faster than submitting jobs to a large cluster. -Ramki. On Sun, Mar 10, 2013 at 5:26 AM, Sai Sai saigr...@yahoo.in wrote: Ramki/John Many Thanks, that really helped. I have run the add jars in the new session and it appears to be running. However i was wondering about by passing MR, why would we do it and what is the use of it. Will appreciate any input. Thanks Sai -- *From:* Ramki Palle ramki.pa...@gmail.com *To:* user@hive.apache.org; Sai Sai saigr...@yahoo.in *Sent:* Sunday, 10 March 2013 4:22 AM *Subject:* Re: java.lang.NoClassDefFoundError: com/jayway/jsonpath/PathUtil When you execute the following query, hive select * from twitter limit 5; Hive runs it in local mode and not use MapReduce. For the query, hive select tweet_id from twitter limit 5; I think you need to add JSON jars to overcome this error. You might have added these in a previous session. If you want these jars available for all sessions, insert the add jar statements to your $HOME/.hiverc file. To bypass MapReduce set hive.exec.mode.local.auto = true; to suggest Hive to use local mode to execute the query. If it still uses MR, try set hive.fetch.task.conversion = more;. -Ramki. On Sun, Mar 10, 2013 at 12:19 AM, Sai Sai saigr...@yahoo.in wrote: Just wondering if anyone has any suggestions: This executes successfully: hive select * from twitter limit 5; This does not work: hive select tweet_id from twitter limit 5; // I have given the exception info below: Here is the output of this: hive select * from twitter limit 5; OK tweet_idcreated_attextuser_iduser_screen_nameuser_lang 122106088022745088Fri Oct 07 00:28:54 + 2011wkwkw -_- ayo saja mba RT @yullyunet: Sepupuuu, kita lanjalan yok.. Kita karokoe-an.. Ajak mas galih jg kalo dia mau.. @Dindnf: doremifas124735434Dindnfen 122106088018558976Fri Oct 07 00:28:54 + 2011@egg486 특별히 준비했습니다!252828803CocaCola_Koreako 122106088026939392Fri Oct 07 00:28:54 + 2011My offer of free gobbies for all if @amityaffliction play Blair snitch project still stands.168590073SarahYoungBlooden 122106088035328001Fri Oct 07 00:28:54 + 2011the girl nxt to me in the lib got her headphones in dancing and singing loud af like she the only one here haha267296295MONEYyDREAMS_en 122106088005971968Fri Oct 07 00:28:54 + 2011@KUnYoong_B2UTY Bị lsao đấy269182160b2st_b2utyhpen Time taken: 0.154 seconds This does not work: hive select tweet_id from twitter limit 5; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201303050432_0094, Tracking URL = http://ubuntu:50030/jobdetails.jsp?jobid=job_201303050432_0094 Kill Command = /home/satish/work/hadoop-1.0.4/libexec/../bin/hadoop job -kill job_201303050432_0094 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2013-03-10 00:14:44,509 Stage-1 map = 0%, reduce = 0% 2013-03-10 00:15:14,613 Stage-1 map = 100%, reduce = 100% Ended Job = job_201303050432_0094 with errors Error during job, obtaining debugging information... Job Tracking URL: http://ubuntu:50030/jobdetails.jsp?jobid=job_201303050432_0094 Examining task ID: task_201303050432_0094_m_02 (and more) from job job_201303050432_0094 Task with the most failures(4): - Task ID: task_201303050432_0094_m_00 URL: http://ubuntu:50030/taskdetails.jsp?jobid=job_201303050432_0094tipid=task_201303050432_0094_m_00 - Diagnostic Messages for this Task: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke
Re: HIVE issues when using large number of partitions
Check this for your first question: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Recoverpartitions Please post if you find any solution for your 2nd and 3rd questions. Regards, Ramki. On Thu, Mar 7, 2013 at 8:01 PM, Suresh Krishnappa suresh.krishna...@gmail.com wrote: Hi All, I have a hadoop cluster with data present in large number of directories ( 10,000) To run HIVE queries over this data I created an external partitioned table and pointed each directory as a partition to the external table using 'alter table add partition' command. Is there a better way to create a HIVE external table over large number of directories? Also I am facing the following issues due to the large number of partitions 1) The DDL operations of creating the table and adding partitions to the table takes a very long time. Takes about an hour to add around 10,000 partitions 2) Getting 'out of memory' java exception while adding partitions 5 3) Sometimes getting 'out of memory' java exception for select queries for partitions 1 What is the recommended limit to the number of partitions that we can create with an HIVE table? Are there any configuration settings in hive/hadoop to support large number of partitions? I am using HIVE 0.10.0. I re-ran the tests by replacing derby with postgresql as metastore and still faced similar issues. Would appreciate any inputs on this Thanks Suresh
Re: difference between add jar in hive session and hive --auxpath
I am not sure whether it works but try this: Put your ADD JAR commands into a file and invoke hive with -i file option. or insert your ADD JAR commands in your $HOME/.hiverc file and start hive. -Ramki. On Fri, Mar 8, 2013 at 11:55 PM, Edward Capriolo edlinuxg...@gmail.comwrote: Essentially anything that is part of the InputFormat needs to be in auxlib/auxpath. Anything part of a UDF can be added with 'add jar'. On Fri, Mar 8, 2013 at 1:01 PM, Dean Wampler dean.wamp...@thinkbiganalytics.com wrote: --auxpath adds more jars to Hive's classpath before invoking Hive. ADD JARS copies jars around the cluster and adds them to the task classpath, so the jars you add aren't visible to hive itself. Annoying, but... On Fri, Mar 8, 2013 at 11:53 AM, java8964 java8964 java8...@hotmail.comwrote: This is in HIVE-0.9.0 hive list jars; /nfs_home/common/userlibs/google-collections-1.0.jar /nfs_home/common/userlibs/elephant-bird-hive-3.0.7.jar /nfs_home/common/userlibs/protobuf-java-2.3.0.jar /nfs_home/common/userlibs/elephant-bird-core-3.0.7.jar file:/usr/lib/hive/lib/hive-builtins-0.9.0-cdh4.1.2.jar hive desc table; java.lang.NoClassDefFoundError: com/twitter/elephantbird/mapreduce/io/ProtobufConverter at com.twitter.elephantbird.hive.serde.ProtobufDeserializer.initialize(ProtobufDeserializer.java:45) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:203) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:260) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:253) at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:490) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:162) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:930) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:844) at org.apache.hadoop.hive.ql.exec.DDLTask.describeTable(DDLTask.java:2545) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:309) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1331) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1117) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:950) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:744) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:607) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:208) Caused by: java.lang.ClassNotFoundException: com.twitter.elephantbird.mapreduce.io.ProtobufConverter at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) ... 25 more FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.DDLTask hive exit; [y130zhan@daca2 userlibs]$ jar tvf /nfs_home/common/userlibs/elephant-bird-core-3.0.7.jar | grep ProtobufConverter 4825 Mon Mar 04 16:50:46 UTC 2013 com/twitter/elephantbird/mapreduce/io/ProtobufConverter.class 732 Mon Mar 04 16:50:46 UTC 2013 com/twitter/elephantbird/mapreduce/io/ProtobufConverter$1.class -- From: vkavul...@outlook.com To: user@hive.apache.org Subject: RE: difference between add jar in hive session and hive --auxpath Date: Thu, 7 Mar 2013 16:44:41 -0800 If properly done, add jar jar-file should work the same as passing the jar with --auxpath. Can you run list jars; command from CLI or Hue and check if you see the jar file. -- From: java8...@hotmail.com To: user@hive.apache.org Subject: difference between add jar in hive session and hive --auxpath Date: Thu, 7 Mar 2013 17:47:26 -0500 Hi, I have a hive table which uses the jar file provided from the elephant-bird, which is a framework integrated between lzo and google protobuf data and hadoop/hive. If I use the hive command like this: hive --auxpath path_to_jars, it works fine
Re: Rename external table, including HDFS directory
As you can run the hadoop dfs commands from inside Hive, you can put both the commands in a script file and run the script file using hive. These are still two different commands but are together at one place and can be executed in the same environment in one go. -Ramki. On Fri, Mar 8, 2013 at 5:32 AM, Dean Wampler dean.wamp...@thinkbiganalytics.com wrote: You can rename the LOCATION with an ALTER TABLE command. I don't believe there's a single, atomic statement to rename both the table and the directory. On Thu, Mar 7, 2013 at 5:29 PM, Keith Wiley kwi...@keithwiley.com wrote: My understanding is that renaming and external table doesn't rename the corresponding HDFS directory to match. I would like to do this; I want to rename the table in Hive and also rename the HDFS directory so they match (they already match with the old name). I'm not sure how to do this. I'm worried that if I rename the HDFS directory through Hadoop, not Hive, that Hive will lose track of the directory, regardless of whether I rename the table in Hive as well. I also don't want to drop and recreate the table because I don't want to notify Hive of all the partitions all over again. Thoughts? Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com Luminous beings are we, not this crude matter. -- Yoda -- *Dean Wampler, Ph.D.* thinkbiganalytics.com +1-312-339-1330
Re: Error while table creation
Just wondering if your create table syntax include if not exists such as CREATE EXTERNAL TABLE IF NOT EXISTS my_table ( ... ... ... ) On Sun, Mar 10, 2013 at 2:34 AM, Viral Bajaria viral.baja...@gmail.comwrote: Is this is a new installation of Hive or did you upgrade ? How many tables do you already have ? On Sat, Mar 9, 2013 at 12:00 PM, Abhishek Gayakwad a.gayak...@gmail.comwrote: while trying to create external table in oozie hive action, I am getting following error, hive version is 0.9.0 Caused by: javax.jdo.JDODataStoreException: Insert of object org.apache.hadoop.hive.metastore.model.MColumnDescriptor@46c1dc0b using statement INSERT INTO `CDS` (`CD_ID`) VALUES (?) failed : Duplicate entry 'X' for key 'PRIMARY' NestedThrowables: com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry 'X' for key 'PRIMARY' at org.datanucleus.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:313) at org.datanucleus.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:660) at org.datanucleus.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:680) at org.apache.hadoop.hive.metastore.ObjectStore.createTable(ObjectStore.java:617) ... 43 more Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry 'X' for key 'PRIMARY' Thanks, Abhishek
Re: Hive sample test
If any of the 100 rows that the sub-query returns do not satisfy the where clause, there would be no rows in the overall result. Do we still consider that the Hive query is verified in this case? Regards, Ramki. On Wed, Mar 6, 2013 at 1:14 AM, Dean Wampler dean.wamp...@thinkbiganalytics.com wrote: NIce, yea that would do it. On Tue, Mar 5, 2013 at 1:26 PM, Mark Grover grover.markgro...@gmail.comwrote: I typically change my query to query from a limited version of the whole table. Change select really_expensive_select_clause from really_big_table where something=something group by something=something to select really_expensive_select_clause from ( select * from really_big_table limit 100 )t where something=something group by something=something On Tue, Mar 5, 2013 at 10:57 AM, Dean Wampler dean.wamp...@thinkbiganalytics.com wrote: Unfortunately, it will still go through the whole thing, then just limit the output. However, there's a flag that I think only works in more recent Hive releases: set hive.limit.optimize.enable=true This is supposed to apply limiting earlier in the data stream, so it will give different results that limiting just the output. Like Chuck said, you might consider sampling, but unless your table is organized into buckets, you'll at least scan the whole table, but maybe not do all computation over it ?? Also, if you have a small sample data set: set hive.exec.mode.local.auto=true will cause Hive to bypass the Job and Task Trackers, calling APIs directly, when it can do the whole thing in a single process. Not lightning fast, but faster. dean On Tue, Mar 5, 2013 at 12:48 PM, Joey D'Antoni jdant...@yahoo.com wrote: Just add a limit 1 to the end of your query. On Mar 5, 2013, at 1:45 PM, Kyle B kbi...@gmail.com wrote: Hello, I was wondering if there is a way to quick-verify a Hive query before it is run against a big dataset? The tables I am querying against have millions of records, and I'd like to verify my Hive query before I run it against all records. Is there a way to test the query against a small subset of the data, without going into full MapReduce? As silly as this sounds, is there a way to MapReduce without the overhead of MapReduce? That way I can check my query is doing what I want before I run it against all records. Thanks, -Kyle -- Dean Wampler, Ph.D. thinkbiganalytics.com +1-312-339-1330 -- *Dean Wampler, Ph.D.* thinkbiganalytics.com +1-312-339-1330
Re: Find current db we r using in Hive
Sai, I do not think there is any command to show the current db in Hive. One alternative for you is to set a property so that the current database is shown as part of the prompt: set hive.cli.print.current.db=true; This one shows your current db as part of your hive prompt. Regards, Ramki. On Fri, Mar 8, 2013 at 11:13 AM, Sai Sai saigr...@yahoo.in wrote: Just wondering if there is any command in Hive which will show us the current db we r using similar to pwd in Unix. Thanks Sai