Re: External Table to Sequence File on HDFS
Check this out http://stackoverflow.com/questions/13203770/reading-hadoop-sequencefiles-with-hive From: Ranjitha Chandrashekar ranjitha...@hcl.commailto:ranjitha...@hcl.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Wednesday, April 3, 2013 10:43 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: External Table to Sequence File on HDFS Hi I want to create a external hive table to a sequence file(each record - key value) on HDFS. How will the field names be mapped to the column names. Please Suggest. Thanks Ranjitha. ::DISCLAIMER:: The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Loopup objects in distributed cache
Hello Vivek, GenericUDTF has method initialize() which is only called once per task. So if you read your files in this method and store the structures in memory then the overhead is relatively small (reading 15MB per mapper is negligible compared to several GB of processed data). Best regards, Jan On Wed, Apr 3, 2013 at 10:35 PM, vivek thakre vivek.tha...@gmail.comwrote: Hello, I want to write a functionality using UDTF. The functionality involves reading 7 different text files and create lookup structures such as Map, Set, List , Map of String and List etc to be used in the logic. These files are small size average 15 MB. I can add these files in distributed cache and access them in UDTF, read the files, and create the necessary lookup data structures, but this would mean that the files will be opened, read and closed every time the UDTF is invoked. Is there a way that I can just read the files once, create the data structures needed , put them in distributed cache and access them from UDTF? I don't think creating hive tables from these files and doing a map side join is possible, as the functionality that I want to implement is fairly complex and I am not sure if it can be done just using hive query and join without using UDTF. Thanks in advance.
Huge join performance issue
Hi all, I have two tables I need to join and then summarize. They are both huge (about 1B rows each, in the relevant partitions) and the query runs for over 2 hours creating 5T intermediate data. The current query looks like this: select t1.b,t1.c,t2.d,t2.e, count(*) from (select a,b,cfrom table baseTB1 where ... ) t1 -- filter by partition as well join (select a,d,e from baseTB2 where ...) t2-- filter by partition as well on t1.a=t2.a group by t1.b,t1.c,t2.d,t2.e two questions: 1. would joining baseTB1 and baseTB2 directly (instead of subqueries) be better in any way? (I know subqueries cause a lot of writes of the intermediate data but we also understand it's best to filter down the data that is being joined, which is more correct?) 2. can I use 'distribute by ' and/or 'sort by' in some way that would help this? my understanding at the moment is that the problem lies in the fact that the reduces are on column a while the group by is on column b ... Any thoughts would be appreciated.
Re: Huge join performance issue
you dont really need subqueries to join the tables which have common columns. Its an additional overhead best way to filter your data and speed up your data processing is how you layout your data When you have larger table I will use partitioning and bucketing to trim down the data and improve the performances over joins distribute by is mainly used when you have your custom map reduce scripts and you want to use transform functionality in hive. I have not used it a lot so not sure on that part. also its helpful to write where clauses in join statements to reduce the dataset you want to join. On Thu, Apr 4, 2013 at 5:53 PM, Gabi D gabi...@gmail.com wrote: Hi all, I have two tables I need to join and then summarize. They are both huge (about 1B rows each, in the relevant partitions) and the query runs for over 2 hours creating 5T intermediate data. The current query looks like this: select t1.b,t1.c,t2.d,t2.e, count(*) from (select a,b,cfrom table baseTB1 where ... ) t1 -- filter by partition as well join (select a,d,e from baseTB2 where ...) t2-- filter by partition as well on t1.a=t2.a group by t1.b,t1.c,t2.d,t2.e two questions: 1. would joining baseTB1 and baseTB2 directly (instead of subqueries) be better in any way? (I know subqueries cause a lot of writes of the intermediate data but we also understand it's best to filter down the data that is being joined, which is more correct?) 2. can I use 'distribute by ' and/or 'sort by' in some way that would help this? my understanding at the moment is that the problem lies in the fact that the reduces are on column a while the group by is on column b ... Any thoughts would be appreciated. -- Nitin Pawar
builtins submodule - is it still needed?
Hey hive gurus - Is the builtins hive submodule in use? The submodule was added in HIVE-2523 as a location for builtin-UDFs, but it appears to not have taken off. Any objections to removing it? DETAILS For HIVE-4278 I'm making some build changes for the HCatalog integration. The builtins submodule causes issues because it delays building until the packaging phase - so HCatalog can't depend on builtins, which it does transitively. While investigating a path forward I discovered the builtins submodule contains very little code, and likely could either go away entirely or merge into ql, simplifying things both for users and developers. Thoughts? Can anyone with context help me understand builtins, both in general and around its non-standard build? For your trouble I'll either make the submodule go away/merge into another submodule, or update the docs with what we learn. Thanks! Travis
Partition performance
Hi, I created 3 years of hourly log files (totally 26280 files), and use External Table with partition to query. I tried two partition methods. 1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory per hour). Use date and hour as partition keys. Add 3 years of directories to the table partitions. So there are 26280 partitions. CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt string, hr int); ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION '/test1/2013/04/02/16'; 2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory per day, 24 files in each directory). Use date as partition key. Add 3 years of directories to the table partitions. So there are 1095 partitions. CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt string); ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION '/test2/2013/04/02'; When doing a simple query like SELECT * FROM test1/test2 WHERE dt = '2013-02-01' and dt = '2013-02-14' Using approach #1 takes 320 seconds, but #2 only takes 70 seconds. I'm wondering why there is a big performance difference between these two? These two approaches have the same number of files, only the directory structure is different. So Hive is going to load the same amount of files. Why does the number of partitions have such big impact? Does that mean #2 is a better partition strategy? Thanks.
Re: Partition performance
The slow down is most possibly due to large number of partitions. I believe the Hive book authors tell us to be cautious with large number of partitions :-) and I abide by that. Users Please add your points of view and experiences Thanks sanjay From: Ian liu...@yahoo.commailto:liu...@yahoo.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org, Ian liu...@yahoo.commailto:liu...@yahoo.com Date: Thursday, April 4, 2013 4:01 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Partition performance Hi, I created 3 years of hourly log files (totally 26280 files), and use External Table with partition to query. I tried two partition methods. 1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory per hour). Use date and hour as partition keys. Add 3 years of directories to the table partitions. So there are 26280 partitions. CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt string, hr int); ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION '/test1/2013/04/02/16'; 2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory per day, 24 files in each directory). Use date as partition key. Add 3 years of directories to the table partitions. So there are 1095 partitions. CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt string); ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION '/test2/2013/04/02'; When doing a simple query like SELECT * FROM test1/test2 WHERE dt = '2013-02-01' and dt = '2013-02-14' Using approach #1 takes 320 seconds, but #2 only takes 70 seconds. I'm wondering why there is a big performance difference between these two? These two approaches have the same number of files, only the directory structure is different. So Hive is going to load the same amount of files. Why does the number of partitions have such big impact? Does that mean #2 is a better partition strategy? Thanks. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Partition performance
Is it possible for you to send the explain plan of these two queries? Regards, Ramki. On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian sanjay.subraman...@wizecommerce.com wrote: The slow down is most possibly due to large number of partitions. I believe the Hive book authors tell us to be cautious with large number of partitions :-) and I abide by that. Users Please add your points of view and experiences Thanks sanjay From: Ian liu...@yahoo.com Reply-To: user@hive.apache.org user@hive.apache.org, Ian liu...@yahoo.com Date: Thursday, April 4, 2013 4:01 PM To: user@hive.apache.org user@hive.apache.org Subject: Partition performance Hi, I created 3 years of hourly log files (totally 26280 files), and use External Table with partition to query. I tried two partition methods. 1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory per hour). Use date and hour as partition keys. Add 3 years of directories to the table partitions. So there are 26280 partitions. CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt string, hr int); ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION '/test1/2013/04/02/16'; 2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory per day, 24 files in each directory). Use date as partition key. Add 3 years of directories to the table partitions. So there are 1095 partitions. CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt string); ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION '/test2/2013/04/02'; When doing a simple query like SELECT * FROM test1/test2 WHERE dt = '2013-02-01' and dt = '2013-02-14' Using approach #1 takes 320 seconds, but #2 only takes 70 seconds. I'm wondering why there is a big performance difference between these two? These two approaches have the same number of files, only the directory structure is different. So Hive is going to load the same amount of files. Why does the number of partitions have such big impact? Does that mean #2 is a better partition strategy? Thanks. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Partition performance
See slide #9 from my Optimizing Hive Queries talk http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we will improve it, but for now you are much better off with 1,000 partitions than 10,000. -- Owen On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle ramki.pa...@gmail.com wrote: Is it possible for you to send the explain plan of these two queries? Regards, Ramki. On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian sanjay.subraman...@wizecommerce.com wrote: The slow down is most possibly due to large number of partitions. I believe the Hive book authors tell us to be cautious with large number of partitions :-) and I abide by that. Users Please add your points of view and experiences Thanks sanjay From: Ian liu...@yahoo.com Reply-To: user@hive.apache.org user@hive.apache.org, Ian liu...@yahoo.com Date: Thursday, April 4, 2013 4:01 PM To: user@hive.apache.org user@hive.apache.org Subject: Partition performance Hi, I created 3 years of hourly log files (totally 26280 files), and use External Table with partition to query. I tried two partition methods. 1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory per hour). Use date and hour as partition keys. Add 3 years of directories to the table partitions. So there are 26280 partitions. CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt string, hr int); ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION '/test1/2013/04/02/16'; 2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory per day, 24 files in each directory). Use date as partition key. Add 3 years of directories to the table partitions. So there are 1095 partitions. CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt string); ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION '/test2/2013/04/02'; When doing a simple query like SELECT * FROM test1/test2 WHERE dt = '2013-02-01' and dt = '2013-02-14' Using approach #1 takes 320 seconds, but #2 only takes 70 seconds. I'm wondering why there is a big performance difference between these two? These two approaches have the same number of files, only the directory structure is different. So Hive is going to load the same amount of files. Why does the number of partitions have such big impact? Does that mean #2 is a better partition strategy? Thanks. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Partition performance
Also, how big are the files in each directory? Are they roughly the size of one HDFS block or a multiple. Lots of small files will mean lots of mapper tasks will little to do. You can also compare the job tracker console output for each job. I bet the slow one has a lot of very short map and reduce tasks, while the faster one has fewer tasks that run longer. A rule of thumb is that any one task should take 20 seconds or more to amortize over the few seconds spent in start up per task. In other words, if you think about what's happening at the HDFS and MR level, you can learn to predict how fast or slow things will run. Learning to read the output of EXPLAIN or EXPLAIN EXTENDED helps with this. dean On Thu, Apr 4, 2013 at 6:25 PM, Owen O'Malley omal...@apache.org wrote: See slide #9 from my Optimizing Hive Queries talk http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we will improve it, but for now you are much better off with 1,000 partitions than 10,000. -- Owen On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle ramki.pa...@gmail.com wrote: Is it possible for you to send the explain plan of these two queries? Regards, Ramki. On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian sanjay.subraman...@wizecommerce.com wrote: The slow down is most possibly due to large number of partitions. I believe the Hive book authors tell us to be cautious with large number of partitions :-) and I abide by that. Users Please add your points of view and experiences Thanks sanjay From: Ian liu...@yahoo.com Reply-To: user@hive.apache.org user@hive.apache.org, Ian liu...@yahoo.com Date: Thursday, April 4, 2013 4:01 PM To: user@hive.apache.org user@hive.apache.org Subject: Partition performance Hi, I created 3 years of hourly log files (totally 26280 files), and use External Table with partition to query. I tried two partition methods. 1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory per hour). Use date and hour as partition keys. Add 3 years of directories to the table partitions. So there are 26280 partitions. CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt string, hr int); ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION '/test1/2013/04/02/16'; 2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory per day, 24 files in each directory). Use date as partition key. Add 3 years of directories to the table partitions. So there are 1095 partitions. CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt string); ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION '/test2/2013/04/02'; When doing a simple query like SELECT * FROM test1/test2 WHERE dt = '2013-02-01' and dt = '2013-02-14' Using approach #1 takes 320 seconds, but #2 only takes 70 seconds. I'm wondering why there is a big performance difference between these two? These two approaches have the same number of files, only the directory structure is different. So Hive is going to load the same amount of files. Why does the number of partitions have such big impact? Does that mean #2 is a better partition strategy? Thanks. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. -- *Dean Wampler, Ph.D.* thinkbiganalytics.com +1-312-339-1330
Correct syntax for EXPLAIN DEPENDENCY
Hi Whats the correct syntax for EXPLAIN DEPENDENCY ? Query == /usr/lib/hive/bin/hive -e explain dependency select * from channel_market_lang where channelid 29000 org.apache.hadoop.hive.ql.parse.ParseException: line 1:8 cannot recognize input near 'plan' 'dependency' 'select' in statement at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:440) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:416) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:338) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:637) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:208) I was referring to this doc..is there another doc ? https://cwiki.apache.org/Hive/languagemanual-explain.html#LanguageManualExplain-EXPLAINSyntax Thanks sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Correct syntax for EXPLAIN DEPENDENCY
Ah its available only in 0.10.0 :-( And I am still using 0.9.x from the CDH4.1.2 distribution From: Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Thursday, April 4, 2013 6:40 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Correct syntax for EXPLAIN DEPENDENCY Hi Whats the correct syntax for EXPLAIN DEPENDENCY ? Query == /usr/lib/hive/bin/hive -e explain dependency select * from channel_market_lang where channelid 29000 org.apache.hadoop.hive.ql.parse.ParseException: line 1:8 cannot recognize input near 'plan' 'dependency' 'select' in statement at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:440) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:416) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:338) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:637) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:208) I was referring to this doc..is there another doc ? https://cwiki.apache.org/Hive/languagemanual-explain.html#LanguageManualExplain-EXPLAINSyntax Thanks sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Correct syntax for EXPLAIN DEPENDENCY
Hi Sanjay, you can upgrade to CDH4.2.0 that contains Hive 0.10. Jarcec On Fri, Apr 05, 2013 at 01:48:39AM +, Sanjay Subramanian wrote: Ah its available only in 0.10.0 :-( And I am still using 0.9.x from the CDH4.1.2 distribution From: Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Date: Thursday, April 4, 2013 6:40 PM To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Correct syntax for EXPLAIN DEPENDENCY Hi Whats the correct syntax for EXPLAIN DEPENDENCY ? Query == /usr/lib/hive/bin/hive -e explain dependency select * from channel_market_lang where channelid 29000 org.apache.hadoop.hive.ql.parse.ParseException: line 1:8 cannot recognize input near 'plan' 'dependency' 'select' in statement at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:440) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:416) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:338) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:637) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:208) I was referring to this doc..is there another doc ? https://cwiki.apache.org/Hive/languagemanual-explain.html#LanguageManualExplain-EXPLAINSyntax Thanks sanjay CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. signature.asc Description: Digital signature
Re: Loopup objects in distributed cache
Thanks Jan for your reply. This is helpful Vivek On Thu, Apr 4, 2013 at 12:11 AM, Jan Dolinár dolik@gmail.com wrote: Hello Vivek, GenericUDTF has method initialize() which is only called once per task. So if you read your files in this method and store the structures in memory then the overhead is relatively small (reading 15MB per mapper is negligible compared to several GB of processed data). Best regards, Jan On Wed, Apr 3, 2013 at 10:35 PM, vivek thakre vivek.tha...@gmail.comwrote: Hello, I want to write a functionality using UDTF. The functionality involves reading 7 different text files and create lookup structures such as Map, Set, List , Map of String and List etc to be used in the logic. These files are small size average 15 MB. I can add these files in distributed cache and access them in UDTF, read the files, and create the necessary lookup data structures, but this would mean that the files will be opened, read and closed every time the UDTF is invoked. Is there a way that I can just read the files once, create the data structures needed , put them in distributed cache and access them from UDTF? I don't think creating hive tables from these files and doing a map side join is possible, as the functionality that I want to implement is fairly complex and I am not sure if it can be done just using hive query and join without using UDTF. Thanks in advance.