Re: External Table to Sequence File on HDFS

2013-04-04 Thread Sanjay Subramanian
Check this out
http://stackoverflow.com/questions/13203770/reading-hadoop-sequencefiles-with-hive

From: Ranjitha Chandrashekar ranjitha...@hcl.commailto:ranjitha...@hcl.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Wednesday, April 3, 2013 10:43 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: External Table to Sequence File on HDFS


Hi



I want to create a external hive table to a sequence file(each record - key 
value) on HDFS. How will the field names be mapped to the column names.



Please Suggest.



Thanks

Ranjitha.



::DISCLAIMER::

The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Loopup objects in distributed cache

2013-04-04 Thread Jan Dolinár
Hello Vivek,

GenericUDTF has method initialize() which is only called once per task. So
if you read your files in this method and store the structures in memory
then the overhead is relatively small (reading 15MB per mapper is
negligible compared to several GB of processed data).

Best regards,
Jan


On Wed, Apr 3, 2013 at 10:35 PM, vivek thakre vivek.tha...@gmail.comwrote:

 Hello,

 I want to write a functionality using UDTF. The functionality involves
 reading 7 different text files and create lookup structures such as Map,
 Set, List , Map of String and List etc to be used in the logic.

 These files are small size average 15 MB.

 I can add these files in distributed cache and access them in UDTF, read
 the files, and create the necessary lookup data structures, but this would
 mean that the files will be opened, read and closed every time the UDTF is
 invoked.

 Is there a way that I can just read the files once, create the data
 structures needed , put them in distributed cache and access them from UDTF?

 I don't think creating hive tables from these files and doing a map side
 join is possible, as the functionality that I want to implement is fairly
 complex and I am not sure if it can be done just using hive query and join
 without using UDTF.

 Thanks in advance.



Huge join performance issue

2013-04-04 Thread Gabi D
Hi all,
I have two tables I need to join and then summarize.
They are both huge (about 1B rows each, in the relevant partitions) and the
query runs for over 2 hours creating 5T intermediate data.

The current query looks like this:

select t1.b,t1.c,t2.d,t2.e, count(*)
from (select a,b,cfrom table baseTB1 where ... ) t1  -- filter by
partition as well
  join
(select a,d,e from baseTB2 where ...) t2-- filter by partition
as well
on t1.a=t2.a
group by t1.b,t1.c,t2.d,t2.e


two questions:
1. would joining baseTB1 and baseTB2 directly (instead of subqueries) be
better in any way?
  (I know subqueries cause a lot of writes of the intermediate data
but we also understand it's best to filter down the data that is being
joined, which is more correct?)
2. can I use 'distribute by ' and/or 'sort by' in some way that would help
this? my understanding at the moment is that the problem lies in the fact
that the reduces are on column a while the group by is on column b ...

Any thoughts would be appreciated.


Re: Huge join performance issue

2013-04-04 Thread Nitin Pawar
you dont really need subqueries to join the tables which have common
columns. Its an additional overhead
best way to filter your data and speed up your data processing is how you
layout your data
When you have larger table I will use partitioning and bucketing to trim
down the data and improve the performances over joins

distribute by is mainly used when you have your custom map reduce scripts
and you want to use transform functionality in hive. I have not used it a
lot so not sure on that part. also its helpful to write where clauses in
join statements to reduce the dataset you want to join.



On Thu, Apr 4, 2013 at 5:53 PM, Gabi D gabi...@gmail.com wrote:

 Hi all,
 I have two tables I need to join and then summarize.
 They are both huge (about 1B rows each, in the relevant partitions) and
 the query runs for over 2 hours creating 5T intermediate data.

 The current query looks like this:

 select t1.b,t1.c,t2.d,t2.e, count(*)
 from (select a,b,cfrom table baseTB1 where ... ) t1  -- filter by
 partition as well
   join
 (select a,d,e from baseTB2 where ...) t2-- filter by partition
 as well
 on t1.a=t2.a
 group by t1.b,t1.c,t2.d,t2.e


 two questions:
 1. would joining baseTB1 and baseTB2 directly (instead of subqueries) be
 better in any way?
   (I know subqueries cause a lot of writes of the intermediate
 data but we also understand it's best to filter down the data that is being
 joined, which is more correct?)
 2. can I use 'distribute by ' and/or 'sort by' in some way that would help
 this? my understanding at the moment is that the problem lies in the fact
 that the reduces are on column a while the group by is on column b ...

 Any thoughts would be appreciated.




-- 
Nitin Pawar


builtins submodule - is it still needed?

2013-04-04 Thread Travis Crawford
Hey hive gurus -

Is the builtins hive submodule in use? The submodule was added in
HIVE-2523 as a location for builtin-UDFs, but it appears to not have
taken off. Any objections to removing it?

DETAILS

For HIVE-4278 I'm making some build changes for the HCatalog
integration. The builtins submodule causes issues because it delays
building until the packaging phase - so HCatalog can't depend on
builtins, which it does transitively.

While investigating a path forward I discovered the builtins
submodule contains very little code, and likely could either go away
entirely or merge into ql, simplifying things both for users and
developers.

Thoughts? Can anyone with context help me understand builtins, both
in general and around its non-standard build? For your trouble I'll
either make the submodule go away/merge into another submodule, or
update the docs with what we learn.

Thanks!
Travis


Partition performance

2013-04-04 Thread Ian
Hi,
 
I created 3 years of hourly log files (totally 26280 files), and use External 
Table with partition to query. I tried two partition methods.
 
1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory per 
hour). Use date and hour as partition keys. Add 3 years of directories to the 
table partitions. So there are 26280 partitions.
CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt string, 
hr int);
ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION 
'/test1/2013/04/02/16';
 
2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory per day, 
24 files in each directory). Use date as partition key. Add 3 years of 
directories to the table partitions. So there are 1095 partitions.
CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt string);
ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION 
'/test2/2013/04/02';
 
When doing a simple query like 
SELECT * FROM  test1/test2  WHERE  dt = '2013-02-01' and dt = '2013-02-14'
Using approach #1 takes 320 seconds, but #2 only takes 70 seconds. 
 
I'm wondering why there is a big performance difference between these two? 
These two approaches have the same number of files, only the directory 
structure is different. So Hive is going to load the same amount of files. Why 
does the number of partitions have such big impact? Does that mean #2 is a 
better partition strategy?
 
Thanks.

Re: Partition performance

2013-04-04 Thread Sanjay Subramanian
The slow down is most possibly due to large number of partitions.
I believe the Hive book authors tell us to be cautious with large number of 
partitions :-)  and I abide by that.

Users
Please add your points of view and experiences

Thanks
sanjay

From: Ian liu...@yahoo.commailto:liu...@yahoo.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org, Ian 
liu...@yahoo.commailto:liu...@yahoo.com
Date: Thursday, April 4, 2013 4:01 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Partition performance

Hi,

I created 3 years of hourly log files (totally 26280 files), and use External 
Table with partition to query. I tried two partition methods.

1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory per 
hour). Use date and hour as partition keys. Add 3 years of directories to the 
table partitions. So there are 26280 partitions.
CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt string, 
hr int);
ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION 
'/test1/2013/04/02/16';

2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory per day, 
24 files in each directory). Use date as partition key. Add 3 years of 
directories to the table partitions. So there are 1095 partitions.
CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt string);
ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION 
'/test2/2013/04/02';

When doing a simple query like
SELECT * FROM  test1/test2  WHERE  dt = '2013-02-01' and dt = '2013-02-14'
Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.

I'm wondering why there is a big performance difference between these two? 
These two approaches have the same number of files, only the directory 
structure is different. So Hive is going to load the same amount of files. Why 
does the number of partitions have such big impact? Does that mean #2 is a 
better partition strategy?

Thanks.



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Partition performance

2013-04-04 Thread Ramki Palle
Is it possible for you to send the explain plan of these two queries?

Regards,
Ramki.


On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.com wrote:

  The slow down is most possibly due to large number of partitions.
 I believe the Hive book authors tell us to be cautious with large number
 of partitions :-)  and I abide by that.

  Users
 Please add your points of view and experiences

  Thanks
 sanjay

   From: Ian liu...@yahoo.com
 Reply-To: user@hive.apache.org user@hive.apache.org, Ian 
 liu...@yahoo.com
 Date: Thursday, April 4, 2013 4:01 PM
 To: user@hive.apache.org user@hive.apache.org
 Subject: Partition performance

   Hi,

 I created 3 years of hourly log files (totally 26280 files), and use
 External Table with partition to query. I tried two partition methods.

 1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory per
 hour). Use date and hour as partition keys. Add 3 years of directories to
 the table partitions. So there are 26280 partitions.
 CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt
 string, hr int);
 ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION
 '/test1/2013/04/02/16';

 2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory per
 day, 24 files in each directory). Use date as partition key. Add 3 years of
 directories to the table partitions. So there are 1095 partitions.
  CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt
 string);
 ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION
 '/test2/2013/04/02';

 When doing a simple query like
 SELECT * FROM  test1/test2  WHERE  dt = '2013-02-01' and dt =
 '2013-02-14'
  Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.

 I'm wondering why there is a big performance difference between these two?
 These two approaches have the same number of files, only the directory
 structure is different. So Hive is going to load the same amount of files.
 Why does the number of partitions have such big impact? Does that mean #2
 is a better partition strategy?

 Thanks.



 CONFIDENTIALITY NOTICE
 ==
 This email message and any attachments are for the exclusive use of the
 intended recipient(s) and may contain confidential and privileged
 information. Any unauthorized review, use, disclosure or distribution is
 prohibited. If you are not the intended recipient, please contact the
 sender by reply email and destroy all copies of the original message along
 with any attachments, from your computer system. If you are the intended
 recipient, please be advised that the content of this message is subject to
 access, review and disclosure by the sender's Email System Administrator.



Re: Partition performance

2013-04-04 Thread Owen O'Malley
See slide #9 from my Optimizing Hive Queries talk
http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we
will improve it, but for now you are much better off with 1,000 partitions
than 10,000.

-- Owen


On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle ramki.pa...@gmail.com wrote:

 Is it possible for you to send the explain plan of these two queries?

 Regards,
 Ramki.


 On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian 
 sanjay.subraman...@wizecommerce.com wrote:

  The slow down is most possibly due to large number of partitions.
 I believe the Hive book authors tell us to be cautious with large number
 of partitions :-)  and I abide by that.

  Users
 Please add your points of view and experiences

  Thanks
 sanjay

   From: Ian liu...@yahoo.com
 Reply-To: user@hive.apache.org user@hive.apache.org, Ian 
 liu...@yahoo.com
 Date: Thursday, April 4, 2013 4:01 PM
 To: user@hive.apache.org user@hive.apache.org
 Subject: Partition performance

   Hi,

 I created 3 years of hourly log files (totally 26280 files), and use
 External Table with partition to query. I tried two partition methods.

 1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory
 per hour). Use date and hour as partition keys. Add 3 years of directories
 to the table partitions. So there are 26280 partitions.
 CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt
 string, hr int);
 ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16)
 LOCATION '/test1/2013/04/02/16';

 2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory
 per day, 24 files in each directory). Use date as partition key. Add 3
 years of directories to the table partitions. So there are 1095 partitions.
  CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt
 string);
 ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION
 '/test2/2013/04/02';

 When doing a simple query like
 SELECT * FROM  test1/test2  WHERE  dt = '2013-02-01' and dt =
 '2013-02-14'
  Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.

 I'm wondering why there is a big performance difference between these
 two? These two approaches have the same number of files, only the directory
 structure is different. So Hive is going to load the same amount of files.
 Why does the number of partitions have such big impact? Does that mean #2
 is a better partition strategy?

 Thanks.



 CONFIDENTIALITY NOTICE
 ==
 This email message and any attachments are for the exclusive use of the
 intended recipient(s) and may contain confidential and privileged
 information. Any unauthorized review, use, disclosure or distribution is
 prohibited. If you are not the intended recipient, please contact the
 sender by reply email and destroy all copies of the original message along
 with any attachments, from your computer system. If you are the intended
 recipient, please be advised that the content of this message is subject to
 access, review and disclosure by the sender's Email System Administrator.





Re: Partition performance

2013-04-04 Thread Dean Wampler
Also, how big are the files in each directory? Are they roughly the size of
one HDFS block or a multiple. Lots of small files will mean lots of mapper
tasks will little to do.

You can also compare the job tracker console output for each job. I bet the
slow one has a lot of very short map and reduce tasks, while the faster one
has fewer tasks that run longer. A rule of thumb is that any one task
should take 20 seconds or more to amortize over the few seconds spent in
start up per task.

In other words, if you think about what's happening at the HDFS and MR
level, you can learn to predict how fast or slow things will run. Learning
to read the output of EXPLAIN or EXPLAIN EXTENDED helps with this.

dean

On Thu, Apr 4, 2013 at 6:25 PM, Owen O'Malley omal...@apache.org wrote:

 See slide #9 from my Optimizing Hive Queries talk
 http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we
 will improve it, but for now you are much better off with 1,000 partitions
 than 10,000.

 -- Owen


 On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle ramki.pa...@gmail.com wrote:

 Is it possible for you to send the explain plan of these two queries?

 Regards,
 Ramki.


 On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian 
 sanjay.subraman...@wizecommerce.com wrote:

  The slow down is most possibly due to large number of partitions.
 I believe the Hive book authors tell us to be cautious with large number
 of partitions :-)  and I abide by that.

  Users
 Please add your points of view and experiences

  Thanks
 sanjay

   From: Ian liu...@yahoo.com
 Reply-To: user@hive.apache.org user@hive.apache.org, Ian 
 liu...@yahoo.com
 Date: Thursday, April 4, 2013 4:01 PM
 To: user@hive.apache.org user@hive.apache.org
 Subject: Partition performance

   Hi,

 I created 3 years of hourly log files (totally 26280 files), and use
 External Table with partition to query. I tried two partition methods.

 1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory
 per hour). Use date and hour as partition keys. Add 3 years of directories
 to the table partitions. So there are 26280 partitions.
 CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt
 string, hr int);
 ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16)
 LOCATION '/test1/2013/04/02/16';

 2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory
 per day, 24 files in each directory). Use date as partition key. Add 3
 years of directories to the table partitions. So there are 1095 partitions.
  CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY
 (dt string);
 ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION
 '/test2/2013/04/02';

 When doing a simple query like
 SELECT * FROM  test1/test2  WHERE  dt = '2013-02-01' and dt =
 '2013-02-14'
  Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.

 I'm wondering why there is a big performance difference between these
 two? These two approaches have the same number of files, only the directory
 structure is different. So Hive is going to load the same amount of files.
 Why does the number of partitions have such big impact? Does that mean #2
 is a better partition strategy?

 Thanks.



 CONFIDENTIALITY NOTICE
 ==
 This email message and any attachments are for the exclusive use of the
 intended recipient(s) and may contain confidential and privileged
 information. Any unauthorized review, use, disclosure or distribution is
 prohibited. If you are not the intended recipient, please contact the
 sender by reply email and destroy all copies of the original message along
 with any attachments, from your computer system. If you are the intended
 recipient, please be advised that the content of this message is subject to
 access, review and disclosure by the sender's Email System Administrator.






-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330


Correct syntax for EXPLAIN DEPENDENCY

2013-04-04 Thread Sanjay Subramanian
Hi
Whats the correct syntax for EXPLAIN DEPENDENCY ?

Query
==
/usr/lib/hive/bin/hive -e explain dependency select * from channel_market_lang 
where channelid  29000

org.apache.hadoop.hive.ql.parse.ParseException: line 1:8 cannot recognize input 
near 'plan' 'dependency' 'select' in statement

at 
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:440)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:416)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:338)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:637)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)


I was referring to this doc..is there another doc ?
https://cwiki.apache.org/Hive/languagemanual-explain.html#LanguageManualExplain-EXPLAINSyntax

Thanks
sanjay

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Correct syntax for EXPLAIN DEPENDENCY

2013-04-04 Thread Sanjay Subramanian
Ah its available only in 0.10.0 :-(
And I am still using 0.9.x from the CDH4.1.2 distribution


From: Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Date: Thursday, April 4, 2013 6:40 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Correct syntax for EXPLAIN DEPENDENCY

Hi
Whats the correct syntax for EXPLAIN DEPENDENCY ?

Query
==
/usr/lib/hive/bin/hive -e explain dependency select * from channel_market_lang 
where channelid  29000

org.apache.hadoop.hive.ql.parse.ParseException: line 1:8 cannot recognize input 
near 'plan' 'dependency' 'select' in statement

at 
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:440)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:416)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:338)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:637)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)


I was referring to this doc..is there another doc ?
https://cwiki.apache.org/Hive/languagemanual-explain.html#LanguageManualExplain-EXPLAINSyntax

Thanks
sanjay

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Correct syntax for EXPLAIN DEPENDENCY

2013-04-04 Thread Jarek Jarcec Cecho
Hi Sanjay,
you can upgrade to CDH4.2.0 that contains Hive 0.10.

Jarcec

On Fri, Apr 05, 2013 at 01:48:39AM +, Sanjay Subramanian wrote:
 Ah its available only in 0.10.0 :-(
 And I am still using 0.9.x from the CDH4.1.2 distribution
 
 
 From: Sanjay Subramanian 
 sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
 Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
 user@hive.apache.orgmailto:user@hive.apache.org
 Date: Thursday, April 4, 2013 6:40 PM
 To: user@hive.apache.orgmailto:user@hive.apache.org 
 user@hive.apache.orgmailto:user@hive.apache.org
 Subject: Correct syntax for EXPLAIN DEPENDENCY
 
 Hi
 Whats the correct syntax for EXPLAIN DEPENDENCY ?
 
 Query
 ==
 /usr/lib/hive/bin/hive -e explain dependency select * from 
 channel_market_lang where channelid  29000
 
 org.apache.hadoop.hive.ql.parse.ParseException: line 1:8 cannot recognize 
 input near 'plan' 'dependency' 'select' in statement
 
 at 
 org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:440)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:416)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
 at 
 org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
 at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
 at 
 org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
 at 
 org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:338)
 at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:637)
 at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
 
 
 I was referring to this doc..is there another doc ?
 https://cwiki.apache.org/Hive/languagemanual-explain.html#LanguageManualExplain-EXPLAINSyntax
 
 Thanks
 sanjay
 
 CONFIDENTIALITY NOTICE
 ==
 This email message and any attachments are for the exclusive use of the 
 intended recipient(s) and may contain confidential and privileged 
 information. Any unauthorized review, use, disclosure or distribution is 
 prohibited. If you are not the intended recipient, please contact the sender 
 by reply email and destroy all copies of the original message along with any 
 attachments, from your computer system. If you are the intended recipient, 
 please be advised that the content of this message is subject to access, 
 review and disclosure by the sender's Email System Administrator.
 
 CONFIDENTIALITY NOTICE
 ==
 This email message and any attachments are for the exclusive use of the 
 intended recipient(s) and may contain confidential and privileged 
 information. Any unauthorized review, use, disclosure or distribution is 
 prohibited. If you are not the intended recipient, please contact the sender 
 by reply email and destroy all copies of the original message along with any 
 attachments, from your computer system. If you are the intended recipient, 
 please be advised that the content of this message is subject to access, 
 review and disclosure by the sender's Email System Administrator.


signature.asc
Description: Digital signature


Re: Loopup objects in distributed cache

2013-04-04 Thread vivek thakre
Thanks Jan for your reply. This is helpful

Vivek


On Thu, Apr 4, 2013 at 12:11 AM, Jan Dolinár dolik@gmail.com wrote:

 Hello Vivek,

 GenericUDTF has method initialize() which is only called once per task. So
 if you read your files in this method and store the structures in memory
 then the overhead is relatively small (reading 15MB per mapper is
 negligible compared to several GB of processed data).

 Best regards,
 Jan


 On Wed, Apr 3, 2013 at 10:35 PM, vivek thakre vivek.tha...@gmail.comwrote:

 Hello,

 I want to write a functionality using UDTF. The functionality involves
 reading 7 different text files and create lookup structures such as Map,
 Set, List , Map of String and List etc to be used in the logic.

 These files are small size average 15 MB.

 I can add these files in distributed cache and access them in UDTF, read
 the files, and create the necessary lookup data structures, but this would
 mean that the files will be opened, read and closed every time the UDTF is
 invoked.

 Is there a way that I can just read the files once, create the data
 structures needed , put them in distributed cache and access them from UDTF?

 I don't think creating hive tables from these files and doing a map side
 join is possible, as the functionality that I want to implement is fairly
 complex and I am not sure if it can be done just using hive query and join
 without using UDTF.

 Thanks in advance.