from:"Dean Wampler"

Re: [ANN] Hivemall: Hive scalable machine learning library

2013-10-03 Thread Dean Wampler

This is great news! I know that Twitter has done something similar with
UDFs for Pig, as described in this paper:
http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf

I'm glad to see the same thing start with Hive.

Dean


On Wed, Oct 2, 2013 at 10:21 AM, Makoto YUI  wrote:

> Hello all,
>
> My employer, AIST, has given the thumbs up to open source our machine
> learning library, named Hivemall.
>
> Hivemall is a scalable machine learning library running on Hive/Hadoop,
> licensed under the LGPL 2.1.
>
>   https://github.com/myui/hivemall
>
> Hivemall provides machine learning functionality as well as feature
> engineering functions through UDFs/UDAFs/UDTFs of Hive. It is designed
> to be scalable to the number of training instances as well as the number
> of training features.
>
> Hivemall is very easy to use as every machine learning step is done
> within HiveQL.
>
> -- Installation is just as follows:
> add jar /tmp/hivemall.jar;
> source /tmp/define-all.hive;
>
> -- Logistic regression is performed by a query.
> SELECT
>   feature,
>   avg(weight) as weight
> FROM
>  (SELECT logress(features,label) as (feature,weight) FROM
> training_features) t
> GROUP BY feature;
>
> You can find detailed examples on our wiki pages.
> https://github.com/myui/hivemall/wiki/_pages
>
> Though we consider that Hivemall is much easier to use and more scalable
> than Mahout for classification/regression tasks, please check it by
> yourself. If you have a Hive environment, you can evaluate Hivemall
> within 5 minutes or so.
>
> Hope you enjoy the release! Feedback (and pull request) is always welcome.
>
> Thank you,
> Makoto
>



-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: [ANNOUNCE]: Apache Sentry 1.2.0 released

2013-09-26 Thread Dean Wampler

If you were hunting around for the sentry web site like me, it's here:

http://sentry.incubator.apache.org/


On Wed, Sep 25, 2013 at 5:03 PM, Shreepadma Venugopalan <
shreepa...@apache.org> wrote:

> Corrected version with the necessary disclaimers.
>
> -- Forwarded message --
> From: Shreepadma Venugopalan 
> Date: Wed, Sep 25, 2013 at 2:40 PM
> Subject: Fwd: [ANNOUNCE]: Apache Sentry 1.2.0 released
> To: annou...@apache.org
>
>
> The Apache Sentry team is happy to announce the release of version
> 1.2.0-incubating from the Apache Incubator.
>
> Apache Sentry is a system to enforce fine grained role based authorization
> to data and metadata stored on a Hadoop cluster.
>
> The release bits are available at:
> http://www.apache.org/dyn/closer.cgi/incubator/sentry
>
> The change list is available at: http://s.apache.org/VlU
>
> We would like to thank all contributors who made the release possible.
>
> Disclaimer
>
> Apache Sentry is an effort undergoing incubation at The Apache Software
> Foundation (ASF), sponsored by the Apache Incubator PMC. Incubation is
> required of all newly accepted projects until a further review indicates
> that the infrastructure, communications, and decision making process have
> stabilized in a manner consistent with other successful ASF projects. While
> incubation status is not necessarily a reflection of the completeness or
> stability of the code, it does indicate that the project has yet to be
> fully endorsed by the ASF.
>
> Regards,
>
> Sentry team
>
>
>
>


-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: Inner Map key and value separators

2013-09-13 Thread Dean Wampler

Unfortunately, I believe there's no way to do this. 

Sent from my rotary phone. 


On Sep 13, 2013, at 6:42 PM, Sanjay Subramanian 
 wrote:

> Hi guys
> 
> I have to load data into the following data type in hive 
> 
> map > 
> 
> Is there a way to define custom SEPARATORS (while creating the table) for 
> - Inner map collection item
> - Inner map key 
> delimiters for 2nd-level maps are \004 and \005 per this 
> http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/201009.mbox/%3c4f6b25afffcafe44b6259a412d5f9b101c07a...@exchmbx104.netflix.com%3E
> 
> Thanks
> 
> sanjay
> 
> CONFIDENTIALITY NOTICE
> ==
> This email message and any attachments are for the exclusive use of the 
> intended recipient(s) and may contain confidential and privileged 
> information. Any unauthorized review, use, disclosure or distribution is 
> prohibited. If you are not the intended recipient, please contact the sender 
> by reply email and destroy all copies of the original message along with any 
> attachments, from your computer system. If you are the intended recipient, 
> please be advised that the content of this message is subject to access, 
> review and disclosure by the sender's Email System Administrator.

Re: question about partition table in hive

2013-09-13 Thread Dean Wampler

Flume might be able to invoke Hive to do this as the data is ingested, but
I don't know anything about Flume.

Brent has a nice blog post describing many of the details of partitioning.

http://www.brentozar.com/archive/2013/03/introduction-to-hive-partitioning/

We also cover them in our book. The key steps to taking the file(s) you
created and transforming them into partitioned data are the following:

1. Create an "external" table where the location is the directory you wrote
that big HDFS file (or files).
2. Create the final target table with the partitioning, as described in
Brent's blog post.
3. Run a query against the first table to populate the second. Again, Brent
covers the details.

See the Hive wiki for additional details on external tables, etc.

Dean

On Thu, Sep 12, 2013 at 7:55 PM, ch huang  wrote:

> hi,all:
> i use flume collect log data and put it in hdfs ,i want to use
> hive to do some caculate, query based on timerange,i want to use parttion
> table ,
> but the data file in hdfs is a big file ,how can i put it into pratition
> table in hive?
>

-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: Interesting claims that seem untrue

2013-09-12 Thread Dean Wampler

I'd be more impressed with who's deleted the most lines of code...


On Thu, Sep 12, 2013 at 11:52 AM, Sanjay Subramanian <
sanjay.subraman...@wizecommerce.com> wrote:

>  I have not read the full blogs but in the year 2013 , IMHO , LOC is a
> very old metric that defines good software any more...
>
>   From: Edward Capriolo 
> Reply-To: "user@hive.apache.org" 
> Date: Thursday, September 12, 2013 7:19 AM
> To: "hive-u...@hadoop.apache.org" , "<
> hive-...@hadoop.apache.org>" 
> Subject: Interesting claims that seem untrue
>
>   I was reading the horton-works blog and found an interesting article.
>
> http://hortonworks.com/blog/stinger-phase-2-the-journey-to-100x-faster-hive/#comment-160753
>
>  There is a very interesting graphic which attempts to demonstrate lines
> of code in the 12 release.
> http://hortonworks.com/wp-content/uploads/2013/09/hive4.png
>
>  Although I do not know how they are calculated, they are probably
> counting code generated by tests output, but besides that they are wrong.
>
>  One claim is that Cloudera contributed 4,244 lines of code.
>
>  So to debunk that claim:
>
>  In https://issues.apache.org/jira/browse/HIVE-4675 Brock Noland from
> cloudera, created the ptest2 testing framework. He did all the work for
> ptest2 in hive 12, and it is clearly more then 4,244
>
>  This consists of 84 java files
> [edward@desksandra ptest2]$ find . -name "*.java" | wc -l
> 84
>  and by itself is 8001 lines of code.
> [edward@desksandra ptest2]$ find . -name "*.java" | xargs cat | wc -l
> 8001
>
>  [edward@desksandra hive-trunk]$ wc -l HIVE-4675.patch
> 7902 HIVE-4675.patch
>
>  This is not the only feature from cloudera in hive 12.
>
>  There is also a section of the article that talks of a "ROAD MAP" for
> hive features. I did not know we (hive) had a road map. I have advocated
> switching to feature based release and having a road map before, but it was
> suggested that might limit people from itch-scratching.
>
>
>
>
>
> CONFIDENTIALITY NOTICE
> ==
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>



-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: DISCUSS: Hive language manual to be source control managed

2013-09-02 Thread Dean Wampler

I suggest moving it to GitHub Pages where you use Markdown.

http://pages.github.com/


On Sun, Sep 1, 2013 at 4:39 PM, Edward Capriolo wrote:

> Over the past few weeks I have taken several looks over documents in our
> wiki.
> The page that strikes me as alarmingly poor is the:
> https://cwiki.apache.org/Hive/languagemanual.html
>
> This page has several critical broken links such as
> https://cwiki.apache.org/Hive/languagemanual-groupby.html
> https://cwiki.apache.org/Hive/languagemanual-transform.html
>
> The language manual used to be in decent shape. At times it had omissions
> or was not clear about what version something appeared it, but it was very
> usable.
>
> A long time ago I had began and completed moving the wiki documentation
> inside the project as xdoc. After completion, several had a problem with
> the xdocs approach. The main complaint was the xdoc approach was too
> cumbersome. (However we have basically had a 'turn over' and since that
> time I am one of the few active committers)
>
> The language manual is in very poor shape at the moment with broken links,
> incorrect content, incomplete content, and poor coverage of the actual
> languages. IMHO the attempts to crowd-source this documentation has failed.
> Having a good concise language manual is critical to the success and
> adoption of hive.
>
> I do not believe all of our documentation needs to be in xdoc (as in every
> udf, or every input format) but I believe the language manual surely does.
>
> Please review the current wiki and discuss the concept of moving the
> language manual to source control, or suggest other options.
>
> Thank you,
> Edward
>
>
>


-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: FAILED: Error in metadata: MetaException

2013-08-16 Thread Dean Wampler

I wonder if the message is misleading. Could there be a problem with the 
metastore. 

1. MySQL isn't running or the JDBC connection is wrong. 
2. You upgraded to a newer Hive and didn't migrate the metadata
3. ...?

I'm speculating here. Perhaps the logs have useful info. 

Dean

Sent from my rotary phone. 


On Aug 16, 2013, at 3:38 PM, Ankit Bhatnagar  wrote:

> 
> Query is pretty simple 
> 
> CREATE TABLE SRC(ID INT, NAME STRING);
> 
> Yes HADOOP_HOME is set
> 
> 
> From: Nitin Pawar 
> Date: Friday, August 16, 2013 12:49 PM
> To: Ankit Bhatnagar 
> Cc: "user@hive.apache.org" 
> Subject: Re: FAILED: Error in metadata: MetaException
> 
> Not sure what's wrong in your query but have you set variable "HADOOP_HOME" ? 
> 
> its deprecated in hadoop but needed in hive (as per my understanding) 
> 
> 
> On Sat, Aug 17, 2013 at 12:55 AM, Ankit Bhatnagar  
> wrote:
>> Creating new table
>> 
>> From: Nitin Pawar 
>> Reply-To: "user@hive.apache.org" 
>> Date: Friday, August 16, 2013 12:17 PM
>> To: "user@hive.apache.org" 
>> Subject: Re: FAILED: Error in metadata: MetaException
>> 
>> Can you tell us what  were you trying to do? 
>> create table? or executing a query on a table someone already created ? 
>> 
>> 
>> On Sat, Aug 17, 2013 at 12:32 AM, Ankit Bhatnagar  
>> wrote:
>>> Hi folks,
>>> 
>>> 
>>> FAILED: Error in metadata: 
>>> MetaException(message:file:/user/hive/warehouse/src is not a directory or 
>>> unable to create one)
>>> 
>>> 
>>> drwxrwxrwx   - hadoop supergroup  0 2013-08-16 18:31 
>>> /user/hive/warehouse/src
>>> 
>>> 
>>> Is it some permission issue?
>>> 
>>> Ankit
>> 
>> 
>> 
>> -- 
>> Nitin Pawar
> 
> 
> 
> -- 
> Nitin Pawar

Re: documentation issue for RLIKE/REGEXP

2013-08-11 Thread Dean Wampler

I just confirmed that for Hive v0.10 (and probably all versions)  "foobar"
rlike "foo" returns true, just to be clear. In other words, the Java
regular expression does NOT have to match the whole string.

The wiki should be changed. If someone wants to give me permission, I'll do
it ;)

dean

On Fri, Aug 9, 2013 at 7:02 PM, Lefty Leverenz wrote:

> 1) Did you miss the first word?  ("NULL if A or B is NULL")
> 2) Darren Yin posted a similar message to the user 
> list<http://mail-archives.apache.org/mod_mbox/hive-user/201308.mbox/%3cCAPWDJtUZRaPgeP=0xnsbaos7+154a978wz1omkazqg-jmuf...@mail.gmail.com%3e>on
>  August 1st (or July 31st in some time zones):
>
> 
> from here:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-RelationalOperators
>  A RLIKE B strings NULL if A or B is NULL, TRUE if any (possibly empty)
> substring of A matches the Java regular expression B, otherwise FALSE. E.g.
> 'foobar' RLIKE 'foo' evaluates to FALSE whereas 'foobar' RLIKE '^f.*r$'
> evaluates to TRUE.
> 'foobar' RLIKE 'foo' evaluates to TRUE doesn't it?
> 
>
>
> The page history shows this:
>
> *June 24, 2011 (first version, unchanged until July 4 this year):*
> NULL if A or B is NULL, TRUE if string A matches the Java regular
> expression B(See Java regular expressions syntax), otherwise FALSE e.g.
> 'foobar' rlike 'foo' evaluates to FALSE where as 'foobar' rlike '^f.*r$'
> evaluates to TRUE
>
> *July 4, 2013 (version 47 changed by Siyang 
> Chen<https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=27362046&selectedPageVersions=47&selectedPageVersions=46>
> )*
> NULL if A or B is NULL, TRUE if any (possibly empty) substring of A
> matches the Java regular expression B, otherwise FALSE. E.g. 'foobar' RLIKE
> 'foo' evaluates to FALSE whereas 'foobar' RLIKE '^f.*r$' evaluates to TRUE.
>
>
> -- Lefty Leverenz
>
>
>
> On Thu, Aug 8, 2013 at 7:28 PM, Sergey Shelukhin 
> wrote:
>
>> Double checking before I try to edit.
>>
>> The page here:
>> https://cwiki.apache.org/Hive/languagemanual-udf.html
>>
>> says:
>>
>> A RLIKE B
>>  if A or B is NULL, TRUE if any (possibly empty) substring of A
>> matches the Java regular expression B, otherwise FALSE. E.g. 'foobar'
>> RLIKE 'foo' evaluates to FALSE whereas 'foobar' RLIKE '^f.*r$'
>> evaluates to TRUE.
>>
>> 1) "if A or B is NULL" seems like an unfinished part.
>> 2) "any (possibly empty) substring of A [that] matches the Java
>> regular expression B" should be "foo" at 0 for 'foobar' RLIKE 'foo',
>> and result in TRUE, right?
>>
>
>
>
> --
> Lefty
>



-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: Hive - external (dynamically) partitioned table

2013-07-26 Thread Dean Wampler

ynamic partitions?
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> Regards,
>
> Omkar Joshi
>
> ** **
>
> ------
> The contents of this e-mail and any attachment(s) may contain confidential
> or privileged information for the intended recipient(s). Unintended
> recipients are prohibited from taking action on the basis of information in
> this e-mail and using or disseminating the information, and must notify the
> sender and delete it from their system. L&T Infotech will not accept
> responsibility or liability for the accuracy or completeness of, or the
> presence of any virus or disabling code in this e-mail"
>



-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: Hive Query

2013-07-12 Thread Dean Wampler

Use a semi-join, which is more or less the same thing.. You might also see
if the having clause will help.

dean

On Fri, Jul 12, 2013 at 6:13 AM, Manickam P  wrote:

> Hi,
>
> I need to run hive query like select * from employee where employee_id IN
> (100,102). I came to know that hive does not support IN clause.
> Then what is the effective replacement for this? i need to execute around
> 250 inputs.  I'm using hive 0.9.0 version.
>
> Please guide me.
>
>
> Thanks,
> Manickam P
>



-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: Hive - UDF

2013-07-09 Thread Dean Wampler

The problem might be Java's limitation on having a single top-level class
in each file (as opposed to classes nested within a top-level class). You
would have to nest your UDFs in such a top-level class. That would work
fine, but when you define a TEMPORARY FUNCTION in Hive, I don't know if the
syntax supports references nested UDFs within a class. Perhaps someone else
can comment, or you could do the experiment yourself and see ;) If it
doesn't work, you'll just refactor the code into separate files.

Actually, since you should build a jar with your UDFs anyway, it's probably
doesn't matter a lot how many files you have, other than the slight
inconvenience of managing more than one.

dean

On Tue, Jul 9, 2013 at 7:49 AM, Manickam P  wrote:

> Hi,
>
> Can we write more than one function like to_upper and to_lower in same UDF
> ? Or do we need write separate UDF for each?
> Please let me know.
>
>
>
> Thanks,
> Manickam P
>

-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: Partition performance

2013-07-03 Thread Dean Wampler

How big were the files in each case in your experiment? Having lots of
small files will add Hadoop overhead.

Also, it would be useful to know the execution times of the map and reduce
tasks. The rule of thumb is that under 20 seconds each, or so, you're
paying a significant of the execution time in startup and shutdown overhead.

Of course, another factor is the number of tasks your cluster can run in
parallel. Scanning 20K partitions with a 1K MapReduce slot capacity over
the cluster will obviously take ~20 passes vs. ~1 pass for 1K partitions.

dean

On Tue, Jul 2, 2013 at 4:34 AM, Peter Marron <
peter.mar...@trilliumsoftware.com> wrote:

>  ...
>
> ** **
>
> *From: *Ian 
> *Reply-To: *"user@hive.apache.org" , Ian <
> liu...@yahoo.com>
> *Date: *Thursday, April 4, 2013 4:01 PM
> *To: *"user@hive.apache.org" 
> *Subject: *Partition performance
>
> ** **
>
> Hi,
>
>  
>
> I created 3 years of hourly log files (totally 26280 files), and use
> External Table with partition to query. I tried two partition methods.
>
>  
>
> 1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory per
> hour). Use date and hour as partition keys. Add 3 years of directories to
> the table partitions. So there are 26280 partitions.
>
> CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt
> string, hr int);
>
> ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION
> '/test1/2013/04/02/16';
>
>  
>
> 2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory per
> day, 24 files in each directory). Use date as partition key. Add 3 years of
> directories to the table partitions. So there are 1095 partitions.
>
> CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt
> string);
>
> ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION
> '/test2/2013/04/02';
>
>  
>
> When doing a simple query like 
>
> SELECT * FROM  test1/test2  WHERE  dt >= '2013-02-01' and dt <=
> '2013-02-14'
>
> Using approach #1 takes 320 seconds, but #2 only takes 70 seconds. 
>
>  
>
> I'm wondering why there is a big performance difference between these two?
> These two approaches have the same number of files, only the directory
> structure is different. So Hive is going to load the same amount of files.
> Why does the number of partitions have such big impact? Does that mean #2
> is a better partition strategy?
>
>  
>
> Thanks.
>
>  
>
>  
>
> ** **
>
> CONFIDENTIALITY NOTICE
> ==
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.*
> ***
>
> ** **
>
> ** **
>



-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: Performance difference between tuning reducer num and partition table

2013-06-29 Thread Dean Wampler

What happens if you don't set the number of reducers in the 1st run? How
many reducers are executed. If it's a much smaller number, the extra
overhead could matter. Another clue is the size of the files the first run
produced, i.e., do you have 30 small (much less than a block size) files?

On Sat, Jun 29, 2013 at 12:27 AM, Felix.徐  wrote:

> Hi Stephen,
>
> My query is actually more complex , hive will generate 2 mapreduces,
> in the first solution , it runs 17 mappers / 30 reducers and 10 mappers /
> 30 reducers (reducer num is set manually)
> in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1
> reducers for each partition
>
> I do not know whether they could achieve the same performance if the
> reducers num is set properly.
>
>
> 2013/6/29 Stephen Sprague 
>
>> great question.  your parallelization seems to trump hadoop's.I guess
>> i'd ask what are the _total_ number of Mappers and Reducers that run on
>> your cluster for these two scenarios?   I'd be curious if there are the
>> same.
>>
>>
>>
>>
>> On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐  wrote:
>>
>>> Hi all,
>>>
>>> Here is the scenario, suppose I have 2 tables A and B, I would like to
>>> perform a simple join on them,
>>>
>>> We can do it like this:
>>>
>>> INSERT OVERWRITE TABLE C
>>> SELECT  FROM A JOIN B on A.id=B.id
>>>
>>> In order to speed up this query since table A and B have lots of data,
>>> another approach is :
>>>
>>> Say I partition table A and B into 10 partitions respectively, and write
>>> the query like this
>>>
>>> INSERT OVERWRITE TABLE C PARTITION(pid=1)
>>> SELECT  FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1
>>>
>>> then I run this query 10 times concurrently (pid ranges from 1 to 10)
>>>
>>> And my question is that , in my observation of some more complex
>>> queries, the second solution is about 15% faster than the first solution,
>>> is it simply because the setting of reducer num is not optimal?
>>> If the resource is not a limit and it is possible to set the proper
>>> reducer nums in the first solution , can they achieve the same performance?
>>> Is there any other fact that can cause performance difference between
>>> them(non-partition VS partition+concurrent) besides the job parameter
>>> issues?
>>>
>>> Thanks!
>>>
>>
>>
>


-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: Question regarding nested complex data type

2013-06-21 Thread Dean Wampler

;) I actually thought it was a clever choice on Hive's part. There's no
real need for the 2nd tier separators, despite the nested collections!

However, it's still tricky to know what Hive expects when you're generating
table data with other apps.

dean

On Thu, Jun 20, 2013 at 9:34 PM, Stephen Sprague  wrote:

> look at it the other around if you want.  knowing an array of a two
> element struct is topologically the same as a map - they  darn well better
> be the same. :)
>
>
>
> On Thu, Jun 20, 2013 at 7:00 PM, Dean Wampler wrote:
>
>> It's not as "simple" as it seems, as I discovered yesterday, to my
>> surprise. I created a table like this:
>>
>> CREATE TABLE t (
>>   name STRING,
>>   stuff   ARRAY>);
>>
>> I then used an insert statement to see how Hive would store the records,
>> so I could populate the real table with another process. Hive used ^A for
>> the field separator, ^B for the collection separator, in this case, to
>> separate structs in the array, and ^C to separate the elements in each
>> struct, e.g.,:
>>
>> Dean Wampler^Afirst^C1^Bsecond^C2^Bthird^C3
>>
>> In other words, the structure you would expect for this table:
>>
>> CREATE TABLE t (
>>   name STRING,
>>   stuff   MAP);
>>
>> We should have covered the permutations of nested structures in our book,
>> but we didn't It would be great to document them, for realz some where.
>>
>> dean
>>
>> On Thu, Jun 20, 2013 at 9:56 AM, Stephen Sprague wrote:
>>
>>> you only get three.  field separator, array elements separator (aka
>>> collection delimiter), and map key/value separator (aka map key
>>> delimiter).
>>>
>>> when you  nest deeper then you gotta use the default '^D', '^E' etc for
>>> each level.  At least that's been my experience which i've found has worked
>>> successfully.
>>>
>>>
>>> On Thu, Jun 20, 2013 at 7:45 AM, neha  wrote:
>>>
>>>> Thanks a lot for your reply, Stephen.
>>>> To answer your question - I was not aware of the fact that we could use
>>>> delimiter (in my example, '|') for first level of nesting. I tried now and
>>>> it worked fine.
>>>>
>>>> My next question - Is there any way to provide delimiter in DDL for
>>>> second level of nesting?
>>>> Thanks again!!
>>>>
>>>>
>>>> On Thu, Jun 20, 2013 at 8:02 PM, Stephen Sprague wrote:
>>>>
>>>>> its all there in the documentation under "create table" and it seems
>>>>> you got everything right too except one little thing - in your second
>>>>> example there for 'sample data loaded' - instead of '^B' change that to
>>>>> '|'  and you should be good. That's the delimiter that separates your two
>>>>> array elements - ie collections.
>>>>>
>>>>> i guess the real question for me is when you say 'since there is no
>>>>> way to use given delimiter "|" ' what did you mean by that?
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jun 20, 2013 at 1:42 AM, neha  wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I have 2 questions about complex data types in nested composition.
>>>>>>
>>>>>> 1 >> I did not find a way to provide delimiter information in DDL if
>>>>>> one or more column has nested array/struct. In this case, default 
>>>>>> delimiter
>>>>>> has to be used for complex type column.
>>>>>> Please let me know if this is a limitation as of now or I am missing
>>>>>> something.
>>>>>>
>>>>>> e.g.:
>>>>>> *DDL*:
>>>>>> hive> create table example(col1 int, col2
>>>>>> array>) row format delimited fields terminated
>>>>>> by ',';
>>>>>> OK
>>>>>> Time taken: 0.226 seconds
>>>>>>
>>>>>> *Sample data loaded:*
>>>>>> 1,1^Cstring1^B2^Cstring2
>>>>>>
>>>>>> *O/P:*
>>>>>> hive> select * from example;
>>>>>> OK
>>>>>> 1[{"st1":1,"st2":"string1"},{"st1":2,"st2":"string2"}]
>>>>>> Time taken: 0.288 seconds
>>>>>>
>>>>>> 2 >> For the same DDL given above, if we provide clause* collection
>>>>>> items terminated by '|' *and still use default delimiters (since
>>>>>> there is no way to use given delimiter '|') then the select query shows
>>>>>> incorrect data.
>>>>>> Please let me know if this is something expected.
>>>>>>
>>>>>> e.g.
>>>>>> *DDL*:
>>>>>> hive> create table example(col1 int, col2
>>>>>> array>) row format delimited fields terminated
>>>>>> by ',' collection items terminated by '|';
>>>>>> OK
>>>>>> Time taken: 0.175 seconds
>>>>>>
>>>>>> *Sample data loaded:*
>>>>>> 1,1^Cstring1^B2^Cstring2
>>>>>>
>>>>>> *O/P:
>>>>>> *hive> select * from
>>>>>> example;
>>>>>>
>>>>>> OK
>>>>>> 1[{"st1":1,"st2":"string1\u00022"}]
>>>>>> Time taken: 0.141 seconds
>>>>>> **
>>>>>> Thanks & Regards.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> Dean Wampler, Ph.D.
>> @deanwampler
>> http://polyglotprogramming.com
>>
>
>


-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: Hive built-in functions not working

2013-06-21 Thread Dean Wampler

at
>>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:113)
>>> at
>>> org.apache.hadoop.hive.ql.exec.FunctionRegistry.registerGenericUDTF(FunctionRegistry.java:526)
>>> at
>>> org.apache.hadoop.hive.ql.exec.FunctionRegistry.registerGenericUDTF(FunctionRegistry.java:520)
>>> at
>>> org.apache.hadoop.hive.ql.exec.FunctionRegistry.(FunctionRegistry.java:423)
>>> at
>>> org.apache.hadoop.hive.ql.exec.DefaultUDFMethodResolver.getEvalMethod(DefaultUDFMethodResolver.java:59)
>>> at
>>> org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.initialize(GenericUDFBridge.java:154)
>>> at
>>> org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:111)
>>> at
>>> org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:141)
>>> at
>>> org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:970)
>>> at
>>> org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:996)
>>> at
>>> org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:60)
>>> at
>>> org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
>>> at
>>> org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:451)
>>> at
>>> org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:407)
>>> at
>>> org.apache.hadoop.hive.ql.exec.FilterOperator.initializeOp(FilterOperator.java:78)
>>> at
>>> org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
>>> at
>>> org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:451)
>>> at
>>> org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:407)
>>> at
>>> org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:186)
>>> at
>>> org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
>>> at
>>> org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:543)
>>> at
>>> org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
>>> at
>>> org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:100)
>>> ... 18 more
>>> Caused by: java.lang.ClassNotFoundException:
>>> org.codehaus.jackson.JsonFactory
>>>
>>> what am i doing wrong here? the jackson-core-asl-1.8.8.jar is in the
>>> $HIVE_HOME/lib directory ...
>>>
>>> SHOW FUNCTIONS;
>>>
>>> shows me that these functions are in there ... i already tried
>>> downgrading to hive 0.10 but the error is the same over there. i need to
>>> work with hadoop 0.20, so unfortunately i can't try hadoop 1.x.x
>>>
>>> thanks in advance
>>> cheers
>>> Wolli
>>>
>>
>>
>


-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: Question regarding nested complex data type

2013-06-20 Thread Dean Wampler

It's not as "simple" as it seems, as I discovered yesterday, to my
surprise. I created a table like this:

CREATE TABLE t (
  name STRING,
  stuff   ARRAY>);

I then used an insert statement to see how Hive would store the records, so
I could populate the real table with another process. Hive used ^A for the
field separator, ^B for the collection separator, in this case, to separate
structs in the array, and ^C to separate the elements in each struct, e.g.,:

Dean Wampler^Afirst^C1^Bsecond^C2^Bthird^C3

In other words, the structure you would expect for this table:

CREATE TABLE t (
  name STRING,
  stuff   MAP);

We should have covered the permutations of nested structures in our book,
but we didn't It would be great to document them, for realz some where.

dean

On Thu, Jun 20, 2013 at 9:56 AM, Stephen Sprague  wrote:

> you only get three.  field separator, array elements separator (aka
> collection delimiter), and map key/value separator (aka map key
> delimiter).
>
> when you  nest deeper then you gotta use the default '^D', '^E' etc for
> each level.  At least that's been my experience which i've found has worked
> successfully.
>
>
> On Thu, Jun 20, 2013 at 7:45 AM, neha  wrote:
>
>> Thanks a lot for your reply, Stephen.
>> To answer your question - I was not aware of the fact that we could use
>> delimiter (in my example, '|') for first level of nesting. I tried now and
>> it worked fine.
>>
>> My next question - Is there any way to provide delimiter in DDL for
>> second level of nesting?
>> Thanks again!!
>>
>>
>> On Thu, Jun 20, 2013 at 8:02 PM, Stephen Sprague wrote:
>>
>>> its all there in the documentation under "create table" and it seems you
>>> got everything right too except one little thing - in your second example
>>> there for 'sample data loaded' - instead of '^B' change that to '|'  and
>>> you should be good. That's the delimiter that separates your two array
>>> elements - ie collections.
>>>
>>> i guess the real question for me is when you say 'since there is no way
>>> to use given delimiter "|" ' what did you mean by that?
>>>
>>>
>>>
>>> On Thu, Jun 20, 2013 at 1:42 AM, neha  wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have 2 questions about complex data types in nested composition.
>>>>
>>>> 1 >> I did not find a way to provide delimiter information in DDL if
>>>> one or more column has nested array/struct. In this case, default delimiter
>>>> has to be used for complex type column.
>>>> Please let me know if this is a limitation as of now or I am missing
>>>> something.
>>>>
>>>> e.g.:
>>>> *DDL*:
>>>> hive> create table example(col1 int, col2
>>>> array>) row format delimited fields terminated
>>>> by ',';
>>>> OK
>>>> Time taken: 0.226 seconds
>>>>
>>>> *Sample data loaded:*
>>>> 1,1^Cstring1^B2^Cstring2
>>>>
>>>> *O/P:*
>>>> hive> select * from example;
>>>> OK
>>>> 1[{"st1":1,"st2":"string1"},{"st1":2,"st2":"string2"}]
>>>> Time taken: 0.288 seconds
>>>>
>>>> 2 >> For the same DDL given above, if we provide clause* collection
>>>> items terminated by '|' *and still use default delimiters (since there
>>>> is no way to use given delimiter '|') then the select query shows incorrect
>>>> data.
>>>> Please let me know if this is something expected.
>>>>
>>>> e.g.
>>>> *DDL*:
>>>> hive> create table example(col1 int, col2
>>>> array>) row format delimited fields terminated
>>>> by ',' collection items terminated by '|';
>>>> OK
>>>> Time taken: 0.175 seconds
>>>>
>>>> *Sample data loaded:*
>>>> 1,1^Cstring1^B2^Cstring2
>>>>
>>>> *O/P:
>>>> *hive> select * from
>>>> example;
>>>>
>>>> OK
>>>> 1[{"st1":1,"st2":"string1\u00022"}]
>>>> Time taken: 0.141 seconds
>>>> **
>>>> Thanks & Regards.
>>>>
>>>
>>>
>>
>


-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: Create table like with partitions

2013-06-11 Thread Dean Wampler

I confirmed it is a pirate site.

Sent from my rotary phone. 

On Jun 11, 2013, at 10:33 AM, Edward Capriolo  wrote:

> For reference, any that puts the entire book online like this is likely 
> pirated.
> 
> 
> 
> 
> On Tue, Jun 11, 2013 at 8:34 AM, Richa Sharma  
> wrote:
>> Hi all,
>> 
>> Found a very useful guide online. Link -> http://it-ebooks.info/book/941/
>> 
>> Richa
>> 
>> 
>> On Tue, Jun 11, 2013 at 1:53 PM, Richa Sharma  
>> wrote:
>>> Thanks for sharing!
>>> 
>>> I looked at these links .. Is there any documentation with more examples 
>>> with both static and dynamic partitions covered together.
>>> 
>>> Richa
>>> 
>>> 
>>> On Tue, Jun 11, 2013 at 12:33 PM, Lefty Leverenz  
>>> wrote:
 Dynamic partitions are described in the Hive design docs here:  
 https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions. 
 
 For the configuration parameters, though, you need to look in the language 
 manual here:  
 https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties 
 (search for "dynamic" to find various parameters related to dynamic 
 partitions). 
 
 – Lefty
 
 
 
 On Mon, Jun 10, 2013 at 7:06 AM, Owen O'Malley  wrote:
> You need to create the partitioned table and then copy the rows into it.
> 
> create table foo_staging (int x, int y);
> 
> create table foo(int x) partitioned by (int y) clustered by (x) into 16 
> buckets;
> 
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nonstrict; 
> set hive.enforce.bucketing = true;
> 
> insert overwrite table partition (y) select * from foo_staging; 
> 
> 
> On Mon, Jun 10, 2013 at 6:38 AM, Nitin Pawar  
> wrote:
>> If a table is not partitioned and then you want to partition the table 
>> on the data already written but data is not in partition format, that is 
>> not doable. 
>> 
>> Best approach would be, create a new table definition with the partition 
>> columns you want.
>> turn on the dynamic partitioning system before you load data into new 
>> table 
>> set hive.exec.dynamic.partition=true;
>> set hive.exec.dynamic.partition.mode=nonstrict;
>> insert overwrite table partitioned(columns) select * from oldtable
>> 
>> remove old table 
>> 
>> PS: wait for others to add more suggestions. I may be very well wrong in 
>> suggesting this 
>> 
>> 
>> On Mon, Jun 10, 2013 at 7:01 PM, Peter Marron 
>>  wrote:
>>> Hi,
>>> 
>>>  
>>> 
>>> Using hive 0.10.0 over hadoop 1.0.4
>>> 
>>>  
>>> 
>>> I have a (non-partitioned) table with loads of columns.
>>> 
>>> I would like to create a partitioned table with the same set of columns.
>>> 
>>> So the approach that I have been taking is to use “CREATE TABLE copy 
>>> LIKE original;”
>>> 
>>> then I can use ALTER TABLE to change the location and the INPUTFORMAT
>>> 
>>> and the OUTPUTFORMAT and the SERDE and properties and pretty much
>>> 
>>> everything else. However I don’t seem to be able to make it partitioned.
>>> 
>>> Sure I can add partitions if it’s already partitioned but I don’t seem
>>> 
>>> to be able to make it partitioned if it’s not already. I get errors 
>>> like this:
>>> 
>>>  
>>> 
>>> hive> ALTER TABLE customerShortValues ADD PARTITION (aid='1') LOCATION 
>>> 'E7/phase2/values/aid=1';
>>> 
>>> FAILED: Error in metadata: table is not partitioned but partition spec 
>>> exists: {aid=1}
>>> 
>>> FAILED: Execution Error, return code 1 from 
>>> org.apache.hadoop.hive.ql.exec.DDLTask
>>> 
>>>  
>>> 
>>> So, I guess that I could create the table I want by hand copying over 
>>> all the
>>> 
>>> column definitions. But is there an easier way?
>>> 
>>>  
>>> 
>>> Z
>>> 
>> 
>> 
>> 
>> -- 
>> Nitin Pawar
>

Re: Difference between like %A% and %a%

2013-05-24 Thread Dean Wampler

If backwards compatibility wasn't an issue, the hive code that implements
LIKE could be changed to convert the fields and LIKE strings to lower case
before comparing ;) Of course, there is overhead doing that.

On Fri, May 24, 2013 at 9:50 AM, Edward Capriolo wrote:

> Also I am thinking that the rlike is based on regex and can be told to do
> case insensitive matching.
>
>
> On Fri, May 24, 2013 at 9:16 AM, Dean Wampler wrote:
>
>> Hortonworks has announced plans to make Hive more SQL compliant. I
>> suspect bugs like this will be addressed sooner or later. It will be
>> necessary to handle backwards compatibility, but that could be handled with
>> a hive property that enables one or the other behaviors.
>>
>> On Fri, May 24, 2013 at 8:07 AM, John Omernik  wrote:
>>
>>> I have mentioned this before, and I think this a big miss by the Hive
>>> team.  Like, by default in many SQL RDBMS (like MSSQL or MYSQL)  is not
>>> case sensitive. Thus when you have new users moving over to Hive, if they
>>> see a command like "like" they will assume similarity (like many other SQL
>>> like qualities) and thus false negatives may ensue.  Even though it's
>>> different by default (I am ok with this ... I guess, my personal preference
>>> is that it matches the defaults on other systems, and outside of that
>>> (which I am, in in the end fine with, just grumbly :) ) give us the ability
>>> to set that behavior in the hive-site.xml.  That way when an org realizes
>>> that it is different, and their users are all getting false negatives, they
>>> can just update the hive-site and fix the problem rather than have to
>>> include it in training that may or may not work.  I've added this comment
>>> to https://issues.apache.org/jira/browse/HIVE-4070#comment-13666278 for 
>>> fun. :)
>>>
>>> Please? :)
>>>
>>>
>>>
>>>
>>> On Fri, May 24, 2013 at 7:53 AM, Dean Wampler wrote:
>>>
>>>> Your where clause looks at the abbreviation, requiring 'A', not the
>>>> state name. You got the correct answer.
>>>>
>>>>
>>>> On Fri, May 24, 2013 at 6:21 AM, Sai Sai  wrote:
>>>>
>>>>> But it should get more results for this:
>>>>>
>>>>> %a%
>>>>>
>>>>> than for
>>>>>
>>>>> %A%
>>>>>
>>>>> Please let me know if i am missing something.
>>>>> Thanks
>>>>> Sai
>>>>>
>>>>>
>>>>>------
>>>>>  *From:* Jov 
>>>>> *To:* user@hive.apache.org; Sai Sai 
>>>>> *Sent:* Friday, 24 May 2013 4:39 PM
>>>>> *Subject:* Re: Difference between like %A% and %a%
>>>>>
>>>>>
>>>>> 2013/5/24 Sai Sai 
>>>>>
>>>>> abbreviation l
>>>>>
>>>>>
>>>>> unlike MySQL, string in Hive is case sensitive，so '%A%' is not equal
>>>>> with '%a%'.
>>>>>
>>>>>
>>>>> --
>>>>> Jov
>>>>> blog: http:amutu.com/blog <http://amutu.com/blog>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Dean Wampler, Ph.D.
>>>> @deanwampler
>>>> http://polyglotprogramming.com
>>>>
>>>
>>>
>>
>>
>> --
>> Dean Wampler, Ph.D.
>> @deanwampler
>> http://polyglotprogramming.com
>
>
>


-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: Difference between like %A% and %a%

2013-05-24 Thread Dean Wampler

Hortonworks has announced plans to make Hive more SQL compliant. I suspect
bugs like this will be addressed sooner or later. It will be necessary to
handle backwards compatibility, but that could be handled with a hive
property that enables one or the other behaviors.

On Fri, May 24, 2013 at 8:07 AM, John Omernik  wrote:

> I have mentioned this before, and I think this a big miss by the Hive
> team.  Like, by default in many SQL RDBMS (like MSSQL or MYSQL)  is not
> case sensitive. Thus when you have new users moving over to Hive, if they
> see a command like "like" they will assume similarity (like many other SQL
> like qualities) and thus false negatives may ensue.  Even though it's
> different by default (I am ok with this ... I guess, my personal preference
> is that it matches the defaults on other systems, and outside of that
> (which I am, in in the end fine with, just grumbly :) ) give us the ability
> to set that behavior in the hive-site.xml.  That way when an org realizes
> that it is different, and their users are all getting false negatives, they
> can just update the hive-site and fix the problem rather than have to
> include it in training that may or may not work.  I've added this comment
> to https://issues.apache.org/jira/browse/HIVE-4070#comment-13666278  for
> fun. :)
>
> Please? :)
>
>
>
>
> On Fri, May 24, 2013 at 7:53 AM, Dean Wampler wrote:
>
>> Your where clause looks at the abbreviation, requiring 'A', not the state
>> name. You got the correct answer.
>>
>>
>> On Fri, May 24, 2013 at 6:21 AM, Sai Sai  wrote:
>>
>>> But it should get more results for this:
>>>
>>> %a%
>>>
>>> than for
>>>
>>> %A%
>>>
>>> Please let me know if i am missing something.
>>> Thanks
>>> Sai
>>>
>>>
>>>--
>>>  *From:* Jov 
>>> *To:* user@hive.apache.org; Sai Sai 
>>> *Sent:* Friday, 24 May 2013 4:39 PM
>>> *Subject:* Re: Difference between like %A% and %a%
>>>
>>>
>>> 2013/5/24 Sai Sai 
>>>
>>> abbreviation l
>>>
>>>
>>> unlike MySQL, string in Hive is case sensitive，so '%A%' is not equal
>>> with '%a%'.
>>>
>>>
>>> --
>>> Jov
>>> blog: http:amutu.com/blog <http://amutu.com/blog>
>>>
>>>
>>>
>>
>>
>> --
>> Dean Wampler, Ph.D.
>> @deanwampler
>> http://polyglotprogramming.com
>>
>
>


-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: Difference between like %A% and %a%

2013-05-24 Thread Dean Wampler

Your where clause looks at the abbreviation, requiring 'A', not the state
name. You got the correct answer.


On Fri, May 24, 2013 at 6:21 AM, Sai Sai  wrote:

> But it should get more results for this:
>
> %a%
>
> than for
>
> %A%
>
> Please let me know if i am missing something.
> Thanks
> Sai
>
>
>   --
>  *From:* Jov 
> *To:* user@hive.apache.org; Sai Sai 
> *Sent:* Friday, 24 May 2013 4:39 PM
> *Subject:* Re: Difference between like %A% and %a%
>
>
> 2013/5/24 Sai Sai 
>
> abbreviation l
>
>
> unlike MySQL, string in Hive is case sensitive，so '%A%' is not equal with
> '%a%'.
>
>
> --
> Jov
> blog: http:amutu.com/blog <http://amutu.com/blog>
>
>
>


-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: Hive skipping first line

2013-05-23 Thread Dean Wampler

Since you have a text file, have you visually inspected the first few lines
to make sure there's nothing unusual, like a missing line feed, field
separator, or anything like that? The missing line feed would be my first
suspicion.

Just because you're paranoid doesn't mean Pig isn't out to get you! ;)

dean

On Thu, May 23, 2013 at 3:43 AM, Tim Bittersohl  wrote:

> Excuting "hive -e select * from tablename" gives me back all my sample
> rows without skipping one
>
> ** **
>
> Thanks
>
> Tim
>
> ** **
>
> *Von:* Sanjay Subramanian [mailto:sanjay.subraman...@wizecommerce.com]
> *Gesendet:* Mittwoch, 22. Mai 2013 19:16
> *An:* user@hive.apache.org
> *Betreff:* Re: Hive skipping first line
>
> ** **
>
> Can u try doing a 
>
> Hive -e "select * from tablename"
>
> See how many rows this gets u 
>
> Thanks
>
> ** **
>
> sanjay
>
> ** **
>
> *From: *Tim Bittersohl 
> *Reply-To: *"user@hive.apache.org" 
> *Date: *Wednesday, May 22, 2013 9:57 AM
> *To: *"user@hive.apache.org" 
> *Subject: *Hive skipping first line
>
> ** **
>
> Hi,
>
>  
>
> I got some little problem with the content of hive tables.
>
> These table’s contents are generated by pig and do so have no header line,
> they are accessed as “TextFile” with an external hive table. When I fetch
> records through the java hive library (Version: 0.10.0-cdh4.2.0) with the
> “hiveClient.fetchN(rowncount)” command, it seems like he always skip the
> first line of data. (perhaps he’s expecting a header row?)
>
> How can I avoid this?
>
>  
>
> Greetings,
>
> Tim Bittersohl
>
> ** **
>
> CONFIDENTIALITY NOTICE
> ==
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.*
> ***
>



-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: hive.metastore.warehouse.dir - Should it point to a physical directory

2013-05-21 Thread Dean Wampler

No, you only need a directory in HDFS, which will be "virtually located"
somewhere in your cluster automatically by HDFS.

Also there's a typo in your hive.xml:

  

Should be

  /correct/path/in/hdfs/to/your/warehouse/directory

On Tue, May 21, 2013 at 1:04 PM, Raj Hadoop  wrote:

> Thanks Sanjay.
>
> My environment is  like this.
>
> $ echo $HADOOP_HOME
> /software/home/hadoop/hadoop/hadoop-1.1.2
>
> $ echo $HIVE_HOME
> /software/home/hadoop/hive/hive-0.9.0
>
> $ id
> uid=50052(hadoop) gid=600(apps) groups=600(apps)
>
> So can i do like this:
>
> $pwd
> /software/home/hadoop/hive/hive-0.9.0
>
> $mkdir warehouse
>
> $cd /software/home/hadoop/hive/hive-0.9.0/warehouse
>
> $ in hive-site.xml
> 
>   hive.metastore.warehouse.dir
>   
>   location of default database for the
> warehouse
> 
>
> Where should I create the HDFS directory ?
>
>
>   *From:* Sanjay Subramanian 
> *To:* "user@hive.apache.org" ; Raj Hadoop <
> hadoop...@yahoo.com>; Dean Wampler 
> *Cc:* User 
> *Sent:* Tuesday, May 21, 2013 1:53 PM
>
> *Subject:* Re: hive.metastore.warehouse.dir - Should it point to a
> physical directory
>
>  Notes below
>
> From: Raj Hadoop 
> Reply-To: "user@hive.apache.org" , Raj Hadoop <
> hadoop...@yahoo.com>
> Date: Tuesday, May 21, 2013 10:49 AM
> To: Dean Wampler , "user@hive.apache.org" <
> user@hive.apache.org>
> Cc: User 
> Subject: Re: hive.metastore.warehouse.dir - Should it point to a physical
> directory
>
>   Ok.I got it. My questions -
>
> 1) Should a local physical directory be created before using this property?
>  I created a directory in HDFS during Hive installation
> /user/hive/warehouse
>
> My hive-site.xml has the following property defined
>
> 
>   hive.metastore.warehouse.dir
>   /user/hive/warehouse
>   location of default database for the
> warehouse
> 
>
>   2) Should a HDFS file directory be created from Hadoop before using
> this property?
> hdfs dfs -mkdir /user/hive/warehouse
> Change the owner:group to hive:hive
>
>
>
>   *From:* Dean Wampler 
> *To:* user@hive.apache.org; Raj Hadoop 
> *Cc:* User 
> *Sent:* Tuesday, May 21, 2013 1:44 PM
> *Subject:* Re: hive.metastore.warehouse.dir - Should it point to a
> physical directory
>
> The name is misleading; this is the directory within HDFS where Hive
> stores the data, by default. (External tables can go elsewhere). It doesn't
> really have anything to do with the metastore.
>
> dean
>
> On Tue, May 21, 2013 at 12:42 PM, Raj Hadoop  wrote:
>
>  Can some one help me on this ? I am stuck installing and configuring
> Hive with Oracle. Your timely help is really aprreciated.
>
>   *From:* Raj Hadoop 
> *To:* Hive ; User 
> *Sent:* Tuesday, May 21, 2013 1:08 PM
> *Subject:* hive.metastore.warehouse.dir - Should it point to a physical
> directory
>
>   Hi,
>
> I am configurinig Hive. I ahve a question on the property
> hive.metastore.warehouse.dir.
>
> Should this point to a physical directory. I am guessing it is a logical
> directory under Hadoop fs.default.name. Please advise whether I need to
> create any directory for the variable hive.metastore.warehouse.dir
>
> Thanks,
> Raj
>
>
>
>
>
> --
> Dean Wampler, Ph.D.
> @deanwampler
> http://polyglotprogramming.com/
>
>
>
> CONFIDENTIALITY NOTICE
> ==
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>
>
>


-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: hive.metastore.warehouse.dir - Should it point to a physical directory

2013-05-21 Thread Dean Wampler

The name is misleading; this is the directory within HDFS where Hive stores
the data, by default. (External tables can go elsewhere). It doesn't really
have anything to do with the metastore.

dean

On Tue, May 21, 2013 at 12:42 PM, Raj Hadoop  wrote:

> Can some one help me on this ? I am stuck installing and configuring Hive
> with Oracle. Your timely help is really aprreciated.
>
>   *From:* Raj Hadoop 
> *To:* Hive ; User 
> *Sent:* Tuesday, May 21, 2013 1:08 PM
> *Subject:* hive.metastore.warehouse.dir - Should it point to a physical
> directory
>
>   Hi,
>
> I am configurinig Hive. I ahve a question on the property
> hive.metastore.warehouse.dir.
>
> Should this point to a physical directory. I am guessing it is a logical
> directory under Hadoop fs.default.name. Please advise whether I need to
> create any directory for the variable hive.metastore.warehouse.dir
>
> Thanks,
> Raj
>
>
>


-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: [ANNOUNCE] Apache Hive 0.11.0 Released

2013-05-16 Thread Dean Wampler

Congratulations!

On Thu, May 16, 2013 at 4:19 PM, Owen O'Malley  wrote:

> The Apache Hive team is proud to announce the the release of Apache
> Hive version 0.11.0.
>
> The Apache Hive data warehouse software facilitates querying and
> managing large datasets residing in distributed storage. Built on top
> of Apache Hadoop, it provides:
>
> * Tools to enable easy data extract/transform/load (ETL)
>
> * A mechanism to impose structure on a variety of data formats
>
> * Access to files stored either directly in Apache HDFS or in other
>   data storage systems such as Apache HBase
>
> * Query execution via MapReduce
>
> For Hive release details and downloads, please visit:
> http://hive.apache.org/releases.html
>
> Hive 0.11.0 Release Notes are available here:
>
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12323587&styleName=Html&projectId=12310843
>
> We would like to thank the many contributors who made this release
> possible.
>
> Regards,
>
> The Apache Hive Team
>



-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Re: Hive query problem on S3 table

2013-04-18 Thread Dean Wampler

ang.RuntimeException: org.apache.hadoop.hive.ql.io.HiveInputFormat
> 
>
> at
> org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:333)
>
> at
> org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:136)
>
> at
> org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138)
>
> at
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
> 
>
> at
> org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1352)
>
> at
> org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1138)
>
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:951)**
> **
>
> at
> org.apache.hadoop.hive.service.HiveServer$HiveServerHandler.execute(HiveServer.java:198)
> 
>
> at
> org.apache.hadoop.hive.service.ThriftHive$Processor$execute.getResult(ThriftHive.java:644)
> 
>
> at
> org.apache.hadoop.hive.service.ThriftHive$Processor$execute.getResult(ThriftHive.java:628)
> 
>
> at
> org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>
> at
> org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>
> at
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
> 
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 
>
> at java.lang.Thread.run(Thread.java:722)
>
> 13/04/18 15:37:14 ERROR exec.ExecDriver: Exception:
> org.apache.hadoop.hive.ql.io.HiveInputFormat   
>
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.MapRedTask
>
> 13/04/18 15:37:14 ERROR ql.Driver: FAILED: Execution Error, return code 1
> from org.apache.hadoop.hive.ql.exec.MapRedTask
>
> ** **
>
> ** **
>
> I’m using the Cloudera “0.10.0-cdh4.2.0” version of the Hive libraries.***
> *
>
> ** **
>
> Greetings
>
> Tim Bittersohl 
>
> Software Engineer 
>
>
> [image: http://www.innoplexia.de/ci/logo/inno_logo_links%20200x80.png]
>
> Innoplexia GmbH
> Mannheimer Str. 175 
>
> 69123 Heidelberg
>
> Tel.: +49 (0) 6221 7198033
> Mobiltel.: +49 (0) 160 99186759
> Fax: +49 (0) 6221 7198034
> Web: www.innoplexia.com
>
> Sitz: 69123 Heidelberg, Mannheimer Str. 175 - Steuernummer 32494/62606 -
> USt. IdNr.: DE 272 871 728 - Geschäftsführer: Prof. Dr. Herbert Schuster *
> ***
>
> ** **
>



-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com
<>

Re: Partition performance

2013-04-04 Thread Dean Wampler

Also, how big are the files in each directory? Are they roughly the size of
one HDFS block or a multiple. Lots of small files will mean lots of mapper
tasks will little to do.

You can also compare the job tracker console output for each job. I bet the
slow one has a lot of very short map and reduce tasks, while the faster one
has fewer tasks that run longer. A rule of thumb is that any one task
should take 20 seconds or more to amortize over the few seconds spent in
start up per task.

In other words, if you think about what's happening at the HDFS and MR
level, you can learn to predict how fast or slow things will run. Learning
to read the output of EXPLAIN or EXPLAIN EXTENDED helps with this.

dean

On Thu, Apr 4, 2013 at 6:25 PM, Owen O'Malley  wrote:

> See slide #9 from my Optimizing Hive Queries talk
> http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we
> will improve it, but for now you are much better off with 1,000 partitions
> than 10,000.
>
> -- Owen
>
>
> On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle  wrote:
>
>> Is it possible for you to send the explain plan of these two queries?
>>
>> Regards,
>> Ramki.
>>
>>
>> On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian <
>> sanjay.subraman...@wizecommerce.com> wrote:
>>
>>>  The slow down is most possibly due to large number of partitions.
>>> I believe the Hive book authors tell us to be cautious with large number
>>> of partitions :-)  and I abide by that.
>>>
>>>  Users
>>> Please add your points of view and experiences
>>>
>>>  Thanks
>>> sanjay
>>>
>>>   From: Ian 
>>> Reply-To: "user@hive.apache.org" , Ian <
>>> liu...@yahoo.com>
>>> Date: Thursday, April 4, 2013 4:01 PM
>>> To: "user@hive.apache.org" 
>>> Subject: Partition performance
>>>
>>>   Hi,
>>>
>>> I created 3 years of hourly log files (totally 26280 files), and use
>>> External Table with partition to query. I tried two partition methods.
>>>
>>> 1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory
>>> per hour). Use date and hour as partition keys. Add 3 years of directories
>>> to the table partitions. So there are 26280 partitions.
>>> CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt
>>> string, hr int);
>>> ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16)
>>> LOCATION '/test1/2013/04/02/16';
>>>
>>> 2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory
>>> per day, 24 files in each directory). Use date as partition key. Add 3
>>> years of directories to the table partitions. So there are 1095 partitions.
>>>  CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY
>>> (dt string);
>>> ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION
>>> '/test2/2013/04/02';
>>>
>>> When doing a simple query like
>>> SELECT * FROM  test1/test2  WHERE  dt >= '2013-02-01' and dt <=
>>> '2013-02-14'
>>>  Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.
>>>
>>> I'm wondering why there is a big performance difference between these
>>> two? These two approaches have the same number of files, only the directory
>>> structure is different. So Hive is going to load the same amount of files.
>>> Why does the number of partitions have such big impact? Does that mean #2
>>> is a better partition strategy?
>>>
>>> Thanks.
>>>
>>>
>>>
>>> CONFIDENTIALITY NOTICE
>>> ==
>>> This email message and any attachments are for the exclusive use of the
>>> intended recipient(s) and may contain confidential and privileged
>>> information. Any unauthorized review, use, disclosure or distribution is
>>> prohibited. If you are not the intended recipient, please contact the
>>> sender by reply email and destroy all copies of the original message along
>>> with any attachments, from your computer system. If you are the intended
>>> recipient, please be advised that the content of this message is subject to
>>> access, review and disclosure by the sender's Email System Administrator.
>>>
>>
>>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Metastore question

2013-04-03 Thread Dean Wampler

Yes, the property name is misleading. The DATA is stored under that
directory (for managed tables, as opposed to external tables where you
explicit tell Hive the location). The METADATA is stored in MySQL (or other
database).

On Wed, Apr 3, 2013 at 7:26 PM, Mark  wrote:

> Trying to make sure I understand this correctly. All databases and tables
> are stored in hive.metastore.warehouse.dir but the actual metadata for
> the database and tables (columns, types, partitions, etc) are stored in the
> hive database (ie.. mysql)?
>
> Is that correct?
>

-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Bucketing external tables

2013-03-30 Thread Dean Wampler

The table can be external. You should be able to use this data with other
tools, because all bucketing does is ensure that all occurrences for
records with a given key are written into the same block. This is why
clustered/blocked data can be joined on those keys using map-side joins;
Hive knows it can cache ab individual block in memory and the block will
hold all records across the table for the keys in that block.

So, Java MR apps and Pig can still read the records, but they won't
necessarily understand how the data is organized. I.e., it might appear
unsorted. Perhaps HCatalog will allow other tools to exploit the structure,
but I'm not sure.

dean

On Sat, Mar 30, 2013 at 5:44 PM, Sadananda Hegde wrote:

> Thanks, Dean.
>
> Does that mean, this bucketing is exclusively Hive feature and not
> available to others like Java, Pig, etc?
>
> And also, my final tables have to be managed tables; not external tables,
> right?
>  .
> Thank again for your time and help.
>
> Sadu
>
>
>
> On Fri, Mar 29, 2013 at 5:57 PM, Dean Wampler <
> dean.wamp...@thinkbiganalytics.com> wrote:
>
>> I don't know of any way to avoid creating new tables and moving the data.
>> In fact, that's the official way to do it, from a temp table to the final
>> table, so Hive can ensure the bucketing is done correctly:
>>
>>  https://cwiki.apache.org/Hive/languagemanual-ddl-bucketedtables.html
>>
>> In other words, you might have a big move now, but going forward, you'll
>> want to stage your data in a temp table, use this procedure to put it in
>> the final location, then delete the temp data.
>>
>> dean
>>
>> On Fri, Mar 29, 2013 at 4:58 PM, Sadananda Hegde wrote:
>>
>>> Hello,
>>>
>>> We run M/R jobs to parse and process large and highly complex xml files
>>> into AVRO files. Then we build external Hive tables on top the parsed Avro
>>> files. The hive tables are partitioned by day; but they are still huge
>>> partitions and joins do not perform that well. So I would like to try
>>> out creating buckets on the join key. How do I create the buckets on the
>>> existing HDFS files? I would prefer to avoid creating another set of tables
>>> (bucketed) and load data from non-bucketed table to bucketed tables if at
>>> all possible. Is it possible to do the bucketing in Java as part of the M/R
>>> jobs while creating the Avro files?
>>>
>>> Any help / insight would greatly be appreciated.
>>>
>>> Thank you very much for your time and help.
>>>
>>> Sadu
>>>
>>
>>
>>
>> --
>> *Dean Wampler, Ph.D.*
>> thinkbiganalytics.com
>> +1-312-339-1330
>>
>>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Bucketing external tables

2013-03-29 Thread Dean Wampler

I don't know of any way to avoid creating new tables and moving the data.
In fact, that's the official way to do it, from a temp table to the final
table, so Hive can ensure the bucketing is done correctly:

 https://cwiki.apache.org/Hive/languagemanual-ddl-bucketedtables.html

In other words, you might have a big move now, but going forward, you'll
want to stage your data in a temp table, use this procedure to put it in
the final location, then delete the temp data.

dean

On Fri, Mar 29, 2013 at 4:58 PM, Sadananda Hegde wrote:

> Hello,
>
> We run M/R jobs to parse and process large and highly complex xml files
> into AVRO files. Then we build external Hive tables on top the parsed Avro
> files. The hive tables are partitioned by day; but they are still huge
> partitions and joins do not perform that well. So I would like to try
> out creating buckets on the join key. How do I create the buckets on the
> existing HDFS files? I would prefer to avoid creating another set of tables
> (bucketed) and load data from non-bucketed table to bucketed tables if at
> all possible. Is it possible to do the bucketing in Java as part of the M/R
> jobs while creating the Avro files?
>
> Any help / insight would greatly be appreciated.
>
> Thank you very much for your time and help.
>
> Sadu
>

-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Noob question on creating tables

2013-03-29 Thread Dean Wampler

That's a drawback of external tables, but it's actually not as difficult as
it sounds. It's easy to write a nightly "cron" job that creates the
partition for the next day (or a job per month...), if someone on your team
has some bash experience. Other job scheduling tools should support this
too. Here's an example. First, a hive script that uses parameters for the
date (Hive v0.8 or newer):

-- addlogpartition.hql
ALTER TABLE log ADD IF NOT EXISTS PARTITION (year = ${YEAR}, month =
${MONTH}, day = ${DAY});

Then, run this bash script AFTER MIDNIGHT:

#!/bin/bash
YEAR=$(date +%Y)   # returns the string "2013" today.
MONTH=$(date +%m)   # returns the string "03" today, with the leading zero.
DAY=$(date +%d)  # returns the string "29" today. Will prefix with
0 for dates < 10.

# Assumes /path/to/2013/03/29 is the correct directory name:
/path/to/hive -f /path/to/addlogpartition.hql -d YEAR=$YEAR -d MON=$MONTH
-d DAY=$DAY


(Of course, all the /path/to will be different...)

So, be careful of how how "03" vs. "3" is handled in both the ALTER
TABLE statement and the path. Off hand, I don't know if Hive will complain
if you use 03 as an integer value in the ALTER TABLE statement.


On Fri, Mar 29, 2013 at 1:16 PM, Mark  wrote:

> Thanks
>
> Does this mean I need to create a partition for each day manually? There
> is no way to have infer that from my directory structure?
>
> On Mar 29, 2013, at 10:40 AM, Sanjay Subramanian <
> sanjay.subraman...@wizecommerce.com> wrote:
>
> > Hi
> >
> > CREATE EXTERNAL TABLE IF NOT EXISTS log_data(col1 datatype1, col2
> > datatype2, . . . colN datatypeN) PARTITIONED BY (YEAR INT, MONTH INT, DAY
> > INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> >
> >
> > ALTER table log_data ADD PARTITION (YEAR=2013 , MONTH=2, DAY=27) LOCATION
> > '/path/to/YEAR/MONTH/DAY/directory/ON/HDFS';"
> >
> > Hive will read gzip and bz2 files out of the box.(so suppose you had
> > hourly log files in gzip format in your /YEAR/MONTH/DAY directory then it
> > will be read)
> > Snappy and LZO will need some jar installs and configs
> > https://github.com/toddlipcon/hadoop-lzo
> >
> > https://code.google.com/p/snappy/
> >
> >
> > Note that for example - gzip format is not splittable..so huge gzip files
> > without splits are not recommended as input to maps
> >
> > Hope this helps
> >
> > sanjay
> >
> >
> > On 3/29/13 10:19 AM, "Mark"  wrote:
> >
> >> We have existing log data in directories in the format of
> YEAR/MONTH/DAY.
> >>
> >> - How can we create a table over this table without hive modifying
> and/or
> >> moving it?
> >> - How can we tell Hive to partition this data so it knows about each day
> >> of logs?
> >> - Does hive out of the box work with reading compressed files?
> >>
> >> Thanks
> >
> >
> > CONFIDENTIALITY NOTICE
> > ==
> > This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
> >
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Noob question on creating tables

2013-03-29 Thread Dean Wampler

On Fri, Mar 29, 2013 at 12:19 PM, Mark  wrote:

> We have existing log data in directories in the format of YEAR/MONTH/DAY.
>
> - How can we create a table over this table without hive modifying and/or
> moving it?
>

create external table foo (...) partitioned by (year  int, month int, day
int);
...

- How can we tell Hive to partition this data so it knows about each day of
> logs?
>

alter table foo add partition(year = 2013, month = 3, day = 29) location
'/path/to/2013/03/29';


> - Does hive out of the box work with reading compressed files?
>

yes, if you're using a compression scheme supported by Hadoop.


>
> Thanks




-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Using TABLESAMPLE on inner queries

2013-03-20 Thread Dean Wampler

Mark,

Aside from what might be wrong here, isn't it true that sampling with the
bucket clause still works on non-bucketed tables; it's just inefficient
because it still scans the whole table? Or am I an idiot? ;)

dean

On Wed, Mar 20, 2013 at 2:17 PM, Mark Grover wrote:

> Hi Robert,
> Sampling in Hive is based on buckets. Therefore, you table needs to be
> appropriately bucketed.
>
> I would recommend storing the results of your inner query in a bucketed
> table. See how to populate a bucketed table at
> https://cwiki.apache.org/Hive/languagemanual-ddl-bucketedtables.html
>
> Then you will be able to be sample through it.
>
> If the predicate is on partition column, you may be able to get around the
> intermediate table requirement but in general, as far as I know,
> intermediate bucketed table might be the only choice.
>
> Mark
>
> On Wed, Mar 20, 2013 at 9:56 AM, Robert Li wrote:
>
>> Hi Everyone
>>
>> I'm trying to use the TABLESAMPLE function to sample data, however it's a
>> little more complicated and I am having trouble getting it to run.
>>
>> I know that this works fine and it will give me about 25% of the whole
>> dataset
>>
>> select distinct s
>> from testtable TABLESAMPLE(BUCKET 1 OUT OF 4 ON s)
>> where month <= 201211
>>
>> However, in my situation I need to do a TABLESAMPLE on an outer query, a
>> simple example is
>>
>> *select mytest.s *
>> *from *
>> * (select distinct s from testtable where month <= 201211)mytest*
>>
>> or something like
>>
>> *select table1.s*
>> *from *
>> * (select distinct s from testtable)table1*
>> * join*
>> * (select distinct s from test2table)table2*
>> * on table1.s=table2.s*
>>
>>
>> How do I use TABLESAMPLE in this case to sample the results of the outer
>> query? I tried placing TABLESAMPLE(BUCKET 1 OUT OF 4 ON s) in various
>> places of my query but it always returns some sort of syntax error and thus
>> not allowing the query to run.
>>
>> Any help is appreciated.
>>
>> Robert
>> **
>>
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Hive 0.10.0 metastore thrift server installation error

2013-03-14 Thread Dean Wampler

You have to upgrade the existing metastore, although I'm not sure this is
the cause. Did you do that?

See the README in the Hive installation directory and scripts to do the
upgrade in scripts/metastore/upgrade.

dean

On Thu, Mar 14, 2013 at 3:51 AM, Shangzhong zhu  wrote:

> Hi,
>
> We are trying to upgrade hive from 0.9.0 to 0.10.0 (apache). When we
> deploy the 0.10.0 binaries and start meta thrift server, we got the
> following error:
>
> hadoop@ip-10-4-106-145:~/hive$ ./bin/hive --service metastore
> Starting Hive Metastore Server
> WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please
> use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
> files.
> java.lang.NoSuchFieldError: METASTORE_USE_THRIFT_FRAMED_TRANSPORT
> at
> org.apache.hadoop.hive.metastore.HiveMetaStore.startMetaStore(HiveMetaStore.java:4178)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStore.main(HiveMetaStore.java:4138)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:601)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> Exception in thread "main" java.lang.NoSuchFieldError:
> METASTORE_USE_THRIFT_FRAMED_TRANSPORT
> at
> org.apache.hadoop.hive.metastore.HiveMetaStore.startMetaStore(HiveMetaStore.java:4178)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStore.main(HiveMetaStore.java:4138)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:601)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> Any clue what's going on here?
>
> Thanks,
> Shanzhong
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: UDFs and Thread Safety?

2013-03-10 Thread Dean Wampler

Hadoop tasks use a single thread, so there won't be multiple threads
accessing the UDF.

However, there's a flip side of thread safety if your UDF maintains state;
is it receiving all the data it should or is the data being sharded over
multiple processes in a way that defeats the UDF? My favorite example is a
moving average calculator (like you might use in Finance). Most
full-featured SQLs have window functions for this purpose.

Suppose I'm averaging over the last 50 closing prices for a given financial
instrument. To do this I cache the last 50 I've seen in the UDF as each
record is passed to me (keeping the data for each instrument properly
separated). If some records go to one mapper task and other records go to a
different mapper task, then at least some of my averages will be wrong due
to missing data.

dean

On Sun, Mar 10, 2013 at 10:12 PM, Shaun Clowes wrote:

> Hi All,
>
> Could anyone describe what the required thread safety for a UDF is? I
> understand that one is instantiated for each use of the function in an
> expression, but can there be multiple threads executing the methods of a
> single UDF object at once?
>
> Thanks,
> Shaun
>

-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Error while table creation

2013-03-10 Thread Dean Wampler

If you upgrade Hive, you have to do a upgrade/migration on the metastore
database. There are scripts that come with hive distributions for this. If
you're starting from scratch, you could just drop the old database.

On Sun, Mar 10, 2013 at 1:23 AM, Abhishek Gayakwad wrote:

> it is an old installation, we recently upgraded hive 0.7.0 to 0.9.0
>
> and we are not using "if not exists" clause
>
>
> On Sun, Mar 10, 2013 at 4:37 AM, Ramki Palle wrote:
>
>> Just wondering if your create table syntax include "if not exists" such
>> as
>>
>> CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
>>
>> ...
>> ...
>> ...
>> )
>>
>>
>>
>> On Sun, Mar 10, 2013 at 2:34 AM, Viral Bajaria 
>> wrote:
>>
>>> Is this is a new installation of Hive or did you upgrade ? How many
>>> tables do you already have ?
>>>
>>>
>>> On Sat, Mar 9, 2013 at 12:00 PM, Abhishek Gayakwad >> > wrote:
>>>
>>>> while trying to create external table in oozie hive action, I am getting 
>>>> following error, hive version is 0.9.0
>>>>
>>>>
>>>>
>>>> Caused by: javax.jdo.JDODataStoreException: Insert of object 
>>>> "org.apache.hadoop.hive.metastore.model.MColumnDescriptor@46c1dc0b" using 
>>>> statement "INSERT INTO `CDS` (`CD_ID`) VALUES (?)" failed : Duplicate 
>>>> entry 'X' for key 'PRIMARY'
>>>> NestedThrowables:
>>>> com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException:
>>>>  Duplicate entry 'X' for key 'PRIMARY'
>>>>at 
>>>> org.datanucleus.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:313)
>>>>at 
>>>> org.datanucleus.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:660)
>>>>at 
>>>> org.datanucleus.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:680)
>>>>at 
>>>> org.apache.hadoop.hive.metastore.ObjectStore.createTable(ObjectStore.java:617)
>>>>... 43 more
>>>> Caused by: 
>>>> com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException:
>>>>  Duplicate entry 'X' for key 'PRIMARY'
>>>>
>>>>
>>>> Thanks,
>>>> Abhishek
>>>>
>>>>
>>>
>>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Hive query started map task being killed during execution

2013-03-08 Thread Dean Wampler

Do you have more than one hive process running? It looks like you're using
Derby, which only supports one process at a time. Also, you have to start
Hive from the same directory every time, where the metastore "database" is
written, unless you edit the JDBC connection property in the Hive config
file to point to a particular path. Here's what I use:


  javax.jdo.option.ConnectionURL

jdbc:derby:;databaseName=/Users/somedufus/hive/metastore_db;create=true
  JDBC connect string for a JDBC metastore



On Fri, Mar 8, 2013 at 4:09 PM, Dileep Kumar wrote:

> Hi All,
>
> I am running a hive query which does insert into a table.
> What I noticed from the symptom it looks like it got to do with some
> settings but  I am not able to figure out what settings.
>
> When I submit the query it starts 2130 map tasks in the job and 150 of
> them completes fine without any error and then next batch of 75 gets killed
> and all of them after that gets killed.
> While I submit a similar query based on smaller table its starts around
> only 135 map tasks and it runs till completion without any error and does
> the insert into appropriate table.
>
> I don't find any obvious error messages in any of the tasks log apart form
> this:
>
>
> ./hadoop-0.20-mapreduce/userlogs/job_201303080834_0001/attempt_201303080834_0001_m_001636_0/syslog:2013-03-08
> 08:54:06,910 INFO orapache.hadoop.hive.ql.exec.MapOperator:
> DESERIALIZE_ERRORS:0
> ./hadoop-0.20-mapreduce/userlogs/job_201303080834_0001/attempt_201303080834_0001_m_001646_0/syslog:2013-03-08
> 08:41:06,060 INFO orapache.hadoop.hive.ql.exec.MapOperator:
> DESERIALIZE_ERRORS:0
> ./hadoop-0.20-mapreduce/userlogs/job_201303080834_0001/attempt_201303080834_0001_m_001646_0/syslog:2013-03-08
> 08:46:54,390 ERROR o.apache.hadoop.hive.ql.stats.jdbc.JDBCStatsPublisher:
> Error during instantiating JDBC driver org.apache.derby.jdbc.EmbeddedDriver.
> ./hadoop-0.20-mapreduce/userlogs/job_201303080834_0001/attempt_201303080834_0001_m_001646_0/syslog:2013-03-08
> 08:46:54,394 ERROR o.apache.hadoop.hive.ql.exec.FileSinkOperator:
> StatsPublishing error: cannot connect to database
>
> Please suggest if I need to set anything in Hive when I invoke this query.
> The query that runs successfully has lot less rows compared to on that
> fails.
>
> Thanks,
> DK
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: difference between add jar in hive session and hive --auxpath

2013-03-08 Thread Dean Wampler

the runtime of my query of the classes in those
> jars.
>
> My questions are:
>
> 1) What is the different between hive --auxpath and "add jar" in the hive
> session?
> 2) This problem makes it is hard to access my table in the HUE, as it only
> supports "add jar", but not --auxpath option. Any suggestions?
>
>
> Thanks
>
> Yong
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Find current db we r using in Hive

2013-03-08 Thread Dean Wampler

It's odd that there is no such command. The trick Ramki mentioned is the
only one I know of. Two points about it, though:

1. It only works on Hive v0.8+.
2. I've seen a few cases where the prompt did NOT change when first used,
but starting working a little later! I have no idea why and of course, it
happened while teaching a class where I'm supposed to be the expert ;)

dean

On Fri, Mar 8, 2013 at 12:36 AM, Ramki Palle  wrote:

> Sai,
>
> I do not think there is any command to show the current db in Hive. One
> alternative for you is to set a property so that the current database is
> shown as part of the prompt:
>
> set hive.cli.print.current.db=true;
>
> This one shows your current db as part of your hive prompt.
>
> Regards,
> Ramki.
>
>
> On Fri, Mar 8, 2013 at 11:13 AM, Sai Sai  wrote:
>
>> Just wondering if there is any command in Hive which will show us the
>> current db we r using similar to pwd in Unix.
>> Thanks
>> Sai
>>
>>
>

-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: java.lang.NoClassDefFoundError: com/jayway/jsonpath/PathUtil

2013-03-08 Thread Dean Wampler

Unfortunately, you have to also add the json jars to Hive's class path
before it starts, e.g.,

env HADOOP_CLASSPATH=/path/to/lib/*.jar hive

Use the appropriate path to your lib directory.

On Fri, Mar 8, 2013 at 4:53 AM, Sai Sai  wrote:

> I have added the jar files successfully like this:
>
>
> hive (testdb)> ADD JAR lib/hive-json-serde-0.3.jar;
>Added lib/hive-json-serde-0.3.jar to class path
>Added resource: lib/hive-json-serde-0.3.jar
>
>
> hive (testdb)> ADD JAR lib/json-path-0.5.4.jar;
>Added lib/json-path-0.5.4.jar to class path
>Added resource: lib/json-path-0.5.4.jar
>
>
> hive (testdb)> ADD JAR lib/json-smart-1.0.6.3.jar;
>Added lib/json-smart-1.0.6.3.jar to class path
>Added resource: lib/json-smart-1.0.6.3.jar
>
>
> After this i am getting this error:
>
>
> CREATE EXTERNAL TABLE IF NOT EXISTS twitter (tweet_id BIGINT,created_at
> STRING,text STRING,user_id BIGINT, user_screen_name STRING,user_lang
> STRING) ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.JsonSerde"
> WITH SERDEPROPERTIES (
> "tweet_id"="$.id","created_at"="$.created_at","text"="$.text","user_id"="$.
> user.id","user_screen_name"="$.user.screen_name",
> "user_lang"="$.user.lang") LOCATION '/home/satish/data/twitter/input';
> java.lang.NoClassDefFoundError: com/jayway/jsonpath/PathUtil
> at org.apache.hadoop.hive.contrib.serde2.JsonSerde.initialize(Unknown
> Source)
> at
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:207)
> at
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:266)
> at
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:259)
> at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:585)
> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:550)
> at
> org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3698)
> at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:253)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138)
> at
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1336)
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1122)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:935)
> at
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:412)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:755)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:613)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:616)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> Caused by: java.lang.ClassNotFoundException: com.jayway.jsonpath.PathUtil
> at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
> ... 23 more
> FAILED: Execution Error, return code -101 from
> org.apache.hadoop.hive.ql.exec.DDLTask
>
>
> Any help would be really appreciated.
> Thanks
> Sai
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Accessing sub column in hive

2013-03-08 Thread Dean Wampler

I recognize this example ;)

You reference struct elements with the dot notation, as Bejoy said, map
elements with what you tried, deductions['Federal taxes'], and arrays by
index, starting from zero, subordinates[0].

On Fri, Mar 8, 2013 at 6:35 AM,  wrote:

> Hi Sai
>
>
> You can do it as
> Select address.country from employees;
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> --
> *From: * Bennie Schut 
> *Date: *Fri, 8 Mar 2013 09:09:49 +0100
> *To: *user@hive.apache.org; 'Sai Sai'<
> saigr...@yahoo.in>
> *ReplyTo: * user@hive.apache.org
> *Subject: *RE: Accessing sub column in hive
>
> Perhaps worth posting the error. Some might know what the error means.
>
> ** **
>
> Also a bit unrelated to hive but please do yourself a favor and don’t use
> float to store monetary values like salary. You will get rounding issues at
> some point in time when you do arithmetic on them. Considering you are
> using hadoop you probably have a lot of data so adding it all up will get
> you there really really fast.
> http://stackoverflow.com/questions/3730019/why-not-use-double-or-float-to-represent-currency
> 
>
> ** **
>
> ** **
>
> *From:* Sai Sai [mailto:saigr...@yahoo.in]
> *Sent:* Thursday, March 07, 2013 12:54 PM
> *To:* user@hive.apache.org
> *Subject:* Re: Accessing sub column in hive
>
> ** **
>
> I have a table created like this successfully:
>
> ** **
>
> CREATE TABLE IF NOT EXISTS employees (name STRING,salary
> FLOAT,subordinates ARRAY,deductions   MAP,address
> STRUCT)
> 
>
> ** **
>
> I would like to access/display country column from my address struct.
>
> I have tried this:
>
> ** **
>
> select address["country"] from employees;
>
> ** **
>
> I get an error.
>
> ** **
>
> Please help.
>
> ** **
>
> Thanks
>
> Sai
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Rename external table, including HDFS directory

2013-03-07 Thread Dean Wampler

You can rename the LOCATION with an ALTER TABLE command. I don't believe
there's a single, "atomic" statement to rename both the table and the
directory.

On Thu, Mar 7, 2013 at 5:29 PM, Keith Wiley  wrote:

> My understanding is that renaming and external table doesn't rename the
> corresponding HDFS directory to match.  I would like to do this; I want to
> rename the table in Hive and also rename the HDFS directory so they match
> (they already match with the old name).  I'm not sure how to do this.  I'm
> worried that if I rename the HDFS directory through Hadoop, not Hive, that
> Hive will lose track of the directory, regardless of whether I rename the
> table in Hive as well.
>
> I also don't want to drop and recreate the table because I don't want to
> notify Hive of all the partitions all over again.
>
> Thoughts?
>
>
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com
> music.keithwiley.com
>
> "Luminous beings are we, not this crude matter."
>    --  Yoda
>
> 
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Variable Substitution

2013-03-06 Thread Dean Wampler

Even newer versions of Hive do this. Any reason you don't want to provide a
definition for all of them? You could argue that an undefined variable is a
bug and leaving the literal text in place makes it easier to notice.
Although, Unix shells would insert an empty string, so never mind ;)

On Wed, Mar 6, 2013 at 3:13 PM, Matt Tucker  wrote:

> Using CDH3u3 (Hive 0.7.1), it appears that variable substitution becomes
> disabled when I use a variable in a query that hasn't been defined.
>
> For instance, using the following script:
>
> set var2=2013-02-01;
> set var3=2013-02-10;
>
> SELECT clndr_dt FROM calendar WHERE clndr_dt LIKE "${hiveconf:var1}" OR
> (clndr_dt >= "${hiveconf:var2}" AND clndr_dt <= "${hiveconf:var3}") LIMIT 1;
>
>
> I would expect the results job configuration to list the hive.query.string
> as:
>
> SELECT clndr_dt FROM calendar WHERE clndr_dt LIKE "${hiveconf:var1}" OR
> (clndr_dt >= "2013-02-01" AND clndr_dt <= "2013-02-10") LIMIT 1;
>
>
> (or with ${hiveconf:var1} removed, leaving an empty string).
>
> Instead, it prints the query as-is (with the variable placeholders).  If I
> set var1=2012-01-01, it properly substitutes the variables, but it only
> returns the '2012-01-01' record (not the 2013 records).
>
> SELECT clndr_dt FROM calendar WHERE clndr_dt LIKE "2012-01-01" OR (
> clndr_dt >= "2013-02-01" AND clndr_dt <= "2013-02-10" ) LIMIT 1
>
> I was originally planning to use this for partition pruning, but it
> doesn't appear to be the cause as the calendar table is not partitioned.
>
> Is there something that I've overlooked?
>
> Thanks!
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Combine two overlapping schema?

2013-03-06 Thread Dean Wampler

Of the top of my head, I think UNION ALL should work if you explicitly
project out the missing columns with NULL or other values, e.g. using
nested SELECTs, something like

SELECT * FROM (
 SELECT a,b,c, Y, NULL AS Z FROM table1
 UNION ALL
 SELECT a,b,c, NULL AS Y, Z FROM table2
) table12;



On Wed, Mar 6, 2013 at 12:03 PM, Keith Wiley  wrote:

> I have two tables which have overlapping but nonidentical schema.  I want
> to creating a new table that unions them, leaving nulls in any given row
> where a column name doesn't occur in the other table:
>
> SCHEMA 1: { a, b, c, Y }
> row:  { 1, 2, 3, 4 }
>
> SCHEMA 2: { a, b, c, Z }
> row:  { 5, 6, 7, 8 }
>
> NEW SCHEMA: { a, b, c, Y, Z }
> new row:  { a:1, b:2, c:3, Y:4, Z:null }
> new row:  { a:5, b:6, c:7, Y:null, Z:8 }
>
> I don't think either "full outer join" or "union all" works.  I'm not sure
> how to do this.  Any ideas?
>
> Thanks.
>
>
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com
> music.keithwiley.com
>
> "You can scratch an itch, but you can't itch a scratch. Furthermore, an
> itch can
> itch but a scratch can't scratch. Finally, a scratch can itch, but an itch
> can't
> scratch. All together this implies: He scratched the itch from the scratch
> that
> itched but would never itch the scratch from the itch that scratched."
>    --  Keith Wiley
>
> 
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Where is the location of hive queries

2013-03-06 Thread Dean Wampler

Or use a variant of the INSERT statement to write to a directory or a table.

On Wed, Mar 6, 2013 at 10:05 AM, Nitin Pawar wrote:

> the results are not stored to any file .. they are available on console
> only
>
> if you want to save to the results then write execute your query like hive
> -e "query" > file
>
>
> On Wed, Mar 6, 2013 at 9:32 PM, Sai Sai  wrote:
>
>> After we run a query in hive shell as:
>> Select * from myTable;
>>
>> Are these results getting saved to any file apart from the
>> console/terminal display.
>> If so where is the location of the results.
>> Thanks
>> Sai
>>
>
>
>
> --
> Nitin Pawar
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Best table storage for analytical use case

2013-03-06 Thread Dean Wampler

MapReduce is very course-grained. It might seem that more cores is better,
but once the data sizes get well below the block threshold in size, the
overhead of starting JVM processes and all the other background becomes a
significant percentage of the overall runtime. So, you quickly reach the
point of diminishing returns. Hadoop wasn't designed for small data, so
this issue was never optimized. For big data, you amortize that overhead
over the actual processing time.

It sounds like 1 day is a good granularity. Your production cluster might
use a larger block size, a multiple of 64MB, which is increasingly common
these days.

On Wed, Mar 6, 2013 at 9:44 AM, Sékine Coulibaly wrote:

> Hi Dean,
>
> Indeed, switching from RCFiles to SequenceFiles yield a query duration
> down 35% (82secs down to 53secs) ! I added Snappy/Gzip block compression
> altogether. Things are getting better, down to 30secs (sequenceFile+snappy).
>
> Yes, most request have a WHERE clause with a time range, will have
> partitionning a try. For now, my tests span over 1 day long log data. I
> will ingest more of them and partition and see how it goes.
>
> However, it's not clear to me why I should minimize the mappers ? Having
> 16 cores, would it make sens to use as many of them as possible to
> parallelize ? So far, 1 day worth log is 256 MB. In my understanding,
> provided that HDFS has 64MB blocks, I should use 4 mappers right ?
>
> If this is the case, since I'm in pseudo distrib for the moment, my number
> of mappers =1, so I could try to configure my setup with additional mappers.
>
>
> Does this make sense ?
>
> Thank you for your help !
>
> Sekine
>
>
>
>
> 2013/3/4 Dean Wampler 
>
>> RCFile won't help much (and apparently not all in this case ;) unless you
>> have a lot of columns and you always query just a few of them. However, you
>> should get better results with Sequence Files (binary format) and usually
>> with a compression scheme like BZip that supports block-level (as opposed
>> to file-level) compression.  Why? compressed files and also using sequence
>> files reduces the amount of disk IO and hence improves IO performance (a
>> bottleneck).
>>
>> Do you almost always query with a WHERE clause with a time range? If so,
>> consider partitioning your data by time ranges, e.g., year/month/day. Your
>> actual timestamp granularity would be chosen so that each folder (and yes,
>> they'll be individual folders) has data files at least 64MB or whatever
>> multiple of 64MB your using in your cluster. It could be that per-day is
>> the finest granularity or even per hour or minute, if you really have a lot
>> of data. Briefly, you want to minimize the number of mapper processes used
>> to process the data, and this is the granularity per mapper. Why partition,
>> because when you do SELECT * FROM mytable WHERE year = 2012 AND month = 3
>> AND day = 4, Hive knows it only has to read the contents of that single
>> directory, not all the directories...
>>
>> You might also consider clustering by URL. This feature (and the others)
>> is described on the Hive wiki. It can also speed up sampling of large data
>> sets and joins.
>>
>> I assume you're just using the virtual machine for experimenting. Lots of
>> overhead there, too!
>>
>> Hope this helps.
>> dean
>>
>> On Mon, Mar 4, 2013 at 4:33 PM, Sékine Coulibaly wrote:
>>
>>> Hi there,
>>>
>>> I've setup a virtual machine hosting Hive.
>>> My use case is a Web traffic analytics, hence most of requests are :
>>>
>>> - how many requests today ?
>>> - how many request today, grouped by country ?
>>> - most requested urls ?
>>> - average http server response time (5 minutes slots) ?
>>>
>>> In other words, lets consider :
>>> CREATE TABLE logs ( url STRING, orig_country STRING, http_rt INT )
>>> and
>>>
>>> SELECT COUNT(*) FROM logs;
>>> SELECT COUNT(*),orig_country FROM logs GROUP BY orig_country;
>>> SELECT COUNT(*),url FROM logs BROUP BY url;
>>> SELECT AVG(http_rt) FROM logs ...
>>>
>>> 2 questions here :
>>> - How to generate 5 minutes slots to make my averages (in Postgresql, I
>>> used to generate_series() and JOIN) ? I wish I could avoid doing multiple
>>> requests each with a 'WHERE date>... AND date <...'. Maybe a mapper,
>>> mapping the date string to a aslot number ?
>>>
>>> - What is the best storage method pour this table ? Since it's purpose
>>> is analytical, I thought columnar format was the

Re: Hive sample test

2013-03-05 Thread Dean Wampler

NIce, yea that would do it.

On Tue, Mar 5, 2013 at 1:26 PM, Mark Grover wrote:

> I typically change my query to query from a limited version of the whole
> table.
>
> Change
>
> select really_expensive_select_clause
> from
> really_big_table
> where
> something=something
> group by something=something
>
> to
>
> select really_expensive_select_clause
> from
> (
> select
> *
> from
> really_big_table
> limit 100
> )t
> where
> something=something
> group by something=something
>
>
> On Tue, Mar 5, 2013 at 10:57 AM, Dean Wampler
>  wrote:
> > Unfortunately, it will still go through the whole thing, then just limit
> the
> > output. However, there's a flag that I think only works in more recent
> Hive
> > releases:
> >
> > set hive.limit.optimize.enable=true
> >
> > This is supposed to apply limiting earlier in the data stream, so it will
> > give different results that limiting just the output.
> >
> > Like Chuck said, you might consider sampling, but unless your table is
> > organized into buckets, you'll at least scan the whole table, but maybe
> not
> > do all computation over it ??
> >
> > Also, if you have a small sample data set:
> >
> > set hive.exec.mode.local.auto=true
> >
> > will cause Hive to bypass the Job and Task Trackers, calling APIs
> directly,
> > when it can do the whole thing in a single process. Not "lightning fast",
> > but faster.
> >
> > dean
> >
> > On Tue, Mar 5, 2013 at 12:48 PM, Joey D'Antoni 
> wrote:
> >>
> >> Just add a limit 1 to the end of your query.
> >>
> >>
> >>
> >>
> >> On Mar 5, 2013, at 1:45 PM, Kyle B  wrote:
> >>
> >> Hello,
> >>
> >> I was wondering if there is a way to quick-verify a Hive query before it
> >> is run against a big dataset? The tables I am querying against have
> millions
> >> of records, and I'd like to verify my Hive query before I run it
> against all
> >> records.
> >>
> >> Is there a way to test the query against a small subset of the data,
> >> without going into full MapReduce? As silly as this sounds, is there a
> way
> >> to MapReduce without the overhead of MapReduce? That way I can check my
> >> query is doing what I want before I run it against all records.
> >>
> >> Thanks,
> >>
> >> -Kyle
> >
> >
> >
> >
> > --
> > Dean Wampler, Ph.D.
> > thinkbiganalytics.com
> > +1-312-339-1330
> >
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Hive sample test

2013-03-05 Thread Dean Wampler

Unfortunately, it will still go through the whole thing, then just limit
the output. However, there's a flag that I think only works in more recent
Hive releases:

set hive.limit.optimize.enable=true

This is supposed to apply limiting earlier in the data stream, so it will
give different results that limiting just the output.

Like Chuck said, you might consider sampling, but unless your table is
organized into buckets, you'll at least scan the whole table, but maybe not
do all computation over it ??

Also, if you have a small sample data set:

set hive.exec.mode.local.auto=true

will cause Hive to bypass the Job and Task Trackers, calling APIs directly,
when it can do the whole thing in a single process. Not "lightning fast",
but faster.

dean

On Tue, Mar 5, 2013 at 12:48 PM, Joey D'Antoni  wrote:

> Just add a limit 1 to the end of your query.
>
>
>
>
> On Mar 5, 2013, at 1:45 PM, Kyle B  wrote:
>
> Hello,
>
> I was wondering if there is a way to quick-verify a Hive query before it
> is run against a big dataset? The tables I am querying against have
> millions of records, and I'd like to verify my Hive query before I run it
> against all records.
>
> Is there a way to test the query against a small subset of the data,
> without going into full MapReduce? As silly as this sounds, is there a way
> to MapReduce without the overhead of MapReduce? That way I can check my
> query is doing what I want before I run it against all records.
>
> Thanks,
>
> -Kyle
>
>

-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Location of external table in hdfs

2013-03-05 Thread Dean Wampler

/tmp/states in HDFS.

On Tue, Mar 5, 2013 at 10:56 AM, Sai Sai  wrote:

> I have created an external table like below and wondering where (folder)
> in hdfs i can find this:
>
> CREATE EXTERNAL TABLE states(abbreviation string, full_name string) ROW
> FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/tmp/states' ;
>
> Any help is really appreciated.
> Thanks
> Sai
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Error while exporting table data from hive to Oracle through Sqoop

2013-03-05 Thread Dean Wampler

.java:36)*
> ***
>
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>
> at
> org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:182)
> 
>
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> 
>
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>
> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:396)
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> 
>
> at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
> Caused by: java.sql.BatchUpdateException: ORA-1: unique constraint
> (HDFSUSER.BTTN_BKP_PK) violated
>
> ** **
>
> at
> oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:10345)
> 
>
> at
> oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:230)
> 
>
> at
> org.apache.sqoop.mapreduce.AsyncSqlOutputFormat$AsyncSqlExecThread.run(AsyncSqlOutputFormat.java:228)
> 
>
> ** **
>
> 13/03/05 19:21:48 WARN mapred.JobClient: Error reading task
> outputConnection timed out
>
> 13/03/05 19:22:09 WARN mapred.JobClient: Error reading task
> outputConnection timed out
>
> 13/03/05 19:22:09 INFO mapred.JobClient: Job complete:
> job_201303051835_0010
>
> 13/03/05 19:22:09 INFO mapred.JobClient: Counters: 8
>
> 13/03/05 19:22:09 INFO mapred.JobClient:   Job Counters
>
> 13/03/05 19:22:09 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=77152
>
> 13/03/05 19:22:09 INFO mapred.JobClient: Total time spent by all
> reduces waiting after reserving slots (ms)=0
>
> 13/03/05 19:22:09 INFO mapred.JobClient: Total time spent by all maps
> waiting after reserving slots (ms)=0
>
> 13/03/05 19:22:09 INFO mapred.JobClient: Rack-local map tasks=3
>
> 13/03/05 19:22:09 INFO mapred.JobClient: Launched map tasks=4
>
> 13/03/05 19:22:09 INFO mapred.JobClient: Data-local map tasks=1
>
> 13/03/05 19:22:09 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
>
> 13/03/05 19:22:09 INFO mapred.JobClient: Failed map tasks=1
>
> 13/03/05 19:22:09 INFO mapreduce.ExportJobBase: Transferred 0 bytes in
> 110.4837 seconds (0 bytes/sec)
>
> 13/03/05 19:22:09 INFO mapreduce.ExportJobBase: Exported 0 records.
>
> 13/03/05 19:22:09 ERROR tool.ExportTool: Error during export: Export job
> failed!****
>
> *[hadoop@NHCLT-PC44-2 sqoop-oper]$*
>
> * *
>
> *Regards,*
>
> *Ajit Kumar Shreevastava*
>
>
>
> ::DISCLAIMER::
>
> 
>
> The contents of this e-mail and any attachment(s) are confidential and
> intended for the named recipient(s) only.
> E-mail transmission is not guaranteed to be secure or error-free as
> information could be intercepted, corrupted,
> lost, destroyed, arrive late or incomplete, or may contain viruses in
> transmission. The e mail and its contents
> (with or without referred errors) shall therefore not attach any liability
> on the originator or HCL or its affiliates.
> Views or opinions, if any, presented in this email are solely those of the
> author and may not necessarily reflect the
> views or opinions of HCL or its affiliates. Any form of reproduction,
> dissemination, copying, disclosure, modification,
> distribution and / or publication of this message without the prior
> written consent of authorized representative of
> HCL is strictly prohibited. If you have received this email in error
> please delete it and notify the sender immediately.
> Before opening any email and/or attachments, please check them for viruses
> and other defects.
>
>
> 
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: doubt with LEFT OUTER JOIN

2013-03-04 Thread Dean Wampler

None that I know of, but I haven't tried too many cases. Of course, there
shouldn't be ;)

On Mon, Mar 4, 2013 at 6:38 PM, kumar mr  wrote:

>
>
>  Hi dean,
>
>  Are there any known issues if we do LEFT OUTER JOIN between different
> table formats like Text and RCfile? Here the tables are of different format
> types.
>
>  Regards,
> Kumar
>
>
>
> -Original Message-
> From: Dean Wampler 
> To: user 
> Sent: Fri, Mar 1, 2013 12:23 pm
> Subject: Re: doubt with LEFT OUTER JOIN
>
>  I just tried an experiment where the right-hand table was empty. It
> worked fine. Could you post more details, like the query, create table
> statements, etc? What version of Hive?
>
>
> On Fri, Mar 1, 2013 at 2:13 PM, kumar mr  wrote:
>
>> Hi,
>>
>>  Does LEFT OUTER JOIN returns zero rows when RIGHT SIDE table doesn't
>> have at least one record that matches JOIN condition in Hive?
>>
>>  Regards,
>> Kumar
>>
>>
>
>
>  --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Best table storage for analytical use case

2013-03-04 Thread Dean Wampler

RCFile won't help much (and apparently not all in this case ;) unless you
have a lot of columns and you always query just a few of them. However, you
should get better results with Sequence Files (binary format) and usually
with a compression scheme like BZip that supports block-level (as opposed
to file-level) compression.  Why? compressed files and also using sequence
files reduces the amount of disk IO and hence improves IO performance (a
bottleneck).

Do you almost always query with a WHERE clause with a time range? If so,
consider partitioning your data by time ranges, e.g., year/month/day. Your
actual timestamp granularity would be chosen so that each folder (and yes,
they'll be individual folders) has data files at least 64MB or whatever
multiple of 64MB your using in your cluster. It could be that per-day is
the finest granularity or even per hour or minute, if you really have a lot
of data. Briefly, you want to minimize the number of mapper processes used
to process the data, and this is the granularity per mapper. Why partition,
because when you do SELECT * FROM mytable WHERE year = 2012 AND month = 3
AND day = 4, Hive knows it only has to read the contents of that single
directory, not all the directories...

You might also consider clustering by URL. This feature (and the others) is
described on the Hive wiki. It can also speed up sampling of large data
sets and joins.

I assume you're just using the virtual machine for experimenting. Lots of
overhead there, too!

Hope this helps.
dean

On Mon, Mar 4, 2013 at 4:33 PM, Sékine Coulibaly wrote:

> Hi there,
>
> I've setup a virtual machine hosting Hive.
> My use case is a Web traffic analytics, hence most of requests are :
>
> - how many requests today ?
> - how many request today, grouped by country ?
> - most requested urls ?
> - average http server response time (5 minutes slots) ?
>
> In other words, lets consider :
> CREATE TABLE logs ( url STRING, orig_country STRING, http_rt INT )
> and
>
> SELECT COUNT(*) FROM logs;
> SELECT COUNT(*),orig_country FROM logs GROUP BY orig_country;
> SELECT COUNT(*),url FROM logs BROUP BY url;
> SELECT AVG(http_rt) FROM logs ...
>
> 2 questions here :
> - How to generate 5 minutes slots to make my averages (in Postgresql, I
> used to generate_series() and JOIN) ? I wish I could avoid doing multiple
> requests each with a 'WHERE date>... AND date <...'. Maybe a mapper,
> mapping the date string to a aslot number ?
>
> - What is the best storage method pour this table ? Since it's purpose is
> analytical, I thought columnar format was the way to go. So I tried RCFILE
> buy the results are as follow for around 1 million rows (quite small, I
> know) and are quite the opposite I was expecting :
>
> Storage / query duration / disk table size
> TEXTFILE / 22 seconds / 250MB
> RCFILE / 31 seconds / 320 MB
>
>  I thought getting values in columns would speed up the aggregate process.
> Maybe the dataset is too small to tell, or I missed something ? Will adding
> Snappy compression help (not sure whether RCFiles are compressed or not) ?
>
> Thank you !
>
>
>
>

-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: doubt with LEFT OUTER JOIN

2013-03-01 Thread Dean Wampler

I just tried an experiment where the right-hand table was empty. It worked
fine. Could you post more details, like the query, create table statements,
etc? What version of Hive?

On Fri, Mar 1, 2013 at 2:13 PM, kumar mr  wrote:

> Hi,
>
>  Does LEFT OUTER JOIN returns zero rows when RIGHT SIDE table doesn't
> have at least one record that matches JOIN condition in Hive?
>
>  Regards,
> Kumar
>
>

-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: regexp_replace with unicode chars

2013-03-01 Thread Dean Wampler

Anyone know if translate takes ranges, like some implementations? e.g.,

translate ('[a-z]', '[A-Z]')

Of course, that probably doesn't work for non-ascii characters.

On Fri, Mar 1, 2013 at 11:24 AM, Tom Hall  wrote:

> Thanks Dean,
>
> I dont think translate would work as the set of things to remove is
> massive.
> Yeah, it's a one-off cleanup job while exporting to try redshift on our
> datasets.
> My guess is it's something about the way hive handles strings? Tried
> "\\ufffd" as the replacement str but no joy either.
>
> Cheers again,
> Tom
>
>
>
> On 1 March 2013 17:08, Dean Wampler wrote:
>
>> I think this should work, but you might investigate using the translate
>> function instead. I suspect it will provide much better performance than
>> using regexps. Also, Are you planning to do this once to create your final
>> tables? If so, the performance overhead won't matter much.
>>
>> dean
>>
>>
>> On Fri, Mar 1, 2013 at 10:52 AM, Tom Hall wrote:
>>
>>> I would like to remove unicode chars that are outside the Basic
>>> Multilingual Plane [1]
>>>
>>> I thought
>>> select regexp_replace(some_column,"[^\\u-\\u]","\ufffd") from
>>> my_table
>>> would work but while the regexp does work the replacement str does not
>>> (I can paste in the literal �, which you may or may not be able to see here
>>> but it somehow did not fell right)
>>>
>>> I saw Deans previous post on using octals [2] but I think \ufffd is
>>> outside the allowable range.
>>>
>>> Cheers,
>>> Tom
>>>
>>>
>>> [1]
>>> http://en.wikipedia.org/wiki/Plane_%28Unicode%29#Basic_Multilingual_Plane
>>> [2]
>>> http://grokbase.com/t/hive/dev/131a4n562y/unicode-character-as-delimiter
>>>
>>
>>
>>
>> --
>> *Dean Wampler, Ph.D.*
>> thinkbiganalytics.com
>> +1-312-339-1330
>>
>>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: regexp_replace with unicode chars

2013-03-01 Thread Dean Wampler

I think this should work, but you might investigate using the translate
function instead. I suspect it will provide much better performance than
using regexps. Also, Are you planning to do this once to create your final
tables? If so, the performance overhead won't matter much.

dean

On Fri, Mar 1, 2013 at 10:52 AM, Tom Hall  wrote:

> I would like to remove unicode chars that are outside the Basic
> Multilingual Plane [1]
>
> I thought
> select regexp_replace(some_column,"[^\\u-\\u]","\ufffd") from
> my_table
> would work but while the regexp does work the replacement str does not (I
> can paste in the literal �, which you may or may not be able to see here
> but it somehow did not fell right)
>
> I saw Deans previous post on using octals [2] but I think \ufffd is
> outside the allowable range.
>
> Cheers,
> Tom
>
>
> [1]
> http://en.wikipedia.org/wiki/Plane_%28Unicode%29#Basic_Multilingual_Plane
> [2]
> http://grokbase.com/t/hive/dev/131a4n562y/unicode-character-as-delimiter
>

-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Books and good starting point for Hive

2013-02-24 Thread Dean Wampler

Wow! You guys are my new best friends!

Seriously, I'm grateful you've found my participation in the list and the
book helpful. I'm sure Ed and Jason would agree (at least about the book ;)

Yours,

dean

-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

On Sun, Feb 24, 2013 at 12:10 PM, Lefty Leverenz wrote:

> +1 for *Programming Hive* (Capriolo. Wampler & Rutherglen, 2012 O'Reilly
> although the author order on the cover is Rutherglen, Wampler & Capriolo).
>  Available in various formats at O'Reilly and Amazon.
>
> Did you already look at the Hive wiki (
> https://cwiki.apache.org/confluence/display/Hive/Home)?
>
> In particular:
>
>- https://cwiki.apache.org/confluence/display/Hive/GettingStarted
>- https://cwiki.apache.org/confluence/display/Hive/Tutorial
>
>
> – Lefty Leverenz
>
>
>
> On Sun, Feb 24, 2013 at 9:04 AM, Joey D'Antoni  wrote:
>
>> I second this recommendation. Dean's Hive book helped me more with Hadoop
>> than the o'reilly Hadoop book did.
>>
>> Joey D'Antoni
>>
>>
>>
>>
>> On Feb 24, 2013, at 11:57 AM, John Omernik  wrote:
>>
>> Hello William - Dean Wampler posts quit often on this list and has done
>> (to my eye) a great job of separating his business (he and other authors
>> have written a Hive book)  from the community aspect of (he participates
>> freely on the list without a lot of self promotion).
>>
>> Therefore, I will give him some unself (is that a word?) promotion for
>> his book.  Dean, and others have written a Hive book published by O'Reilly
>> (I included a link to it on Amazon)
>>
>> A few thoughts. Prior to getting the book, I had used Hive from a
>> standpoint of a strong SQL background and no big data experience for about
>> a year. Much of what Hive is and can do was abstracted away from me, and I
>> just wrote queries. After getting the Programming Hive book I truly
>> understood Hive AND I also understod Hadoop better. I had purchased the
>> Hadoop O'Reilly book when I started my new position, but found Hadoop to be
>> pretty beyond me (at the time).  Since then, experience and the O'Reilly
>> Hive book have been awesome for me understanding what makes Hive great.
>>  Give it a shout. I have no financial interest, and Dean and the Other
>> Authors don't know me at all, this is just a freely given thought on the
>> book.
>>
>>
>>
>> http://www.amazon.com/Programming-Hive-Edward-Capriolo/dp/1449319335/
>>
>> \\\
>>
>>
>>
>>
>>
>> On Sun, Feb 24, 2013 at 10:43 AM, William Kang wrote:
>>
>>> Hi All,
>>> I just get started with Hive.
>>>
>>> I already installed Hive and finished a few examples.
>>>
>>> Would you please give me some suggestions on a few good books or
>>> tutorials for me to continue?
>>>
>>> Many thanks.
>>>
>>>
>>> William
>>>
>>
>>
>

Re: Loading json files into hive table is giving NULL as output(data is in s3 bucket)

2013-02-18 Thread Dean Wampler

I don't know what changed. I didn't work on this SerDe myself. I'm glad
it's working now for you, though!

dean

On Mon, Feb 18, 2013 at 9:31 AM, Chunky Gupta wrote:

> Hi Dean,
>
> I was using *hive-json-serde-0.2.jar* earlier. Now I tried 
> *hive-json-serde-0.3.jar
> *as you suggested and it is working fine, I am getting the output as
> expected.
>
> Can you please tell me that what code change from 0.2 to 0.3 could have
> solved this problem ?
>
>
> Thanks,
> Chunky.
>
>
> On Mon, Feb 18, 2013 at 8:47 PM, Chunky Gupta wrote:
>
>> Hi Dean,
>>
>> I tried with removing underscore too, and getting the same output which
>> means problem is not with underscore. Yes, it was an example.
>>
>> Actual json file is like :-
>>
>>
>> {"colnamec":"ColNametest","colnamets":"2013-01-14","colnameip":"10.10.10.10","colnameid":"10","colnameid2":"100","colnamep":0,"colnamecp":0,"colnamep":1,"colnameed":"31509","colnamesw":0,"colnamesu2":3,"colnameqq":"0","colnameppaa":0,"colnameqwe1":0,"colnamerty2":0,"colnameiop":"1000","colnamebnm":"23425253RFDSE","colnamefgh":2,"colnameagl":"","colnameyhgb":["1234","12345","2345","56789"],"colnamepoix":["12","4567","123","5678"],"colnamedswer":["100","567","123","678"],"colnamewerui":["10","10","10","10"]}
>>
>> I tried extracting one column only as I mentioned in last mail.
>>
>> There are values not in double quotes, some are null and some keys are
>> having multiple values.
>> Dean, is this json file correct for HIVE to handle it ?
>>
>> Thanks,
>> Chunky.
>>
>>
>>
>>
>>
>> On Mon, Feb 18, 2013 at 6:23 PM, Dean Wampler <
>> dean.wamp...@thinkbiganalytics.com> wrote:
>>
>>> The "uname="$._u" is the correct form. We also hacked on this SerDe at
>>> Think Big Analytics. I don't know if you'll see an improvement though.
>>>
>>> https://github.com/thinkbiganalytics/hive-json-serde
>>>
>>> I wonder if there's a problem handling the leading underscore?
>>>
>>> Also, I know it's just an example, but in case it was taken from a real
>>> situation, the dates in your example are for January.
>>>
>>> dean
>>>
>>> On Mon, Feb 18, 2013 at 6:43 AM, Chunky Gupta 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have data in s3 bucket, which is in json format and is a zip file. I
>>>> have added this jar file in hive console :-
>>>>
>>>> http://code.google.com/p/hive-json-serde/downloads/detail?name=hive-json-serde-0.2.jar&can=2&q=
>>>>
>>>> I tried the following steps to create table and load data :-
>>>>
>>>> 1. CREATE EXTERNAL TABLE table_test ( uname STRING ) PARTITIONED BY (dt
>>>> STRING ) ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.JsonSerde"
>>>> WITH SERDEPROPERTIES ( "uname"="$._u" ) LOCATION
>>>> 's3://BUCKET_NAME/test_data/'
>>>>
>>>>I tried this also :-
>>>>
>>>> CREATE EXTERNAL TABLE table_test ( uname STRING ) PARTITIONED BY (dt
>>>> STRING ) ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.JsonSerde"
>>>> WITH SERDEPROPERTIES ( "uname"="_u" ) LOCATION
>>>> 's3://BUCKET_NAME/test_data/'
>>>>
>>>>
>>>>
>>>> 2. alter table table_test add partition (dt='13Feb2012') location
>>>> 's3n://BUCKET_NAME/test_data/13Feb2012';
>>>>
>>>> and json file is like this :-
>>>> -
>>>> {"_u":"test_name1","_ts":"2012-01-13","_ip":"IP1"}
>>>> {"_u":"test_name2","_ts":"2012-01-13","_ip":"IP2"}
>>>> {"_u":"test_name3","_ts":"2012-01-13","_ip":"IP3"}
>>>>
>>>>
>>>> When I query :-
>>>> select uname from table_test;
>>>>
>>>> Output :-
>>>> NULL 13Feb2012
>>>> NULL 13Feb2012
>>>> NULL 13Feb2012
>>>>
>>>>
>>>> Please help me and let me know how to add json data in a table.
>>>>
>>>> Thanks,
>>>> Chunky.
>>>>
>>>
>>>
>>>
>>> --
>>> *Dean Wampler, Ph.D.*
>>> thinkbiganalytics.com
>>> +1-312-339-1330
>>>
>>>
>>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Loading json files into hive table is giving NULL as output(data is in s3 bucket)

2013-02-18 Thread Dean Wampler

The "uname="$._u" is the correct form. We also hacked on this SerDe at
Think Big Analytics. I don't know if you'll see an improvement though.

https://github.com/thinkbiganalytics/hive-json-serde

I wonder if there's a problem handling the leading underscore?

Also, I know it's just an example, but in case it was taken from a real
situation, the dates in your example are for January.

dean

On Mon, Feb 18, 2013 at 6:43 AM, Chunky Gupta wrote:

> Hi,
>
> I have data in s3 bucket, which is in json format and is a zip file. I
> have added this jar file in hive console :-
>
> http://code.google.com/p/hive-json-serde/downloads/detail?name=hive-json-serde-0.2.jar&can=2&q=
>
> I tried the following steps to create table and load data :-
>
> 1. CREATE EXTERNAL TABLE table_test ( uname STRING ) PARTITIONED BY (dt
> STRING ) ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.JsonSerde"
> WITH SERDEPROPERTIES ( "uname"="$._u" ) LOCATION
> 's3://BUCKET_NAME/test_data/'
>
>I tried this also :-
>
> CREATE EXTERNAL TABLE table_test ( uname STRING ) PARTITIONED BY (dt
> STRING ) ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.JsonSerde"
> WITH SERDEPROPERTIES ( "uname"="_u" ) LOCATION
> 's3://BUCKET_NAME/test_data/'
>
>
>
> 2. alter table table_test add partition (dt='13Feb2012') location
> 's3n://BUCKET_NAME/test_data/13Feb2012';
>
> and json file is like this :-
> -
> {"_u":"test_name1","_ts":"2012-01-13","_ip":"IP1"}
> {"_u":"test_name2","_ts":"2012-01-13","_ip":"IP2"}
> {"_u":"test_name3","_ts":"2012-01-13","_ip":"IP3"}
>
>
> When I query :-
> select uname from table_test;
>
> Output :-
> NULL 13Feb2012
> NULL 13Feb2012
> NULL 13Feb2012
>
>
> Please help me and let me know how to add json data in a table.
>
> Thanks,
> Chunky.
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Install / Download of Hive 0.7.0 or 0.7.1

2013-02-15 Thread Dean Wampler

Any particular reason you want to use such an old version?

I see it's not even listed in the Apache releases anymore.
http://hive.apache.org/releases.html

(and following the links to the ftp sites doesn't show it either...)

dean

On Fri, Feb 15, 2013 at 1:08 PM, Vince George wrote:

>  Hi,
>
> ** **
>
> Where can I get a install /download of Hive 0.7.0 or 0.7.1?
>
> ** **
>
> Thx…
>
> ** **
>
> Regards,
>
> ** **
>
> Vince George
>
> Composite Software****
>
> Mobile: 201-519-3777
>
> ** **
>
> ** **
>

-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: CREATE EXTERNAL TABLE Fails on Some Directories

2013-02-15 Thread Dean Wampler

Something's odd about this output; why is there no / in front of 715? I
always get the full path when I run a -ls command. I would expect either:

/715/file.csv
or
/user//715/file.csv

Or is that what you meant by "(didn't leave rest of ls results)"?

dean

On Fri, Feb 15, 2013 at 10:45 AM, Joseph D Antoni wrote:

> [cloudera@localhost data]$ hdfs dfs -ls 715
> Found 13 items
> -rw-r--r--   1 cloudera cloudera7853975 2013-02-15 00:41 715/file.csv
> (didn't leave rest of ls results)
>
>
> Thanks on the directory--wasn't clear on that..
>
> Joey
>
>
>
>   --
> *From:* Dean Wampler 
> *To:* user@hive.apache.org; Joseph D Antoni 
> *Sent:* Friday, February 15, 2013 11:37 AM
> *Subject:* Re: CREATE EXTERNAL TABLE Fails on Some Directories
>
> You confirmed that 715 is an actual directory? It didn't become a file by
> accident?
>
> By the way, you don't need to include the file name in the LOCATION. It
> will read all the files in the directory.
>
> dean
>
> On Fri, Feb 15, 2013 at 10:29 AM, Joseph D Antoni wrote:
>
> I'm trying to create a series of external tables for a time series of data
> (using the prebuilt Cloudera VM).
>
> The directory structure in HDFS is as such:
>
> /711
> /712
> /713
> /714
> /715
> /716
> /717
>
> Each directory contains the same set of files, from a different day. They
> were all put into HDFS using the following script:
>
> for i in *;do hdfs dfs -put $i in $dir;done
>
> They all show up with the same ownership/perms in HDFS.
>
> Going into Hive to build the tables, I built a set of scripts to do the
> loads--then did a sed (changing 711 to 712,713, etc) to a file for each
> day. All of my loads work, EXCEPT for 715 and 716.
>
> Script is as follows:
>
> create external table 715_table_name
> (col1 string,
> col2 string)
> row format
> delimited fields terminated by ','
> lines terminated by '\n'
> stored as textfile
> location '/715/file.csv';
>
> This is failing with:
>
> Error in Metadata MetaException(message:Got except:
> org.apache.hadoop.fs.FileAlreadyExistsException Parent Path is not a
> directory: /715 715...
>
> Like I mentioned it works for all of the other directories, except 715 and
> 716. Thoughts on troubleshooting path?
>
> Thanks
>
> Joey D'Antoni
>
>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: CREATE EXTERNAL TABLE Fails on Some Directories

2013-02-15 Thread Dean Wampler

You confirmed that 715 is an actual directory? It didn't become a file by
accident?

By the way, you don't need to include the file name in the LOCATION. It
will read all the files in the directory.

dean

On Fri, Feb 15, 2013 at 10:29 AM, Joseph D Antoni wrote:

> I'm trying to create a series of external tables for a time series of data
> (using the prebuilt Cloudera VM).
>
> The directory structure in HDFS is as such:
>
> /711
> /712
> /713
> /714
> /715
> /716
> /717
>
> Each directory contains the same set of files, from a different day. They
> were all put into HDFS using the following script:
>
> for i in *;do hdfs dfs -put $i in $dir;done
>
> They all show up with the same ownership/perms in HDFS.
>
> Going into Hive to build the tables, I built a set of scripts to do the
> loads--then did a sed (changing 711 to 712,713, etc) to a file for each
> day. All of my loads work, EXCEPT for 715 and 716.
>
> Script is as follows:
>
> create external table 715_table_name
> (col1 string,
> col2 string)
> row format
> delimited fields terminated by ','
> lines terminated by '\n'
> stored as textfile
> location '/715/file.csv';
>
> This is failing with:
>
> Error in Metadata MetaException(message:Got except:
> org.apache.hadoop.fs.FileAlreadyExistsException Parent Path is not a
> directory: /715 715...
>
> Like I mentioned it works for all of the other directories, except 715 and
> 716. Thoughts on troubleshooting path?
>
> Thanks
>
> Joey D'Antoni
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Generate Hive DDL

2013-02-15 Thread Dean Wampler

If you're using Hive v0.10, this implements the new SHOW CREATE TABLE
mytable command.

On Fri, Feb 15, 2013 at 12:07 AM, Anandha L Ranganathan <
analog.s...@gmail.com> wrote:

> 1) download the jar from
>   from Apache JIRA https://issues.apache.org/jira/browse/HIVE-967
>
> 2) run the following command to generate DDLs.
>
>  hive --service jar <>
> com.media6.hive2rdbms.job.ShowCreateTable -D db.name=default -D 
> table.name=<>
> 2>/dev/null
>
>
>
> On Thu, Feb 14, 2013 at 9:41 PM, Murtaza Doctor  > wrote:
>
>>  Folks,
>>
>>  I was wondering if any one is aware of a way to generate Hive DDL
>> scripts using the existing metastore which in our case is in postgres. We
>> have a few tables which have been historically created and figuring our
>> creating ways to get the schema as DDL so that we can keep it in under
>> version control.  We could replicate the entire metastore would we just
>> want 4 tables out of the 100 we already have in place.
>>
>>  Thanks,
>> murtaza
>>
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re:

2013-02-14 Thread Dean Wampler

According to your stack trace, you have NullPointerException on line 137 of
your UDF.

On Thu, Feb 14, 2013 at 2:28 AM, neelesh gadhia  wrote:

> Hello,
>
> I am a Newbie to using UDF's on hive. But implemented these GenericJDF (
> https://issues.apache.org/jira/browse/HIVE-2361 ) on hive 0.9.0 and
> hadoop 1.1.1. Was able to add jar to hive
>
> hive> select * from emp;
> OK
> 1101000
> 2101200
> 3121500
> 412300
> 5121800
> 6205000
> 7207000
> 8201
> Time taken: 0.191 seconds
>
> hive> add jar
> /usr/local/Cellar/hive/0.9.0/libexec/lib/GenUDF.jar;
>
> Added /usr/local/Cellar/hive/0.9.0/libexec/lib/GenUDF.jar to class path
> Added resource: /usr/local/Cellar/hive/0.9.0/libexec/lib/GenUDF.jar
>
> hive> create temporary function nexr_sum as
> 'com.nexr.platform.analysis.udf.GenericUDFSum';
> OK
> Time taken: 0.012 seconds
>
>
> and kicked the sample sql shown below.
>
> SELECT t.empno, t.deptno, t.sal, nexr_sum(hash(t.deptno),t.sal) as sal_sum
> FROM (
> select a.empno, a.deptno, a.sal from emp a
> distribute by hash(a.deptno)
> sort BY a.deptno, a.empno
> ) t;
>
> The sql failed with errors. Any pointers or advise towards resolving this
> is much appreciated.
>
> 2013-02-13 23:30:18,925 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task (REDUCE) 'attempt_201302132324_0002_r_00_3' to tip
> task_201302132324_0002_r_00, for tracker
> 'tracker_192.168.0.151:localhost/127.0.0.1:50099'
> 2013-02-13 23:30:18,925 INFO org.apache.hadoop.mapred.JobTracker: Removing
> task 'attempt_201302132324_0002_r_00_2'
> 2013-02-13 23:30:26,484 INFO org.apache.hadoop.mapred.TaskInProgress:
> Error from attempt_201302132324_0002_r_00_3:
> java.lang.RuntimeException: Error in configuring object
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:486)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
> at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> ... 9 more
> Caused by: java.lang.RuntimeException: Reduce operator initialization
> failed
> at
> org.apache.hadoop.hive.ql.exec.ExecReducer.configure(ExecReducer.java:157)
> ... 14 more
> Caused by: java.lang.NullPointerException
> at
> org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:137)
> at
> org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:896)
> at
> org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:922)
> at
> org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:60)
> at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
> at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433)
> at
> org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389)
> at
> org.apache.hadoop.hive.ql.exec.ExtractOperator.initializeOp(ExtractOperator.java:40)
> at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357)
> at
> org.apache.hadoop.hive.ql.exec.ExecReducer.configure(ExecReducer.java:150)
> ... 14 more
> 2013-02-13 23:30:29,819 INFO org.apache.hadoop.mapred.TaskInProgress:
> TaskInProgress task_201302132324_0002_r_00 has failed 4 times.
> 2013-02-13 23:30:29,820 INFO org.apache.hadoop.mapred.JobInProgress:
> TaskTracker at '192.168.0.151' turned 'flaky'
>  12 more lines..
>
> Tried different function "GenericUDFMax".. same error.
>
> Any pointers/advise, what could be wrong?
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: INSERT INTO table with STRUCT, SELECT FROM

2013-02-13 Thread Dean Wampler

Hmm. I tried the following hacks, but all wouldn't parse. Ideas?

I changed:

  ... select struct(x,y) ...

to

  ... select struct(x,y) as struct ...
  ... select cast(struct(x,y) as struct) ...
  ... select struct(x as a,y as b) ...

Okay, but there is a hack that does work; By pass INSERT INTO and just
write to the directory:

INSERT DIRECTORY '/path/to/table/directory' SELECT ...;

Just be careful it doesn't clobber any files already there. I'm paranoid,
so I would write to a different directory and then move the files over...

dean

On Wed, Feb 13, 2013 at 1:26 PM, Michael Malak wrote:

> Is it possible to INSERT INTO TABLE t SELECT FROM where t has a column
> with a STRUCT?
>
> Based on
>
> http://grokbase.com/t/hive/user/109r87hh3e/insert-data-into-a-column-of-complex-type
>
> I thought perhaps the following would work:
>
> echo 1,2 >twovalues.csv
> hive
> CREATE TABLE tc (x INT, y INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY
> ',';
> LOAD DATA LOCAL INPATH 'twovalues.csv' INTO TABLE tc;
> CREATE TABLE oc (z STRUCT);
> INSERT INTO TABLE oc SELECT struct(x,y) FROM tc;
>
> but when I do the above I get:
>
> FAILED: SemanticException [Error 10044]: Line 1:18 Cannot insert into
> target table because column number/types are different 'oc': Cannot convert
> column 0 from struct to struct.
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Change timestamp format in hive

2013-02-13 Thread Dean Wampler

I'll mention some workarounds, but they all add overhead:

1. Use STRING for the column, then parse it with the date functions
Alexander mentioned.
2. Use STRING, then replace the offending '-' with a space, e.g.,

select printf("%s %s", substr('2013-02-13-08:11:22', 0, 10),
substr('2013-02-13-08:11:22', 12)) as time from hadoop.stocks limit 1;

Obviously I hard coded a value for the string; you would pass in a column
name.

You could even cast the result to TIMESTAMP:

select cast(printf("%s %s", substr('2013-02-13-08:11:22', 0, 10),
substr('2013-02-13-08:11:22', 12)) as TIMESTAMP) as time from hadoop.stocks
limit 1;

3. If you just need the year-month-day, i.e., the date, the to_date
function appears to work fine with these strings. However, there isn't a
corresponding to_time function for the HMS.

Ugly and not something you would want to do for every query. However, you
could hide any of these hacks behind a view.

The best thing to do would be to have your ETL process convert these
strings while loading into HDFS, if possible.

I'm experimenting with Hive v0.10, by the way.

dean

On Wed, Feb 13, 2013 at 3:48 AM, Alexander Alten-Lorenz  wrote:

> May
> https://cwiki.apache.org/Hive/languagemanual-udf.html#LanguageManualUDF-DateFunctionshelp
>  you?
>
> - Alex
>
> On Feb 13, 2013, at 10:43 AM, Chunky Gupta 
> wrote:
>
> > Hi,
> >
> > I have a log file which has timestamp in format "-MM-DD-HH:MM:SS".
> But since the timestamp datatype format in hive is "-MM-DD HH:MM:SS".
> > I created a table with datatype of that column as TIMESTAMP. But when I
> load the data it is throwing error. I think it is because of difference in
> format.
> >
> > Is there any way to set the timestamp format while creating the table.
> Or is there some other solution for this issue ?
> >
> > Thanks,
> > Chunky.
>
> --
> Alexander Alten-Lorenz
> http://mapredit.blogspot.com
> German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>
>

-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: How to load hive metadata from conf dir

2013-02-12 Thread Dean Wampler

But then you're writing Java code!!! The Horror!!!

;^P

On Tue, Feb 12, 2013 at 10:53 AM, Edward Capriolo wrote:

> If you use hive-thrift/hive-service you can get the location of a
> table through the Table API (instead of Dean's horrid bash-isms)
>
>
> http://hive.apache.org/docs/r0.7.0/api/org/apache/hadoop/hive/metastore/api/ThriftHiveMetastore.Client.html#get_table(java.lang.String
> ,
> java.lang.String)
>
> Table t = 
> t.getSd().getLocation()
>
>
> On Tue, Feb 12, 2013 at 9:41 AM, Dean Wampler
>  wrote:
> > I'll mention another bash hack that I use all the time:
> >
> > hive -e 'some_command' | grep for_what_i_want |
> > sed_command_to_remove_just_i_dont_want
> >
> > For example, the following command will print just the value of
> > hive.metastore.warehouse.dir, sending all the logging junk written to
> stderr
> > to /dev/null and stripping off the leading
> "hive.metastore.warehouse.dir="
> > from the stdout output:
> >
> > hive -e 'set hive.metastore.warehouse.dir;' 2> /dev/null | sed -e
> > 's/hive.metastore.warehouse.dir=//'
> >
> > (No grep subcommand required in this case...)
> >
> > You could do something similar with DESCRIBE EXTENDED table PARTION(...)
> > Suppose you want a script that works for any property. Put the following
> in
> > a script file, say hive-prop.sh:
> >
> > #!/bin/sh
> > hive -e "set $1;" 2> /dev/null | sed -e "s/$1=//"
> >
> > Make it executable (chmod +x /path/to/hive-prop.sh), then run it this
> way:
> >
> > /path/to/hive-prop.sh hive.metastore.warehouse.dir
> >
> > Back to asking for for metadata for a table. The following script will
> > determine the location of a particular partition for an external
> > "mydatabase.stocks" table:
> >
> > #!/bin/sh
> > hive -e "describe formatted mydatabase.stocks
> partition(exchange='NASDAQ',
> > symbol='AAPL');" 2> /dev/null | grep Location | sed -e "s/Location:[
> \t]*//"
> >
> > dean
> >
> > On Mon, Feb 11, 2013 at 4:59 PM, Parag Sarda 
> wrote:
> >>
> >> Hello Hive Users,
> >>
> >> I am writing a program in java which is bundled as JAR and executed
> using
> >> hadoop jar command. I would like to access hive metadata (read
> partitions
> >> informations) in this program. I can ask user to set HIVE_CONF_DIR
> >> environment variable before calling my program or ask for any reasonable
> >> parameters to be passed. I do not want to force user to run hive
> megastore
> >> service if possible to increase reliability of program by avoiding
> >> external dependencies.
> >>
> >> What is the recommended way to get partitions information? Here is my
> >> understanding
> >> 1. Make sure my jar is bundled with hive-metastore[1] library.
> >> 2. Use HiveMetastoreClient[2]
> >>
> >> Is this correct? If yes, how to read the hive configuration[3] from
> >> HIVE_CONF_DIR?
> >>
> >> [1] http://mvnrepository.com/artifact/org.apache.hive/hive-metastore
> >> [2]
> >>
> >>
> http://hive.apache.org/docs/r0.7.1/api/org/apache/hadoop/hive/metastore/Hiv
> >> eMetaStoreClient.html
> >> [3]
> >>
> >>
> http://hive.apache.org/docs/r0.7.1/api/org/apache/hadoop/hive/conf/HiveConf
> >> .html
> >>
> >> Thanks in advance,
> >> Parag
> >>
> >
> >
> >
> > --
> > Dean Wampler, Ph.D.
> > thinkbiganalytics.com
> > +1-312-339-1330
> >
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Transfer Data to new location

2013-02-12 Thread Dean Wampler

hadoop fs -mv old_path new_path

If one new_path isn't in HDFS, use -get instead of -mv.

If you're moving Hive tables, you should then use ALTER TABLE to change the
metadata. Or, if the tables are external and have no partitions, you could
just drop them and recreate them with the new location.

On Tue, Feb 12, 2013 at 10:50 AM, Hamza Asad  wrote:

> i want to change my HIVE/HDFS directory.. how can i transfer data which
> resides in HDFS to some other directory ? either i have have to use
> sqoop-export followed by sqoop-import to new location?? or is there any
> easier solution for safely data transfer??
>
> --
> *Muhammad Hamza Asad *
>  +923457261988
>

-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: How to load hive metadata from conf dir

2013-02-12 Thread Dean Wampler

I'll mention another bash hack that I use all the time:

hive -e 'some_command' | grep for_what_i_want |
sed_command_to_remove_just_i_dont_want

For example, the following command will print just the value of
hive.metastore.warehouse.dir, sending all the logging junk written to
stderr to /dev/null and stripping off the leading
"hive.metastore.warehouse.dir=" from the stdout output:

hive -e 'set hive.metastore.warehouse.dir;' 2> /dev/null | sed -e
's/hive.metastore.warehouse.dir=//'

(No grep subcommand required in this case...)

You could do something similar with DESCRIBE EXTENDED table PARTION(...)
Suppose you want a script that works for any property. Put the following in
a script file, say hive-prop.sh:

#!/bin/sh
hive -e "set $1;" 2> /dev/null | sed -e "s/$1=//"

Make it executable (chmod +x /path/to/hive-prop.sh), then run it this way:

/path/to/hive-prop.sh hive.metastore.warehouse.dir

Back to asking for for metadata for a table. The following script will
determine the location of a particular partition for an external
"mydatabase.stocks" table:

#!/bin/sh
hive -e "describe formatted mydatabase.stocks partition(exchange='NASDAQ',
symbol='AAPL');" 2> /dev/null | grep Location | sed -e "s/Location:[ \t]*//"

dean

On Mon, Feb 11, 2013 at 4:59 PM, Parag Sarda  wrote:

> Hello Hive Users,
>
> I am writing a program in java which is bundled as JAR and executed using
> hadoop jar command. I would like to access hive metadata (read partitions
> informations) in this program. I can ask user to set HIVE_CONF_DIR
> environment variable before calling my program or ask for any reasonable
> parameters to be passed. I do not want to force user to run hive megastore
> service if possible to increase reliability of program by avoiding
> external dependencies.
>
> What is the recommended way to get partitions information? Here is my
> understanding
> 1. Make sure my jar is bundled with hive-metastore[1] library.
> 2. Use HiveMetastoreClient[2]
>
> Is this correct? If yes, how to read the hive configuration[3] from
> HIVE_CONF_DIR?
>
> [1] http://mvnrepository.com/artifact/org.apache.hive/hive-metastore
> [2]
> http://hive.apache.org/docs/r0.7.1/api/org/apache/hadoop/hive/metastore/Hiv
> eMetaStoreClient.html
> [3]
> http://hive.apache.org/docs/r0.7.1/api/org/apache/hadoop/hive/conf/HiveConf
> .html
>
> Thanks in advance,
> Parag
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Combine multiple row values based upon a condition.

2013-02-03 Thread Dean Wampler

If you really only need to consider adjacent rows, it might just be easier
to write a UDF or use streaming, where your code remembers the last record
seen and emits a new record if you want to do the join with the current
record.

On Sat, Feb 2, 2013 at 1:21 PM, Martijn van Leeuwen wrote:

> Hi all,
>
> I new to Apache Hive and I am doing some test to see if it fits my needs,
> one of the questions I have if it is possible to "peek" for the next row in
> order to find out if the values should be combined. Let me explain by an
> example.
>
> Let say my data looks like this
>
> Id name offset
> 1 Jan 100
> 2 Janssen 104
> 3 Klaas 150
> 4 Jan 160
> 5 Janssen 164
>
> An my output to another table should be this
>
> Id fullname offsets
> 1 Jan Janssen [ 100, 160 ]
>
> I would like to combine the name values from two rows where the offset of
> the two rows are no more then 1 character apart.
>
> Is this type of data manipulation is possible and if it is could someone
> point me to the right direction hopefully with some explaination?
>
> Kind regards
> Martijn




-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: The dreaded Heap Space Issue on a Transform

2013-01-30 Thread Dean Wampler

100 rows
>>>>> 2013-01-29 08:20:27,170 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 100 rows
>>>>> 2013-01-29 08:20:27,170 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 100 rows
>>>>> 2013-01-29 08:20:27,171 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 100 rows
>>>>> 2013-01-29 08:20:27,171 INFO
>>>>> org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 100 rows
>>>>> 2013-01-29 08:20:27,171 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 100 rows
>>>>> 2013-01-29 08:20:27,171 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 100 rows
>>>>> 2013-01-29 08:20:27,171 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 100 rows
>>>>> 2013-01-29 08:21:16,247 INFO
>>>>> org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 1000 rows
>>>>> 2013-01-29 08:21:16,247 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 1000 rows
>>>>> 2013-01-29 08:21:16,247 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 1000 rows
>>>>> 2013-01-29 08:21:16,247 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 1000 rows
>>>>> 2013-01-29 08:21:16,248 INFO
>>>>> org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 1000 rows
>>>>> 2013-01-29 08:21:16,248 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 1000 rows
>>>>> 2013-01-29 08:21:16,248 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 1000 rows
>>>>> 2013-01-29 08:21:16,248 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 1000 rows
>>>>> 2013-01-29 08:25:47,532 INFO
>>>>> org.apache.hadoop.hive.ql.exec.ScriptOperator: 3 forwarding 1 rows
>>>>> 2013-01-29 08:25:47,532 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 4 forwarding 1 rows
>>>>> 2013-01-29 08:25:47,532 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 1 rows
>>>>> 2013-01-29 08:25:47,532 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 1 rows
>>>>>  2013-01-29 08:25:47,532 INFO
>>>>> org.apache.hadoop.hive.ql.exec.FilterOperator: 8 forwarding 1 rows
>>>>> 2013-01-29 08:25:47,532 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 9 forwarding 1 rows
>>>>> 2013-01-29 08:25:47,532 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 10 forwarding 1 rows
>>>>> 2013-01-29 08:25:47,532 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 12 forwarding 1 rows
>>>>> 2013-01-29 08:27:34,276 WARN
>>>>> org.apache.hadoop.hive.ql.exec.ScriptOperator: Exception in
>>>>> StreamThread.run(): Java heap space
>>>>> Cause: null
>>>>> 2013-01-29 08:27:34,277 WARN
>>>>> org.apache.hadoop.hive.ql.exec.ScriptOperator: java.lang.OutOfMemoryError:
>>>>> Java heap space
>>>>> at java.util.Arrays.copyOfRange(Arrays.java:3209)
>>>>>  at java.lang.String.(String.java:215)
>>>>> at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
>>>>>  at java.nio.CharBuffer.toString(CharBuffer.java:1157)
>>>>> at org.apache.hadoop.io.Text.decode(Text.java:350)
>>>>>  at org.apache.hadoop.io.Text.decode(Text.java:327)
>>>>> at org.apache.hadoop.io.Text.toString(Text.java:254)
>>>>>  at java.lang.String.valueOf(String.java:2826)
>>>>> at java.lang.StringBuilder.append(StringBuilder.java:115)
>>>>>  at
>>>>> org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:873)
>>>>> at
>>>>> org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.evaluate(GenericUDFBridge.java:181)
>>>>>  at
>>>>> org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.evaluate(ExprNodeGenericFuncEvaluator.java:163)
>>>>> at
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:76)
>>>>>  at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
>>>>> at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
>>>>>  at
>>>>> org.apache.hadoop.hive.ql.exec.ScriptOperator$OutputStreamProcessor.processLine(ScriptOperator.java:477)
>>>>> at
>>>>> org.apache.hadoop.hive.ql.exec.ScriptOperator$StreamThread.run(ScriptOperator.java:563)
>>>>>
>>>>> 2013-01-29 08:27:34,306 INFO
>>>>> org.apache.hadoop.hive.ql.exec.ScriptOperator: ErrorStreamProcessor 
>>>>> calling
>>>>> reporter.progress()
>>>>> 2013-01-29 08:27:34,307 INFO
>>>>> org.apache.hadoop.hive.ql.exec.ScriptOperator: StreamThread ErrorProcessor
>>>>> done
>>>>> 2013-01-29 08:27:34,307 ERROR
>>>>> org.apache.hadoop.hive.ql.exec.ScriptOperator: Script failed with code 1
>>>>>
>>>>
>>>>
>>>
>>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: ALTER TABLE CHANGE COLUMN issue

2013-01-30 Thread Dean Wampler

Right, the very important thing to remember about ALTER TABLE is that it
only changes metadata about your table. It doesn't modify the data in any
way. You have to do that yourself.

On Wed, Jan 30, 2013 at 2:17 AM, Nitin Pawar wrote:

> after u did alter table, did you add any new data to table with new
> schema?
>
> for the old data already present in data, if you add anything new in
> columns it will be null value
>
>
> On Wed, Jan 30, 2013 at 1:44 PM, hardik doshi wrote:
>
>> Hi,
>>
>> I am running into an issue where ALTER TABLE CHANGE COLUMN does not seem
>> to be working.
>>
>> I have a table with a column data type looking like array> b:int>> and I am trying to it change to array> c:string>> based
>> on the underlying data schema change.
>>
>> The alter command succeeds and subsequent describe call shows me the
>> updated table structure. But when tried querying the table,
>> it returns null for the newly added field.
>>
>> This does not happen when a new table with updated column data type is
>> created.
>>
>> Is this a known bug?
>>
>> Thanks,
>> Hardik.
>>
>> PS:- My alter command: ALTER TABLE hardiktest CHANGE COLUMN col1 col2
>> array>.
>>
>
>
>
> --
> Nitin Pawar
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Automating the partition creation process

2013-01-29 Thread Dean Wampler

Yes, if the only problem with my script solution is the variable number of
new days, you can just add all possibilities in advance, knowing that some
will be empty of content...

On Tue, Jan 29, 2013 at 7:49 PM, Sadananda Hegde wrote:

> Thanks, Edward.  I can probably create all previous days partitions ahead
> of time and then use Dean's logic to create new partitions on a daily
> basis. I will probably end up having few empty partitions; need to make
> sure it does not cause any confusions.
>
> Thanks,
> Sadu
>
> On Tue, Jan 29, 2013 at 7:21 PM, Edward Capriolo wrote:
>
>> You can also just create all your partitions ahead of time. They will not
>> do any harm if empty. (unless you have an older version and hit this...
>> http://issues.apache.org/jira/browse/HIVE-1007 )
>>
>>
>> On Tue, Jan 29, 2013 at 8:17 PM, Mark Grover > > wrote:
>>
>>> Hi Sadananda,
>>> Sorry to hear that.
>>>
>>> It got committed, don't worry about the "ABORTED". Here is the commit on
>>> the trunk:
>>>
>>> https://github.com/apache/hive/commit/523f47c3b6e7cb7b6b7b7801c66406e116af6dbc
>>>
>>> However, there is no Apache Hive release with that patch in it.
>>>
>>> You have two options:
>>> 1. Download the patch, rebuild hive and use it
>>> 2. Find a hacky way to recover your partitions when they are empty and
>>> populate them later.
>>>
>>> Sorry for the inconvenience.
>>>
>>> Mark
>>>
>>> On Tue, Jan 29, 2013 at 5:09 PM, Sadananda Hegde wrote:
>>>
>>>> Thanks Mark,
>>>>
>>>> Recover partition feature will satisfy my needs; but MSCK Repair
>>>> Partition < tablename> option is not working for me. It does not give any
>>>> error; but does not add any partitions either.  It looks like it adds
>>>> partitions only when the sub-folder is empty; but not
>>>> when the sub-folder has the data files. I see a fix to this issue here.
>>>>
>>>> https://issues.apache.org/jira/browse/HIVE-3231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>>>>
>>>> But probably it's not commited yet, since the final result says
>>>> 'ABORTED".
>>>>
>>>> Thanks,
>>>> Sadu
>>>>
>>>> On Mon, Jan 28, 2013 at 10:47 PM, Mark Grover <
>>>> grover.markgro...@gmail.com> wrote:
>>>>
>>>>> Sadananda,
>>>>> See if this helps:
>>>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Recoverpartitions
>>>>>
>>>>>
>>>>> On Mon, Jan 28, 2013 at 8:05 PM, Sadananda Hegde 
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> My hive table is partitioned by year, month and day. I have
>>>>>> defined it as external table. The M/R job correctly loads the files into
>>>>>> the daily subfolders. The hdfs files will be loaded to
>>>>>> /year=/month=mm/day=dd/ folders by the scheduled M/R jobs.
>>>>>> The M/R job has some business logic in determining the values for year,
>>>>>> month and day; so one run might create / load files into multiple sub
>>>>>> -folders (multiple days). I am able to query the tables after adding
>>>>>> partitions using ALTER TABLE ADD PARTITION statement. But how do I 
>>>>>> automate
>>>>>> the partition creation step? Basically this script needs to identify the
>>>>>> subfolders created by the M/R job and create corresponding ALTER TABLE 
>>>>>> ADD
>>>>>> PARTITION statements.
>>>>>>
>>>>>> For example, say the M/R job loads files into the following 3
>>>>>> sub-folders
>>>>>>
>>>>>> /user/hive/warehouse/sales/year=2013/month=1/day=21
>>>>>> /user/hive/warehouse/sales/year=2013/month=1/day=22
>>>>>> /user/hive/warehouse/sales/year=2013/month=1/day=23
>>>>>>
>>>>>> Then it should create 3 alter table statements
>>>>>>
>>>>>> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=21);
>>>>>>  ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=22);
>>>>>> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=23);
>>>>>>
>>>>>> I thought of changing M/R jobs to load all files into same folder,
>>>>>> then first load the files into non-partitioned table and then to load the
>>>>>> partitioned table from non-partitioned table (using dynamic partition); 
>>>>>> but
>>>>>> would prefer to avoid that extra step if possible (esp. since data is
>>>>>> already in the correct sub-folders).
>>>>>>
>>>>>> Any help would greately be appreciated.
>>>>>>
>>>>>> Regards,
>>>>>> Sadu
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Automating the partition creation process

2013-01-29 Thread Dean Wampler

This is very easy to do with a shell script and you can have cron run it
nightly, say just after midnight so the year, month, and day calculation
below return the correct values:

#!/bin/bash

year=$(year +%Y)
month=$(year +%m)
day=$(year +%d)

hive -e "use mydb; ALTER TABLE foo ADD IF NOT EXISTS PARTITION
(year='$year', month='$month', day='$day') location
'/path/to/mydb.db/foo/year=$year/month=$month/day=$day';"

(The last line wrapped)

If the year, month and day are integers, then omit the single quotes, i.e.,
(year=$year, month=$month, day=$day)

Including IF NOT EXISTS let's you run the script multiple times per day, if
you're paranoid... ;)

By the way, I used "use mydb; ALTER TABLE foo..." instead of "ALTER TABLE
mydb.foo..." because hive (v0.10.0) didn't like the latter.

dean

On Mon, Jan 28, 2013 at 10:47 PM, Mark Grover
wrote:

> Sadananda,
> See if this helps:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Recoverpartitions
>
>
> On Mon, Jan 28, 2013 at 8:05 PM, Sadananda Hegde wrote:
>
>> Hello,
>>
>> My hive table is partitioned by year, month and day. I have defined it as
>> external table. The M/R job correctly loads the files into the daily
>> subfolders. The hdfs files will be loaded to
>> /year=/month=mm/day=dd/ folders by the scheduled M/R jobs.
>> The M/R job has some business logic in determining the values for year,
>> month and day; so one run might create / load files into multiple sub
>> -folders (multiple days). I am able to query the tables after adding
>> partitions using ALTER TABLE ADD PARTITION statement. But how do I automate
>> the partition creation step? Basically this script needs to identify the
>> subfolders created by the M/R job and create corresponding ALTER TABLE ADD
>> PARTITION statements.
>>
>> For example, say the M/R job loads files into the following 3 sub-folders
>>
>> /user/hive/warehouse/sales/year=2013/month=1/day=21
>> /user/hive/warehouse/sales/year=2013/month=1/day=22
>> /user/hive/warehouse/sales/year=2013/month=1/day=23
>>
>> Then it should create 3 alter table statements
>>
>> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=21);
>>  ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=22);
>> ALTER TABLE sales ADD PARTITION (year=2013, month=1, day=23);
>>
>> I thought of changing M/R jobs to load all files into same folder,
>> then first load the files into non-partitioned table and then to load the
>> partitioned table from non-partitioned table (using dynamic partition); but
>> would prefer to avoid that extra step if possible (esp. since data is
>> already in the correct sub-folders).
>>
>> Any help would greately be appreciated.
>>
>> Regards,
>> Sadu
>>
>>
>>
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: A few JIRAs closed in v0.10.0 that don't actually appear to be working

2013-01-29 Thread Dean Wampler

Thanks!

On Tue, Jan 29, 2013 at 5:34 AM, Navis류승우  wrote:

> HIVE-446 - Implement TRUNCATE : is on trunk (v0.11.0)
>
> HIVE-887 - Allow SELECT  without a mapreduce job : It needs "set
> hive.fetch.task.conversion=more"
>
> 2013/1/29 Dean Wampler :
> > Oh, another one is
> >
> > https://issues.apache.org/jira/browse/HIVE-446 - Implement TRUNCATE.
> >
> > The CLI doesn't recognize it.
> >
> > dean
> >
> > On Mon, Jan 28, 2013 at 11:44 AM, Dean Wampler
> >  wrote:
> >>
> >> I've noticed a few JIRA items for new features that are supposed to work
> >> in v0.10.0, but don't appear to actually work:
> >>
> >> https://issues.apache.org/jira/browse/HIVE-3066 - Add the option
> -database
> >> DATABASE in hive cli to specify a default database to use for the cli
> >> session.
> >>
> >> The option is not recognized.
> >>
> >> https://issues.apache.org/jira/browse/HIVE-3621 - Make prompt in Hive
> CLI
> >> configurable
> >>
> >> Setting the hive.cli.prompt configuration property doesn't do anything.
> >>
> >> https://issues.apache.org/jira/browse/HIVE-887 - Allow SELECT 
> >> without a mapreduce job
> >>
> >> All queries I've tried use MR.
> >>
> >> Did I misread the JIRA items?
> >>
> >> dean
> >>
> >> --
> >> Dean Wampler, Ph.D.
> >> thinkbiganalytics.com
> >> +1-312-339-1330
> >>
> >
> >
> >
> > --
> > Dean Wampler, Ph.D.
> > thinkbiganalytics.com
> > +1-312-339-1330
> >
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: A few JIRAs closed in v0.10.0 that don't actually appear to be working

2013-01-28 Thread Dean Wampler

Oh, another one is

https://issues.apache.org/jira/browse/HIVE-446 - Implement TRUNCATE.

The CLI doesn't recognize it.

dean

On Mon, Jan 28, 2013 at 11:44 AM, Dean Wampler <
dean.wamp...@thinkbiganalytics.com> wrote:

> I've noticed a few JIRA items for new features that are supposed to work
> in v0.10.0, but don't appear to actually work:
>
> https://issues.apache.org/jira/browse/HIVE-3066 - Add the option
> -database DATABASE in hive cli to specify a default database to use for the
> cli session.
>
> The option is not recognized.
>
> https://issues.apache.org/jira/browse/HIVE-3621 - Make prompt in Hive CLI
> configurable
>
> Setting the hive.cli.prompt configuration property doesn't do anything.
>
> https://issues.apache.org/jira/browse/HIVE-887 - Allow SELECT 
> without a mapreduce job
>
> All queries I've tried use MR.
>
> Did I misread the JIRA items?
>
> dean
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

A few JIRAs closed in v0.10.0 that don't actually appear to be working

2013-01-28 Thread Dean Wampler

I've noticed a few JIRA items for new features that are supposed to work in
v0.10.0, but don't appear to actually work:

https://issues.apache.org/jira/browse/HIVE-3066 - Add the option -database
DATABASE in hive cli to specify a default database to use for the cli
session.

The option is not recognized.

https://issues.apache.org/jira/browse/HIVE-3621 - Make prompt in Hive CLI
configurable

Setting the hive.cli.prompt configuration property doesn't do anything.

https://issues.apache.org/jira/browse/HIVE-887 - Allow SELECT  without
a mapreduce job

All queries I've tried use MR.

Did I misread the JIRA items?

dean

-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Cartesian product detection in the query plan?

2013-01-28 Thread Dean Wampler

By the way, v0.10.0 adds a new CROSS JOIN feature, when you really intended
to do one ;)

SELECT a.x, b.y FROM tablea a CROSS JOIN tableb b;



On Mon, Jan 28, 2013 at 10:58 AM, Edward Capriolo wrote:

> Currently order by is very inefficient in hive. Hopefully you can use sort
> by in most cases.
>
> It should be "visible" in the plan. The plan should be very different if
> you are using the ON clause vs not.  Can it be easily detected is another
> question.
>
>
> On Mon, Jan 28, 2013 at 11:45 AM, David Morel  wrote:
>
>> On 28 Jan 2013, at 14:29, Edward Capriolo wrote:
>>
>>  Iirc hive.mapred.mode strict should prevent this. If not we should add
>>> it.
>>>
>>
>> hi Edward,
>>
>> Yes, that's indeed what the book claims (quoting):
>>
>>   hive> SELECT * FROM fracture_act JOIN fracture_ads
>>  > WHERE fracture_act.planner_id = fracture_ads.planner_id;
>>   FAILED: Error in semantic analysis: In strict mode, cartesian product
>> is not allowed. If you really want to perform the operation,
>>   +set hive.mapred.mode=nonstrict+
>>
>> I am about to re-enable this setting on my cluster (after fixing all the
>> queries that it broke, especially all the ORDER BY ones :-) but I hoped
>> it was visible right there in the query plan, or in some other way. If
>> Hive can detect it, it should be visible somewhere, right?
>>
>> Thanks!
>>
>> david
>>
>>
>>> On Monday, January 28, 2013, David Morel  wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I had to kill some queries that were taking forever, and it turns out
>>>> they were doing cartesian products (missing ON clause on a JOIN).
>>>>
>>>> I wonder how I could see that in the EXPLAIN output (which I still find
>>>> a bit cryptic). Specifically, the stage that it was stuck in was this:
>>>>
>>>> Stage: Stage-7
>>>> Map Reduce
>>>> Alias -> Map Operator Tree:
>>>>   $INTNAME
>>>>   Reduce Output Operator
>>>> sort order:
>>>> tag: 1
>>>> value expressions:
>>>>   expr: _col1
>>>>   type: int
>>>>   $INTNAME1
>>>>   Reduce Output Operator
>>>> sort order:
>>>> tag: 0
>>>> value expressions:
>>>>   expr: _col0
>>>>   type: bigint
>>>>   expr: _col1
>>>>   type: string
>>>> Reduce Operator Tree:
>>>>   Join Operator
>>>> condition map:
>>>>  Inner Join 0 to 1
>>>> condition expressions:
>>>>   0 {VALUE._col0} {VALUE._col1}
>>>>   1 {VALUE._col1}
>>>> handleSkewJoin: false
>>>> outputColumnNames: _col0, _col1, _col3
>>>> File Output Operator
>>>>   compressed: true
>>>>   GlobalTableId: 0
>>>>   table:
>>>>   input format:
>>>>
>>> org.apache.hadoop.mapred.SequenceFileInputFormat
>>>
>>>>   output format:
>>>>
>>> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>>>
>>>>
>>>> Is there anything in there that should have alerted me?
>>>>
>>>> I found out by looking at the query, but I wonder if the query plan (if
>>>> I could read it) would have given me that information.
>>>>
>>>> Thanks a lot
>>>>
>>>> David Morel
>>>>
>>>>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Whatever happened to the MACRO facility, Hive-2655

2013-01-26 Thread Dean Wampler

Great! It would be nice to have.

dean

On Sat, Jan 26, 2013 at 10:30 AM, Edward Capriolo wrote:

> That is my fault I was hoping it would get in because it seems close. Ill
> see if i can shove the ticket along. It is a cool feature.
>
>
> On Sat, Jan 26, 2013 at 8:59 AM, Dean Wampler <
> dean.wamp...@thinkbiganalytics.com> wrote:
>
>> We mentioned it in our book and now I realize it's not actually
>> implemented, even in 0.10.0. OOPS!!
>>
>> https://issues.apache.org/jira/browse/HIVE-2655
>>
>> dean
>>
>> --
>> *Dean Wampler, Ph.D.*
>> thinkbiganalytics.com
>> +1-312-339-1330
>>
>>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Whatever happened to the MACRO facility, Hive-2655

2013-01-26 Thread Dean Wampler

We mentioned it in our book and now I realize it's not actually
implemented, even in 0.10.0. OOPS!!

https://issues.apache.org/jira/browse/HIVE-2655

dean

-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: querying objects and list fields

2013-01-25 Thread Dean Wampler

Use size instead of count.  Count is for counting rows, while size is for
determining the size of a collection.

For the second question, I think you'll need to call explode on the array,
turning it into records first. Google  for hive's "lateral view" to see the
correct syntax for exploding and also projecting other fields.

dean

On Fri, Jan 25, 2013 at 5:25 AM, Lauren Blau <
lauren.b...@digitalreasoning.com> wrote:

> I'm building up a set of classes (objectinspectors and serdes) to allow
> hive queries over some data files I have. While I'm making it work, I don't
> fully grok all the concepts involved.
> Right now I've got 2 questions.
>
> I'm able to make queries like this (this is the first syntax I tried to
> query into what I know are lists of objects, is it the best way?):
> select messageId,lastmodifiedDate,contexts[1].conceptId from MessageData
> LIMIT 5; and get the conceptId of the first context element in the first 5
> rows/ (my messagedata contexts field is a list of context objects;)
> select messageId,lastmodifiedDate,contexts.conceptId from MessageData
> LIMIT 5; and get the conceptId of all the context elements in the first 5
> rows
>
> but I can't make a query like this
>
> select messageId,lastmodifiedDate,count(contexts) LIMIT 5;
>
> Is there a different syntax to query the length of that list of objects?
>
>
> Also, currently when you query
> select messageId, lastmodifiedDate,contexts LIMIT 1; you get a fully
> expanded representation of all of the contexts for 1 row back. What I'd
> really like is for that query to just return the list of contextIds (as if
> the query had been contexts.contextId), but then to be able to query down
> into the contexts like above. Is there some way my ObjectInspector could
> respond to
>
> select messageId, lastmodifiedDate,contexts;  as if it were select
> messageId,lastmodifiedDate.contexts.contextId
> but also still respond correctly to
> select messageId. lastmodifiedDate.contexts.conceptId
> ?
>
> Thanks for the help,
> Lauren
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Real-life experience of forcing smaller input splits?

2013-01-25 Thread Dean Wampler

...

>
> That will be my approach for now, or disabling compression altogether for
> these files. The only problem I have is that compression is so efficient
> that any operation in the mapper (so on the uncompressed data) just makes
> the mapper throw an OOM exception, no matter how much memory I give it.
>
> What partly works though, is setting a low mapred.max.split.size. In a
> directory containing 34 files, I get 33 mappers (???). When setting
> hive.merge.mapfiles to false (and leaving mapred.max.split.size at its fs
> blocksize default), it doesn't seem to have any effect and I get 20 mappers
> only.
>
>
You can still use compression if you use a splittable format, like bzip2
with block compression. Gzip isn't splittable.

If you're running out of memory, you could also increase the heap size for
the client VMs. See the "Real-World Cluster Configurations" section of this
page:

http://hadoop.apache.org/docs/r1.0.3/cluster_setup.html

By the way, you could also experiment with turning on intermediate
compression; compression of the data sent between the mapper and reducer
tasks, compression of the output, etc, as discussed here:

https://cwiki.apache.org/Hive/adminmanual-configuration.html


>  ...
>>
>

Re: Loading a Hive table simultaneously from 2 different sources

2013-01-24 Thread Dean Wampler

You'll face all the usual concurrency synchronization risks if you're
updating the same "place" concurrently. One thing to keep in mind; it's all
just HDFS under the hood. That pretty much tells you everything you need to
know. Yes, there's also the metadata. So, one way to update a partition
directory safely is to write to unique files. Hive doesn't care about their
names.

You can even write new directories for the partitions yourself, bypassing
Hive, and then tell Hive to "find" them afterwards. See
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Recoverpartitions
In this case, you're updating the metadata to reflect what just
happened
to the file system.

dean

On Thu, Jan 24, 2013 at 9:33 AM, Krishnan K  wrote:

> Hi Edward, All,
>
> Thanks for the quick reply!
>
> We are using dynamic partitions - so unable to say to which partition each
> record goes. We dont have much control here.
>
> Is there any properties that can be set ?
> I'm a bit doubtful here - is it because of the lock acquired on the table ?
>
> Regards,
> Krishnan
>
>
> On Thu, Jan 24, 2013 at 8:22 PM, Edward Capriolo wrote:
>
>> Partition the table and load the data into different partitions. That or
>> build the data outside he table and then use scripting to move the data in
>> using LOAD DATA INPATH or copying.
>>
>>
>> On Thu, Jan 24, 2013 at 9:44 AM, Krishnan K wrote:
>>
>>> Hi All,
>>>
>>> Could you please let me know what would happen if we try to load a table
>>> from 2 different sources at the same time ?
>>>
>>> I had tried this earlier and got an error for 1 load job and while the
>>> other job loaded the data successfully into the table..
>>>
>>> I guess it was because of lock acquired on the table by the first load
>>> process.
>>>
>>> Is there anyway to handle this ?
>>>
>>> Please give your insights.
>>>
>>> Regards,
>>> Krishnan
>>>
>>>
>>>
>>
>

-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: serde jar causing problems in loading other jars.

2013-01-23 Thread Dean Wampler

e.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:266)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.(HiveMetaStore.java:228)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:114)
> at
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2110)
> at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2120)
> at
> org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1095)
> ... 18 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:601)
> at javax.jdo.JDOHelper$16.run(JDOHelper.java:1958)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.jdo.JDOHelper.invoke(JDOHelper.java:1953)
> at
> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1159)
> ... 35 more
> Caused by: java.lang.NullPointerException
> at
> org.datanucleus.plugin.NonManagedPluginRegistry.registerBundle(NonManagedPluginRegistry.java:443)
> at
> org.datanucleus.plugin.NonManagedPluginRegistry.registerBundle(NonManagedPluginRegistry.java:355)
> at
> org.datanucleus.plugin.NonManagedPluginRegistry.registerExtensions(NonManagedPluginRegistry.java:215)
> at
> org.datanucleus.plugin.NonManagedPluginRegistry.registerExtensionPoints(NonManagedPluginRegistry.java:156)
> at
> org.datanucleus.plugin.PluginManager.registerExtensionPoints(PluginManager.java:82)
> at org.datanucleus.OMFContext.(OMFContext.java:156)
> at org.datanucleus.OMFContext.(OMFContext.java:137)
> at
> org.datanucleus.ObjectManagerFactoryImpl.initialiseOMFContext(ObjectManagerFactoryImpl.java:132)
> at
> org.datanucleus.jdo.JDOPersistenceManagerFactory.initialiseProperties(JDOPersistenceManagerFactory.java:363)
> at
> org.datanucleus.jdo.JDOPersistenceManagerFactory.(JDOPersistenceManagerFactory.java:307)
> at
> org.datanucleus.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:255)
> at
> org.datanucleus.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:182)
> ... 43 more
>
>
>
> On Wed, Jan 23, 2013 at 3:05 PM, Ehsan Haq  wrote:
>
>> Thanks dean, I knew about the .hiverc script, will try the other
>> alternative of renaming jar and get back. Thanks
>>
>> /Ehsan
>>
>>
>> On Wed, Jan 23, 2013 at 2:57 PM, Dean Wampler <
>> dean.wamp...@thinkbiganalytics.com> wrote:
>>
>>> Is there anything in the logs about problems loading the jar, etc.?
>>>
>>> The jar files in $HVE_HOME are added to the CLASSPATH in alphabetical
>>> order. As an experiment, rename your jar with a name that will go last,
>>> something like zzz.jar, and see what happens when you start Hive. If it
>>> seems to be working normally, it may be that some file in your jar file has
>>> a name that's the same as a file in one of Hive's jar files and your's gets
>>> read instead. An XML config file, for example.
>>>
>>> If this experiment appears to work, run "jar tf name-of-your-file.jar"
>>> and post the listing here, if you don't mind.
>>>
>>> Note that an alternative to dropping jar files in Hive's lib directory
>>> is to put the "add jar ..." command in your $HOME/.hiverc file, which Hive
>>> will read on startup (Hive v0.7.0 and later). What Hive version are you
>>> using, by the way? I also put lots of custom configuration setting
>>> commands, e.g., set hive.exec.mode.local.auto=true; (to encourage local
>>> mode execution, when possible).
>>>
>>> dean
>>>
>>>
>>> On Wed, Jan 23, 2013 at 6:41 AM, Ehsan Haq  wrote:
>>>
>>>> Hi,
>>>>I have writen a custom serde and bundle it in a jar file, which is
>>>> working fine, when I add the jar using the CLI command add jar. However
>>>> when I put the jar in the hive/lib folder so that I dont have to explicitly
>>>> add the jar, it looks like other jars were failed to load. The outcome is
>>>> that the meta data is also not accessible due to that. The jar works just
>>>> fine if put it somewhere else and add it via add jar.
>>>>Any idea what might be wrong?
>>>>
>>>> /Ehsan
>>>>
>>>
>>>
>>>
>>> --
>>> *Dean Wampler, Ph.D.*
>>> thinkbiganalytics.com
>>> +1-312-339-1330
>>>
>>>
>>
>>
>> --
>> *Muhammad Ehsan ul Haque*
>> Klarna AB
>> Norra Stationsgatan 61
>> SE-113 43 Stockholm
>>
>> Tel: +46 (0)8- 120 120 00
>> Fax: +46 (0)8- 120 120 99
>> Web: www.klarna.com
>>
>
>
>
> --
> *Muhammad Ehsan ul Haque*
> Klarna AB
> Norra Stationsgatan 61
> SE-113 43 Stockholm
>
> Tel: +46 (0)8- 120 120 00
> Fax: +46 (0)8- 120 120 99
> Web: www.klarna.com
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Problem with using Postgres as hive meta store DB.

2013-01-23 Thread Dean Wampler

Are you able to use Hive v0.8.1 and v0.9.0 interchangeably? Since the
metadata schema changed between versions, I wouldn't expect 0.8.1 to work
if you're schema is set up for 0.9.0.

dean

On Wed, Jan 23, 2013 at 6:35 AM, Ehsan Haq  wrote:

> Thanks, that solved the problem :-)
>
> /Ehsan
>
>
> On Wed, Jan 23, 2013 at 10:00 AM, wd  wrote:
>
>>
>> On Wed, Jan 23, 2013 at 4:46 PM, Ehsan Haq  wrote:
>>
>>> ERROR: invalid escape string
>>>   Hint: Escape string must be empty or one character..)
>>>
>>
>>
>> You can set standard_conforming_strings = off in postgresql.conf to avoid
>> this.
>>
>>
>
>
> --
> *Muhammad Ehsan ul Haque*
> Klarna AB
> Norra Stationsgatan 61
> SE-113 43 Stockholm
>
> Tel: +46 (0)8- 120 120 00
> Fax: +46 (0)8- 120 120 99
> Web: www.klarna.com
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: serde jar causing problems in loading other jars.

2013-01-23 Thread Dean Wampler

Is there anything in the logs about problems loading the jar, etc.?

The jar files in $HVE_HOME are added to the CLASSPATH in alphabetical
order. As an experiment, rename your jar with a name that will go last,
something like zzz.jar, and see what happens when you start Hive. If it
seems to be working normally, it may be that some file in your jar file has
a name that's the same as a file in one of Hive's jar files and your's gets
read instead. An XML config file, for example.

If this experiment appears to work, run "jar tf name-of-your-file.jar" and
post the listing here, if you don't mind.

Note that an alternative to dropping jar files in Hive's lib directory is
to put the "add jar ..." command in your $HOME/.hiverc file, which Hive
will read on startup (Hive v0.7.0 and later). What Hive version are you
using, by the way? I also put lots of custom configuration setting
commands, e.g., set hive.exec.mode.local.auto=true; (to encourage local
mode execution, when possible).

dean

On Wed, Jan 23, 2013 at 6:41 AM, Ehsan Haq  wrote:

> Hi,
>I have writen a custom serde and bundle it in a jar file, which is
> working fine, when I add the jar using the CLI command add jar. However
> when I put the jar in the hive/lib folder so that I dont have to explicitly
> add the jar, it looks like other jars were failed to load. The outcome is
> that the meta data is also not accessible due to that. The jar works just
> fine if put it somewhere else and add it via add jar.
>Any idea what might be wrong?
>
> /Ehsan
>

-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Missing tables!

2013-01-22 Thread Dean Wampler

Are you using Derby? Did you change directories? By default, Derby drops
the metastore_db in the current directory, but you can specify an absolute
path for it in the hive-site.xml config file, e.g.,:

  javax.jdo.option.ConnectionURL
  jdbc:derby:;databaseName=/path/to/metastore_db;create=true
  JDBC connect string for a JDBC metastore

On Tue, Jan 22, 2013 at 6:16 AM, ashish negi wrote:

> Hi,
>
> I was successfully running command
> SHOW TABLES;
> which was viewing tables created @HDFS.
> I was doing operations on hive suddenly i lost all tables. But all tables
> are showing in warehouse.
> It seems some configuration error but exact solution is yet to know.
> Any idea?
>
> Regards,
> Ashish
>

-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: HWI use on AWS/EMR

2013-01-18 Thread Dean Wampler

Yes, this is straightforward to set up in Putty.  Ariel, thanks for
providing the details I was too lazy to mention ;)

One flaw with this approach is that only your machine will have access to
HWI. That is, the port is forwarded only to your machine. Making the ports
public on the master node means anyone on the interwebs can see it. You can
also set up security groups in AWS so that only specifically authorized
people can access the ports. That's a bit involved...

dean

On Fri, Jan 18, 2013 at 10:31 AM, Ariel Marcus wrote:

> Hey Tony,
>
> Port forwarding is the same as what Dean referred to as ssh tunneling.
>
> Here is a website explaining how to set up port forwarding in putty (use
> port ):
> http://www.cs.uu.nl/technical/services/ssh/putty/puttyfw.html
>
> You can edit either hive-default or hive-site but the general practice is
> to store changes from the default configuration in hive-site.
>
> Best,
> Ariel
>
>
> -
> Ariel Marcus, Consultant
> www.openbi.com | ariel.mar...@openbi.com
> 150 N Michigan Avenue, Suite 2800, Chicago, IL 60601
> Cell: 314-827-4356
>
>
> On Fri, Jan 18, 2013 at 11:26 AM, Tony Burton 
> wrote:
>
>> Hi Ariel,
>>
>> ** **
>>
>> Thanks for the speedy reply. We’ll be accessing the HWI from Windows
>> rather Linux desktops, so can you help me out with understanding how to
>> carry out the port forwarding from a Windows environment? Currently I use
>> WinSCP to initiate a connection to the EMR instance, then spawn a PuTTY
>> session.
>>
>> ** **
>>
>> The changes to the hive config look identical to what I’ve changed in my
>> hive-default.xml. Does it make any difference which config file the
>> changesare applied to (hive-default.xml vs hive-site.xml)?
>>
>> ** **
>>
>> Tony
>>
>> ** **
>>
>> ** **
>>
>> *From:* Ariel Marcus [mailto:ariel.mar...@openbi.com]
>> *Sent:* 18 January 2013 16:16
>>
>> *To:* user@hive.apache.org
>> *Subject:* Re: HWI use on AWS/EMR
>>
>> ** **
>>
>> Hey Tony,
>>
>> ** **
>>
>> I would recommend using ssh port forwarding instead of making your hwi
>> publicly available on the internet. When you connect to the master node
>> using ssh you should use a command like the following:
>>
>> ** **
>>
>> ssh -i /path/to/keyfile.pem -L :localhost: -l hadoop
>> MASTER_HOSTNAME
>>
>> ** **
>>
>> After you have connected, add the following to your
>> hive/conf/hive-site.xml file:
>>
>> ** **
>>
>> 
>>
>>   hive.hwi.war.file
>>
>>   lib/hive-hwi-0.8.1.war
>>
>>   This is the WAR file with the jsp content for Hive Web
>> Interface
>>
>> 
>>
>> ** **
>>
>> Run this command to start up hwi:
>>
>> ** **
>>
>> hive --service hwi
>>
>> ** **
>>
>> And finally point your browser to:
>>
>> ** **
>>
>> localhost:
>>
>> ** **
>>
>> That worked for me.
>>
>> ** **
>>
>> Best,
>>
>> Ariel
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>
>>
>> 
>>
>> -----****
>>
>> Ariel Marcus, Consultant
>>
>> www.openbi.com | ariel.mar...@openbi.com
>>
>> 150 N Michigan Avenue, Suite 2800, Chicago, IL 60601
>> Cell: 314-827-4356
>>
>> ** **
>>
>> On Fri, Jan 18, 2013 at 11:09 AM, Dean Wampler <
>> dean.wamp...@thinkbiganalytics.com> wrote:
>>
>> Oops, I overlooked that you have the public domain name in your message.
>> Can you surf to  http://ec2-54-247-61-206.eu-west-1.compute.amazonaws.com?  
>> If not, does HWI use port 80? Is whatever port it uses blocked by EC2?
>> 
>>
>> ** **
>>
>> If it's blocked you can use ssh to tunnel the port through.
>>
>> ** **
>>
>> dean
>>
>> ** **
>>
>> On Fri, Jan 18, 2013 at 10:06 AM, Dean Wampler <
>> dean.wamp...@thinkbiganalytics.com> wrote:
>>
>> That's the internal hostname, not visible outside. Use the name like
>> ec2-NNN-NN-NN-NNN.compute-1.amazonaws.com. It's shown in the EMR console
>> and the elastic-mapreduce script you might have used to launch the cluster.
>> 
>>
>> ** **
>>
>> If th

Re: HWI use on AWS/EMR

2013-01-18 Thread Dean Wampler

Oops, I overlooked that you have the public domain name in your message.
Can you surf to  http://ec2-54-247-61-206.eu-west-1.compute.amazonaws.com ?
 If not, does HWI use port 80? Is whatever port it uses blocked by EC2?

If it's blocked you can use ssh to tunnel the port through.

dean

On Fri, Jan 18, 2013 at 10:06 AM, Dean Wampler <
dean.wamp...@thinkbiganalytics.com> wrote:

> That's the internal hostname, not visible outside. Use the name like
> ec2-NNN-NN-NN-NNN.compute-1.amazonaws.com. It's shown in the EMR console
> and the elastic-mapreduce script you might have used to launch the cluster.
>
> If that doesn't work, verify that port 80 is not blocked by default.
> That's certainly true for ports 9XXX used by the JobTracker, etc.
>
> dean
>
>
> On Fri, Jan 18, 2013 at 9:54 AM, Tony Burton wrote:
>
>> Hi,
>>
>> I'm trying to get HWI running and accessible from an Amazon Web Services
>> EMR instance. I've hit a blocker early on though, and the documentation is
>> less than illuminating. Can you share any experiences you have had?
>> Specifically, here's what I'm curious about.
>>
>> - Running on AWS. I've created a Hive job flow on AWS, edited
>> hive-default.xml (in /home/hadoop/.versions/hive-0.8.1/conf) and changed
>> the hive.hwi.war.file to lib/hive-hwi-0.8.1.war.
>> - HWI starts up fine, but when I try to connect with
>> http://ip-XX-AAA-BBB-CCC.eu-west-1.compute.internal:/hwi, I get a
>> timeout message, in Firefox it's "The connection has timed out - The server
>> at ec2-54-247-61-206.eu-west-1.compute.amazonaws.com is taking too long
>> to respond".
>>
>> Has anyone successfully connected to HWI running on an AWS EMR instance?
>> From the same browser I can connect to HWI on a local Ubuntu box.
>>
>> Thanks!
>>
>> Tony
>>
>>
>> **
>> Please consider the environment before printing this email or attachments
>>
>> This email and any attachments are confidential, protected by copyright
>> and may be legally privileged.  If you are not the intended recipient, then
>> the dissemination or copying of this email is prohibited. If you have
>> received this in error, please notify the sender by replying by email and
>> then delete the email completely from your system.  Neither Sporting Index
>> nor the sender accepts responsibility for any virus, or any other defect
>> which might affect any computer or IT system into which the email is
>> received and/or opened.  It is the responsibility of the recipient to scan
>> the email and no responsibility is accepted for any loss or damage arising
>> in any way from receipt or use of this email.  Sporting Index Ltd is a
>> company registered in England and Wales with company number 2636842, whose
>> registered office is at Gateway House, Milverton Street, London, SE11 4AP.
>>  Sporting Index Ltd is authorised and regulated by the UK Financial
>> Services Authority (reg. no. 150404) and Gambling Commission (reg. no.
>> 000-027343-R-308898-001).  Any financial promotion contained herein has
>> been issued
>> and approved by Sporting Index Ltd.
>>
>> Outbound email has been scanned for viruses and SPAM
>>
>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: HWI use on AWS/EMR

2013-01-18 Thread Dean Wampler

That's the internal hostname, not visible outside. Use the name like
ec2-NNN-NN-NN-NNN.compute-1.amazonaws.com. It's shown in the EMR console
and the elastic-mapreduce script you might have used to launch the cluster.

If that doesn't work, verify that port 80 is not blocked by default. That's
certainly true for ports 9XXX used by the JobTracker, etc.

dean

On Fri, Jan 18, 2013 at 9:54 AM, Tony Burton wrote:

> Hi,
>
> I'm trying to get HWI running and accessible from an Amazon Web Services
> EMR instance. I've hit a blocker early on though, and the documentation is
> less than illuminating. Can you share any experiences you have had?
> Specifically, here's what I'm curious about.
>
> - Running on AWS. I've created a Hive job flow on AWS, edited
> hive-default.xml (in /home/hadoop/.versions/hive-0.8.1/conf) and changed
> the hive.hwi.war.file to lib/hive-hwi-0.8.1.war.
> - HWI starts up fine, but when I try to connect with
> http://ip-XX-AAA-BBB-CCC.eu-west-1.compute.internal:/hwi, I get a
> timeout message, in Firefox it's "The connection has timed out - The server
> at ec2-54-247-61-206.eu-west-1.compute.amazonaws.com is taking too long
> to respond".
>
> Has anyone successfully connected to HWI running on an AWS EMR instance?
> From the same browser I can connect to HWI on a local Ubuntu box.
>
> Thanks!
>
> Tony
>
>
> **
> Please consider the environment before printing this email or attachments
>
> This email and any attachments are confidential, protected by copyright
> and may be legally privileged.  If you are not the intended recipient, then
> the dissemination or copying of this email is prohibited. If you have
> received this in error, please notify the sender by replying by email and
> then delete the email completely from your system.  Neither Sporting Index
> nor the sender accepts responsibility for any virus, or any other defect
> which might affect any computer or IT system into which the email is
> received and/or opened.  It is the responsibility of the recipient to scan
> the email and no responsibility is accepted for any loss or damage arising
> in any way from receipt or use of this email.  Sporting Index Ltd is a
> company registered in England and Wales with company number 2636842, whose
> registered office is at Gateway House, Milverton Street, London, SE11 4AP.
>  Sporting Index Ltd is authorised and regulated by the UK Financial
> Services Authority (reg. no. 150404) and Gambling Commission (reg. no.
> 000-027343-R-308898-001).  Any financial promotion contained herein has
> been issued
> and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Execution of udf

2013-01-18 Thread Dean Wampler

That should be an identity pass through or NOOP that does nothing, but
MapReduce requires "something".

On Fri, Jan 18, 2013 at 8:57 AM, nagarjuna kanamarlapudi <
nagarjuna.kanamarlap...@gmail.com> wrote:

> No but the query execution shows a reducer running .. And infant I feel
> that reduce phase can be there
>
>
> On Friday, January 18, 2013, Dean Wampler wrote:
>
>> There is no reduce phase needed in this query.
>>
>> On Fri, Jan 18, 2013 at 6:59 AM, nagarjuna kanamarlapudi <
>> nagarjuna.kanamarlap...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Select col1,myudf(col2,col3) from table1;
>>>
>>>
>>> In what phase if map reduce an udf is executed.
>>>
>>> In the very beginning, I assumed that hive will be joining two tables.,
>>> getting the required columns and then applies udf on columns specified
>>> I.e., essentially on reducer phase . But later on I realised that I was
>>> wrong.
>>>
>>> Is there any specific parameter which suggests hive to call udf at
>>> reducer phase rather than at Mapper phase.
>>>
>>>
>>> Regards,
>>> Nagarjuna
>>>
>>>
>>> --
>>> Sent from iPhone
>>>
>>
>>
>>
>> --
>> *Dean Wampler, Ph.D.*
>> thinkbiganalytics.com
>> +1-312-339-1330
>>
>>
>
> --
> Sent from iPhone
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Execution of udf

2013-01-18 Thread Dean Wampler

There is no reduce phase needed in this query.

On Fri, Jan 18, 2013 at 6:59 AM, nagarjuna kanamarlapudi <
nagarjuna.kanamarlap...@gmail.com> wrote:

> Hi,
>
> Select col1,myudf(col2,col3) from table1;
>
>
> In what phase if map reduce an udf is executed.
>
> In the very beginning, I assumed that hive will be joining two tables.,
> getting the required columns and then applies udf on columns specified
> I.e., essentially on reducer phase . But later on I realised that I was
> wrong.
>
> Is there any specific parameter which suggests hive to call udf at reducer
> phase rather than at Mapper phase.
>
>
> Regards,
> Nagarjuna
>
>
> --
> Sent from iPhone
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Interaction between Java and Transform Scripts on Hive

2013-01-16 Thread Dean Wampler

The transform scripts (or executables) are run as separate processes, so it
sounds like Hive itself is blowing up. That would be consistent with your
script working fine outside Hive. The Hive or Hadoop logs might have clues.

So, it happens consistently with this one file? I would check to be sure
that there isn't a subtle error in the file or the output from your script,
say an extra tab, other whitespace, or a malformed data value. If you can
find the line where it blows up, that would be good. You could have your
script dump debug data, like an index for each input and the corresponding
key-value pair. Or modify the output of the script and the query results to
return information like this to Hive. It seems more likely that the problem
is downstream from when the data passes through the query. So, you could
try changing the Hive query to just dump the script results and do nothing
else afterwards, etc.

However, I wouldn't expect those problems to cause heap exhaustion, unless
it somehow triggers an infinite loop.

Can you share your python script, Hive query, table schema(s), and a sample
of the file?

dean

On Wed, Jan 16, 2013 at 9:32 PM, John Omernik  wrote:

> I am perplexed  if I run a transform script on a file by itself, it runs
> fine, outputs to standard out life is good. If I run the transform script
> on that same file (with the path and filename being passed into the script
> via transform so that the python script is doing the exact same thing) I
> get a java heap space error. This process works on 99% of files, and I just
> can't figure out why this file is different.  How does say a python
> transform script run "in" the java process (if that is even what it is
> doing) so that it causes a heap error in a transform script but not run
> without java around?
>
> I am curious on what steps I can take to trouble shoot or eliminate this
> problem.
>
>
>

-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Re: create a hive table: always a tab space before each line

2013-01-14 Thread Dean Wampler

Hadoop supports Sequence Files natively. Hadoop the Definitive Guide
discusses the details.

dean

On Mon, Jan 14, 2013 at 8:58 AM, Richard  wrote:

> thanks.
> it seems that as long as I use sequencefile as the storage format, there
> will be \t before the first column. If this output is continously used by
> hive, it is fine. The problem is that I may use a self-define map-reduce
> job to read these files.  Does that mean I have to take care of
> this \t by myself?
>
> is there any option that I can disable this \t in hive?
>
>
>
> At 2013-01-09 22:38:11,"Dean Wampler" 
> wrote:
>
> To add to what Nitin said, there is no key output by Hive in front of the
> tab.
>
> On Wed, Jan 9, 2013 at 3:07 AM, Nitin Pawar wrote:
>
>> you may want to look at the sequencefile format
>>
>> http://my.safaribooksonline.com/book/databases/hadoop/9780596521974/file-based-data-structures/id3555432
>>
>> that tab is to separate key from values in the record (I may be wrong but
>> this is how I interpreted it)
>>
>>
>> On Wed, Jan 9, 2013 at 12:49 AM, Richard  wrote:
>>
>>> more information:
>>>
>>> if I set the format as textfile, there is no tab space.
>>> if I set the format as sequencefile and view the content via hadoop fs
>>> -text, I saw a tab space in the head of each line.
>>>
>>>
>>> At 2013-01-09 15:44:00,Richard  wrote:
>>>
>>> hi there
>>>
>>>
>>> I have a problem with creating a hive table.
>>>
>>> no matter what field delimiter I used, I always got a tab space in the head 
>>> of each line (a line is a record).
>>>
>>> something like this:
>>>
>>> \t f1 \001 f2 \001 f3 ...
>>>
>>> where f1 , f2 , f3 denotes the field value and \001 is the field separator.
>>>
>>>
>>> **
>>>
>>> here is the clause I used
>>>
>>> 35 create external table if not exists ${HIVETBL_my_table}
>>>  36 (
>>>  37 nid string,
>>>  38 userid string,
>>>  39 spv bigint,
>>>  40 sipv bigint,
>>>  41 pay bigint,
>>>  42 spay bigint,
>>>  43 ipv bigint,
>>>  44 sellerid string,
>>>  45 cate string
>>>  46 )
>>>  47 partitioned by(ds string)
>>>  48 row format delimited fields terminated by '\001' lines terminated by 
>>> '\n'
>>>  49 stored as sequencefile
>>>  50 location '${HADOOP_PATH_4_MY_HIVE}/${HIVETBL_my_table}';
>>>
>>>
>>> thanks for help.
>>>
>>>
>>> Richard
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Best practice for automating jobs

2013-01-10 Thread Dean Wampler

If you know make and bash, have a look at Stampede for scheduling work:

https://github.com/ThinkBigAnalytics/stampede

(Full disclosure: I wrote it)


On Thu, Jan 10, 2013 at 4:11 PM, Sean McNamara
wrote:

> > I want to know if there are any accepted patterns or best practices for
> >this?
>
> http://oozie.apache.org/
>
>
>
> With both Stampede and Oozie, you can tell them to watch for certain data
to show up, e.g., a _SUCCESS file marker in a directory getting new data
files, and then start a Hive query, etc. You can also add your partition
creation commands in the workflow, e.g., as soon as the data is present (or
even before; Hive won't care if it doesn't exist yet).


> > New partitions will be added regularly
>
> When you add a partition, that metadata goes into the metastore, so every
hive instance sharing that metastore will see it. Of course, you should
avoid scenarios where multiple processes attempt to create the same
partition, although if they are using exactly the same command, then adding
an IF NOT EXISTS clause will avoid error messages. Still, I wouldn't want
to torture test the metastore...


> What type of partitions are you adding? Why frequently?
>
>
>
>
> Sean
>
>
> On 1/10/13 3:03 PM, "Tom Brown"  wrote:
>
> >All,
> >
> >I want to automate jobs against Hive (using an external table with
> >ever growing partitions), and I'm running into a few challenges:
> >
> >Concurrency - If I run Hive as a thrift server, I can only safely run
> >one job at a time. As such, it seems like my best bet will be to run
> >it from the command line and setup a brand new instance for each job.
> >That quite a bit of a hassle to solves a seemingly common problem, so
> >I want to know if there are any accepted patterns or best practices
> >for this?
> >
> >Partition management - New partitions will be added regularly. If I
> >have to setup multiple instances of Hive for each (potentially)
> >overlapping job, it will be difficult to keep track of the partitions
> >that have been added. In the context of the preceding question, what
> >is the best way to add metadata about new partitions?
> >
> >Thanks in advance!
> >
> >--Tom
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Stack function in Hive : how to specify multiple aliases?

2013-01-10 Thread Dean Wampler

Try "as (alias1, alias2, ...)"

On Thu, Jan 10, 2013 at 3:42 AM, Mathieu Despriee wrote:

> Not working either :
>
> SELECT stack(2,AppWeb, ByteWeb, PacketWeb, AppP2P, ByteP2P, PacketP2P) AS
> App,Byte,Packet FROM testApp2;
> > FAILED: SemanticException 1:76 Only a single expression in the SELECT
> clause is supported with UDTF's. Error encountered near token 'Byte'
>
> I tried to quote the aliases or to use array-style with no luck.
>
> Is there any description of hive grammar somewhere ?
> I only found this doc :
> https://cwiki.apache.org/Hive/languagemanual-select.html, but "select_expr"
> is not described 
>
>
>
>
>
> 2013/1/10 Nitin Pawar 
>
>> I never ran into this kind of problem but can you try select as A,B,C
>>
>>
>> On Thu, Jan 10, 2013 at 12:58 AM, Mathieu Despriee wrote:
>>
>>> SELECT stack(2,AppWeb, ByteWeb, PacketWeb, AppP2P, ByteP2P, PacketP2P)
>>> AS A FROM testApp2;
>>>
>>>
>>> 2013/1/10 Nitin Pawar 
>>>
>>>> can you provide your query ?
>>>>
>>>>
>>>> On Thu, Jan 10, 2013 at 12:39 AM, Mathieu Despriee 
>>>> wrote:
>>>>
>>>>> Hi folks,
>>>>>
>>>>> I want to use the stack function, described here :
>>>>> https://cwiki.apache.org/Hive/languagemanual-udf.html#LanguageManualUDF-BuiltinTableGeneratingFunctions%2528UDTF%2529
>>>>>
>>>>> Hive asks me to provide the multiple aliases for the resulting columns
>>>>> ("The number of aliases in the AS clause does not match the number of
>>>>> colums output by the UDTF, expected 3 aliases but got 1").
>>>>>
>>>>> What's the syntax to provide multiple aliases ?
>>>>>
>>>>> Thanks,
>>>>> Mathieu
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: create a hive table: always a tab space before each line

2013-01-09 Thread Dean Wampler

To add to what Nitin said, there is no key output by Hive in front of the
tab.

On Wed, Jan 9, 2013 at 3:07 AM, Nitin Pawar  wrote:

> you may want to look at the sequencefile format
>
> http://my.safaribooksonline.com/book/databases/hadoop/9780596521974/file-based-data-structures/id3555432
>
> that tab is to separate key from values in the record (I may be wrong but
> this is how I interpreted it)
>
>
> On Wed, Jan 9, 2013 at 12:49 AM, Richard  wrote:
>
>> more information:
>>
>> if I set the format as textfile, there is no tab space.
>> if I set the format as sequencefile and view the content via hadoop fs
>> -text, I saw a tab space in the head of each line.
>>
>>
>> At 2013-01-09 15:44:00,Richard  wrote:
>>
>> hi there
>>
>>
>> I have a problem with creating a hive table.
>>
>> no matter what field delimiter I used, I always got a tab space in the head 
>> of each line (a line is a record).
>>
>> something like this:
>>
>> \t f1 \001 f2 \001 f3 ...
>>
>> where f1 , f2 , f3 denotes the field value and \001 is the field separator.
>>
>>
>> **
>>
>> here is the clause I used
>>
>> 35 create external table if not exists ${HIVETBL_my_table}
>>  36 (
>>  37 nid string,
>>  38 userid string,
>>  39 spv bigint,
>>  40 sipv bigint,
>>  41 pay bigint,
>>  42 spay bigint,
>>  43 ipv bigint,
>>  44 sellerid string,
>>  45 cate string
>>  46 )
>>  47 partitioned by(ds string)
>>  48 row format delimited fields terminated by '\001' lines terminated by '\n'
>>  49 stored as sequencefile
>>  50 location '${HADOOP_PATH_4_MY_HIVE}/${HIVETBL_my_table}';
>>
>>
>> thanks for help.
>>
>>
>> Richard
>>
>>
>>
>>
>>
>>
>
>
> --
> Nitin Pawar
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Map Reduce Local Task

2013-01-08 Thread Dean Wampler

That setting will make Hive more aggressive about trying to convert a join
to a local task, where it bypasses the job tracker. When you're
experimenting with queries on a small data set, it can make things much
faster, but won't be useful for large data sets where you need the cluster.

dean

On Tue, Jan 8, 2013 at 9:11 AM, Santosh Achhra wrote:

>
>
> Is setting hive.auto.convert.join to true will help setting mapreduce
> local task and conditional task ?
>
> Good wishes,always !
> Santosh
>
>
> On Tue, Jan 8, 2013 at 4:04 PM, Santosh Achhra wrote:
>
>> Hello,
>>
>> I was reading an article on web which tells about MapReduce local Task
>> and use of hash table files and conditional tasks to improve performance of
>> hive queries.
>>
>> Any idea how to implement this ? I am aware of Map joins but I am sure
>> how to implement Map reduce local tasks with hash tables.
>>
>> Good wishes,always !
>> Santosh
>>
>
>

-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Does Hue (Hadoop User Experience) works with Apache HIVE/HADOOP

2012-12-29 Thread Dean Wampler

If you're willing to consider a commercial option, Karmasphere is already
supported on EMR.

http://aws.amazon.com/elasticmapreduce/karmasphere/

On Sat, Dec 29, 2012 at 1:29 AM, Chunky Gupta wrote:

> Hi,
>
> I have Apache Hive and Apache Hadoop on Amazon EC2 machines. If anyone can
> tell me that can HUE be used with this setup instead of CHD Hadoop cluster.
> If not, then is there any alternate UI similar to HUE.
>
> Please help.
> Thanks,
> Chunky.
>



-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: hive regular expression

2012-12-26 Thread Dean Wampler

Hive uses Java's regex API. This tutorial provides an excellent
introduction.

http://docs.oracle.com/javase/tutorial/essential/regex/

dean

On Wed, Dec 26, 2012 at 12:46 PM, qiaoresearcher
wrote:

> Assume there are strings like
>
> 'afewexcvgregre&MovieTitle=321grgrdg&olfll3onsl'
>
> or
>
> '?MovieTitle=949303sjkskld&sososodn'
>
>
> how to extract 'MovieTitle=321grgrdg' or 'MovieTitle=949303sjkskld' using
> Hive reg expression functions
>
>
> thanks
> and happy holidays
>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Reflect MySQL updates into Hive

2012-12-24 Thread Dean Wampler

Looks good, but a few suggestions. If you can eliminate duplicates, etc. as
you ingest the data into HDFS, that would eliminate a cleansing step. Note
that if the target directory in HDFS IS the specified location for an
external Hive table/partition, then there will be no separate step to "load
in Hive as External Table". It's already there!

Your "transform data..." is a common pattern; stage "raw" data into a
location, then use Hive (or Pig) to transform it into the final form and
INSERT INTO the final Hive table.

dean

On Mon, Dec 24, 2012 at 9:34 AM, Ibrahim Yakti  wrote:

> Thanks Dean for the great reply, setting incremental import should be
> easy, if I partitioned my data how hive will get me the updated rows only
> considering that the row may have multiple fields that will be updated over
> time? and how will I manage the tables that based on multiple sources? and
> do you recommend to import the data to HDFS instead of Hive directly? Won't
> we have a lot of duplicated records then?
>
> Regarding automation we were thinking to use sqoop-job command or crons as
> you suggested.
>
> So, the suggested flow as follows:
>
> MySQL ---(Extract / Load)---> HDFS (Table/Year/Month/Day) ---> Load in
> Hive as External Table ---(Transform Data & Join Tables)--> Save it in Hive
> tables for reporting.
>
>
> Correct?
>
> Appreciated.
>
>
> --
> Ibrahim
>
>
> On Mon, Dec 24, 2012 at 5:51 PM, Dean Wampler <
> dean.wamp...@thinkbiganalytics.com> wrote:
>
>> This is not as hard as it sounds. The hardest part is setting up the
>> incremental query against your MySQL database. Then you can write the
>> results to new files in the HDFS directory for the table and Hive will see
>> them immediately. Yes, even though Hive doesn't support updates, it doesn't
>> care how many files are in the directory. The trick is to avoid lots of
>> little files.
>>
>> As others have suggested, you should consider partitioning the data,
>> perhaps by time. Say you import about a few HDFS blocks-worth of data each
>> day, then use year/month/day partitioning to speed up your Hive queries.
>> You'll need to add the partitions to the table as you go, but actually, you
>> can add those once a month, for example, for all partitions. Hive doesn't
>> care if the partition directories don't exist yet or the directories are
>> empty. I also recommend using an external table, which gives you more
>> flexibility on directory layout, etc.
>>
>> Sqoop might be the easiest tool for importing the data, as it will even
>> generate a Hive table schema from the original MySQL table. However, that
>> feature may not be useful in this case, as you already have the table.
>>
>> I think Oozie is horribly complex to use and overkill for this purpose. A
>> simple bash script triggered periodically by cron is all you need. If you
>> aren't using a partitioned table, you have a single sqoop command to run.
>> If you have partitioned data, you'll also need a hive statement in the
>> script to create the partition, unless you do those in batch once a month,
>> etc., etc.
>>
>> Hope this helps,
>> dean
>>
>> On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti  wrote:
>>
>>> Hi All,
>>>
>>> We are new to hadoop and hive, we are trying to use hive to
>>> run analytical queries and we are using sqoop to import data into hive, in
>>> our RDBMS the data updated very frequently and this needs to be reflected
>>> to hive. Hive does not support update/delete but there are many workarounds
>>> to do this task.
>>>
>>> What's in our mind is importing all the tables into hive as is, then we
>>> build the required tables for reporting.
>>>
>>> My questions are:
>>>
>>>1. What is the best way to reflect MySQL updates into Hive with
>>>minimal resources?
>>>2. Is sqoop the right tool to do the ETL?
>>>3. Is Hive the right tool to do this kind of queries or we should
>>>search for alternatives?
>>>
>>> Any hint will be useful, thanks in advanced.
>>>
>>> --
>>> Ibrahim
>>>
>>
>>
>>
>> --
>> *Dean Wampler, Ph.D.*
>> thinkbiganalytics.com
>> +1-312-339-1330
>>
>>
>


-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

Re: Reflect MySQL updates into Hive

2012-12-24 Thread Dean Wampler

This is not as hard as it sounds. The hardest part is setting up the
incremental query against your MySQL database. Then you can write the
results to new files in the HDFS directory for the table and Hive will see
them immediately. Yes, even though Hive doesn't support updates, it doesn't
care how many files are in the directory. The trick is to avoid lots of
little files.

As others have suggested, you should consider partitioning the data,
perhaps by time. Say you import about a few HDFS blocks-worth of data each
day, then use year/month/day partitioning to speed up your Hive queries.
You'll need to add the partitions to the table as you go, but actually, you
can add those once a month, for example, for all partitions. Hive doesn't
care if the partition directories don't exist yet or the directories are
empty. I also recommend using an external table, which gives you more
flexibility on directory layout, etc.

Sqoop might be the easiest tool for importing the data, as it will even
generate a Hive table schema from the original MySQL table. However, that
feature may not be useful in this case, as you already have the table.

I think Oozie is horribly complex to use and overkill for this purpose. A
simple bash script triggered periodically by cron is all you need. If you
aren't using a partitioned table, you have a single sqoop command to run.
If you have partitioned data, you'll also need a hive statement in the
script to create the partition, unless you do those in batch once a month,
etc., etc.

Hope this helps,
dean

On Mon, Dec 24, 2012 at 7:08 AM, Ibrahim Yakti  wrote:

> Hi All,
>
> We are new to hadoop and hive, we are trying to use hive to
> run analytical queries and we are using sqoop to import data into hive, in
> our RDBMS the data updated very frequently and this needs to be reflected
> to hive. Hive does not support update/delete but there are many workarounds
> to do this task.
>
> What's in our mind is importing all the tables into hive as is, then we
> build the required tables for reporting.
>
> My questions are:
>
>1. What is the best way to reflect MySQL updates into Hive with
>minimal resources?
>2. Is sqoop the right tool to do the ETL?
>3. Is Hive the right tool to do this kind of queries or we should
>search for alternatives?
>
> Any hint will be useful, thanks in advanced.
>
> --
> Ibrahim
>

-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330

1 2 >

1 - 100 of 122 matches

Mail list logo