from:"Keith Wiley"

Re: Can't drop table

2014-06-12 Thread Keith Wiley


On Jun 11, 2014, at 13:15 , Edward Capriolo wrote:

> "There is a lengthy discussion, but it's unclear how to actually drop the 
> darn table.  In fact, that page"
> 
> Dumb work around:
>  
> You could go directly to mysql and delete the table...
> 
> Or make a serde with that name just so you will not get the class not found 
> error :)

Okay, thanks.  Still a pretty weird bug all in all, considering that the 
conclusion of that bug report was to not fix the bug.

________
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"Luminous beings are we, not this crude matter."
   --  Yoda

Re: Can't drop table

2014-06-11 Thread Keith Wiley

That's beside the point.  The question is, why can't I drop the table?  There 
is no excuse for not dropping a table just because some serde can't be found.  
It shouldn't operate that way at all.

Thanks.

On Jun 10, 2014, at 23:33 , Nitin Pawar wrote:

> if you have added a table with a serde definition 
> 
> then just keep the jar in local filesystem and then in hive console do 
> add jar ;
> 
> this should make your serde available to table and you should be able to drop 
> the table then.
> 
> I just tried above steps and it works for my json based tables. 
> 
> 
> On Wed, Jun 11, 2014 at 2:24 AM, Keith Wiley  wrote:
> I tried to create a table that would use a csv serde.  There were various 
> problems with it, primarily in that I couldn't figure out how to specify the 
> path to the serde (whether I indicate the serde location either locally or on 
> HDFS) there were subsequent errors about not finding the serde, even though 
> the paths were correct).
> 
> So I decided to drop the table since it wasn't working...but the serde error 
> that is causing these problems to begin with prevents me from dropping the 
> table due to a serde-doesn't-exist error.  Yeah, duh, that's the problem in 
> the first place.
> 
> This is an acknowledged problem, as indicated here:
> https://issues.apache.org/jira/browse/HIVE-3392
> ...but I don't understand from that page how to get around the problem.  
> There is a lengthy discussion, but it's unclear how to actually drop the darn 
> table.  In fact, that page suggests the problem is officially classified as 
> "won't be fixed".
> 
> Note, this problem also prevents me from renaming the table via "alter".
> 
> I don't know what to do at this point.  Any ideas?
> 
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com
> music.keithwiley.com
> 
> "The easy confidence with which I know another man's religion is folly teaches
> me to suspect that my own is also."
>    --  Mark Twain
> 
> 
> 
> 
> 
> -- 
> Nitin Pawar



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow."
   --  Keith Wiley

Re: Can't drop table

2014-06-11 Thread Keith Wiley

Yeah, that doesn't work.  Hive gives clear failure-to-find serdes when running 
queries against such a table.  The serde unquestionably resides at the local 
file system path I specified to "add jar".  It just doesnt' work.

On Jun 10, 2014, at 23:33 , Nitin Pawar wrote:

> if you have added a table with a serde definition 
> 
> then just keep the jar in local filesystem and then in hive console do 
> add jar ;
> 
> this should make your serde available to table and you should be able to drop 
> the table then.
> 
> I just tried above steps and it works for my json based tables. 
> 
> 
> On Wed, Jun 11, 2014 at 2:24 AM, Keith Wiley  wrote:
> I tried to create a table that would use a csv serde.  There were various 
> problems with it, primarily in that I couldn't figure out how to specify the 
> path to the serde (whether I indicate the serde location either locally or on 
> HDFS) there were subsequent errors about not finding the serde, even though 
> the paths were correct).
> 
> So I decided to drop the table since it wasn't working...but the serde error 
> that is causing these problems to begin with prevents me from dropping the 
> table due to a serde-doesn't-exist error.  Yeah, duh, that's the problem in 
> the first place.
> 
> This is an acknowledged problem, as indicated here:
> https://issues.apache.org/jira/browse/HIVE-3392
> ...but I don't understand from that page how to get around the problem.  
> There is a lengthy discussion, but it's unclear how to actually drop the darn 
> table.  In fact, that page suggests the problem is officially classified as 
> "won't be fixed".
> 
> Note, this problem also prevents me from renaming the table via "alter".
> 
> I don't know what to do at this point.  Any ideas?
> 
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com
> music.keithwiley.com
> 
> "The easy confidence with which I know another man's religion is folly teaches
> me to suspect that my own is also."
>    --  Mark Twain
> 
> 
> 
> 
> 
> -- 
> Nitin Pawar



Keith Wiley kwi...@keithwiley.com keithwiley.com    music.keithwiley.com

"What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered."
   --  Keith Wiley

Can't drop table

2014-06-10 Thread Keith Wiley

I tried to create a table that would use a csv serde.  There were various 
problems with it, primarily in that I couldn't figure out how to specify the 
path to the serde (whether I indicate the serde location either locally or on 
HDFS) there were subsequent errors about not finding the serde, even though the 
paths were correct).

So I decided to drop the table since it wasn't working...but the serde error 
that is causing these problems to begin with prevents me from dropping the 
table due to a serde-doesn't-exist error.  Yeah, duh, that's the problem in the 
first place.

This is an acknowledged problem, as indicated here:
https://issues.apache.org/jira/browse/HIVE-3392
...but I don't understand from that page how to get around the problem.  There 
is a lengthy discussion, but it's unclear how to actually drop the darn table.  
In fact, that page suggests the problem is officially classified as "won't be 
fixed".

Note, this problem also prevents me from renaming the table via "alter".

I don't know what to do at this point.  Any ideas?

____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"The easy confidence with which I know another man's religion is folly teaches
me to suspect that my own is also."
   --  Mark Twain

Re: bucketed table problems

2014-03-07 Thread Keith Wiley

Thanks.  :-)  It's working better for me now.

On Mar 7, 2014, at 17:37 , Stephen Sprague wrote:

> short answer: its by position.
> 


____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
   --  Homer Simpson

Re: bucketed table problems

2014-03-07 Thread Keith Wiley

Admittedly the problem has gone away.  I *think* I may have offered the schema 
to the bucketed table creation in the wrong order the first time...I think.  
All I know is the problem went away when I tried a second time.  Should 
inserting a table into another table match the columns by name or does it just 
write from left to right utterly ignoring any schema specification?  Does it 
write the column 'a' from foo into column 'a' in foo_bucketed, or does it write 
column 'a' from foo into the same column position in foo_bucketed without any 
regard for the names of the columns?

On Mar 7, 2014, at 13:52 , Stephen Sprague wrote:

> yeah. that's not right. 
> 
> 1. lets see the output of "show create table foo"
> 
> 2. what version of hive are you using.
> 
> 
> On Fri, Mar 7, 2014 at 11:46 AM, Keith Wiley  wrote:
> I want to convert a table to a bucketed table, so I made a new table with the 
> same schema as the old table and specified a cluster column:
> 
> create table foo_bucketed
> (
> a string,
> b int,
> c float
> )
> clustered by (b) into 10 buckets;
> 
> Then I populate it from my original table:
> 
> set hive.enforce.bucketing = true;
> insert overwrite table foo_bucketed
> select * from foo;
> 
> All of the data goes into the first bucket, leaving the remaining 9 buckets 
> empty (in the file system, the remaining 9 files are 0 size).  Furthermore, 
> the cluster column is now NULL.  Its values have been completely erased by 
> the insertion (which might explain how they all ended up in a single bucket 
> of course).
> 
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com
> music.keithwiley.com
> 
> "Yet mark his perfect self-contentment, and hence learn his lesson, that to be
> self-contented is to be vile and ignorant, and that to aspire is better than 
> to
> be blindly and impotently happy."
>--  Edwin A. Abbott, Flatland
> ____
> 
> 



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"Luminous beings are we, not this crude matter."
   --  Yoda

bucketed table problems

2014-03-07 Thread Keith Wiley

I want to convert a table to a bucketed table, so I made a new table with the 
same schema as the old table and specified a cluster column:

create table foo_bucketed
(
a string,
b int,
c float
)
clustered by (b) into 10 buckets;

Then I populate it from my original table:

set hive.enforce.bucketing = true;
insert overwrite table foo_bucketed
select * from foo;

All of the data goes into the first bucket, leaving the remaining 9 buckets 
empty (in the file system, the remaining 9 files are 0 size).  Furthermore, the 
cluster column is now NULL.  Its values have been completely erased by the 
insertion (which might explain how they all ended up in a single bucket of 
course).


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy."
   --  Edwin A. Abbott, Flatland

Use distribute to spread across reducers

2013-10-02 Thread Keith Wiley

I'm trying to create a subset of a large table for testing.  The following 
approach works:

create table subset_table as
select * from large_table limit 1000

...but it only uses one reducer.  I would like to speed up the process of 
creating a subset but distributing across multiple reducers.  I already tried 
explicitly setting mapred.reduce.tasks and hive.exec.reducers.max to values 
larger than 1, but in this particular case, those values seem to be over-ridden 
by Hive's internal query->to->mapreduce conversion; it ignores those parameters.

So, I tried this:

create table subset_table as
select * from large_table limit 1000
distribute by column_name

...but that doesn't parse.  I get the following error:

OK FAILED: ParseException line 3:0 missing EOF at 'distribute' near '1000'.

I have tried NUMEROUS applications of parentheses, nested queries, etc.  For 
example, here's just one (amongst perhaps ten variations on a theme):

create table subset_table as
select * from (
from (
select * from large_table limit 1000
distribute by column_name
)) s

Like I said, I've tried all sorts of combinations of the elements shown above.  
So far I have not even gotten any syntax to parse, much less run.  Only the 
original query at the top will even pass the parsing stage of processing.

Any ideas?

Thanks.

________
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
   --  Galileo Galilei

Re: Want query to use more reducers

2013-09-30 Thread Keith Wiley

Thanks.  mapred.reduce.tasks and hive.exec.reducers.max seem to have fixed the 
problem.  It is now saturating the cluster and running the query super fast.  
Excellent!

On Sep 30, 2013, at 12:28 , Sean Busbey wrote:

> Hey Keith,
> 
> It sounds like you should tweak the settings for how Hive handles query 
> execution[1]:
> 
> 1) Tune the guessed number of reducers based on input size
> 
> = hive.exec.reducers.bytes.per.reducer
> 
> Defaults to 1G. Based on your description, it sounds like this is probably 
> still at default.
> 
> In this case, you should also set a max # of reducers based on your cluster 
> size.
> 
> = hive.exec.reducers.max
> 
> I usually set this to the # reduce slots, if there's a decent chance I'll get 
> to saturate the cluster. If not, don't worry about it.
> 
> 2) Hard code a number of reducers
> 
> = mapred.reduce.tasks
> 
> Setting this will cause Hive to always use that number. It defaults to -1, 
> which tells hive to use the heuristic about input size to guess.
> 
> In either of the above cases, you should look at the options to merge small 
> files (search for "merge"  in the configuration property list) to avoid 
> getting lots of little outputs.
> 
> HTH
> 
> [1]: 
> https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-QueryExecution
> 
> -Sean
> 
> On Mon, Sep 30, 2013 at 11:31 AM, Keith Wiley  wrote:
> I have a query that doesn't use reducers as efficiently as I would hope.  If 
> I run it on a large table, it uses more reducers, even saturating the 
> cluster, as I desire.  However, on smaller tables it uses as low as a single 
> reducer.  While I understand there is a logic in this (not using multiple 
> reducers until the data size is larger), it is nevertheless inefficient to 
> run a query for thirty minutes leaving the entire cluster vacant when the 
> query could distribute the work evenly and wrap things up in a fraction of 
> the time.  The query is shown below (abstracted to its basic form).  As you 
> can see, it is a little atypical: it is a nested query which obviously 
> implies two map-reduce jobs and it uses a script for the reducer stage that I 
> am trying to speed up.  I thought the "distribute by" clause should make it 
> use the reducers more evenly, but as I said, that is not the behavior I am 
> seeing.
> 
> Any ideas how I could improve this situation?
> 
> Thanks.
> 
> CREATE TABLE output_table ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' as
> SELECT * FROM (
> FROM (
> SELECT * FROM input_table
> DISTRIBUTE BY input_column_1 SORT BY input_column_1 ASC, 
> input_column_2 ASC, input_column_etc ASC) q
> SELECT TRANSFORM(*)
> USING 'python my_reducer_script.py' AS(
> output_column_1,
> output_column_2,
> output_column_etc,
> )
> ) s
> ORDER BY output_column_1;
> 
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com
> music.keithwiley.com
> 
> "Luminous beings are we, not this crude matter."
>--  Yoda
> 
> 
> 
> 
> 
> -- 
> Sean



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
   --  Galileo Galilei

Want query to use more reducers

2013-09-30 Thread Keith Wiley

I have a query that doesn't use reducers as efficiently as I would hope.  If I 
run it on a large table, it uses more reducers, even saturating the cluster, as 
I desire.  However, on smaller tables it uses as low as a single reducer.  
While I understand there is a logic in this (not using multiple reducers until 
the data size is larger), it is nevertheless inefficient to run a query for 
thirty minutes leaving the entire cluster vacant when the query could 
distribute the work evenly and wrap things up in a fraction of the time.  The 
query is shown below (abstracted to its basic form).  As you can see, it is a 
little atypical: it is a nested query which obviously implies two map-reduce 
jobs and it uses a script for the reducer stage that I am trying to speed up.  
I thought the "distribute by" clause should make it use the reducers more 
evenly, but as I said, that is not the behavior I am seeing.

Any ideas how I could improve this situation?

Thanks.

CREATE TABLE output_table ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' as 
SELECT * FROM (
FROM (
SELECT * FROM input_table
DISTRIBUTE BY input_column_1 SORT BY input_column_1 ASC, 
input_column_2 ASC, input_column_etc ASC) q
SELECT TRANSFORM(*)
USING 'python my_reducer_script.py' AS(
output_column_1,
output_column_2,
output_column_etc,
)
) s
ORDER BY output_column_1;

________
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"Luminous beings are we, not this crude matter."
   --  Yoda

Re: .sql vs. .hql

2013-05-31 Thread Keith Wiley

On May 31, 2013, at 13:52 , Sanjay Subramanian wrote:

> First u need to setup some files in HDFS and define some Hive tables ands
> point to the location and then start your journey !
> 
> sanjay

Oh I'm miles past that.  I've been deploying Hive systems for well over a year 
now.  I'm just not used to working with input files.  I (and my end-users) have 
been one-query-at-a-time console queriers up until now (often batched through 
bash scripts and such), so all I need is to come up to speed on using .sql and 
.hql files.

Thanks.

________
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
   --  Homer Simpson

Re: .sql vs. .hql

2013-05-31 Thread Keith Wiley

Thanks!  I'm always up for a good book.  Our case is a little open-ended right 
now, a little difficult to nail down or describe.  Basically, I'm deploying 
Hive systems for a SQl-comfortable data analyst (who is also quite versed in 
Hive) who would like to run .sql files in Hive and I need to clearly understand 
how Hive uses input files and how .sql and .hql may differ.

Cheers!

On May 31, 2013, at 13:20 , Sanjay Subramanian wrote:

> For Hive u need the other bible by Dean Wampler , Edward Capriolo et al
> Also if u tell us what use cases u have we could provide helpŠ
> 
> sanjay
> 
> On 5/31/13 1:17 PM, "Keith Wiley"  wrote:
> 
>> I'm looking for documentation on how to use .sql and .hql files in Hive
>> and what the differences are between the two file types.  I'm not even
>> certain .hql is a real thing, but I'm looking.  I didn't see any mention
>> of this issue in White's Hadoop book, at least not clearly indicated in
>> the index.
>> 
>> Thanks.
>> 
>> __
>> __
>> Keith Wiley kwi...@keithwiley.com keithwiley.com
>> music.keithwiley.com
>> 
>> "Luminous beings are we, not this crude matter."
>>  --  Yoda
>> __
>> __
>> 
>> 
> 
> 
> CONFIDENTIALITY NOTICE
> ==
> This email message and any attachments are for the exclusive use of the 
> intended recipient(s) and may contain confidential and privileged 
> information. Any unauthorized review, use, disclosure or distribution is 
> prohibited. If you are not the intended recipient, please contact the sender 
> by reply email and destroy all copies of the original message along with any 
> attachments, from your computer system. If you are the intended recipient, 
> please be advised that the content of this message is subject to access, 
> review and disclosure by the sender's Email System Administrator.
> 



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
   --  Galileo Galilei

.sql vs. .hql

2013-05-31 Thread Keith Wiley

I'm looking for documentation on how to use .sql and .hql files in Hive and 
what the differences are between the two file types.  I'm not even certain .hql 
is a real thing, but I'm looking.  I didn't see any mention of this issue in 
White's Hadoop book, at least not clearly indicated in the index.

Thanks.

____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"Luminous beings are we, not this crude matter."
   --  Yoda

Query doesn't filter UDF result as expected

2013-04-17 Thread Keith Wiley

Hive is not filtering on the result of a UDF in the way I expect or desire it 
to.  The example below shows a very small sample table.  The nested query that 
follows the table dump attempts to filter the results on the basis of a 
row_number UDF.  The thing to note is that in the query results, 'rn' 
increments steadily from 1 to 15.  Next, note that the last clause of the query 
intended to filter the result to only those result rows where 'rn' equals 1.  
In other words, I expected the query to only return the first result row.  What 
am I missing?  Thanks.

hive> set hive.cli.print.header=true;
hive> select * from test;
OK
foo bar usercategoryvalue
foo bar user1   catA1
foo bar user1   catB2
foo bar user1   catC3
foo bar user2   catA99
foo bar user2   catB98
foo bar user2   catC97
foo bar user3   catA100
foo bar user3   catB100
foo bar user3   catC100
foo bar user3   catA200
foo bar user3   catB200
foo bar user3   catC200
foo bar user3   catA300
foo bar user3   catB300
foo bar user3   catC300
Time taken: 0.092 seconds
hive> select * from (
> select *, row_number(foo) as rn
> from (
> select foo, bar, user, category, value
> from test
> where foo = 'foo'
> sort by foo ) a
> sort by rn ) b
> where rn = 1;
[...MAP_REDUCE...]
OK
foo bar usercategoryvalue   rn
foo bar user1   catA1   1
foo bar user1   catB2   2
foo bar user1   catC3   3
foo bar user2   catA99  4
foo bar user2   catB98  5
foo bar user2   catC97  6
foo bar user3   catA100 7
foo bar user3   catB100 8
foo bar user3   catC100 9
foo bar user3   catA200 10
foo bar user3   catB200 11
foo bar user3   catC200 12
foo bar user3   catA300 13
foo bar user3   catB300 14
foo bar user3   catC300 15
Time taken: 87.933 seconds
hive>


________
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
   --  Galileo Galilei

Rank(): here's how I did it, for better or worse...

2013-04-02 Thread Keith Wiley

I agree, it's probably best to use a better engineered approach such as 
Edward's.  In the meantime, if anyone would benefit from a walk-through of my 
direct approach, here it is.  It combines Ritesh's direct ultra-simplistic 
method with Edward's correct HiveQL syntax.  As would be expected, it is 
sensitive to shell vagaries that would be better managed by a combo system like 
git and maven...but it works.



Create Rank.java:
-
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
 
public final class Rank extends UDF{
private int counter;
private String last_key;
public int evaluate(final String key){
  if ( !key.equalsIgnoreCase(this.last_key) ) {
 this.counter = 0;
 this.last_key = key;
  }
  return this.counter++;
}
}



Compile Rank.java to Rank.class.  Then bundle to Rank.jar.  Observe that the 
jar command is highly sensitive to the relative path to the .class file when 
naming the package inside the resulting .jar file:
-
$ mkdir ./RankTempDir
$ javac -classpath 
$HIVE_HOME/lib/hive-serde-0.8.1.jar:$HIVE_HOME/lib/hive-exec-0.8.1.jar:$HADOOP_HOME/hadoop-core.jar
 -d ./RankTempDir Rank.java
$ cd RankTempDir;
$ jar -cf ../Rank.jar ./com
$ cd ..

You will have to verify (via echo) HADOOP_HOME and HIVE_HOME and then will have 
to verify the name of the serde, exec, and core files in each directory.  The 
exact filenames are probably version specific.

Verify the package path in Rank.jar:
-
$ jar -tvf Rank.jar

You should see 'com/example/hive/udf/Rank.class'.  If you see a different path, 
the package has not been properly represented in the jar w.r.t. its designation 
in the .java file.



Run hive and prepare the session to use the UDF:

$ hive
hive> add jar Rank.jar;
hive> create temporary function rank as 'com.example.hive.udf.Rank';

You must either run hive from a directory containing Rank.jar or specify an 
alternate path to it in the "add" command.  Note that the Rank class's full 
package is specified in the "create" command and therefore must, logically, 
match the package in both the .java and the .jar files.



Consider a table named 'test' consisting of columns 'user', 'category', and 
'value', containing the following data:

hive> select * from test;
user1   catA1
user1   catB11
user1   catC111
user2   catA222
user2   catB22
user2   catC2
user3   catA3
user3   catB5
user3   catC4

So the top category for user1 is catC, for user2 is catA and for user3 is catB. 
 Say we want the top N valued categories for each user.  In the example below, 
N is 2 (it is indicated in the final WHERE clause).  Here is the format of the 
corresponding ranked query, and its result:

hive> SELECT user, category, value, ranked_col
FROM (
SELECT user, category, value, rank(user) ranked_col
FROM (
SELECT user, category, value
FROM test
DISTRIBUTE BY user
SORT BY user, value desc
) a
) b
WHERE ranked_col < 2
ORDER BY user, ranked_col;
...
[wait for Hive query and MapReduce job(s) to finish]
...
user1   catC111 0
user1   catB11  1
user2   catA222 0
user2   catB22  1
user3   catB5   0
user3   catC4   1

Note that ranks are 0-indexed (of course I suppose that's a property of the 
specific .java we wrote above, and therefore is easily amenable to 1-indexing, 
which would more canonically connote the notion of a "rank").

Anyway, that's what I came up with.  I don't by any means claim it's the best 
approach.  Edward is surely right that the best method would be to use the 
powerful tools made available by the large developer community such as git and 
maven.

Cheers!


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
   --  Homer Simpson

Re: Need rank()

2013-04-02 Thread Keith Wiley

On Apr 2, 2013, at 12:59 , Edward Capriolo wrote:

> I feel your pain I do not have time to look at the project ATM. However 
> trying to do Java development these days without understanding git, or maven 
> is fairly impossible, most projects depend on N others that depend on M 
> others. So trying to even compile the code outside of maven requires a lot of 
> manually fetching jar etc. Your better off taking a day and getting good with 
> maven, IMHO. You will be dealing with it for the foreseeable future. 


I agree.  I'll cozy up with it first chance I get...but sometimes you're on a 
deadline and have to go for the low-hanging fruit.  I fully understand the 
motivation behind your approach.  I'll circle back around it when I get some 
free time.

Cheers!

____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
   --  Galileo Galilei

Re: Need rank()

2013-04-02 Thread Keith Wiley

Yep, the original article is definitely erroneous in this regard.  I figured 
out that eventually.  I'm not sure how much I can trust that resource now.  I 
may have to look elsewhere.  I agree that Edward's description is pretty good, 
but as I said earlier, I can't actually use his code, so I'm trying to cobble a 
workable solution together from the various resources available.  Ritesh's 
article, despite the error in the Hive syntax, is still useful in that it 
enables one to quickly compile a simple rank jar without relying on git, maven, 
or other project dependencies -- problems which have plagued me with Edward's 
approach.  So, if I can use Ritesh's method to write a simple rank function, 
and Edward's accurate description of how to construct the query, then I can put 
all the pieces together into a workable solution.

I'll let you know if I get it.

On Apr 2, 2013, at 10:56 , Igor Tatarinov wrote:

> You are getting the error because you are ORDERing BY rank but rank is not in 
> the top SELECT
> 
> Also, DISTRIBUTE BY/SORT BY are done after SELECT so you have to use a 
> subquery:
> SELECT ..., rank(user)
> FROM (SELECT ... DISTRIBUTE BY ... SORT BY)
> 
> igor
> decide.com
> 
> 
> On Tue, Apr 2, 2013 at 10:03 AM, Keith Wiley  wrote:
> On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:
> 
> > http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
> 
> Is there any possibility there is a bug in Ritesh Agrawal's query statement 
> from that article?  I created a test table with the exact column names from 
> the example in the article and used a minimally altered version of the 
> command (I removed the where clause to simplify things a bit) and got an 
> error which suggests there is something slightly wrong with the command (or 
> perhaps the table has to be configured a special way).  Here's what I get 
> when I almost perfectly duplicate that example:
> 
> hive> describe test;
> OK
> userstring
> categorystring
> value   int
> Time taken: 0.082 seconds
> ==
> hive> select * from test;
> OK
> user1   cat11
> user1   cat12
> user1   cat13
> user1   cat210
> user1   cat220
> user1   cat230
> user2   cat111
> user2   cat121
> user2   cat131
> user2   cat25
> user2   cat26
> user2   cat27
> Time taken: 0.202 seconds
> ==
> hive> SELECT user, category, value
> > FROM (
> > SELECT user, category, rank(user) as rank, value
> > FROM test
> > DISTRIBUTE BY user
> > SORT BY user, value desc
> > ) a
> > WHERE rank > 1
> > ORDER BY user, rank;
> FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or column 
> reference 'rank': (possible column names are: _col0, _col1, _col2)
> hive>
> 
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com
> music.keithwiley.com
> 
> "What I primarily learned in grad school is how much I *don't* know.
> Consequently, I left grad school with a higher ignorance to knowledge ratio 
> than
> when I entered."
>--  Keith Wiley
> 
> 
> 



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"Luminous beings are we, not this crude matter."
   --  Yoda

Re: Need rank()

2013-04-02 Thread Keith Wiley

I'm pretty sure there are at least a few bugs in that article.  The problem 
described below goes away if "rank" is added to the outer select statement.  If 
there's one error I suppose there may be others.  Although I have gotten the 
query to run, I'm not getting properly ranked results yet.  I'm still working 
on it.

On Apr 2, 2013, at 10:03 , Keith Wiley wrote:

> On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:
> 
>> http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
> 
> Is there any possibility there is a bug in Ritesh Agrawal's query statement 
> from that article?  I created a test table with the exact column names from 
> the example in the article and used a minimally altered version of the 
> command (I removed the where clause to simplify things a bit) and got an 
> error which suggests there is something slightly wrong with the command (or 
> perhaps the table has to be configured a special way).  Here's what I get 
> when I almost perfectly duplicate that example:
> 
> hive> describe test;
> OK
> user  string  
> category  string  
> value int 
> Time taken: 0.082 seconds
> ==
> hive> select * from test;
> OK
> user1 cat11
> user1 cat12
> user1 cat13
> user1 cat210
> user1 cat220
> user1 cat230
> user2 cat111
> user2 cat121
> user2 cat131
> user2 cat25
> user2 cat26
> user2 cat27
> Time taken: 0.202 seconds
> ==
> hive> SELECT user, category, value
>> FROM (
>> SELECT user, category, rank(user) as rank, value
>> FROM test
>> DISTRIBUTE BY user
>> SORT BY user, value desc
>> ) a
>> WHERE rank > 1
>> ORDER BY user, rank;
> FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or column 
> reference 'rank': (possible column names are: _col0, _col1, _col2)
> hive> 
> 
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com
> music.keithwiley.com
> 
> "What I primarily learned in grad school is how much I *don't* know.
> Consequently, I left grad school with a higher ignorance to knowledge ratio 
> than
> when I entered."
>   --  Keith Wiley
> 
> 



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy."
   --  Edwin A. Abbott, Flatland

Re: Need rank()

2013-04-02 Thread Keith Wiley

I suppose.  Like I said, I was just following the example on that website and 
it didn't seem to have any comments on it remarking on errors in the article.  
Thus my confusion about the whole thing.  It is presented as being accurate and 
is old enough that any errors would have been caught and labeled in the 
comments long ago.  I also thought it was a little odd to name the function the 
same thing as the alias assigned to the result.  I'll try disambiguating that.

On Apr 2, 2013, at 10:22 , Nitin Pawar wrote:

> I am not sure about this 
> but you defined a rank function and then aliased the column as rank itself. 
> Can this cause the issue? 
> 
> From the error it looks like it could not figure out which column is rank in 
> the where clause "WHERE rank > 1" 
> 
> from Edward's blog the correct query looks similar  but his function is named 
> as p_rank 
> 
> SELECT
>  category,country,product,sales,rank
> FROM (
>  SELECT
>category,country,product,sales,
>p_rank(category, country) rank
>  FROM (
>SELECT
>  category,country,product,
>  sales
>FROM p_rank_demo
>DISTRIBUTE BY
>  category,country
>SORT BY
>      category,country,sales desc) t1) t2
> WHERE rank <= 3
> 
> 
> On Tue, Apr 2, 2013 at 10:45 PM, Keith Wiley  wrote:
> I did all that, I just didn't paste it into the email.  That isn't the 
> problem.  Sorry for the confusion.
> 
> hive> add jar Rank.jar;
> hive> create temporary function rank as 'com.example.hive.udf.Rank';
> 
> BTW, small typo, the condition at the end uses less-then, not 
> greater-then...obviously...but that isn't the problem either.
> 
> On Apr 2, 2013, at 10:06 , Nitin Pawar wrote:
> 
> > I dont see you added rank in the classpath and defined rank function
> >
> > can you try add jar and define the function and try again
> >
> >
> >
> > On Tue, Apr 2, 2013 at 10:33 PM, Keith Wiley  wrote:
> > On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:
> >
> > > http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
> >
> > Is there any possibility there is a bug in Ritesh Agrawal's query statement 
> > from that article?  I created a test table with the exact column names from 
> > the example in the article and used a minimally altered version of the 
> > command (I removed the where clause to simplify things a bit) and got an 
> > error which suggests there is something slightly wrong with the command (or 
> > perhaps the table has to be configured a special way).  Here's what I get 
> > when I almost perfectly duplicate that example:
> >
> > hive> describe test;
> > OK
> > userstring
> > categorystring
> > value   int
> > Time taken: 0.082 seconds
> > ==
> > hive> select * from test;
> > OK
> > user1   cat11
> > user1   cat12
> > user1   cat13
> > user1   cat210
> > user1   cat220
> > user1   cat230
> > user2   cat111
> > user2   cat121
> > user2   cat131
> > user2   cat25
> > user2   cat26
> > user2   cat27
> > Time taken: 0.202 seconds
> > ==
> > hive> SELECT user, category, value
> > > FROM (
> > > SELECT user, category, rank(user) as rank, value
> > > FROM test
> > > DISTRIBUTE BY user
> > > SORT BY user, value desc
> > > ) a
> > > WHERE rank > 1
> > > ORDER BY user, rank;
> > FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or column 
> > reference 'rank': (possible column names are: _col0, _col1, _col2)
> > hive>
> >
> > ____
> > Keith Wiley kwi...@keithwiley.com keithwiley.com
> > music.keithwiley.com
> >
> > "What I primarily learned in grad school is how much I *don't* know.
> > Consequently, I left grad school with a higher ignorance to knowledge ratio 
> > than
> > when I entered."
> >--  Keith Wiley
> > ____
> >
> >
> >
> >
> > --
> > Nitin Pawar
> 
> 
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com
> mu

Re: Need rank()

2013-04-02 Thread Keith Wiley

I did all that, I just didn't paste it into the email.  That isn't the problem. 
 Sorry for the confusion.

hive> add jar Rank.jar;
hive> create temporary function rank as 'com.example.hive.udf.Rank';

BTW, small typo, the condition at the end uses less-then, not 
greater-then...obviously...but that isn't the problem either.

On Apr 2, 2013, at 10:06 , Nitin Pawar wrote:

> I dont see you added rank in the classpath and defined rank function 
> 
> can you try add jar and define the function and try again 
> 
> 
> 
> On Tue, Apr 2, 2013 at 10:33 PM, Keith Wiley  wrote:
> On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:
> 
> > http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
> 
> Is there any possibility there is a bug in Ritesh Agrawal's query statement 
> from that article?  I created a test table with the exact column names from 
> the example in the article and used a minimally altered version of the 
> command (I removed the where clause to simplify things a bit) and got an 
> error which suggests there is something slightly wrong with the command (or 
> perhaps the table has to be configured a special way).  Here's what I get 
> when I almost perfectly duplicate that example:
> 
> hive> describe test;
> OK
> userstring
> categorystring
> value   int
> Time taken: 0.082 seconds
> ==
> hive> select * from test;
> OK
> user1   cat11
> user1   cat12
> user1   cat13
> user1   cat210
> user1   cat220
> user1   cat230
> user2   cat111
> user2   cat121
> user2   cat131
> user2   cat25
> user2   cat26
> user2   cat27
> Time taken: 0.202 seconds
> ==
> hive> SELECT user, category, value
> > FROM (
> > SELECT user, category, rank(user) as rank, value
> > FROM test
> > DISTRIBUTE BY user
> > SORT BY user, value desc
> > ) a
> > WHERE rank > 1
> > ORDER BY user, rank;
> FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or column 
> reference 'rank': (possible column names are: _col0, _col1, _col2)
> hive>
> 
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com    
> music.keithwiley.com
> 
> "What I primarily learned in grad school is how much I *don't* know.
> Consequently, I left grad school with a higher ignorance to knowledge ratio 
> than
> when I entered."
>--  Keith Wiley
> 
> 
> 
> 
> 
> -- 
> Nitin Pawar


________
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow."
   --  Keith Wiley

Re: Need rank()

2013-04-02 Thread Keith Wiley

On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:

> http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/

Is there any possibility there is a bug in Ritesh Agrawal's query statement 
from that article?  I created a test table with the exact column names from the 
example in the article and used a minimally altered version of the command (I 
removed the where clause to simplify things a bit) and got an error which 
suggests there is something slightly wrong with the command (or perhaps the 
table has to be configured a special way).  Here's what I get when I almost 
perfectly duplicate that example:

hive> describe test;
OK
userstring  
categorystring  
value   int 
Time taken: 0.082 seconds
==
hive> select * from test;
OK
user1   cat11
user1   cat12
user1   cat13
user1   cat210
user1   cat220
user1   cat230
user2   cat111
user2   cat121
user2   cat131
user2   cat25
user2   cat26
user2   cat27
Time taken: 0.202 seconds
==
hive> SELECT user, category, value
> FROM (
> SELECT user, category, rank(user) as rank, value
> FROM test
> DISTRIBUTE BY user
> SORT BY user, value desc
> ) a
> WHERE rank > 1
> ORDER BY user, rank;
FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or column 
reference 'rank': (possible column names are: _col0, _col1, _col2)
hive> 

________
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered."
   --  Keith Wiley

Re: Need rank(), can't build m6d's version

2013-04-02 Thread Keith Wiley

Thank you Edward.  I appreciate you trying to help out.  I'll see if I can get 
that to work.

On Apr 1, 2013, at 17:09 , Edward Capriolo wrote:

> approx steps
> git clone hive-test 
> mvn install
> git clone hive-rang
> mvn install
> 
> or just
> git clone hive-test
> mvn -Dmaven.test.skip=true install

____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"The easy confidence with which I know another man's religion is folly teaches
me to suspect that my own is also."
   --  Mark Twain

Re: Need rank(), can't build m6d's version

2013-04-01 Thread Keith Wiley

Thanks!  That looks like it has serious potential.  I'll study up on it.

On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:

> http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/


____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
   --  Homer Simpson

Need rank(), can't build m6d's version

2013-04-01 Thread Keith Wiley

I need rank() in Hive.  I have't had much luck with Edward Capriolo's on git 
and it comes with no documentation.  It depends on hive-test (also by Edward) 
and I can't get maven to build the tests for hive-test which prevents me from 
building the hive-test package (the jar)...which is all pretty tangential to 
the real goal, which is rank of course.

Is there any place to download the hive-rank jar pre-built instead of having to 
build it out of git?  If not, has anyone built it recently, or namely has 
anyone built the hive-test project it depends on?  The following is just one 
example of the sort of error I see.  Remember this error is on hive-test, which 
frankly I don't even care about.  :-)

Feeling a little desperate...thanks for any help.

---
Test set: com.jointhegrid.hive_test.EmbeddedHiveExampleTest
---
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.878 sec <<< 
FAILURE!
testA(com.jointhegrid.hive_test.EmbeddedHiveExampleTest)  Time elapsed: 7.823 
sec  <<< FAILURE!
java.lang.AssertionError: expected:<0> but was:<9>
at org.junit.Assert.fail(Assert.java:91)
at org.junit.Assert.failNotEquals(Assert.java:645)
at org.junit.Assert.assertEquals(Assert.java:126)
at org.junit.Assert.assertEquals(Assert.java:470)
at org.junit.Assert.assertEquals(Assert.java:454)
at 
com.jointhegrid.hive_test.EmbeddedHiveExampleTest.testA(EmbeddedHiveExampleTest.java:51)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at junit.framework.TestCase.runTest(TestCase.java:168)
at junit.framework.TestCase.runBare(TestCase.java:134)
at junit.framework.TestResult$1.protect(TestResult.java:110)
at junit.framework.TestResult.runProtected(TestResult.java:128)
at junit.framework.TestResult.run(TestResult.java:113)
at junit.framework.TestCase.run(TestCase.java:124)
at junit.framework.TestSuite.runTest(TestSuite.java:243)
at junit.framework.TestSuite.run(TestSuite.java:238)
at 
org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83)
at 
org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:35)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:115)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:97)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.maven.surefire.booter.ProviderFactory$ClassLoaderProxy.invoke(ProviderFactory.java:103)
at $Proxy0.invoke(Unknown Source)
at 
org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:150)
at 
org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcess(SurefireStarter.java:91)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:69)

________
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
   --  Galileo Galilei

Re: S3/EMR Hive: Load contents of a single file

2013-03-27 Thread Keith Wiley

Okay, I also saw your previous response which analyzed queries into two tables 
built around two files in the same directory.  I guess I was simply wrong in my 
understanding that a Hive table is fundamentally associated with a directory 
instead of a file.  Turns out, it be can either one.  A directory table uses 
all files in the directory while a file table uses one specific file and 
properly avoids sibling files.  My bad.

Thanks for the careful analysis and clarification.  TIL!

Cheers!

On Mar 27, 2013, at 02:58 , Tony Burton wrote:

> A bit more info - do an extended description of the table:
>  
> $ desc extended gsrc1;
>  
> And the “location” field is “location:s3://mybucket/path/to/data/src1.txt”
>  
> Do the same on a table created with a location pointing at the directory and 
> the same info gives (not surprisingly) “location:s3://mybucket/path/to/data/”
> 

________
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me."
   --  Abe (Grandpa) Simpson

Re: S3/EMR Hive: Load contents of a single file

2013-03-26 Thread Keith Wiley

7:9000
> $ set hive.metastore.warehouse.dir;
> hive.metastore.warehouse.dir=/mnt/hive_081/warehouse
>  
> Clearly different, but which is correct? Is there an easier way to load a 
> single file into a hive table? Or should I just put each file in a directory 
> and proceed as before?
>  
> Thanks!
>  
> Tony
>  
>  
>  
>  
>  
>  
>  
> Tony Burton
> Senior Software Engineer
> e: tbur...@sportingindex.com
> 
> 
>  
>  
> 
> 
> *
> PPlease consider the environment before printing this email or attachments
> 
> 
> This email and any attachments are confidential, protected by copyright and 
> may be legally privileged. If you are not the intended recipient, then the 
> dissemination or copying of this email is prohibited. If you have received 
> this in error, please notify the sender by replying by email and then delete 
> the email completely from your system. Neither Sporting Index nor the sender 
> accepts responsibility for any virus, or any other defect which might affect 
> any computer or IT system into which the email is received and/or opened. It 
> is the responsibility of the recipient to scan the email and no 
> responsibility is accepted for any loss or damage arising in any way from 
> receipt or use of this email. Sporting Index Ltd is a company registered in 
> England and Wales with company number 2636842, whose registered office is at 
> Gateway House, Milverton Street, London, SE11 4AP. Sporting Index Ltd is 
> authorised and regulated by the UK Financial Services Authority (reg. no. 
> 150404) and Gambling Commission (reg. no. 000-027343-R-308898-001). Any 
> financial promotion contained herein has been issued and approved by Sporting 
> Index Ltd.
> 
> 
> Outbound email has been scanned for viruses and SPAM
>  
> CONFIDENTIALITY NOTICE
> ==
> This email message and any attachments are for the exclusive use of the 
> intended recipient(s) and may contain confidential and privileged 
> information. Any unauthorized review, use, disclosure or distribution is 
> prohibited. If you are not the intended recipient, please contact the sender 
> by reply email and destroy all copies of the original message along with any 
> attachments, from your computer system. If you are the intended recipient, 
> please be advised that the content of this message is subject to access, 
> review and disclosure by the sender's Email System Administrator.
>  
> 
> 
> *
> P Pleaseconsider the environment before printing this email
> 
> 
> Inbound email has been scanned for viruses & spam
> 
> 
> 
> 
> *
> P Please consider the environment before printing this email or attachments
> 
> 
> This email and any attachments are confidential, protected by copyright and 
> may be legally privileged. If you are not the intended recipient, then the 
> dissemination or copying of this email is prohibited. If you have received 
> this in error, please notify the sender by replying by email and then delete 
> the email completely from your system. Neither Sporting Index nor the sender 
> accepts responsibility for any virus, or any other defect which might affect 
> any computer or IT system into which the email is received and/or opened. It 
> is the responsibility of the recipient to scan the email and no 
> responsibility is accepted for any loss or damage arising in any way from 
> receipt or use of this email. Sporting Index Ltd is a company registered in 
> England and Wales with company number 2636842, whose registered office is at 
> Gateway House, Milverton Street, London, SE11 4AP. Sporting Index Ltd is 
> authorised and regulated by the UK Financial Services Authority (reg. no. 
> 150404) and Gambling Commission (reg. no. 000-027343-R-308898-001). Any 
> financial promotion contained herein has been issued and approved by Sporting 
> Index Ltd.
> 
> 
> Outbound email has been scanned for viruses and SPAM
> 



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy."
   --  Edwin A. Abbott, Flatland

Re: Query crawls through reducer

2013-03-24 Thread Keith Wiley

Thanks.

On Mar 22, 2013, at 21:02 , Nitin Pawar wrote:

> instead of >= can you just try =  if you want to limit top 100 (b being a 
> partition  i guess it will have more that 100 records to fit into your limit)
> 
> to improve your query performance your table file format matters as well. 
> Which one are you using?  
> how many partitions are there? 
> what's the size of the cluster?
> you can set the number of reducers but if your query just has one key then 
> only one reducer will get the data and rest will run empty 
> 
> 
> 
> On Sat, Mar 23, 2013 at 4:32 AM, Keith Wiley  wrote:
> The following query translates into a many-map-single-reduce job (which is 
> common) and also slags through the reduce stage...it's killing the overall 
> query:
> 
> select * from a where b >= 'c' order by b desc limit 100
> 
> Note that b is a partition.  What component is making the reducer heavy?  Is 
> it the order by or the limit (I'm sure it's not the partition-specific where 
> clause, right?)?  Are there ways to improve its performance?
> 
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com
> music.keithwiley.com
> 
> "You can scratch an itch, but you can't itch a scratch. Furthermore, an itch 
> can
> itch but a scratch can't scratch. Finally, a scratch can itch, but an itch 
> can't
> scratch. All together this implies: He scratched the itch from the scratch 
> that
> itched but would never itch the scratch from the itch that scratched."
>--  Keith Wiley
> ________
> 
> 
> 
> 
> -- 
> Nitin Pawar



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow."
   --  Keith Wiley

Query crawls through reducer

2013-03-22 Thread Keith Wiley

The following query translates into a many-map-single-reduce job (which is 
common) and also slags through the reduce stage...it's killing the overall 
query:

select * from a where b >= 'c' order by b desc limit 100

Note that b is a partition.  What component is making the reducer heavy?  Is it 
the order by or the limit (I'm sure it's not the partition-specific where 
clause, right?)?  Are there ways to improve its performance?

____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched."
   --  Keith Wiley

Rename external table, including HDFS directory

2013-03-07 Thread Keith Wiley

My understanding is that renaming and external table doesn't rename the 
corresponding HDFS directory to match.  I would like to do this; I want to 
rename the table in Hive and also rename the HDFS directory so they match (they 
already match with the old name).  I'm not sure how to do this.  I'm worried 
that if I rename the HDFS directory through Hadoop, not Hive, that Hive will 
lose track of the directory, regardless of whether I rename the table in Hive 
as well.

I also don't want to drop and recreate the table because I don't want to notify 
Hive of all the partitions all over again.

Thoughts?

____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"Luminous beings are we, not this crude matter."
   --  Yoda

Re: Combine two overlapping schema?

2013-03-06 Thread Keith Wiley

Ah.  I was stuck on the requirement that the two schema match, but I see your 
point.  I'll see if that works.

On Mar 6, 2013, at 10:11 , Dean Wampler wrote:

> Of the top of my head, I think UNION ALL should work if you explicitly 
> project out the missing columns with NULL or other values, e.g. using nested 
> SELECTs, something like
> 
> SELECT * FROM (
>  SELECT a,b,c, Y, NULL AS Z FROM table1
>  UNION ALL
>  SELECT a,b,c, NULL AS Y, Z FROM table2
> ) table12; 
> 
> 
> 
> On Wed, Mar 6, 2013 at 12:03 PM, Keith Wiley  wrote:
> I have two tables which have overlapping but nonidentical schema.  I want to 
> creating a new table that unions them, leaving nulls in any given row where a 
> column name doesn't occur in the other table:
> 
> SCHEMA 1: { a, b, c, Y }
> row:  { 1, 2, 3, 4 }
> 
> SCHEMA 2: { a, b, c, Z }
> row:  { 5, 6, 7, 8 }
> 
> NEW SCHEMA: { a, b, c, Y, Z }
> new row:  { a:1, b:2, c:3, Y:4, Z:null }
> new row:  { a:5, b:6, c:7, Y:null, Z:8 }
> 
> I don't think either "full outer join" or "union all" works.  I'm not sure 
> how to do this.  Any ideas?
> 
> Thanks.
> 
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com
> music.keithwiley.com
> 
> "You can scratch an itch, but you can't itch a scratch. Furthermore, an itch 
> can
> itch but a scratch can't scratch. Finally, a scratch can itch, but an itch 
> can't
> scratch. All together this implies: He scratched the itch from the scratch 
> that
> itched but would never itch the scratch from the itch that scratched."
>--  Keith Wiley
> ____
> 
> 
> 
> 
> -- 
> Dean Wampler, Ph.D.
> thinkbiganalytics.com
> +1-312-339-1330
> 



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
   --  Galileo Galilei

Combine two overlapping schema?

2013-03-06 Thread Keith Wiley

I have two tables which have overlapping but nonidentical schema.  I want to 
creating a new table that unions them, leaving nulls in any given row where a 
column name doesn't occur in the other table:

SCHEMA 1: { a, b, c, Y }
row:  { 1, 2, 3, 4 }

SCHEMA 2: { a, b, c, Z }
row:  { 5, 6, 7, 8 }

NEW SCHEMA: { a, b, c, Y, Z }
new row:  { a:1, b:2, c:3, Y:4, Z:null }
new row:  { a:5, b:6, c:7, Y:null, Z:8 }

I don't think either "full outer join" or "union all" works.  I'm not sure how 
to do this.  Any ideas?

Thanks.

________
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched."
   --  Keith Wiley

Re: Can hive show external table filenames?

2013-01-31 Thread Keith Wiley

Thanks, that seems like a reasonable summary of my options.

Cheers!

On Jan 31, 2013, at 14:38 , Edward Capriolo wrote:

> You can execute dfs commands from inside hive
> 
> hive>dfs -ls
> 
> You can also use virtual columns inside a select
> 
> https://cwiki.apache.org/Hive/languagemanual-virtualcolumns.html
> 
> However there is no direct command in hive to list the files in a
> table/partition.
> 
> On Thu, Jan 31, 2013 at 4:55 PM, Keith Wiley  wrote:
>> Hmmm, that doesn't seem to list the files in the directory.
>> 
>> On Jan 31, 2013, at 12:12 , Edward Capriolo wrote:
>> 
>>> Try
>>> describe extended 
>>> or
>>> describe extended  ()
>>> 
>>> the location is one of the properties
>>> 
>>> Edward
>>> 
>>> 
>>> On Thu, Jan 31, 2013 at 2:37 PM, Keith Wiley  wrote:
>>>> "show tables" shows the HDFS directories which correspond to Hive tables 
>>>> and various select queries show data from inside the files (each row of a 
>>>> file being a row of the overarching table that the entire directory 
>>>> represents)...or so I have convinced myself...but can Hive show the HDFS 
>>>> filenames?  Or is "hadoop fs -ls" the only way to do that?
>>>> 
>>>> Thanks.



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy."
   --  Edwin A. Abbott, Flatland

Re: Can hive show external table filenames?

2013-01-31 Thread Keith Wiley

Hmmm, that doesn't seem to list the files in the directory.

On Jan 31, 2013, at 12:12 , Edward Capriolo wrote:

> Try
> describe extended 
> or
> describe extended  ()
> 
> the location is one of the properties
> 
> Edward
> 
> 
> On Thu, Jan 31, 2013 at 2:37 PM, Keith Wiley  wrote:
>> "show tables" shows the HDFS directories which correspond to Hive tables and 
>> various select queries show data from inside the files (each row of a file 
>> being a row of the overarching table that the entire directory 
>> represents)...or so I have convinced myself...but can Hive show the HDFS 
>> filenames?  Or is "hadoop fs -ls" the only way to do that?
>> 
>> Thanks.



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched."
   --  Keith Wiley

Can hive show external table filenames?

2013-01-31 Thread Keith Wiley

"show tables" shows the HDFS directories which correspond to Hive tables and 
various select queries show data from inside the files (each row of a file 
being a row of the overarching table that the entire directory represents)...or 
so I have convinced myself...but can Hive show the HDFS filenames?  Or is 
"hadoop fs -ls" the only way to do that?

Thanks.

____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"The easy confidence with which I know another man's religion is folly teaches
me to suspect that my own is also."
   --  Mark Twain

apache can't access hive

2012-11-12 Thread Keith Wiley

I have created a php-backed dynamic website which pulls data from hive (via 
exec) and presents the results through a web interface.  This was working 
fine...then I upgraded to CDH4 and now apache can't access hive anymore.  Hive 
commands work fine on the command line as a "real" user, but not through the 
web as apache.  Curiously, if I run hive as "sudo -u apache", it works.  It 
just doesn't work from php.  Remember, I had this all working a few weeks ago, 
I know the basic approach works.  It just that some aspect of the CDH4 upgrade 
has changed something, and I'm not sure what it could be.  The metastore_db 
directory's permissions are good to go (apache has read/write access to the 
directory, the derby.log, etc.).

I realize the following output is a bit eclectic, but if anyone has any 
thoughts on the error I would love some help.

Thanks.

Result: FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask
Status: 1
Output: Array
Output line: Logging initialized using configuration in 
file:/etc/hive/conf.dist/hive-log4j.properties
Output line: Hive history 
file=/tmp/apache/hive_job_log_apache_201211121323_696011741.txt
Output line: 2012-11-12 21:23:28.806 GMT Thread[main,5,main] 
java.io.FileNotFoundException: derby.log (Permission denied)
Output line: 
Output line: 2012-11-12 21:23:29.075 GMT:
Output line: Booting Derby version The Apache Software Foundation - Apache 
Derby - 10.4.2.0 - (689064): instance a816c00e-013a-f681-6fa8-025818d8
Output line: on database directory /var/lib/hive/metastore/metastore_db in READ 
ONLY mode
Output line:
Output line: Database Class Loader started - derby.database.classpath=''
Output line: FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: 
Cannot get a connection, pool error Could not create a validated object, cause: 
A read-only user or a user in a read-only database is not permitted to disable 
read-only mode on a connection.
Output line: NestedThrowables:
Output line: org.apache.commons.dbcp.SQLNestedException: Cannot get a 
connection, pool error Could not create a validated object, cause: A read-only 
user or a user in a read-only database is not permitted to disable read-only 
mode on a connection.
Output line: FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask

________
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow."
   --  Keith Wiley

Re: Error in semantic analysis: Unable to fetch table

2012-10-18 Thread Keith Wiley

I never needed the -S.  I would just call it with -e, but yeah, that's the 
basic idea.  The output is gathered in the remaining optional arguments to 
exec().

On Oct 18, 2012, at 22:15 , MiaoMiao wrote:

> What php, you are using exec('hive -S -e "select * from some_table"')
> or something like that in php?
> 
> On Fri, Oct 19, 2012 at 4:46 AM, Keith Wiley  wrote:
>> If I run a select command on the command line with -e it works.  If I run 
>> the same command inside hive (without -e) it still works...but if I exec it 
>> with a -e from php (so obviously the apache user), I get "Error in semantic 
>> analysis: Unable to fetch table [tablename]".  Bear in mind that this 
>> overall system worked perfectly for weeks.  I haven't modified the code or 
>> anything.  What did change was I had to reinstall hadoop and hive and I'm 
>> sure some configuration parameter is not correct now.  I found some Google 
>> references to this error but no clear-but solutions.
>> 
>> What does this error mean and how do I fix it?
>> 
>> Thanks.



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"Luminous beings are we, not this crude matter."
   --  Yoda

Re: Error in semantic analysis: Unable to fetch table

2012-10-18 Thread Keith Wiley

Followup: If I run "show tables" then while I get a list from the command line, 
I get nothing from php.  Furthermore, if I attempt to select from a bogus 
tablename on the command line, I get that same error, "semantic error" and 
whatnot...so it looks like the real problem is that when apache/php runs hive, 
it doesn't see the table list.

Like I said, this used to work...so I'm sure I need to get something configured 
properly, I just don't remember how.  I do see that I am defining HIVE_HOME in 
my .bashrc...Do I need to do something like this for the apache account or as a 
HIVE parameter?

________
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered."
       --  Keith Wiley

Error in semantic analysis: Unable to fetch table

2012-10-18 Thread Keith Wiley

If I run a select command on the command line with -e it works.  If I run the 
same command inside hive (without -e) it still works...but if I exec it with a 
-e from php (so obviously the apache user), I get "Error in semantic analysis: 
Unable to fetch table [tablename]".  Bear in mind that this overall system 
worked perfectly for weeks.  I haven't modified the code or anything.  What did 
change was I had to reinstall hadoop and hive and I'm sure some configuration 
parameter is not correct now.  I found some Google references to this error but 
no clear-but solutions.

What does this error mean and how do I fix it?

Thanks.

____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"The easy confidence with which I know another man's religion is folly teaches
me to suspect that my own is also."
   --  Mark Twain

Increase max tablename length?

2012-08-08 Thread Keith Wiley

The default seems to be 128.  Can it be increased?  I haven't found a 
configuration parameter for that yet.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"Luminous beings are we, not this crude matter."
   --  Yoda

Re: drop table: no err on nonexistant table?

2012-08-06 Thread Keith Wiley

Oye, got it.  Sorry.

RTFM: hive.exec.drop.ignorenonexistent

On Aug 6, 2012, at 11:06 , Keith Wiley wrote:

> I'm wrapping hive in a web tool and would like to do some basic 
> error-checking.  If an attempt is made to drop a table that doesn't exist, I 
> would like to show an error message.  The problem is, hive doesn't seem to 
> produce any sort of error when dropping a table that doesn't exists.  
> Furthermore, if hive is run on the command line (-e) it doesn't return an 
> error status.  Therefore, I think the only way to detect such an error is to 
> actually inspect the table list (show tables) and manually see if the drop 
> attempt will find a corresponding table.
> 
> Can I get an error of some sort on this or is it not possible?
> 
> Thank you.
> 
> ____
> Keith Wiley kwi...@keithwiley.com keithwiley.com
> music.keithwiley.com
> 
> "I do not feel obliged to believe that the same God who has endowed us with
> sense, reason, and intellect has intended us to forgo their use."
>   --  Galileo Galilei
> ____
> 



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
   --  Homer Simpson

drop table: no err on nonexistant table?

2012-08-06 Thread Keith Wiley

I'm wrapping hive in a web tool and would like to do some basic error-checking. 
 If an attempt is made to drop a table that doesn't exist, I would like to show 
an error message.  The problem is, hive doesn't seem to produce any sort of 
error when dropping a table that doesn't exists.  Furthermore, if hive is run 
on the command line (-e) it doesn't return an error status.  Therefore, I think 
the only way to detect such an error is to actually inspect the table list 
(show tables) and manually see if the drop attempt will find a corresponding 
table.

Can I get an error of some sort on this or is it not possible?

Thank you.

____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
   --  Galileo Galilei

HQL vs. HiveQL

2012-07-03 Thread Keith Wiley

I have found references to both HQL and HiveQL (as abbreviations for Hive Query 
Language) as the name of the SQL-like language Hive uses.  Is there a formal 
and definitively correct name in this regard or are both considered equally 
appropriate?  I'm writing up a white paper and want to use the right term.

Thanks.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
   --  Homer Simpson

Re: LIKE Statement

2012-05-14 Thread Keith Wiley

Thanks for the followup.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me."
   --  Abe (Grandpa) Simpson

Re: LIKE Statement

2012-05-14 Thread Keith Wiley

On Apr 4, 2012, at 06:40 , John Omernik wrote:

> I think the like statement should be changed to be case-insensitive to match 
> it's function in other DBMS  Thoughts?

Out of curiosity, was there any activity on this issue?  I see John's original 
post in the archives (~5wks ago) with no followups.  I'm generally curious 
about HiveQL's similarities and differences w.r.t. other SQL environments, so 
questions like this are of interest to me as well.

Cheers!

________
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
   --  Galileo Galilei

Re: Hive performance vs. SQL?

2012-03-19 Thread Keith Wiley

Thanks for the response.

Cheers!

On Mar 19, 2012, at 16:42 , Maxime Brugidou wrote:

> From my experience, if you can fit data in a SQL without sharding or 
> anything, don't ever think twice. Hive is not even comparable.


____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered."
       --  Keith Wiley

Hive performance vs. SQL?

2012-03-19 Thread Keith Wiley

I haven't had an opportunity to set up a huge Hive database yet because 
exporting csv files from our SQL database is, in itself, a rather laborious 
task.  I was just curious how I might expect Hive to perform vs. SQL on large 
databases and large queries?  I realize Hive is pretty "latent" since it builds 
and runs MapReduce jobs for even the simplest queries, but that is precisely 
why I think it might perform better on long queries against large (external 
CSV) databases).

Would you expect Hive to ever outperform SQL on a single machine (standalone or 
pseudo-distributed mode)?  I am entirely open to the possibility that the 
answer is no, that Hive could never compete with SQL in a single machine.  Is 
this true?

If so, how large (how parallel) do you think the underlying Hadoop cluster 
needs to be before Hive overtakes SQL?  2X?  10X?  Where is the crossover point 
where Hive actually outperforms SQL?

Along similar lines, might Hive never outperform SQL on a database small enough 
for SQL to run on a single machine, a 10s to 100s of GBs?  Must the database 
itself be so large that SQL is effectively crippled and the data must be 
distributed before Hive offer significant gains?

I am really just trying to get a basic feel for how I might anticipate's Hive's 
behavior vs. SQL once I get a large system up and running.

Thanks.

____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me."
   --  Abe (Grandpa) Simpson

Re: csv boolean type

2012-03-13 Thread Keith Wiley

I obviously intended '1', not '0' as an example of a true value.

____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"The easy confidence with which I know another man's religion is folly teaches
me to suspect that my own is also."
   --  Mark Twain

csv boolean type

2012-03-13 Thread Keith Wiley

What string values in a csv field are parsable by Hive as booleans?  If I 
indicate that a column is of type boolean when wrapping an external table 
around a csv file, what are the legal values?  I can imagine numerous 
possibilities, for example (for the true values):
0
t
T
true
True
TRUE
y
Y
yes
Yes
YES

etc.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me."
   --  Abe (Grandpa) Simpson

Re: order by having no effect?!

2012-03-13 Thread Keith Wiley


On Mar 13, 2012, at 13:57 , Igor Tatarinov wrote:

> You have attributevalue in quotes which makes it a constant literal.
> 
> igor
> decide.com

Argh!  You are correct good sir!



____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
   --  Homer Simpson

order by having no effect?!

2012-03-13 Thread Keith Wiley

Um, this is weird.  It simply isn't modifying the order of the returned rows at 
all.  I get the same result with no 'order by' clause as with one.  Adding a 
limit or specifying 'asc' has no effect.  Using 'sort by' also has no effect.  
The column used for ordering is type INT.  In the example below, I was hoping 
to "sort" or "order" the results according to the third column.  Like I said, I 
also tried adding 'limit 10' and/or 'asc' to the end of the query; they had no 
effect.  The jobtracker shows a single mapper and a single reducer being used 
for the job incidentally, so it can't be some sort of multi-reducer sort 
discrepancy.

hive> describe stringmap;
OK
objecttypecode  int 
attributename   string  
attributevalue  int 
langid  int 
value   string  
displayorderint 
Time taken: 0.074 seconds
hive>


hive> select * from stringmap where attributename='foo' order by 
'attributevalue'; 
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Starting Job = job_201202221500_0114, Tracking URL = 
http://localhost:50030/jobdetails.jsp?jobid=job_201202221500_0114
Kill Command = /media/sdb1/kwiley/hadoop/hadoop-0.20.2-cdh3u3/bin/hadoop job  
-Dmapred.job.tracker=localhost:9001 -kill job_201202221500_0114
2012-03-13 14:45:28,418 Stage-1 map = 0%,  reduce = 0%
2012-03-13 14:45:31,428 Stage-1 map = 100%,  reduce = 0%
2012-03-13 14:45:39,459 Stage-1 map = 100%,  reduce = 33%
2012-03-13 14:45:40,463 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201202221500_0114
OK
3   foo 6   10336 - aaa 6
3   foo 3   10333 - bbb 3
3   foo 4   10334 - ccc 4
4   foo 1   1033Default Value   1
3   foo 2   10332 - ddd 2
3   foo 1   10331 - eee 1
3   foo 5   10335 - fff 5
Time taken: 17.954 seconds
hive>


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"Luminous beings are we, not this crude matter."
   --  Yoda

Re: order by date

2012-03-13 Thread Keith Wiley

Thanks for all the feedback on this.  I'll take a look at all the suggestions I 
received.

Cheers!


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy."
   --  Edwin A. Abbott, Flatland

Re: order by date

2012-03-13 Thread Keith Wiley

Okay, what are the differences and similarities with use a date function for 
the same purpose?  Is there a reason to go one route or the other?

On Mar 13, 2012, at 11:29 , Mark Grover wrote:

> Hi Keith,
> You should also consider writing you own UDF that takes in the date in 
> "American" format and spits out a lexicographical string.
> That way you don't have to modify your base data, just use this newly created 
> from_american_date(String date) UDF to get your new date string in
> 
> Mark Grover, Business Intelligence Analyst
> OANDA Corporation 
> 
> www: oanda.com www: fxtrade.com 
> e: mgro...@oanda.com 


____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow."
   --  Keith Wiley

Re: order by date

2012-03-13 Thread Keith Wiley

I see how I misled you, sorry.  I wasn't implying that my csv data is cleanly 
represented in -MM-dd format.  I was just asking syntactically how to use 
date functions in HiveQL because I hadn't found any examples and I used 
-MM-dd in my example.  The dates in my csv tables are often in "American" 
format, month first without leading zeroes, e.g., "3/31/2012 7:00".  The lack 
of leading zeroes and the unsortabled date format make the dates difficult to 
work with.  I was thinking I could use the date functions with some other 
format to sort them (I guess it would be "M/d/ h:mm" or something like 
that).

I admit, I didn't actually go to the trouble of providing the correct pattern 
string in my earlier post, I was focused on the HiveQL syntax in that post, not 
the precise date pattern given to the date function.

So yeah, I'm still trying to determine the best way to sort queries against the 
date-time columns.  One option is to read/write the entire tables with a date 
conversion to a lexicographic format.  Another option -- my original question 
in this thread -- was how I might use hive's date functions at the time a query 
is performed.

What do you think is the best way to deal with this?

Thanks.

On Mar 13, 2012, at 10:35 , Tucker, Matt wrote:

> I'm a bit confused.  It sounds like you're already storing your dates as 
> strings in a '-mm-dd' format.  In that case, you can just sort by 
> dateColName.  There's no issue with using UNIX_TIMESTAMP() in the order by 
> clause, as it outputs integer values.
> 
> Most of the date functions in hive take arguments in string format, with a 
> few functions that will translate between unix timestamps and datetime 
> strings.
> 
> Matt Tucker

Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
   --  Galileo Galilei

Re: non-equality joins

2012-03-13 Thread Keith Wiley

Sounds like Matt possesses the proper combination of expertise in both 
databases and MapReduce to assist you.  I'm bowing out as I honestly don't know 
advanced database concepts at all.  In addition, hive offers hive-specific 
tools like Matt suggested (map-side joins) to help out, which I'm too new too 
to speculate on.  I'm just starting hive this week as a matter of fact.

The short answer on MapReduce algorithms is that the individual computational 
units can't communicate with each other (each mapper or each map() in fact 
cannot communicate with the others, likewise for reducers).  That's one of the 
major distinctions between MapReduce and more general parallel processing 
frameworks like MPI.  This is the wrong mailing list to go much deeper than 
that however.

Thanks Matt.

Best of luck Mahsa.

On Mar 13, 2012, at 10:13 , Tucker, Matt wrote:

> For theta joins, you’ll have to convert the query to an equi-join, and then 
> filter for non-equality in the WHERE clause.  Depending upon the size of each 
> table, you might consider looking at map-side joins, which will allow for 
> doing non-equality filters during a join before it’s passed to the reducers.
>  
> Matt Tucker
>  
> From: mahsa mofidpoor [mailto:mofidp...@gmail.com] 
> Sent: Tuesday, March 13, 2012 1:02 PM
> To: user@hive.apache.org
> Subject: Re: non-equality joins
>  
>  
> Hi Keith,
>  
> Do you know exactly how an algorithm should be in order to fit in the 
> MapReduce framework? Could you refer me to some references?
>  
> Thanks and Regards,
> Mahsa
>  

Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"Luminous beings are we, not this crude matter."
   --  Yoda

How to import extremely "wide" csv tables

2012-03-13 Thread Keith Wiley

Wrapping hive around existing csv files consists of manually naming and typing 
every column during the creation command.  I have several csv tables and some 
of them have a ton of columns.  I would love a way to create hive tables which 
automatically infers the column types by attempting various type conversions or 
regex matches on the data (say the first row).  What would be even cooler is if 
the first row could actually be interpreted differently from the rest of the 
table...as a set of string labels to name the columns while the types could be 
automatically inferred from, say, the *second* row.  These csv files are 
currently of this format, with the first row naming the columns.

Does this make sense?

Now, I'm sure that hive doesn't support this yet -- and I admit it is a 
somewhat esoteric desire on my part -- but I'm curious how others would suggest 
approaching it?  I'm thinking of writing a separate isolated program that reads 
the first two rows of a csv file and dumps a text string of column names and 
types in the correct syntax for a hive external table creation statement which 
I would then copy/paste into hive...I was just hoping for a simpler solution.

Thoughts?

Thanks.

____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched."
       --  Keith Wiley

Re: order by date

2012-03-13 Thread Keith Wiley

Is see, you store the date-time as a lexicographically sortable string.  That's 
fine, but I'm operating on existing csv tables.  I guess I could whip up a 
hadoop job to convert all the date-time columns to lexicographic strings and 
then wrap hive around the resulting converted tables.  I was just wondering if 
there was a more direct approach, whether I could apply hive to the original 
csv tables.

*Would* the date functions work in the way I suggested in my first post or do 
you think a lexicographic conversion is practically required to perform 
date-ordering on query results?

On Mar 13, 2012, at 09:49 , Tucker, Matt wrote:

> Hi Keith,
> 
> We generally store date columns as a string in a similar format to ISO 8601 
> (-mm-dd hh:MM:ss).  This way, when we put the date column in the ORDER BY 
> clause, it will be sorted chronologically.  It also saves us the trouble of 
> whipping out a unix timestamp calculator to figure out what we're looking at.
> 
> There is supposed to be a TIMESTAMP data type in Hive 0.8, but I haven't 
> found any documentation on it yet.
> 
> Matt Tucker
> 
> -Original Message-
> From: Keith Wiley [mailto:kwi...@keithwiley.com] 
> Sent: Tuesday, March 13, 2012 12:45 PM
> To: user@hive.apache.org
> Subject: order by date
> 
> I realize that hive doesn't have a date type for the columns and I realize 
> that hive *does* have various date functions.  I just haven't found a 
> concrete example of how these two issues are brought together.  Ordering the 
> results of a sql query is done by adding "order by dateColName" to the query. 
>  Are the date functions supposed to be used in the form "order by 
> UNIX_TIMESTAMP(dateColName, '-MM-dd')"?  Does the function just go right 
> into the "order by" clause like that or is that totally wrong?
> 
> How does one order query results by a date column?  Am I on the right track?
> 
> Thanks.



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy."
   --  Edwin A. Abbott, Flatland

Re: non-equality joins

2012-03-13 Thread Keith Wiley

https://cwiki.apache.org/Hive/languagemanual-joins.html

"Hive does not support join conditions that are not equality conditions as it 
is very difficult to express such conditions as a map/reduce job."

I admit, that isn't a very detailed answer, but it gives some indication of the 
reason for the discrepancy between Hive and other databases.  Hive 
fundamentally operates on Hadoop, namely on MapReduce (we all know this, I'm 
just reiterating the train of thought).  The problem is that certain algorithms 
are exceedingly difficult to wedge into the MapReduce framework.

That is as detailed as my personal insight can get.  I've done a lot of 
MapReduce programming in Hadoop but I'm not a database expert and I don't 
really understand the steps involved in various kinds of table-joins, so I 
don't understand the particular ways in which certain database operations do or 
do not fit into MapReduce...but presumably nonequality joins (whatever those 
are :-D ) are particularly difficult to MapReduceify.

Cheers!

On Mar 13, 2012, at 09:17 , mahsa mofidpoor wrote:

> Hello,
> 
> Is there a reason behind not implementing non-equality joins in Hive? In 
> other words, is there any usage for theta-join, if implemented?
> 
> Thank you in advance for your response,
> Mahsa 

____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow."
   --  Keith Wiley

order by date

2012-03-13 Thread Keith Wiley

I realize that hive doesn't have a date type for the columns and I realize that 
hive *does* have various date functions.  I just haven't found a concrete 
example of how these two issues are brought together.  Ordering the results of 
a sql query is done by adding "order by dateColName" to the query.  Are the 
date functions supposed to be used in the form "order by 
UNIX_TIMESTAMP(dateColName, '-MM-dd')"?  Does the function just go right 
into the "order by" clause like that or is that totally wrong?

How does one order query results by a date column?  Am I on the right track?

Thanks.

________
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered."
   --  Keith Wiley

External table point of clarification

2012-03-12 Thread Keith Wiley

It appears as if the syntax for creating an external table takes a column 
description and a directory where the table (a .csv file) should be found.  
Notably, it doesn't take the path to the file, rather it takes the path to the 
directory containing the file.

...which means that if I have multiple .csv files, each a unique table with a 
unique column description, it is not sufficient to put them in one hdfs 
directory, since the columns must be described when creating the hive table.  
Rather, each .csv file conforming to a unique column description must be placed 
in its own separate hdfs directory, right?


Just clarifying...is this correct?  Can I not put a set of .csv files 
comprising multiple tables of a heterogenous database in a single directory and 
then have hive build a set of tables around those files?

Thank you.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow."
       --  Keith Wiley

Re: Basic statement problems

2012-03-12 Thread Keith Wiley

It has started working now.  I don't know what I changed.  I dropped every 
single table from hive, explicitly created a new directory on HDFS and moved 
the .csv file to that directory, ran hive again and created the table.  This 
time it worked.  I can perform queries against the directory.  Maybe hadoop and 
hive confused each other about the other directory (it got corrupted or 
something)...or maybe I screwed something up, I dunno.

I would have expected better error-detection.  Instead of simply returning 
0-length queries, it would be nice if hive would actually produce an error 
message if I create the table in an invalid or incorrect fashion...but perhaps 
it couldn't tell; maybe the database just looked empty at hive's level of 
abstraction.

I mean, even if I did screw something up (an option I am entirely open to), I 
never really got an error about it.  Hive gladly wrapped a "hive" table around 
the directory, and the csv file in question, without an error of any kind.  
Hive could see the table, it listed and could describe it, but would then 
return empty queries against the table.  When I moved the exact same csv file 
to a brand new HDFS directory and tried again from scratch, everything started 
working.  I probably did something wrong, but some sort of error message would 
have been very helpful.

Anyway, these tools are still pretty young.  I understand that they will 
continue to evolve.  The ability to detect and report errors will almost 
certainly improve with time.


Thanks for the concerted efforts to help.

Cheers!

____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered."
       --  Keith Wiley

Re: Basic statement problems

2012-03-12 Thread Keith Wiley

On Mar 9, 2012, at 16:46 , Steven Wong wrote:

> The LOCATION clause has to specify the directory that contains (only) your 
> data files.

I've tried it both ways:

CREATE EXTERNAL TABLE stringmap (ObjectTypeCode INT, AttributeName STRING, 
AttributeValue INT, LangId INT, Value STRING, DisplayOrder INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/crm/records/all/hiveTest';

CREATE EXTERNAL TABLE stringmap (ObjectTypeCode INT, AttributeName STRING, 
AttributeValue INT, LangId INT, Value STRING, DisplayOrder INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/crm/records/all/hiveTest/StringMap.csv';

In both cases, "show tables" lists stringmap and "describe stringmap" describes 
the columns shown above, but a basic query doesn't return any results.

What else should I try here?

____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me."
   --  Abe (Grandpa) Simpson

Re: Error

2012-03-12 Thread Keith Wiley

Thanks for the suggestion, but my hadoop directory doesn't have a build 
directory at all.  I suppose that makes sense.  I haven't tried to rebuild 
hadoop since I installed it (CDH3).  I'm just using the executables as they 
came in the original .zip.

Gah, hive sorta works but I can't get this last bit.  It creates tables around 
an existing csv file, it can describe them, but queries against those tables 
return empty results.

On Mar 10, 2012, at 01:12 , Jagat wrote:

> Dear Keith
> 
> Please delete $HADOOP_HOME/build , in your hadoop home the build directory
> 
> And try again
> 
> Thanks
> 
> Jagat 
> 
> 
> 
> On Sat, Mar 10, 2012 at 5:07 AM, Keith Wiley  wrote:
> Considering that I don't even konw what the metastore is, I doubt I did 
> anything specifically aside from the instructions.  All I did was follow the 
> startup guide.  Nothing more.  More to the point, when I followed the same 
> instructions on a Linux box, it worked.  Something is admittedly wrong with 
> my Mac setup, but I don't think there is any third-party metastore of some 
> sort.  I certainly didn't install or create anything like that.  I'm not even 
> sure what it is.
> 
> Hmmm, how do I resetup or initialize the metastore from scratch and try again?



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
   --  Homer Simpson

Re: Error

2012-03-10 Thread Keith Wiley

Thanks, I'll give it a shot on Monday.

On Mar 10, 2012, at 01:12 , Jagat wrote:

> Dear Keith
> 
> Please delete $HADOOP_HOME/build , in your hadoop home the build directory
> 
> And try again
> 
> Thanks
> 
> Jagat 


____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"The easy confidence with which I know another man's religion is folly teaches
me to suspect that my own is also."
   --  Mark Twain

Re: Basic statement problems

2012-03-09 Thread Keith Wiley

So a directory, not a specific file.  I thought I tried it both ways, but I'll 
switch it back the other way and try again.

Thanks.

On Mar 9, 2012, at 16:46 , Steven Wong wrote:

> The LOCATION clause has to specify the directory that contains (only) your 
> data files.


____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
   --  Homer Simpson

0.7.1 vs. 0.8.1

2012-03-09 Thread Keith Wiley

The only version feature list I have found so far is the JIRA logs, which are a 
little difficult to make sense of since they get pretty detailed.  I'm curious 
from a straight-forward high-level "sales-pitch" perspective what advantages 
0.8.1 offers over 0.7.1.  I ask because I generally use CDH3 and it provides 
0.7.1.  I'm curious if I should use a more recent version of Hive with the 
version of Hadoop in CDH3 (0.20.2).

Thanks.

____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy."
   --  Edwin A. Abbott, Flatland

Basic statement problems

2012-03-09 Thread Keith Wiley

I successfully installed and used Hive to create basic tables (on one of my two 
machines; another discussion describes the problems I'm having with the other 
machine).  However, basic queries aren't working.  I brought a typical CSV file 
into a Hive table and it seemed fine.  Here's how I did it:

CREATE EXTERNAL TABLE stringmap (ObjectTypeCode INT, AttributeName STRING, 
AttributeValue INT, LangId INT, Value STRING, DisplayOrder INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/crm/records/all/hiveTest/StringMap.csv';

"show tables" and "describe stringmap" return correct results.  However, if I 
run a really simple query, it returns incorrect results.  For example, a row 
count query returns a 0.  Observe:

hive> select count(*) from stringmap;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Starting Job = job_201202221500_0103, Tracking URL = 
http://localhost:50030/jobdetails.jsp?jobid=job_201202221500_0103
Kill Command = /media/sdb1/kwiley/hadoop/hadoop-0.20.2-cdh3u3/bin/hadoop job  
-Dmapred.job.tracker=localhost:9001 -kill job_201202221500_0103
2012-03-09 16:20:26,243 Stage-1 map = 0%,  reduce = 0%
2012-03-09 16:20:29,258 Stage-1 map = 0%,  reduce = 100%
2012-03-09 16:20:32,278 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201202221500_0103
OK
0
Time taken: 15.969 seconds

The Hadoop job runs without error, but it returns a 0.  The job tracker 
indicates that Hive create a job with 0 mappers and 1 reducer.  I don't see any 
useful output in the reducer task log however.  The following is the from the 
hive log:

2012-03-09 16:20:17,057 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) 
- Bundle "org.eclipse.jdt.core" requires "org.eclipse.core.resources" but it 
cannot be resolved.
2012-03-09 16:20:17,057 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) 
- Bundle "org.eclipse.jdt.core" requires "org.eclipse.core.resources" but it 
cannot be resolved.
2012-03-09 16:20:17,059 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) 
- Bundle "org.eclipse.jdt.core" requires "org.eclipse.core.runtime" but it 
cannot be resolved.
2012-03-09 16:20:17,059 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) 
- Bundle "org.eclipse.jdt.core" requires "org.eclipse.core.runtime" but it 
cannot be resolved.
2012-03-09 16:20:17,060 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) 
- Bundle "org.eclipse.jdt.core" requires "org.eclipse.text" but it cannot be 
resolved.
2012-03-09 16:20:17,060 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) 
- Bundle "org.eclipse.jdt.core" requires "org.eclipse.text" but it cannot be 
resolved.
2012-03-09 16:20:22,781 WARN  mapred.JobClient 
(JobClient.java:copyAndConfigureFiles(649)) - Use GenericOptionsParser for 
parsing the arguments. Applications should implement Tool for the same.
2012-03-09 16:20:22,946 WARN  snappy.LoadSnappy (LoadSnappy.java:(36)) 
- Snappy native library is available

I admit, that does look rather erroneous, but I'm not sure what to make of it.  
I looked those errors up online but didn't find much that seemed to suggest a 
cause or solution.

Any ideas?


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched."
   --  Keith Wiley

Re: Error

2012-03-09 Thread Keith Wiley

Considering that I don't even konw what the metastore is, I doubt I did 
anything specifically aside from the instructions.  All I did was follow the 
startup guide.  Nothing more.  More to the point, when I followed the same 
instructions on a Linux box, it worked.  Something is admittedly wrong with my 
Mac setup, but I don't think there is any third-party metastore of some sort.  
I certainly didn't install or create anything like that.  I'm not even sure 
what it is.

Hmmm, how do I resetup or initialize the metastore from scratch and try again?

On Mar 9, 2012, at 13:21 , Mark Grover wrote:

> Hi Keith,
> This error is typically encountered when your metastore (that stores the 
> metadata related to Hive tables e.g. name of columns, name of tables, etc. 
> etc.) is not correctly set up.
> 
> Have you set up an external metastore? By default, Hive should use an 
> embedded Derby metastore which only allows one connection at a time.
> 
> Mark
> 
> Mark Grover, Business Intelligence Analyst
> OANDA Corporation 
> 
> www: oanda.com www: fxtrade.com 
> 
> "Best Trading Platform" - World Finance's Forex Awards 2009. 
> "The One to Watch" - Treasury Today's Adam Smith Awards 2009. 
> 
> 
> - Original Message -
> From: "Keith Wiley" 
> To: user@hive.apache.org
> Sent: Friday, March 9, 2012 3:02:40 PM
> Subject: Error
> 
> I'm quite comfortable hadoop and the associated lingo, been programming it 
> via Java and via C++ streams for several years.  However, I have just started 
> Hive for the first time...and I'm stuck.  I was following the "getting 
> started" page on the apache site.  I got a far as this:
> 
> hive> CREATE TABLE pokes (foo INT, bar STRING);
> FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected 
> exception caught.
> NestedThrowables:
> java.lang.reflect.InvocationTargetException
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask
> 
> If I look at the logs, I see the following.  I'm not sure how to proceed from 
> here.  Any help is appreciated.
> 
> Thanks.
> 
> $ cat /tmp/keithw/hive.log
> 2012-03-09 12:02:22,087 ERROR exec.DDLTask 
> (SessionState.java:printError(365)) - FAILED: Error in metadata: 
> javax.jdo.JDOFatalInternalException: Unexpected exception caught.
> NestedThrowables:
> java.lang.reflect.InvocationTargetException
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> javax.jdo.JDOFatalInternalException: Unexpected exception caught.
> NestedThrowables:
> java.lang.reflect.InvocationTargetException
>   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:476)
>   at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3176)
>   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:213)
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:131)
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
>   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063)
>   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748)
>   at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286)
>   at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:516)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
> Caused by: javax.jdo.JDOFatalInternalException: Unexpected exception caught.
> NestedThrowables:
> java.lang.reflect.InvocationTargetException
>   at 
> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1186)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:803)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:698)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:234)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:261)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:196)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:171)
>   at 
> org.apache.hadoop.util.Reflection

Error

2012-03-09 Thread Keith Wiley

luginRegistry.registerBundle(NonManagedPluginRegistry.java:443)
at 
org.datanucleus.plugin.NonManagedPluginRegistry.registerBundle(NonManagedPluginRegistry.java:355)
at 
org.datanucleus.plugin.NonManagedPluginRegistry.registerExtensions(NonManagedPluginRegistry.java:215)
at 
org.datanucleus.plugin.NonManagedPluginRegistry.registerExtensionPoints(NonManagedPluginRegistry.java:156)
at 
org.datanucleus.plugin.PluginManager.registerExtensionPoints(PluginManager.java:82)
at org.datanucleus.OMFContext.(OMFContext.java:156)
at org.datanucleus.OMFContext.(OMFContext.java:137)
at 
org.datanucleus.ObjectManagerFactoryImpl.initialiseOMFContext(ObjectManagerFactoryImpl.java:132)
at 
org.datanucleus.jdo.JDOPersistenceManagerFactory.initialiseProperties(JDOPersistenceManagerFactory.java:363)
at 
org.datanucleus.jdo.JDOPersistenceManagerFactory.(JDOPersistenceManagerFactory.java:307)
at 
org.datanucleus.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:255)
at 
org.datanucleus.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:182)
... 40 more

2012-03-09 12:02:22,090 ERROR ql.Driver (SessionState.java:printError(365)) - 
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask

____
Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"The easy confidence with which I know another man's religion is folly teaches
me to suspect that my own is also."
   --  Mark Twain

68 matches

Mail list logo