Re: Why the filter push down does not reduce the read data record count

2018-02-24 Thread Sun, Keith
Hi ,


My env : Hive 1.2.1 and Parquet 1.8.1


Per my search in hive and parquet source code of version 1.8.1,  I did not see 
the paramters in that slides. but found that here :

https://github.com/apache/parquet-mr/blob/parquet-1.8.x/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java


For hive, (also detected in  my test),  row group filter is auto-applied , 
check this  ParquetRecordReaderWrapper

if (filter != null) {
  filtedBlocks = RowGroupFilter.filterRowGroups(filter, splitGroup, 
fileMetaData.getSchema());
  if (filtedBlocks.isEmpty()) {
LOG.debug("All row groups are dropped due to filter predicates");
return null;
  }


I will dig into more detail of parquet and do some test later.


Thanks,

Keith


From: Furcy Pin <pin.fu...@gmail.com>
Sent: Friday, February 23, 2018 8:03:34 AM
To: user@hive.apache.org
Subject: Re: Why the filter push down does not reduce the read data record count

And if you come across a comprehensive documentation of parquet configuration, 
please share it!!!

The Parquet documentation says that it can be configured but doesn't explain 
how: 
http://parquet.apache.org/documentation/latest/<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fparquet.apache.org%2Fdocumentation%2Flatest%2F=02%7C01%7Caisun%40ebay.com%7C41a5aa52483a46e8458208d57ad7294b%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549986889004305=QoTRmoY7yiVICa%2FXgJJyNiC1zFTjvzYHv2u8MdvDtM0%3D=0>
and apparently, both TAJO 
(http://tajo.apache.org/docs/0.8.0/table_management/parquet.html<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Ftajo.apache.org%2Fdocs%2F0.8.0%2Ftable_management%2Fparquet.html=02%7C01%7Caisun%40ebay.com%7C41a5aa52483a46e8458208d57ad7294b%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549986889004305=FPOuDplxxU68HxCjdZMRKuHrF9kvuARddiNWARyPaTg%3D=0>)
 and Drill 
(https://drill.apache.org/docs/parquet-format/<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrill.apache.org%2Fdocs%2Fparquet-format%2F=02%7C01%7Caisun%40ebay.com%7C41a5aa52483a46e8458208d57ad7294b%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549986889004305=gqpYD9UYPQewMPlpvsc%2BQjaz8Qq6FcII00zALHCWbPU%3D=0>)
 seem to have some configuration parameters for Parquet.
If Hive has configuration parameters for Parquet too, I couldn't find it 
documented anywhere.



On 23 February 2018 at 16:48, Sun, Keith 
<ai...@ebay.com<mailto:ai...@ebay.com>> wrote:

I got your point and thanks for the nice slides info.


So the parquet filter is not an easy thing and I will try that according to the 
deck.


Thanks !


From: Furcy Pin <pin.fu...@gmail.com<mailto:pin.fu...@gmail.com>>
Sent: Friday, February 23, 2018 3:37:52 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Why the filter push down does not reduce the read data record count

Hi,

Unless your table is partitioned or bucketed by myid, Hive generally requires 
to read through all the records to find the records that match your predicate.

In other words, Hive table are generally not indexed for single record 
retrieval like you would expect RDBMs tables or Vertica tables to be indexed to 
allow single record.
Some file formats like ORC (and maybe Parquet, I'm not sure) allow to add bloom 
filters on specific columns of a 
table<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsnippetessay.wordpress.com%2F2015%2F07%2F25%2Fhive-optimizations-with-indexes-bloom-filters-and-statistics%2F=02%7C01%7Caisun%40ebay.com%7C65fc6c45d6394d53c25508d57ab204ff%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549827365379178=rqlaV994fEVnDts8xeKJ3gysOkG738Q6iAi5aWnLTrM%3D=0>,
 which could work as a kind of index.
Also, depending on the query engine you are using (Hive, Spark-SQL, Impala, 
Presto...) and its version, they may or may not be able to leverage certain 
storage optimization.
For example, Spark still does not support Hive Bucketed Table optimization. But 
it might come in the upcoming Spark 2.3.


I'm much less familiar with Parquet, so if anyone has links to a good 
documentation for Parquet fine tuning (or even better a comparison with ORC 
features) that would be really helpful.
By googling, I found these slides where someone at 
Netflix<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FRyanBlue3%2Fparquet-performance-tuning-the-missing-guide=02%7C01%7Caisun%40ebay.com%7C65fc6c45d6394d53c25508d57ab204ff%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549827365379178=Ek%2BezplTbMr5m8xmHFICmwkWIBKhO39zWARXNKCrR18%3D=0>
 seems to have tried the same kind of optimization as you in Parquet.





On 23 February 2018 at 12:02, Sun, Keith 
<ai...@ebay.com<mailto:ai...@ebay.com>> wrote:

Hi,


Why Hive still read so much "records" even with a filter pushdown enabled and 
the

Re: Why the filter push down does not reduce the read data record count

2018-02-23 Thread Sun, Keith
I got your point and thanks for the nice slides info.


So the parquet filter is not an easy thing and I will try that according to the 
deck.


Thanks !


From: Furcy Pin <pin.fu...@gmail.com>
Sent: Friday, February 23, 2018 3:37:52 AM
To: user@hive.apache.org
Subject: Re: Why the filter push down does not reduce the read data record count

Hi,

Unless your table is partitioned or bucketed by myid, Hive generally requires 
to read through all the records to find the records that match your predicate.

In other words, Hive table are generally not indexed for single record 
retrieval like you would expect RDBMs tables or Vertica tables to be indexed to 
allow single record.
Some file formats like ORC (and maybe Parquet, I'm not sure) allow to add bloom 
filters on specific columns of a 
table<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsnippetessay.wordpress.com%2F2015%2F07%2F25%2Fhive-optimizations-with-indexes-bloom-filters-and-statistics%2F=02%7C01%7Caisun%40ebay.com%7C65fc6c45d6394d53c25508d57ab204ff%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549827365379178=rqlaV994fEVnDts8xeKJ3gysOkG738Q6iAi5aWnLTrM%3D=0>,
 which could work as a kind of index.
Also, depending on the query engine you are using (Hive, Spark-SQL, Impala, 
Presto...) and its version, they may or may not be able to leverage certain 
storage optimization.
For example, Spark still does not support Hive Bucketed Table optimization. But 
it might come in the upcoming Spark 2.3.


I'm much less familiar with Parquet, so if anyone has links to a good 
documentation for Parquet fine tuning (or even better a comparison with ORC 
features) that would be really helpful.
By googling, I found these slides where someone at 
Netflix<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FRyanBlue3%2Fparquet-performance-tuning-the-missing-guide=02%7C01%7Caisun%40ebay.com%7C65fc6c45d6394d53c25508d57ab204ff%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549827365379178=Ek%2BezplTbMr5m8xmHFICmwkWIBKhO39zWARXNKCrR18%3D=0>
 seems to have tried the same kind of optimization as you in Parquet.





On 23 February 2018 at 12:02, Sun, Keith 
<ai...@ebay.com<mailto:ai...@ebay.com>> wrote:

Hi,


Why Hive still read so much "records" even with a filter pushdown enabled and 
the returned dataset would be a very small amount ( 4k out of  30billion 
records).


The "RECORDS_IN" counter of Hive which still showed the 30billion count and 
also the output in the map reduce log like this :

org.apache.hadoop.hive.ql.exec.MapOperator: MAP[4]: records read - 10


BTW, I am using parquet as stoarg format and the filter pushdown did work as i 
see this in log :


AM INFO: parquet.filter2.compat.FilterCompat: Filtering using predicate: 
eq(myid, 223)


Thanks,

Keith




Why the filter push down does not reduce the read data record count

2018-02-23 Thread Sun, Keith
Hi,


Why Hive still read so much "records" even with a filter pushdown enabled and 
the returned dataset would be a very small amount ( 4k out of  30billion 
records).


The "RECORDS_IN" counter of Hive which still showed the 30billion count and 
also the output in the map reduce log like this :

org.apache.hadoop.hive.ql.exec.MapOperator: MAP[4]: records read - 10


BTW, I am using parquet as stoarg format and the filter pushdown did work as i 
see this in log :


AM INFO: parquet.filter2.compat.FilterCompat: Filtering using predicate: 
eq(myid, 223)


Thanks,

Keith



Re: Can't drop table

2014-06-12 Thread Keith Wiley

On Jun 11, 2014, at 13:15 , Edward Capriolo wrote:

 There is a lengthy discussion, but it's unclear how to actually drop the 
 darn table.  In fact, that page
 
 Dumb work around:
  
 You could go directly to mysql and delete the table...
 
 Or make a serde with that name just so you will not get the class not found 
 error :)

Okay, thanks.  Still a pretty weird bug all in all, considering that the 
conclusion of that bug report was to not fix the bug.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

Luminous beings are we, not this crude matter.
   --  Yoda




Re: Can't drop table

2014-06-11 Thread Keith Wiley
That's beside the point.  The question is, why can't I drop the table?  There 
is no excuse for not dropping a table just because some serde can't be found.  
It shouldn't operate that way at all.

Thanks.

On Jun 10, 2014, at 23:33 , Nitin Pawar wrote:

 if you have added a table with a serde definition 
 
 then just keep the jar in local filesystem and then in hive console do 
 add jar full path to jar;
 
 this should make your serde available to table and you should be able to drop 
 the table then.
 
 I just tried above steps and it works for my json based tables. 
 
 
 On Wed, Jun 11, 2014 at 2:24 AM, Keith Wiley kwi...@keithwiley.com wrote:
 I tried to create a table that would use a csv serde.  There were various 
 problems with it, primarily in that I couldn't figure out how to specify the 
 path to the serde (whether I indicate the serde location either locally or on 
 HDFS) there were subsequent errors about not finding the serde, even though 
 the paths were correct).
 
 So I decided to drop the table since it wasn't working...but the serde error 
 that is causing these problems to begin with prevents me from dropping the 
 table due to a serde-doesn't-exist error.  Yeah, duh, that's the problem in 
 the first place.
 
 This is an acknowledged problem, as indicated here:
 https://issues.apache.org/jira/browse/HIVE-3392
 ...but I don't understand from that page how to get around the problem.  
 There is a lengthy discussion, but it's unclear how to actually drop the darn 
 table.  In fact, that page suggests the problem is officially classified as 
 won't be fixed.
 
 Note, this problem also prevents me from renaming the table via alter.
 
 I don't know what to do at this point.  Any ideas?
 
 
 Keith Wiley kwi...@keithwiley.com keithwiley.com
 music.keithwiley.com
 
 The easy confidence with which I know another man's religion is folly teaches
 me to suspect that my own is also.
--  Mark Twain
 
 
 
 
 
 -- 
 Nitin Pawar



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow.
   --  Keith Wiley




Bug in Hive Partition windowing functions?

2014-04-29 Thread Keith
Hi, 

we have an issue with windowing function query never completed when
running against the large dataset  25,000 rows. That is the reducer
(only one) never exit and it appears stuck in an infinite loop. 
I looked at the Reducer counter and it never changes over the 6 hours when it 
gets stuck in a loop.

When the data set is small  25K rows, it runs fine.

Is there any work around this issue? We tested
against Hive 0.11/0.12/0.13 and the same result is the same.

create table window_function_fail
as
select a.*,
sum(case when bprice is not null then 1 else 0 end) over (partition by
date,name order by otime,bprice,aprice desc ROWS BETWEEN UNBOUNDED
PRECEDING AND CURRENT ROW) bidpid
from
large_table a;

create table large_table(
date   string, 
name string   ,
stime  string  ,  
bpricedecimal  ,  
apricedecimal   , 
otime double 
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' stored as textfile;

Thanks in advance.





bucketed table problems

2014-03-07 Thread Keith Wiley
I want to convert a table to a bucketed table, so I made a new table with the 
same schema as the old table and specified a cluster column:

create table foo_bucketed
(
a string,
b int,
c float
)
clustered by (b) into 10 buckets;

Then I populate it from my original table:

set hive.enforce.bucketing = true;
insert overwrite table foo_bucketed
select * from foo;

All of the data goes into the first bucket, leaving the remaining 9 buckets 
empty (in the file system, the remaining 9 files are 0 size).  Furthermore, the 
cluster column is now NULL.  Its values have been completely erased by the 
insertion (which might explain how they all ended up in a single bucket of 
course).


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy.
   --  Edwin A. Abbott, Flatland




Re: bucketed table problems

2014-03-07 Thread Keith Wiley
Admittedly the problem has gone away.  I *think* I may have offered the schema 
to the bucketed table creation in the wrong order the first time...I think.  
All I know is the problem went away when I tried a second time.  Should 
inserting a table into another table match the columns by name or does it just 
write from left to right utterly ignoring any schema specification?  Does it 
write the column 'a' from foo into column 'a' in foo_bucketed, or does it write 
column 'a' from foo into the same column position in foo_bucketed without any 
regard for the names of the columns?

On Mar 7, 2014, at 13:52 , Stephen Sprague wrote:

 yeah. that's not right. 
 
 1. lets see the output of show create table foo
 
 2. what version of hive are you using.
 
 
 On Fri, Mar 7, 2014 at 11:46 AM, Keith Wiley kwi...@keithwiley.com wrote:
 I want to convert a table to a bucketed table, so I made a new table with the 
 same schema as the old table and specified a cluster column:
 
 create table foo_bucketed
 (
 a string,
 b int,
 c float
 )
 clustered by (b) into 10 buckets;
 
 Then I populate it from my original table:
 
 set hive.enforce.bucketing = true;
 insert overwrite table foo_bucketed
 select * from foo;
 
 All of the data goes into the first bucket, leaving the remaining 9 buckets 
 empty (in the file system, the remaining 9 files are 0 size).  Furthermore, 
 the cluster column is now NULL.  Its values have been completely erased by 
 the insertion (which might explain how they all ended up in a single bucket 
 of course).
 
 
 Keith Wiley kwi...@keithwiley.com keithwiley.com
 music.keithwiley.com
 
 Yet mark his perfect self-contentment, and hence learn his lesson, that to be
 self-contented is to be vile and ignorant, and that to aspire is better than 
 to
 be blindly and impotently happy.
--  Edwin A. Abbott, Flatland
 
 
 



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

Luminous beings are we, not this crude matter.
   --  Yoda




Re: bucketed table problems

2014-03-07 Thread Keith Wiley
Thanks.  :-)  It's working better for me now.

On Mar 7, 2014, at 17:37 , Stephen Sprague wrote:

 short answer: its by position.
 



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!
   --  Homer Simpson




CDH4.4 - 0 progress on select queries

2013-11-07 Thread Keith Wright
Hi all,

   We have happily been using hive on our CDH install for some time without 
issues.  We recently upgraded from CDH4.2.0 to CDH 4.4.0 and now select queries 
against one table fails with the syslog for the job showing what's below.  
Couple of things to note:

 *   This is a partitioned table
 *   An example failure query:  select count(*) from table where partition 
conditions
 *   other tables do not appear to be effected however many/most of those have 
small files.  This table's directories have only 1 file which is around 60 GB 
in size
 *   Queries using OOTB hive 0.9.0 pointing at the centralized meta store work 
without issue
 *   It looks like an infinite loop issue where this query usually completes in 
~2 min but we let it run for 45 with the job tracker showing 0% progress on 
mapping (perhaps related to https://issues.apache.org/jira/browse/HIVE-5235)

Have others seen this behavior?

Thanks!!!

2013-11-07 13:58:02,102 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 
Processing alias organic_events for file 
hdfs://nameservice1/events/organic/2013/11/05
2013-11-07 13:58:02,102 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 7 
forwarding 1 rows
2013-11-07 13:58:02,102 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 
0 forwarding 1 rows
2013-11-07 13:58:02,102 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 1 
forwarding 1 rows
2013-11-07 13:58:02,103 INFO ExecMapper: ExecMapper: processing 1 rows: used 
memory = 178407888
2013-11-07 13:58:02,104 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 7 
forwarding 10 rows
2013-11-07 13:58:02,104 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 
0 forwarding 10 rows
2013-11-07 13:58:02,104 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 1 
forwarding 10 rows
2013-11-07 13:58:02,104 INFO ExecMapper: ExecMapper: processing 10 rows: used 
memory = 178407888
….

Where processing row counts and used memory keeps increasing


Use distribute to spread across reducers

2013-10-02 Thread Keith Wiley
I'm trying to create a subset of a large table for testing.  The following 
approach works:

create table subset_table as
select * from large_table limit 1000

...but it only uses one reducer.  I would like to speed up the process of 
creating a subset but distributing across multiple reducers.  I already tried 
explicitly setting mapred.reduce.tasks and hive.exec.reducers.max to values 
larger than 1, but in this particular case, those values seem to be over-ridden 
by Hive's internal query-to-mapreduce conversion; it ignores those parameters.

So, I tried this:

create table subset_table as
select * from large_table limit 1000
distribute by column_name

...but that doesn't parse.  I get the following error:

OK FAILED: ParseException line 3:0 missing EOF at 'distribute' near '1000'.

I have tried NUMEROUS applications of parentheses, nested queries, etc.  For 
example, here's just one (amongst perhaps ten variations on a theme):

create table subset_table as
select * from (
from (
select * from large_table limit 1000
distribute by column_name
)) s

Like I said, I've tried all sorts of combinations of the elements shown above.  
So far I have not even gotten any syntax to parse, much less run.  Only the 
original query at the top will even pass the parsing stage of processing.

Any ideas?

Thanks.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use.
   --  Galileo Galilei




Re: Want query to use more reducers

2013-09-30 Thread Keith Wiley
Thanks.  mapred.reduce.tasks and hive.exec.reducers.max seem to have fixed the 
problem.  It is now saturating the cluster and running the query super fast.  
Excellent!

On Sep 30, 2013, at 12:28 , Sean Busbey wrote:

 Hey Keith,
 
 It sounds like you should tweak the settings for how Hive handles query 
 execution[1]:
 
 1) Tune the guessed number of reducers based on input size
 
 = hive.exec.reducers.bytes.per.reducer
 
 Defaults to 1G. Based on your description, it sounds like this is probably 
 still at default.
 
 In this case, you should also set a max # of reducers based on your cluster 
 size.
 
 = hive.exec.reducers.max
 
 I usually set this to the # reduce slots, if there's a decent chance I'll get 
 to saturate the cluster. If not, don't worry about it.
 
 2) Hard code a number of reducers
 
 = mapred.reduce.tasks
 
 Setting this will cause Hive to always use that number. It defaults to -1, 
 which tells hive to use the heuristic about input size to guess.
 
 In either of the above cases, you should look at the options to merge small 
 files (search for merge  in the configuration property list) to avoid 
 getting lots of little outputs.
 
 HTH
 
 [1]: 
 https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-QueryExecution
 
 -Sean
 
 On Mon, Sep 30, 2013 at 11:31 AM, Keith Wiley kwi...@keithwiley.com wrote:
 I have a query that doesn't use reducers as efficiently as I would hope.  If 
 I run it on a large table, it uses more reducers, even saturating the 
 cluster, as I desire.  However, on smaller tables it uses as low as a single 
 reducer.  While I understand there is a logic in this (not using multiple 
 reducers until the data size is larger), it is nevertheless inefficient to 
 run a query for thirty minutes leaving the entire cluster vacant when the 
 query could distribute the work evenly and wrap things up in a fraction of 
 the time.  The query is shown below (abstracted to its basic form).  As you 
 can see, it is a little atypical: it is a nested query which obviously 
 implies two map-reduce jobs and it uses a script for the reducer stage that I 
 am trying to speed up.  I thought the distribute by clause should make it 
 use the reducers more evenly, but as I said, that is not the behavior I am 
 seeing.
 
 Any ideas how I could improve this situation?
 
 Thanks.
 
 CREATE TABLE output_table ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' as
 SELECT * FROM (
 FROM (
 SELECT * FROM input_table
 DISTRIBUTE BY input_column_1 SORT BY input_column_1 ASC, 
 input_column_2 ASC, input_column_etc ASC) q
 SELECT TRANSFORM(*)
 USING 'python my_reducer_script.py' AS(
 output_column_1,
 output_column_2,
 output_column_etc,
 )
 ) s
 ORDER BY output_column_1;
 
 
 Keith Wiley kwi...@keithwiley.com keithwiley.com
 music.keithwiley.com
 
 Luminous beings are we, not this crude matter.
--  Yoda
 
 
 
 
 
 -- 
 Sean



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use.
   --  Galileo Galilei




Re: IndexOutOfBoundsException with Snappy compressed SequenceFile from Flume

2013-07-15 Thread Keith Wright
I was able to resolve it.  Turns our we had a misconfiguration with one of our 
flume hdfs sinks where 2 were writing as sequence file text while one was 
sequence file bytes.  It was due to one bad character in the flume conf.  My 
manual map/reduce jobs worked fine because it would automatically handle both 
cases.  Hive on the other hand would work for the beginning of the query until 
it hit a non-text based file.  One of those 60 hour of debugging leads to a one 
character change bugs.  Hopefully this will be a good lesson for others in the 
future!

Shreepadma Venugopalan shreepa...@cloudera.com wrote:



Hi Keith,

Were you able to resolve this? Or, is this still an issue?

Thanks.
Shreepadma


On Tue, May 28, 2013 at 6:02 AM, Keith Wright 
kwri...@nanigans.commailto:kwri...@nanigans.com wrote:
Hi all,

   This is my first post to the hive mailing list and I was hoping to get some 
help with the exception I am getting below.  I am using CDH4.2 (hive 0.10.0) to 
query snappy compressed, Sequence files that are built using Flume (relevant 
portion of flume conf below as well).  Note that I'm using a SequenceFile as it 
was needed for Impala integration.  Has anyone see this error before?  Couple 
of additional points to help diagnose:

 1.  Queries seem to be able to process some mappers without issues.   In fact, 
I can do a simple select * from table limit 10 without issue. However if I 
make the limit high enough, it will eventually fail presumably as it needs to 
read in a file that has this issue.
 2.  The same query runs in Impala without errors but appears to skip some 
data.  I can confirm that the missing data is present via a custom map/reduce 
job
 3.  I am able to write a map/reduce job that reads through all of the same 
data without issue and have been unable to identify data corruption
 4.  This is a partitioned table and queries fail that touch ANY of the 
partitions (and there are hundreds) so this does not appear to be a sporadic, 
data integrity problem (table definition below)
 5.  We are using '\001' as our field separator.  We are capturing other data 
also with SequenceFile, snappy but using '|' as our delimiter and we do not 
have any issues querying there.  Although we are using a different flume source.

My next step for debugging was to disable snappy compression and see if I could 
query the data.  If not, switch from SequenceFile to simple text.

I appreciate the help!!!

CREATE EXTERNAL TABLE ORGANIC_EVENTS (
event_id BIGINT,
app_id INT,
user_id BIGINT,
type STRING,
name STRING,
value STRING,
extra STRING,
ip_address STRING,
user_agent STRING,
referrer STRING,
event_time BIGINT,
install_flag TINYINT,
first_for_user TINYINT,
cookie STRING,
year int,
month int,
day int,
hour int)  PARTITIONED BY (year int, month int, day int,hour int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
STORED AS SEQUENCEFILE
LOCATION '/events/organic';

agent.sinks.exhaustHDFSSink3.type = HDFS
agent.sinks.exhaustHDFSSink3.channel = exhaustFileChannel
agent.sinks.exhaustHDFSSink3.hdfs.path = 
hdfs://lxscdh001.nanigans.com:8020%{path}
agent.sinks.exhaustHDFSSink3.hdfs.filePrefix = 3.%{hostname}
agent.sinks.exhaustHDFSSink3.hdfs.rollInterval = 0
agent.sinks.exhaustHDFSSink3.hdfs.idleTimeout = 600
agent.sinks.exhaustHDFSSink3.hdfs.rollSize = 0
agent.sinks.exhaustHDFSSink3.hdfs.rollCount = 0
agent.sinks.exhaustHDFSSink3.hdfs.batchSize = 5000
agent.sinks.exhaustHDFSSink3.hdfs.txnEventMax = 5000
agent.sinks.exhaustHDFSSink3.hdfs.fileType = SequenceFile
agent.sinks.exhaustHDFSSink3.hdfs.maxOpenFiles = 100
agent.sinks.exhaustHDFSSink3.hdfs.codeC = snappy
agent.sinks.exhaustHDFSSink.3hdfs.writeFormat = Text



2013-05-28 12:29:00,919 WARN org.apache.hadoop.mapred.Child: Error running 
child  java.io.IOException: java.io.IOException: 
java.lang.IndexOutOfBoundsException
  at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
  at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
  at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:330)
  at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:246)
  at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:216)
  at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:201)
  at 
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
  at 
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418

Re: .sql vs. .hql

2013-05-31 Thread Keith Wiley
Thanks!  I'm always up for a good book.  Our case is a little open-ended right 
now, a little difficult to nail down or describe.  Basically, I'm deploying 
Hive systems for a SQl-comfortable data analyst (who is also quite versed in 
Hive) who would like to run .sql files in Hive and I need to clearly understand 
how Hive uses input files and how .sql and .hql may differ.

Cheers!

On May 31, 2013, at 13:20 , Sanjay Subramanian wrote:

 For Hive u need the other bible by Dean Wampler , Edward Capriolo et al
 Also if u tell us what use cases u have we could provide helpŠ
 
 sanjay
 
 On 5/31/13 1:17 PM, Keith Wiley kwi...@keithwiley.com wrote:
 
 I'm looking for documentation on how to use .sql and .hql files in Hive
 and what the differences are between the two file types.  I'm not even
 certain .hql is a real thing, but I'm looking.  I didn't see any mention
 of this issue in White's Hadoop book, at least not clearly indicated in
 the index.
 
 Thanks.
 
 __
 __
 Keith Wiley kwi...@keithwiley.com keithwiley.com
 music.keithwiley.com
 
 Luminous beings are we, not this crude matter.
  --  Yoda
 __
 __
 
 
 
 
 CONFIDENTIALITY NOTICE
 ==
 This email message and any attachments are for the exclusive use of the 
 intended recipient(s) and may contain confidential and privileged 
 information. Any unauthorized review, use, disclosure or distribution is 
 prohibited. If you are not the intended recipient, please contact the sender 
 by reply email and destroy all copies of the original message along with any 
 attachments, from your computer system. If you are the intended recipient, 
 please be advised that the content of this message is subject to access, 
 review and disclosure by the sender's Email System Administrator.
 



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use.
   --  Galileo Galilei




IndexOutOfBoundsException with Snappy compressed SequenceFile from Flume

2013-05-28 Thread Keith Wright
Hi all,

   This is my first post to the hive mailing list and I was hoping to get some 
help with the exception I am getting below.  I am using CDH4.2 (hive 0.10.0) to 
query snappy compressed, Sequence files that are built using Flume (relevant 
portion of flume conf below as well).  Note that I'm using a SequenceFile as it 
was needed for Impala integration.  Has anyone see this error before?  Couple 
of additional points to help diagnose:

 1.  Queries seem to be able to process some mappers without issues.   In fact, 
I can do a simple select * from table limit 10 without issue. However if I 
make the limit high enough, it will eventually fail presumably as it needs to 
read in a file that has this issue.
 2.  The same query runs in Impala without errors but appears to skip some 
data.  I can confirm that the missing data is present via a custom map/reduce 
job
 3.  I am able to write a map/reduce job that reads through all of the same 
data without issue and have been unable to identify data corruption
 4.  This is a partitioned table and queries fail that touch ANY of the 
partitions (and there are hundreds) so this does not appear to be a sporadic, 
data integrity problem (table definition below)
 5.  We are using '\001' as our field separator.  We are capturing other data 
also with SequenceFile, snappy but using '|' as our delimiter and we do not 
have any issues querying there.  Although we are using a different flume source.

My next step for debugging was to disable snappy compression and see if I could 
query the data.  If not, switch from SequenceFile to simple text.

I appreciate the help!!!

CREATE EXTERNAL TABLE ORGANIC_EVENTS (
event_id BIGINT,
app_id INT,
user_id BIGINT,
type STRING,
name STRING,
value STRING,
extra STRING,
ip_address STRING,
user_agent STRING,
referrer STRING,
event_time BIGINT,
install_flag TINYINT,
first_for_user TINYINT,
cookie STRING,
year int,
month int,
day int,
hour int)  PARTITIONED BY (year int, month int, day int,hour int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
STORED AS SEQUENCEFILE
LOCATION '/events/organic';

agent.sinks.exhaustHDFSSink3.type = HDFS
agent.sinks.exhaustHDFSSink3.channel = exhaustFileChannel
agent.sinks.exhaustHDFSSink3.hdfs.path = 
hdfs://lxscdh001.nanigans.com:8020%{path}
agent.sinks.exhaustHDFSSink3.hdfs.filePrefix = 3.%{hostname}
agent.sinks.exhaustHDFSSink3.hdfs.rollInterval = 0
agent.sinks.exhaustHDFSSink3.hdfs.idleTimeout = 600
agent.sinks.exhaustHDFSSink3.hdfs.rollSize = 0
agent.sinks.exhaustHDFSSink3.hdfs.rollCount = 0
agent.sinks.exhaustHDFSSink3.hdfs.batchSize = 5000
agent.sinks.exhaustHDFSSink3.hdfs.txnEventMax = 5000
agent.sinks.exhaustHDFSSink3.hdfs.fileType = SequenceFile
agent.sinks.exhaustHDFSSink3.hdfs.maxOpenFiles = 100
agent.sinks.exhaustHDFSSink3.hdfs.codeC = snappy
agent.sinks.exhaustHDFSSink.3hdfs.writeFormat = Text


2013-05-28 12:29:00,919 WARN org.apache.hadoop.mapred.Child: Error running 
child  java.io.IOException: java.io.IOException: 
java.lang.IndexOutOfBoundsException
  at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
  at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
  at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:330)
  at 
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:246)
  at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:216)
  at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:201)
  at 
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
  at 
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
  at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
  at 
org.apache.hadoop.mapred.Child$4.run(Child.java:268)
  at 
java.security.AccessController.doPrivileged(Native Method)
  at 
javax.security.auth.Subject.doAs(Subject.java:396)
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
  at 
org.apache.hadoop.mapred.Child.main(Child.java:262)
  Caused by: java.io.IOException: 
java.lang.IndexOutOfBoundsException
  at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
  

Re: Need rank(), can't build m6d's version

2013-04-02 Thread Keith Wiley
Thank you Edward.  I appreciate you trying to help out.  I'll see if I can get 
that to work.

On Apr 1, 2013, at 17:09 , Edward Capriolo wrote:

 approx steps
 git clone hive-test 
 mvn install
 git clone hive-rang
 mvn install
 
 or just
 git clone hive-test
 mvn -Dmaven.test.skip=true install


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

The easy confidence with which I know another man's religion is folly teaches
me to suspect that my own is also.
   --  Mark Twain




Re: Need rank()

2013-04-02 Thread Keith Wiley
On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:

 http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/

Is there any possibility there is a bug in Ritesh Agrawal's query statement 
from that article?  I created a test table with the exact column names from the 
example in the article and used a minimally altered version of the command (I 
removed the where clause to simplify things a bit) and got an error which 
suggests there is something slightly wrong with the command (or perhaps the 
table has to be configured a special way).  Here's what I get when I almost 
perfectly duplicate that example:

hive describe test;
OK
userstring  
categorystring  
value   int 
Time taken: 0.082 seconds
==
hive select * from test;
OK
user1   cat11
user1   cat12
user1   cat13
user1   cat210
user1   cat220
user1   cat230
user2   cat111
user2   cat121
user2   cat131
user2   cat25
user2   cat26
user2   cat27
Time taken: 0.202 seconds
==
hive SELECT user, category, value
 FROM (
 SELECT user, category, rank(user) as rank, value
 FROM test
 DISTRIBUTE BY user
 SORT BY user, value desc
 ) a
 WHERE rank  1
 ORDER BY user, rank;
FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or column 
reference 'rank': (possible column names are: _col0, _col1, _col2)
hive 


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered.
   --  Keith Wiley




Re: Need rank()

2013-04-02 Thread Keith Wiley
I did all that, I just didn't paste it into the email.  That isn't the problem. 
 Sorry for the confusion.

hive add jar Rank.jar;
hive create temporary function rank as 'com.example.hive.udf.Rank';

BTW, small typo, the condition at the end uses less-then, not 
greater-then...obviously...but that isn't the problem either.

On Apr 2, 2013, at 10:06 , Nitin Pawar wrote:

 I dont see you added rank in the classpath and defined rank function 
 
 can you try add jar and define the function and try again 
 
 
 
 On Tue, Apr 2, 2013 at 10:33 PM, Keith Wiley kwi...@keithwiley.com wrote:
 On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:
 
  http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
 
 Is there any possibility there is a bug in Ritesh Agrawal's query statement 
 from that article?  I created a test table with the exact column names from 
 the example in the article and used a minimally altered version of the 
 command (I removed the where clause to simplify things a bit) and got an 
 error which suggests there is something slightly wrong with the command (or 
 perhaps the table has to be configured a special way).  Here's what I get 
 when I almost perfectly duplicate that example:
 
 hive describe test;
 OK
 userstring
 categorystring
 value   int
 Time taken: 0.082 seconds
 ==
 hive select * from test;
 OK
 user1   cat11
 user1   cat12
 user1   cat13
 user1   cat210
 user1   cat220
 user1   cat230
 user2   cat111
 user2   cat121
 user2   cat131
 user2   cat25
 user2   cat26
 user2   cat27
 Time taken: 0.202 seconds
 ==
 hive SELECT user, category, value
  FROM (
  SELECT user, category, rank(user) as rank, value
  FROM test
  DISTRIBUTE BY user
  SORT BY user, value desc
  ) a
  WHERE rank  1
  ORDER BY user, rank;
 FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or column 
 reference 'rank': (possible column names are: _col0, _col1, _col2)
 hive
 
 
 Keith Wiley kwi...@keithwiley.com keithwiley.com
 music.keithwiley.com
 
 What I primarily learned in grad school is how much I *don't* know.
 Consequently, I left grad school with a higher ignorance to knowledge ratio 
 than
 when I entered.
--  Keith Wiley
 
 
 
 
 
 -- 
 Nitin Pawar



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow.
   --  Keith Wiley




Re: Need rank()

2013-04-02 Thread Keith Wiley
I suppose.  Like I said, I was just following the example on that website and 
it didn't seem to have any comments on it remarking on errors in the article.  
Thus my confusion about the whole thing.  It is presented as being accurate and 
is old enough that any errors would have been caught and labeled in the 
comments long ago.  I also thought it was a little odd to name the function the 
same thing as the alias assigned to the result.  I'll try disambiguating that.

On Apr 2, 2013, at 10:22 , Nitin Pawar wrote:

 I am not sure about this 
 but you defined a rank function and then aliased the column as rank itself. 
 Can this cause the issue? 
 
 From the error it looks like it could not figure out which column is rank in 
 the where clause WHERE rank  1 
 
 from Edward's blog the correct query looks similar  but his function is named 
 as p_rank 
 
 SELECT
  category,country,product,sales,rank
 FROM (
  SELECT
category,country,product,sales,
p_rank(category, country) rank
  FROM (
SELECT
  category,country,product,
  sales
FROM p_rank_demo
DISTRIBUTE BY
  category,country
SORT BY
  category,country,sales desc) t1) t2
 WHERE rank = 3
 
 
 On Tue, Apr 2, 2013 at 10:45 PM, Keith Wiley kwi...@keithwiley.com wrote:
 I did all that, I just didn't paste it into the email.  That isn't the 
 problem.  Sorry for the confusion.
 
 hive add jar Rank.jar;
 hive create temporary function rank as 'com.example.hive.udf.Rank';
 
 BTW, small typo, the condition at the end uses less-then, not 
 greater-then...obviously...but that isn't the problem either.
 
 On Apr 2, 2013, at 10:06 , Nitin Pawar wrote:
 
  I dont see you added rank in the classpath and defined rank function
 
  can you try add jar and define the function and try again
 
 
 
  On Tue, Apr 2, 2013 at 10:33 PM, Keith Wiley kwi...@keithwiley.com wrote:
  On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:
 
   http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
 
  Is there any possibility there is a bug in Ritesh Agrawal's query statement 
  from that article?  I created a test table with the exact column names from 
  the example in the article and used a minimally altered version of the 
  command (I removed the where clause to simplify things a bit) and got an 
  error which suggests there is something slightly wrong with the command (or 
  perhaps the table has to be configured a special way).  Here's what I get 
  when I almost perfectly duplicate that example:
 
  hive describe test;
  OK
  userstring
  categorystring
  value   int
  Time taken: 0.082 seconds
  ==
  hive select * from test;
  OK
  user1   cat11
  user1   cat12
  user1   cat13
  user1   cat210
  user1   cat220
  user1   cat230
  user2   cat111
  user2   cat121
  user2   cat131
  user2   cat25
  user2   cat26
  user2   cat27
  Time taken: 0.202 seconds
  ==
  hive SELECT user, category, value
   FROM (
   SELECT user, category, rank(user) as rank, value
   FROM test
   DISTRIBUTE BY user
   SORT BY user, value desc
   ) a
   WHERE rank  1
   ORDER BY user, rank;
  FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or column 
  reference 'rank': (possible column names are: _col0, _col1, _col2)
  hive
 
  
  Keith Wiley kwi...@keithwiley.com keithwiley.com
  music.keithwiley.com
 
  What I primarily learned in grad school is how much I *don't* know.
  Consequently, I left grad school with a higher ignorance to knowledge ratio 
  than
  when I entered.
 --  Keith Wiley
  
 
 
 
 
  --
  Nitin Pawar
 
 
 
 Keith Wiley kwi...@keithwiley.com keithwiley.com
 music.keithwiley.com
 
 It's a fine line between meticulous and obsessive-compulsive and a slippery
 rope between obsessive-compulsive and debilitatingly slow.
--  Keith Wiley
 
 
 
 
 
 -- 
 Nitin Pawar



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched.
   --  Keith Wiley




Re: Need rank()

2013-04-02 Thread Keith Wiley
I'm pretty sure there are at least a few bugs in that article.  The problem 
described below goes away if rank is added to the outer select statement.  If 
there's one error I suppose there may be others.  Although I have gotten the 
query to run, I'm not getting properly ranked results yet.  I'm still working 
on it.

On Apr 2, 2013, at 10:03 , Keith Wiley wrote:

 On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:
 
 http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
 
 Is there any possibility there is a bug in Ritesh Agrawal's query statement 
 from that article?  I created a test table with the exact column names from 
 the example in the article and used a minimally altered version of the 
 command (I removed the where clause to simplify things a bit) and got an 
 error which suggests there is something slightly wrong with the command (or 
 perhaps the table has to be configured a special way).  Here's what I get 
 when I almost perfectly duplicate that example:
 
 hive describe test;
 OK
 user  string  
 category  string  
 value int 
 Time taken: 0.082 seconds
 ==
 hive select * from test;
 OK
 user1 cat11
 user1 cat12
 user1 cat13
 user1 cat210
 user1 cat220
 user1 cat230
 user2 cat111
 user2 cat121
 user2 cat131
 user2 cat25
 user2 cat26
 user2 cat27
 Time taken: 0.202 seconds
 ==
 hive SELECT user, category, value
 FROM (
 SELECT user, category, rank(user) as rank, value
 FROM test
 DISTRIBUTE BY user
 SORT BY user, value desc
 ) a
 WHERE rank  1
 ORDER BY user, rank;
 FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or column 
 reference 'rank': (possible column names are: _col0, _col1, _col2)
 hive 
 
 
 Keith Wiley kwi...@keithwiley.com keithwiley.com
 music.keithwiley.com
 
 What I primarily learned in grad school is how much I *don't* know.
 Consequently, I left grad school with a higher ignorance to knowledge ratio 
 than
 when I entered.
   --  Keith Wiley
 
 



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy.
   --  Edwin A. Abbott, Flatland




Re: Need rank()

2013-04-02 Thread Keith Wiley
Yep, the original article is definitely erroneous in this regard.  I figured 
out that eventually.  I'm not sure how much I can trust that resource now.  I 
may have to look elsewhere.  I agree that Edward's description is pretty good, 
but as I said earlier, I can't actually use his code, so I'm trying to cobble a 
workable solution together from the various resources available.  Ritesh's 
article, despite the error in the Hive syntax, is still useful in that it 
enables one to quickly compile a simple rank jar without relying on git, maven, 
or other project dependencies -- problems which have plagued me with Edward's 
approach.  So, if I can use Ritesh's method to write a simple rank function, 
and Edward's accurate description of how to construct the query, then I can put 
all the pieces together into a workable solution.

I'll let you know if I get it.

On Apr 2, 2013, at 10:56 , Igor Tatarinov wrote:

 You are getting the error because you are ORDERing BY rank but rank is not in 
 the top SELECT
 
 Also, DISTRIBUTE BY/SORT BY are done after SELECT so you have to use a 
 subquery:
 SELECT ..., rank(user)
 FROM (SELECT ... DISTRIBUTE BY ... SORT BY)
 
 igor
 decide.com
 
 
 On Tue, Apr 2, 2013 at 10:03 AM, Keith Wiley kwi...@keithwiley.com wrote:
 On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:
 
  http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/
 
 Is there any possibility there is a bug in Ritesh Agrawal's query statement 
 from that article?  I created a test table with the exact column names from 
 the example in the article and used a minimally altered version of the 
 command (I removed the where clause to simplify things a bit) and got an 
 error which suggests there is something slightly wrong with the command (or 
 perhaps the table has to be configured a special way).  Here's what I get 
 when I almost perfectly duplicate that example:
 
 hive describe test;
 OK
 userstring
 categorystring
 value   int
 Time taken: 0.082 seconds
 ==
 hive select * from test;
 OK
 user1   cat11
 user1   cat12
 user1   cat13
 user1   cat210
 user1   cat220
 user1   cat230
 user2   cat111
 user2   cat121
 user2   cat131
 user2   cat25
 user2   cat26
 user2   cat27
 Time taken: 0.202 seconds
 ==
 hive SELECT user, category, value
  FROM (
  SELECT user, category, rank(user) as rank, value
  FROM test
  DISTRIBUTE BY user
  SORT BY user, value desc
  ) a
  WHERE rank  1
  ORDER BY user, rank;
 FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or column 
 reference 'rank': (possible column names are: _col0, _col1, _col2)
 hive
 
 
 Keith Wiley kwi...@keithwiley.com keithwiley.com
 music.keithwiley.com
 
 What I primarily learned in grad school is how much I *don't* know.
 Consequently, I left grad school with a higher ignorance to knowledge ratio 
 than
 when I entered.
--  Keith Wiley
 
 
 



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

Luminous beings are we, not this crude matter.
   --  Yoda




Re: Need rank()

2013-04-02 Thread Keith Wiley
On Apr 2, 2013, at 12:59 , Edward Capriolo wrote:

 I feel your pain I do not have time to look at the project ATM. However 
 trying to do Java development these days without understanding git, or maven 
 is fairly impossible, most projects depend on N others that depend on M 
 others. So trying to even compile the code outside of maven requires a lot of 
 manually fetching jar etc. Your better off taking a day and getting good with 
 maven, IMHO. You will be dealing with it for the foreseeable future. 


I agree.  I'll cozy up with it first chance I get...but sometimes you're on a 
deadline and have to go for the low-hanging fruit.  I fully understand the 
motivation behind your approach.  I'll circle back around it when I get some 
free time.

Cheers!


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use.
   --  Galileo Galilei




Rank(): here's how I did it, for better or worse...

2013-04-02 Thread Keith Wiley
I agree, it's probably best to use a better engineered approach such as 
Edward's.  In the meantime, if anyone would benefit from a walk-through of my 
direct approach, here it is.  It combines Ritesh's direct ultra-simplistic 
method with Edward's correct HiveQL syntax.  As would be expected, it is 
sensitive to shell vagaries that would be better managed by a combo system like 
git and maven...but it works.



Create Rank.java:
-
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
 
public final class Rank extends UDF{
private int counter;
private String last_key;
public int evaluate(final String key){
  if ( !key.equalsIgnoreCase(this.last_key) ) {
 this.counter = 0;
 this.last_key = key;
  }
  return this.counter++;
}
}



Compile Rank.java to Rank.class.  Then bundle to Rank.jar.  Observe that the 
jar command is highly sensitive to the relative path to the .class file when 
naming the package inside the resulting .jar file:
-
$ mkdir ./RankTempDir
$ javac -classpath 
$HIVE_HOME/lib/hive-serde-0.8.1.jar:$HIVE_HOME/lib/hive-exec-0.8.1.jar:$HADOOP_HOME/hadoop-core.jar
 -d ./RankTempDir Rank.java
$ cd RankTempDir;
$ jar -cf ../Rank.jar ./com
$ cd ..

You will have to verify (via echo) HADOOP_HOME and HIVE_HOME and then will have 
to verify the name of the serde, exec, and core files in each directory.  The 
exact filenames are probably version specific.

Verify the package path in Rank.jar:
-
$ jar -tvf Rank.jar

You should see 'com/example/hive/udf/Rank.class'.  If you see a different path, 
the package has not been properly represented in the jar w.r.t. its designation 
in the .java file.



Run hive and prepare the session to use the UDF:

$ hive
hive add jar Rank.jar;
hive create temporary function rank as 'com.example.hive.udf.Rank';

You must either run hive from a directory containing Rank.jar or specify an 
alternate path to it in the add command.  Note that the Rank class's full 
package is specified in the create command and therefore must, logically, 
match the package in both the .java and the .jar files.



Consider a table named 'test' consisting of columns 'user', 'category', and 
'value', containing the following data:

hive select * from test;
user1   catA1
user1   catB11
user1   catC111
user2   catA222
user2   catB22
user2   catC2
user3   catA3
user3   catB5
user3   catC4

So the top category for user1 is catC, for user2 is catA and for user3 is catB. 
 Say we want the top N valued categories for each user.  In the example below, 
N is 2 (it is indicated in the final WHERE clause).  Here is the format of the 
corresponding ranked query, and its result:

hive SELECT user, category, value, ranked_col
FROM (
SELECT user, category, value, rank(user) ranked_col
FROM (
SELECT user, category, value
FROM test
DISTRIBUTE BY user
SORT BY user, value desc
) a
) b
WHERE ranked_col  2
ORDER BY user, ranked_col;
...
[wait for Hive query and MapReduce job(s) to finish]
...
user1   catC111 0
user1   catB11  1
user2   catA222 0
user2   catB22  1
user3   catB5   0
user3   catC4   1

Note that ranks are 0-indexed (of course I suppose that's a property of the 
specific .java we wrote above, and therefore is easily amenable to 1-indexing, 
which would more canonically connote the notion of a rank).

Anyway, that's what I came up with.  I don't by any means claim it's the best 
approach.  Edward is surely right that the best method would be to use the 
powerful tools made available by the large developer community such as git and 
maven.

Cheers!


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!
   --  Homer Simpson




Need rank(), can't build m6d's version

2013-04-01 Thread Keith Wiley
I need rank() in Hive.  I have't had much luck with Edward Capriolo's on git 
and it comes with no documentation.  It depends on hive-test (also by Edward) 
and I can't get maven to build the tests for hive-test which prevents me from 
building the hive-test package (the jar)...which is all pretty tangential to 
the real goal, which is rank of course.

Is there any place to download the hive-rank jar pre-built instead of having to 
build it out of git?  If not, has anyone built it recently, or namely has 
anyone built the hive-test project it depends on?  The following is just one 
example of the sort of error I see.  Remember this error is on hive-test, which 
frankly I don't even care about.  :-)

Feeling a little desperate...thanks for any help.

---
Test set: com.jointhegrid.hive_test.EmbeddedHiveExampleTest
---
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.878 sec  
FAILURE!
testA(com.jointhegrid.hive_test.EmbeddedHiveExampleTest)  Time elapsed: 7.823 
sec   FAILURE!
java.lang.AssertionError: expected:0 but was:9
at org.junit.Assert.fail(Assert.java:91)
at org.junit.Assert.failNotEquals(Assert.java:645)
at org.junit.Assert.assertEquals(Assert.java:126)
at org.junit.Assert.assertEquals(Assert.java:470)
at org.junit.Assert.assertEquals(Assert.java:454)
at 
com.jointhegrid.hive_test.EmbeddedHiveExampleTest.testA(EmbeddedHiveExampleTest.java:51)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at junit.framework.TestCase.runTest(TestCase.java:168)
at junit.framework.TestCase.runBare(TestCase.java:134)
at junit.framework.TestResult$1.protect(TestResult.java:110)
at junit.framework.TestResult.runProtected(TestResult.java:128)
at junit.framework.TestResult.run(TestResult.java:113)
at junit.framework.TestCase.run(TestCase.java:124)
at junit.framework.TestSuite.runTest(TestSuite.java:243)
at junit.framework.TestSuite.run(TestSuite.java:238)
at 
org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83)
at 
org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:35)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:115)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:97)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.maven.surefire.booter.ProviderFactory$ClassLoaderProxy.invoke(ProviderFactory.java:103)
at $Proxy0.invoke(Unknown Source)
at 
org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:150)
at 
org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcess(SurefireStarter.java:91)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:69)


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use.
   --  Galileo Galilei




Re: Need rank(), can't build m6d's version

2013-04-01 Thread Keith Wiley
Thanks!  That looks like it has serious potential.  I'll study up on it.

On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote:

 http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!
   --  Homer Simpson




Re: S3/EMR Hive: Load contents of a single file

2013-03-27 Thread Keith Wiley
Okay, I also saw your previous response which analyzed queries into two tables 
built around two files in the same directory.  I guess I was simply wrong in my 
understanding that a Hive table is fundamentally associated with a directory 
instead of a file.  Turns out, it be can either one.  A directory table uses 
all files in the directory while a file table uses one specific file and 
properly avoids sibling files.  My bad.

Thanks for the careful analysis and clarification.  TIL!

Cheers!

On Mar 27, 2013, at 02:58 , Tony Burton wrote:

 A bit more info - do an extended description of the table:
  
 $ desc extended gsrc1;
  
 And the “location” field is “location:s3://mybucket/path/to/data/src1.txt”
  
 Do the same on a table created with a location pointing at the directory and 
 the same info gives (not surprisingly) “location:s3://mybucket/path/to/data/”
 


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me.
   --  Abe (Grandpa) Simpson




Re: S3/EMR Hive: Load contents of a single file

2013-03-26 Thread Keith Wiley
 
 
 This email and any attachments are confidential, protected by copyright and 
 may be legally privileged. If you are not the intended recipient, then the 
 dissemination or copying of this email is prohibited. If you have received 
 this in error, please notify the sender by replying by email and then delete 
 the email completely from your system. Neither Sporting Index nor the sender 
 accepts responsibility for any virus, or any other defect which might affect 
 any computer or IT system into which the email is received and/or opened. It 
 is the responsibility of the recipient to scan the email and no 
 responsibility is accepted for any loss or damage arising in any way from 
 receipt or use of this email. Sporting Index Ltd is a company registered in 
 England and Wales with company number 2636842, whose registered office is at 
 Gateway House, Milverton Street, London, SE11 4AP. Sporting Index Ltd is 
 authorised and regulated by the UK Financial Services Authority (reg. no. 
 150404) and Gambling Commission (reg. no. 000-027343-R-308898-001). Any 
 financial promotion contained herein has been issued and approved by Sporting 
 Index Ltd.
 
 
 Outbound email has been scanned for viruses and SPAM
  
 CONFIDENTIALITY NOTICE
 ==
 This email message and any attachments are for the exclusive use of the 
 intended recipient(s) and may contain confidential and privileged 
 information. Any unauthorized review, use, disclosure or distribution is 
 prohibited. If you are not the intended recipient, please contact the sender 
 by reply email and destroy all copies of the original message along with any 
 attachments, from your computer system. If you are the intended recipient, 
 please be advised that the content of this message is subject to access, 
 review and disclosure by the sender's Email System Administrator.
  
 
 
 *
 P Pleaseconsider the environment before printing this email
 
 
 Inbound email has been scanned for viruses  spam
 
 
 
 
 *
 P Please consider the environment before printing this email or attachments
 
 
 This email and any attachments are confidential, protected by copyright and 
 may be legally privileged. If you are not the intended recipient, then the 
 dissemination or copying of this email is prohibited. If you have received 
 this in error, please notify the sender by replying by email and then delete 
 the email completely from your system. Neither Sporting Index nor the sender 
 accepts responsibility for any virus, or any other defect which might affect 
 any computer or IT system into which the email is received and/or opened. It 
 is the responsibility of the recipient to scan the email and no 
 responsibility is accepted for any loss or damage arising in any way from 
 receipt or use of this email. Sporting Index Ltd is a company registered in 
 England and Wales with company number 2636842, whose registered office is at 
 Gateway House, Milverton Street, London, SE11 4AP. Sporting Index Ltd is 
 authorised and regulated by the UK Financial Services Authority (reg. no. 
 150404) and Gambling Commission (reg. no. 000-027343-R-308898-001). Any 
 financial promotion contained herein has been issued and approved by Sporting 
 Index Ltd.
 
 
 Outbound email has been scanned for viruses and SPAM
 



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy.
   --  Edwin A. Abbott, Flatland




Re: Query crawls through reducer

2013-03-24 Thread Keith Wiley
Thanks.

On Mar 22, 2013, at 21:02 , Nitin Pawar wrote:

 instead of = can you just try =  if you want to limit top 100 (b being a 
 partition  i guess it will have more that 100 records to fit into your limit)
 
 to improve your query performance your table file format matters as well. 
 Which one are you using?  
 how many partitions are there? 
 what's the size of the cluster?
 you can set the number of reducers but if your query just has one key then 
 only one reducer will get the data and rest will run empty 
 
 
 
 On Sat, Mar 23, 2013 at 4:32 AM, Keith Wiley kwi...@keithwiley.com wrote:
 The following query translates into a many-map-single-reduce job (which is 
 common) and also slags through the reduce stage...it's killing the overall 
 query:
 
 select * from a where b = 'c' order by b desc limit 100
 
 Note that b is a partition.  What component is making the reducer heavy?  Is 
 it the order by or the limit (I'm sure it's not the partition-specific where 
 clause, right?)?  Are there ways to improve its performance?
 
 
 Keith Wiley kwi...@keithwiley.com keithwiley.com
 music.keithwiley.com
 
 You can scratch an itch, but you can't itch a scratch. Furthermore, an itch 
 can
 itch but a scratch can't scratch. Finally, a scratch can itch, but an itch 
 can't
 scratch. All together this implies: He scratched the itch from the scratch 
 that
 itched but would never itch the scratch from the itch that scratched.
--  Keith Wiley
 
 
 
 
 
 -- 
 Nitin Pawar



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow.
   --  Keith Wiley




Query crawls through reducer

2013-03-22 Thread Keith Wiley
The following query translates into a many-map-single-reduce job (which is 
common) and also slags through the reduce stage...it's killing the overall 
query:

select * from a where b = 'c' order by b desc limit 100

Note that b is a partition.  What component is making the reducer heavy?  Is it 
the order by or the limit (I'm sure it's not the partition-specific where 
clause, right?)?  Are there ways to improve its performance?


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched.
   --  Keith Wiley




Rename external table, including HDFS directory

2013-03-07 Thread Keith Wiley
My understanding is that renaming and external table doesn't rename the 
corresponding HDFS directory to match.  I would like to do this; I want to 
rename the table in Hive and also rename the HDFS directory so they match (they 
already match with the old name).  I'm not sure how to do this.  I'm worried 
that if I rename the HDFS directory through Hadoop, not Hive, that Hive will 
lose track of the directory, regardless of whether I rename the table in Hive 
as well.

I also don't want to drop and recreate the table because I don't want to notify 
Hive of all the partitions all over again.

Thoughts?


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

Luminous beings are we, not this crude matter.
   --  Yoda




Combine two overlapping schema?

2013-03-06 Thread Keith Wiley
I have two tables which have overlapping but nonidentical schema.  I want to 
creating a new table that unions them, leaving nulls in any given row where a 
column name doesn't occur in the other table:

SCHEMA 1: { a, b, c, Y }
row:  { 1, 2, 3, 4 }

SCHEMA 2: { a, b, c, Z }
row:  { 5, 6, 7, 8 }

NEW SCHEMA: { a, b, c, Y, Z }
new row:  { a:1, b:2, c:3, Y:4, Z:null }
new row:  { a:5, b:6, c:7, Y:null, Z:8 }

I don't think either full outer join or union all works.  I'm not sure how 
to do this.  Any ideas?

Thanks.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched.
   --  Keith Wiley




Re: Combine two overlapping schema?

2013-03-06 Thread Keith Wiley
Ah.  I was stuck on the requirement that the two schema match, but I see your 
point.  I'll see if that works.

On Mar 6, 2013, at 10:11 , Dean Wampler wrote:

 Of the top of my head, I think UNION ALL should work if you explicitly 
 project out the missing columns with NULL or other values, e.g. using nested 
 SELECTs, something like
 
 SELECT * FROM (
  SELECT a,b,c, Y, NULL AS Z FROM table1
  UNION ALL
  SELECT a,b,c, NULL AS Y, Z FROM table2
 ) table12; 
 
 
 
 On Wed, Mar 6, 2013 at 12:03 PM, Keith Wiley kwi...@keithwiley.com wrote:
 I have two tables which have overlapping but nonidentical schema.  I want to 
 creating a new table that unions them, leaving nulls in any given row where a 
 column name doesn't occur in the other table:
 
 SCHEMA 1: { a, b, c, Y }
 row:  { 1, 2, 3, 4 }
 
 SCHEMA 2: { a, b, c, Z }
 row:  { 5, 6, 7, 8 }
 
 NEW SCHEMA: { a, b, c, Y, Z }
 new row:  { a:1, b:2, c:3, Y:4, Z:null }
 new row:  { a:5, b:6, c:7, Y:null, Z:8 }
 
 I don't think either full outer join or union all works.  I'm not sure 
 how to do this.  Any ideas?
 
 Thanks.
 
 
 Keith Wiley kwi...@keithwiley.com keithwiley.com
 music.keithwiley.com
 
 You can scratch an itch, but you can't itch a scratch. Furthermore, an itch 
 can
 itch but a scratch can't scratch. Finally, a scratch can itch, but an itch 
 can't
 scratch. All together this implies: He scratched the itch from the scratch 
 that
 itched but would never itch the scratch from the itch that scratched.
--  Keith Wiley
 
 
 
 
 
 -- 
 Dean Wampler, Ph.D.
 thinkbiganalytics.com
 +1-312-339-1330
 



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use.
   --  Galileo Galilei




Can hive show external table filenames?

2013-01-31 Thread Keith Wiley
show tables shows the HDFS directories which correspond to Hive tables and 
various select queries show data from inside the files (each row of a file 
being a row of the overarching table that the entire directory represents)...or 
so I have convinced myself...but can Hive show the HDFS filenames?  Or is 
hadoop fs -ls the only way to do that?

Thanks.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

The easy confidence with which I know another man's religion is folly teaches
me to suspect that my own is also.
   --  Mark Twain




Re: Can hive show external table filenames?

2013-01-31 Thread Keith Wiley
Hmmm, that doesn't seem to list the files in the directory.

On Jan 31, 2013, at 12:12 , Edward Capriolo wrote:

 Try
 describe extended tablename
 or
 describe extended tablename (partition)
 
 the location is one of the properties
 
 Edward
 
 
 On Thu, Jan 31, 2013 at 2:37 PM, Keith Wiley kwi...@keithwiley.com wrote:
 show tables shows the HDFS directories which correspond to Hive tables and 
 various select queries show data from inside the files (each row of a file 
 being a row of the overarching table that the entire directory 
 represents)...or so I have convinced myself...but can Hive show the HDFS 
 filenames?  Or is hadoop fs -ls the only way to do that?
 
 Thanks.



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched.
   --  Keith Wiley




Re: Can hive show external table filenames?

2013-01-31 Thread Keith Wiley
Thanks, that seems like a reasonable summary of my options.

Cheers!

On Jan 31, 2013, at 14:38 , Edward Capriolo wrote:

 You can execute dfs commands from inside hive
 
 hivedfs -ls
 
 You can also use virtual columns inside a select
 
 https://cwiki.apache.org/Hive/languagemanual-virtualcolumns.html
 
 However there is no direct command in hive to list the files in a
 table/partition.
 
 On Thu, Jan 31, 2013 at 4:55 PM, Keith Wiley kwi...@keithwiley.com wrote:
 Hmmm, that doesn't seem to list the files in the directory.
 
 On Jan 31, 2013, at 12:12 , Edward Capriolo wrote:
 
 Try
 describe extended tablename
 or
 describe extended tablename (partition)
 
 the location is one of the properties
 
 Edward
 
 
 On Thu, Jan 31, 2013 at 2:37 PM, Keith Wiley kwi...@keithwiley.com wrote:
 show tables shows the HDFS directories which correspond to Hive tables 
 and various select queries show data from inside the files (each row of a 
 file being a row of the overarching table that the entire directory 
 represents)...or so I have convinced myself...but can Hive show the HDFS 
 filenames?  Or is hadoop fs -ls the only way to do that?
 
 Thanks.



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy.
   --  Edwin A. Abbott, Flatland




apache can't access hive

2012-11-12 Thread Keith Wiley
I have created a php-backed dynamic website which pulls data from hive (via 
exec) and presents the results through a web interface.  This was working 
fine...then I upgraded to CDH4 and now apache can't access hive anymore.  Hive 
commands work fine on the command line as a real user, but not through the 
web as apache.  Curiously, if I run hive as sudo -u apache, it works.  It 
just doesn't work from php.  Remember, I had this all working a few weeks ago, 
I know the basic approach works.  It just that some aspect of the CDH4 upgrade 
has changed something, and I'm not sure what it could be.  The metastore_db 
directory's permissions are good to go (apache has read/write access to the 
directory, the derby.log, etc.).

I realize the following output is a bit eclectic, but if anyone has any 
thoughts on the error I would love some help.

Thanks.

Result: FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask
Status: 1
Output: Array
Output line: Logging initialized using configuration in 
file:/etc/hive/conf.dist/hive-log4j.properties
Output line: Hive history 
file=/tmp/apache/hive_job_log_apache_201211121323_696011741.txt
Output line: 2012-11-12 21:23:28.806 GMT Thread[main,5,main] 
java.io.FileNotFoundException: derby.log (Permission denied)
Output line: 
Output line: 2012-11-12 21:23:29.075 GMT:
Output line: Booting Derby version The Apache Software Foundation - Apache 
Derby - 10.4.2.0 - (689064): instance a816c00e-013a-f681-6fa8-025818d8
Output line: on database directory /var/lib/hive/metastore/metastore_db in READ 
ONLY mode
Output line:
Output line: Database Class Loader started - derby.database.classpath=''
Output line: FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: 
Cannot get a connection, pool error Could not create a validated object, cause: 
A read-only user or a user in a read-only database is not permitted to disable 
read-only mode on a connection.
Output line: NestedThrowables:
Output line: org.apache.commons.dbcp.SQLNestedException: Cannot get a 
connection, pool error Could not create a validated object, cause: A read-only 
user or a user in a read-only database is not permitted to disable read-only 
mode on a connection.
Output line: FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow.
   --  Keith Wiley




Error in semantic analysis: Unable to fetch table

2012-10-18 Thread Keith Wiley
If I run a select command on the command line with -e it works.  If I run the 
same command inside hive (without -e) it still works...but if I exec it with a 
-e from php (so obviously the apache user), I get Error in semantic analysis: 
Unable to fetch table [tablename].  Bear in mind that this overall system 
worked perfectly for weeks.  I haven't modified the code or anything.  What did 
change was I had to reinstall hadoop and hive and I'm sure some configuration 
parameter is not correct now.  I found some Google references to this error but 
no clear-but solutions.

What does this error mean and how do I fix it?

Thanks.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

The easy confidence with which I know another man's religion is folly teaches
me to suspect that my own is also.
   --  Mark Twain




Re: Error in semantic analysis: Unable to fetch table

2012-10-18 Thread Keith Wiley
Followup: If I run show tables then while I get a list from the command line, 
I get nothing from php.  Furthermore, if I attempt to select from a bogus 
tablename on the command line, I get that same error, semantic error and 
whatnot...so it looks like the real problem is that when apache/php runs hive, 
it doesn't see the table list.

Like I said, this used to work...so I'm sure I need to get something configured 
properly, I just don't remember how.  I do see that I am defining HIVE_HOME in 
my .bashrc...Do I need to do something like this for the apache account or as a 
HIVE parameter?


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered.
   --  Keith Wiley




Increase max tablename length?

2012-08-08 Thread Keith Wiley
The default seems to be 128.  Can it be increased?  I haven't found a 
configuration parameter for that yet.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

Luminous beings are we, not this crude matter.
   --  Yoda




drop table: no err on nonexistant table?

2012-08-06 Thread Keith Wiley
I'm wrapping hive in a web tool and would like to do some basic error-checking. 
 If an attempt is made to drop a table that doesn't exist, I would like to show 
an error message.  The problem is, hive doesn't seem to produce any sort of 
error when dropping a table that doesn't exists.  Furthermore, if hive is run 
on the command line (-e) it doesn't return an error status.  Therefore, I think 
the only way to detect such an error is to actually inspect the table list 
(show tables) and manually see if the drop attempt will find a corresponding 
table.

Can I get an error of some sort on this or is it not possible?

Thank you.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use.
   --  Galileo Galilei




Re: drop table: no err on nonexistant table?

2012-08-06 Thread Keith Wiley
Oye, got it.  Sorry.

RTFM: hive.exec.drop.ignorenonexistent

On Aug 6, 2012, at 11:06 , Keith Wiley wrote:

 I'm wrapping hive in a web tool and would like to do some basic 
 error-checking.  If an attempt is made to drop a table that doesn't exist, I 
 would like to show an error message.  The problem is, hive doesn't seem to 
 produce any sort of error when dropping a table that doesn't exists.  
 Furthermore, if hive is run on the command line (-e) it doesn't return an 
 error status.  Therefore, I think the only way to detect such an error is to 
 actually inspect the table list (show tables) and manually see if the drop 
 attempt will find a corresponding table.
 
 Can I get an error of some sort on this or is it not possible?
 
 Thank you.
 
 
 Keith Wiley kwi...@keithwiley.com keithwiley.com
 music.keithwiley.com
 
 I do not feel obliged to believe that the same God who has endowed us with
 sense, reason, and intellect has intended us to forgo their use.
   --  Galileo Galilei
 
 



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!
   --  Homer Simpson




HQL vs. HiveQL

2012-07-03 Thread Keith Wiley
I have found references to both HQL and HiveQL (as abbreviations for Hive Query 
Language) as the name of the SQL-like language Hive uses.  Is there a formal 
and definitively correct name in this regard or are both considered equally 
appropriate?  I'm writing up a white paper and want to use the right term.

Thanks.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!
   --  Homer Simpson




Re: LIKE Statement

2012-05-14 Thread Keith Wiley
Thanks for the followup.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me.
   --  Abe (Grandpa) Simpson




Re: Hive performance vs. SQL?

2012-03-20 Thread Keith Wiley
Thanks for the response.

Cheers!

On Mar 19, 2012, at 16:42 , Maxime Brugidou wrote:

 From my experience, if you can fit data in a SQL without sharding or 
 anything, don't ever think twice. Hive is not even comparable.



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered.
   --  Keith Wiley




Hive performance vs. SQL?

2012-03-19 Thread Keith Wiley
I haven't had an opportunity to set up a huge Hive database yet because 
exporting csv files from our SQL database is, in itself, a rather laborious 
task.  I was just curious how I might expect Hive to perform vs. SQL on large 
databases and large queries?  I realize Hive is pretty latent since it builds 
and runs MapReduce jobs for even the simplest queries, but that is precisely 
why I think it might perform better on long queries against large (external 
CSV) databases).

Would you expect Hive to ever outperform SQL on a single machine (standalone or 
pseudo-distributed mode)?  I am entirely open to the possibility that the 
answer is no, that Hive could never compete with SQL in a single machine.  Is 
this true?

If so, how large (how parallel) do you think the underlying Hadoop cluster 
needs to be before Hive overtakes SQL?  2X?  10X?  Where is the crossover point 
where Hive actually outperforms SQL?

Along similar lines, might Hive never outperform SQL on a database small enough 
for SQL to run on a single machine, a 10s to 100s of GBs?  Must the database 
itself be so large that SQL is effectively crippled and the data must be 
distributed before Hive offer significant gains?

I am really just trying to get a basic feel for how I might anticipate's Hive's 
behavior vs. SQL once I get a large system up and running.

Thanks.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me.
   --  Abe (Grandpa) Simpson




Re: order by date

2012-03-13 Thread Keith Wiley
Is see, you store the date-time as a lexicographically sortable string.  That's 
fine, but I'm operating on existing csv tables.  I guess I could whip up a 
hadoop job to convert all the date-time columns to lexicographic strings and 
then wrap hive around the resulting converted tables.  I was just wondering if 
there was a more direct approach, whether I could apply hive to the original 
csv tables.

*Would* the date functions work in the way I suggested in my first post or do 
you think a lexicographic conversion is practically required to perform 
date-ordering on query results?

On Mar 13, 2012, at 09:49 , Tucker, Matt wrote:

 Hi Keith,
 
 We generally store date columns as a string in a similar format to ISO 8601 
 (-mm-dd hh:MM:ss).  This way, when we put the date column in the ORDER BY 
 clause, it will be sorted chronologically.  It also saves us the trouble of 
 whipping out a unix timestamp calculator to figure out what we're looking at.
 
 There is supposed to be a TIMESTAMP data type in Hive 0.8, but I haven't 
 found any documentation on it yet.
 
 Matt Tucker
 
 -Original Message-
 From: Keith Wiley [mailto:kwi...@keithwiley.com] 
 Sent: Tuesday, March 13, 2012 12:45 PM
 To: user@hive.apache.org
 Subject: order by date
 
 I realize that hive doesn't have a date type for the columns and I realize 
 that hive *does* have various date functions.  I just haven't found a 
 concrete example of how these two issues are brought together.  Ordering the 
 results of a sql query is done by adding order by dateColName to the query. 
  Are the date functions supposed to be used in the form order by 
 UNIX_TIMESTAMP(dateColName, '-MM-dd')?  Does the function just go right 
 into the order by clause like that or is that totally wrong?
 
 How does one order query results by a date column?  Am I on the right track?
 
 Thanks.



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy.
   --  Edwin A. Abbott, Flatland




How to import extremely wide csv tables

2012-03-13 Thread Keith Wiley
Wrapping hive around existing csv files consists of manually naming and typing 
every column during the creation command.  I have several csv tables and some 
of them have a ton of columns.  I would love a way to create hive tables which 
automatically infers the column types by attempting various type conversions or 
regex matches on the data (say the first row).  What would be even cooler is if 
the first row could actually be interpreted differently from the rest of the 
table...as a set of string labels to name the columns while the types could be 
automatically inferred from, say, the *second* row.  These csv files are 
currently of this format, with the first row naming the columns.

Does this make sense?

Now, I'm sure that hive doesn't support this yet -- and I admit it is a 
somewhat esoteric desire on my part -- but I'm curious how others would suggest 
approaching it?  I'm thinking of writing a separate isolated program that reads 
the first two rows of a csv file and dumps a text string of column names and 
types in the correct syntax for a hive external table creation statement which 
I would then copy/paste into hive...I was just hoping for a simpler solution.

Thoughts?

Thanks.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched.
   --  Keith Wiley




order by having no effect?!

2012-03-13 Thread Keith Wiley
Um, this is weird.  It simply isn't modifying the order of the returned rows at 
all.  I get the same result with no 'order by' clause as with one.  Adding a 
limit or specifying 'asc' has no effect.  Using 'sort by' also has no effect.  
The column used for ordering is type INT.  In the example below, I was hoping 
to sort or order the results according to the third column.  Like I said, I 
also tried adding 'limit 10' and/or 'asc' to the end of the query; they had no 
effect.  The jobtracker shows a single mapper and a single reducer being used 
for the job incidentally, so it can't be some sort of multi-reducer sort 
discrepancy.

hive describe stringmap;
OK
objecttypecode  int 
attributename   string  
attributevalue  int 
langid  int 
value   string  
displayorderint 
Time taken: 0.074 seconds
hive


hive select * from stringmap where attributename='foo' order by 
'attributevalue'; 
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=number
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=number
In order to set a constant number of reducers:
  set mapred.reduce.tasks=number
Starting Job = job_201202221500_0114, Tracking URL = 
http://localhost:50030/jobdetails.jsp?jobid=job_201202221500_0114
Kill Command = /media/sdb1/kwiley/hadoop/hadoop-0.20.2-cdh3u3/bin/hadoop job  
-Dmapred.job.tracker=localhost:9001 -kill job_201202221500_0114
2012-03-13 14:45:28,418 Stage-1 map = 0%,  reduce = 0%
2012-03-13 14:45:31,428 Stage-1 map = 100%,  reduce = 0%
2012-03-13 14:45:39,459 Stage-1 map = 100%,  reduce = 33%
2012-03-13 14:45:40,463 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201202221500_0114
OK
3   foo 6   10336 - aaa 6
3   foo 3   10333 - bbb 3
3   foo 4   10334 - ccc 4
4   foo 1   1033Default Value   1
3   foo 2   10332 - ddd 2
3   foo 1   10331 - eee 1
3   foo 5   10335 - fff 5
Time taken: 17.954 seconds
hive


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

Luminous beings are we, not this crude matter.
   --  Yoda




Re: order by having no effect?!

2012-03-13 Thread Keith Wiley

On Mar 13, 2012, at 13:57 , Igor Tatarinov wrote:

 You have attributevalue in quotes which makes it a constant literal.
 
 igor
 decide.com

Argh!  You are correct good sir!

thanks


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!
   --  Homer Simpson




csv boolean type

2012-03-13 Thread Keith Wiley
What string values in a csv field are parsable by Hive as booleans?  If I 
indicate that a column is of type boolean when wrapping an external table 
around a csv file, what are the legal values?  I can imagine numerous 
possibilities, for example (for the true values):
0
t
T
true
True
TRUE
y
Y
yes
Yes
YES

etc.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me.
   --  Abe (Grandpa) Simpson




Re: csv boolean type

2012-03-13 Thread Keith Wiley
I obviously intended '1', not '0' as an example of a true value.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

The easy confidence with which I know another man's religion is folly teaches
me to suspect that my own is also.
   --  Mark Twain




Re: Basic statement problems

2012-03-12 Thread Keith Wiley
On Mar 9, 2012, at 16:46 , Steven Wong wrote:

 The LOCATION clause has to specify the directory that contains (only) your 
 data files.

I've tried it both ways:

CREATE EXTERNAL TABLE stringmap (ObjectTypeCode INT, AttributeName STRING, 
AttributeValue INT, LangId INT, Value STRING, DisplayOrder INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/crm/records/all/hiveTest';

CREATE EXTERNAL TABLE stringmap (ObjectTypeCode INT, AttributeName STRING, 
AttributeValue INT, LangId INT, Value STRING, DisplayOrder INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/crm/records/all/hiveTest/StringMap.csv';

In both cases, show tables lists stringmap and describe stringmap describes 
the columns shown above, but a basic query doesn't return any results.

What else should I try here?


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me.
   --  Abe (Grandpa) Simpson




Re: Basic statement problems

2012-03-12 Thread Keith Wiley
It has started working now.  I don't know what I changed.  I dropped every 
single table from hive, explicitly created a new directory on HDFS and moved 
the .csv file to that directory, ran hive again and created the table.  This 
time it worked.  I can perform queries against the directory.  Maybe hadoop and 
hive confused each other about the other directory (it got corrupted or 
something)...or maybe I screwed something up, I dunno.

I would have expected better error-detection.  Instead of simply returning 
0-length queries, it would be nice if hive would actually produce an error 
message if I create the table in an invalid or incorrect fashion...but perhaps 
it couldn't tell; maybe the database just looked empty at hive's level of 
abstraction.

I mean, even if I did screw something up (an option I am entirely open to), I 
never really got an error about it.  Hive gladly wrapped a hive table around 
the directory, and the csv file in question, without an error of any kind.  
Hive could see the table, it listed and could describe it, but would then 
return empty queries against the table.  When I moved the exact same csv file 
to a brand new HDFS directory and tried again from scratch, everything started 
working.  I probably did something wrong, but some sort of error message would 
have been very helpful.

Anyway, these tools are still pretty young.  I understand that they will 
continue to evolve.  The ability to detect and report errors will almost 
certainly improve with time.


Thanks for the concerted efforts to help.

Cheers!


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered.
   --  Keith Wiley




External table point of clarification

2012-03-12 Thread Keith Wiley
It appears as if the syntax for creating an external table takes a column 
description and a directory where the table (a .csv file) should be found.  
Notably, it doesn't take the path to the file, rather it takes the path to the 
directory containing the file.

...which means that if I have multiple .csv files, each a unique table with a 
unique column description, it is not sufficient to put them in one hdfs 
directory, since the columns must be described when creating the hive table.  
Rather, each .csv file conforming to a unique column description must be placed 
in its own separate hdfs directory, right?


Just clarifying...is this correct?  Can I not put a set of .csv files 
comprising multiple tables of a heterogenous database in a single directory and 
then have hive build a set of tables around those files?

Thank you.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow.
   --  Keith Wiley




Re: Error

2012-03-10 Thread Keith Wiley
Thanks, I'll give it a shot on Monday.

On Mar 10, 2012, at 01:12 , Jagat wrote:

 Dear Keith
 
 Please delete $HADOOP_HOME/build , in your hadoop home the build directory
 
 And try again
 
 Thanks
 
 Jagat 



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

The easy confidence with which I know another man's religion is folly teaches
me to suspect that my own is also.
   --  Mark Twain




Re: Error

2012-03-09 Thread Keith Wiley
Considering that I don't even konw what the metastore is, I doubt I did 
anything specifically aside from the instructions.  All I did was follow the 
startup guide.  Nothing more.  More to the point, when I followed the same 
instructions on a Linux box, it worked.  Something is admittedly wrong with my 
Mac setup, but I don't think there is any third-party metastore of some sort.  
I certainly didn't install or create anything like that.  I'm not even sure 
what it is.

Hmmm, how do I resetup or initialize the metastore from scratch and try again?

On Mar 9, 2012, at 13:21 , Mark Grover wrote:

 Hi Keith,
 This error is typically encountered when your metastore (that stores the 
 metadata related to Hive tables e.g. name of columns, name of tables, etc. 
 etc.) is not correctly set up.
 
 Have you set up an external metastore? By default, Hive should use an 
 embedded Derby metastore which only allows one connection at a time.
 
 Mark
 
 Mark Grover, Business Intelligence Analyst
 OANDA Corporation 
 
 www: oanda.com www: fxtrade.com 
 
 Best Trading Platform - World Finance's Forex Awards 2009. 
 The One to Watch - Treasury Today's Adam Smith Awards 2009. 
 
 
 - Original Message -
 From: Keith Wiley kwi...@keithwiley.com
 To: user@hive.apache.org
 Sent: Friday, March 9, 2012 3:02:40 PM
 Subject: Error
 
 I'm quite comfortable hadoop and the associated lingo, been programming it 
 via Java and via C++ streams for several years.  However, I have just started 
 Hive for the first time...and I'm stuck.  I was following the getting 
 started page on the apache site.  I got a far as this:
 
 hive CREATE TABLE pokes (foo INT, bar STRING);
 FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected 
 exception caught.
 NestedThrowables:
 java.lang.reflect.InvocationTargetException
 FAILED: Execution Error, return code 1 from 
 org.apache.hadoop.hive.ql.exec.DDLTask
 
 If I look at the logs, I see the following.  I'm not sure how to proceed from 
 here.  Any help is appreciated.
 
 Thanks.
 
 $ cat /tmp/keithw/hive.log
 2012-03-09 12:02:22,087 ERROR exec.DDLTask 
 (SessionState.java:printError(365)) - FAILED: Error in metadata: 
 javax.jdo.JDOFatalInternalException: Unexpected exception caught.
 NestedThrowables:
 java.lang.reflect.InvocationTargetException
 org.apache.hadoop.hive.ql.metadata.HiveException: 
 javax.jdo.JDOFatalInternalException: Unexpected exception caught.
 NestedThrowables:
 java.lang.reflect.InvocationTargetException
   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:476)
   at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3176)
   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:213)
   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:131)
   at 
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063)
   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900)
   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748)
   at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286)
   at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:516)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
 Caused by: javax.jdo.JDOFatalInternalException: Unexpected exception caught.
 NestedThrowables:
 java.lang.reflect.InvocationTargetException
   at 
 javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1186)
   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:803)
   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:698)
   at 
 org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:234)
   at 
 org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:261)
   at 
 org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:196)
   at 
 org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:171)
   at 
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
   at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
   at 
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:354)
   at 
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.executeWithRetry(HiveMetaStore.java:306)
   at 
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:451

Basic statement problems

2012-03-09 Thread Keith Wiley
I successfully installed and used Hive to create basic tables (on one of my two 
machines; another discussion describes the problems I'm having with the other 
machine).  However, basic queries aren't working.  I brought a typical CSV file 
into a Hive table and it seemed fine.  Here's how I did it:

CREATE EXTERNAL TABLE stringmap (ObjectTypeCode INT, AttributeName STRING, 
AttributeValue INT, LangId INT, Value STRING, DisplayOrder INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/crm/records/all/hiveTest/StringMap.csv';

show tables and describe stringmap return correct results.  However, if I 
run a really simple query, it returns incorrect results.  For example, a row 
count query returns a 0.  Observe:

hive select count(*) from stringmap;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=number
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=number
In order to set a constant number of reducers:
  set mapred.reduce.tasks=number
Starting Job = job_201202221500_0103, Tracking URL = 
http://localhost:50030/jobdetails.jsp?jobid=job_201202221500_0103
Kill Command = /media/sdb1/kwiley/hadoop/hadoop-0.20.2-cdh3u3/bin/hadoop job  
-Dmapred.job.tracker=localhost:9001 -kill job_201202221500_0103
2012-03-09 16:20:26,243 Stage-1 map = 0%,  reduce = 0%
2012-03-09 16:20:29,258 Stage-1 map = 0%,  reduce = 100%
2012-03-09 16:20:32,278 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201202221500_0103
OK
0
Time taken: 15.969 seconds

The Hadoop job runs without error, but it returns a 0.  The job tracker 
indicates that Hive create a job with 0 mappers and 1 reducer.  I don't see any 
useful output in the reducer task log however.  The following is the from the 
hive log:

2012-03-09 16:20:17,057 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) 
- Bundle org.eclipse.jdt.core requires org.eclipse.core.resources but it 
cannot be resolved.
2012-03-09 16:20:17,057 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) 
- Bundle org.eclipse.jdt.core requires org.eclipse.core.resources but it 
cannot be resolved.
2012-03-09 16:20:17,059 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) 
- Bundle org.eclipse.jdt.core requires org.eclipse.core.runtime but it 
cannot be resolved.
2012-03-09 16:20:17,059 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) 
- Bundle org.eclipse.jdt.core requires org.eclipse.core.runtime but it 
cannot be resolved.
2012-03-09 16:20:17,060 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) 
- Bundle org.eclipse.jdt.core requires org.eclipse.text but it cannot be 
resolved.
2012-03-09 16:20:17,060 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) 
- Bundle org.eclipse.jdt.core requires org.eclipse.text but it cannot be 
resolved.
2012-03-09 16:20:22,781 WARN  mapred.JobClient 
(JobClient.java:copyAndConfigureFiles(649)) - Use GenericOptionsParser for 
parsing the arguments. Applications should implement Tool for the same.
2012-03-09 16:20:22,946 WARN  snappy.LoadSnappy (LoadSnappy.java:clinit(36)) 
- Snappy native library is available

I admit, that does look rather erroneous, but I'm not sure what to make of it.  
I looked those errors up online but didn't find much that seemed to suggest a 
cause or solution.

Any ideas?


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched.
   --  Keith Wiley




0.7.1 vs. 0.8.1

2012-03-09 Thread Keith Wiley
The only version feature list I have found so far is the JIRA logs, which are a 
little difficult to make sense of since they get pretty detailed.  I'm curious 
from a straight-forward high-level sales-pitch perspective what advantages 
0.8.1 offers over 0.7.1.  I ask because I generally use CDH3 and it provides 
0.7.1.  I'm curious if I should use a more recent version of Hive with the 
version of Hadoop in CDH3 (0.20.2).

Thanks.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy.
   --  Edwin A. Abbott, Flatland




Re: Basic statement problems

2012-03-09 Thread Keith Wiley
So a directory, not a specific file.  I thought I tried it both ways, but I'll 
switch it back the other way and try again.

Thanks.

On Mar 9, 2012, at 16:46 , Steven Wong wrote:

 The LOCATION clause has to specify the directory that contains (only) your 
 data files.



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!
   --  Homer Simpson