Re: Can hive 0.8.1 work with hadoop 0.23.0?

2012-02-21 Thread Carl Steinbach
Hi Xiaofeng,

Which mode are you running Hadoop in, e.g. local, pseudo-distributed, or
distributed?

Thanks.

Carl

2012/2/1 张晓峰 zhangxiaofe...@q.com.cn

 Hi,

 ** **

 I installed hadoop 0.23.0 which can work.

 The version of my hive is 0.8.1. The query like ‘select * from tablename’
 can work. But an exception is thrown when executing query like ‘select col1
 form tablename’.

 ** **

 2012-02-01 16:32:20,296 WARN  mapreduce.JobSubmitter
 (JobSubmitter.java:copyAndConfigureFiles(139)) - Use GenericOptionsParser
 for parsing the arguments. Applications should implement Tool for the same.
 

 2012-02-01 16:32:20,389 INFO  mapreduce.JobSubmitter
 (JobSubmitter.java:submitJobInternal(388)) - Cleaning up the staging area
 file:/tmp/hadoop-hadoop/mapred/staging/hadoop-469936305/.staging/job_local_0001
 

 2012-02-01 16:32:20,392 ERROR exec.ExecDriver
 (SessionState.java:printError(380)) - Job Submission failed with exception
 'java.io.FileNotFoundException(File does not exist:
 /home/hadoop/hive-0.8.1/lib/hive-builtins-0.8.1.jar)'

 java.io.FileNotFoundException: File does not exist:
 /home/hadoop/hive-0.8.1/lib/hive-builtins-0.8.1.jar

 at
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:764)
 

 at
 org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:208)
 

 at
 org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestamps(ClientDistributedCacheManager.java:71)
 

 at
 org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:246)
 

 at
 org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:284)
 

 at
 org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:355)
 

 at org.apache.hadoop.mapreduce.Job$2.run(Job.java:1159)

 at org.apache.hadoop.mapreduce.Job$2.run(Job.java:1156)

 at java.security.AccessController.doPrivileged(Native Method)

 at javax.security.auth.Subject.doAs(Subject.java:396)

 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1152)
 

 at org.apache.hadoop.mapreduce.Job.submit(Job.java:1156)

 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:571)
 

 at
 org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:452)

 at
 org.apache.hadoop.hive.ql.exec.ExecDriver.main(ExecDriver.java:710)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 

 at java.lang.reflect.Method.invoke(Method.java:597)

 at org.apache.hadoop.util.RunJar.main(RunJar.java:189)

 ** **

 Thanks,

 xiaofeng

 ** **



RE: Can hive 0.8.1 work with hadoop 0.23.0?

2012-02-21 Thread hezhiqiang (Ransom)
Hi Xiaofeng,

Backup “hive_exec.jar” in all hadop directory, then delete “hive_exec.jar”. Try 
it.
Because “select *  just use hdfs . And “select col1” will use MapReduce.

Best regards
Ransom.

From: Carl Steinbach [mailto:c...@cloudera.com]
Sent: Tuesday, February 21, 2012 4:45 PM
To: user@hive.apache.org
Subject: Re: Can hive 0.8.1 work with hadoop 0.23.0?

Hi Xiaofeng,

Which mode are you running Hadoop in, e.g. local, pseudo-distributed, or 
distributed?

Thanks.

Carl
2012/2/1 张晓峰 zhangxiaofe...@q.com.cnmailto:zhangxiaofe...@q.com.cn
Hi,

I installed hadoop 0.23.0 which can work.
The version of my hive is 0.8.1. The query like ‘select * from tablename’ can 
work. But an exception is thrown when executing query like ‘select col1 form 
tablename’.

2012-02-01 16:32:20,296 WARN  mapreduce.JobSubmitter 
(JobSubmitter.java:copyAndConfigureFiles(139)) - Use GenericOptionsParser for 
parsing the arguments. Applications should implement Tool for the same.
2012-02-01 16:32:20,389 INFO  mapreduce.JobSubmitter 
(JobSubmitter.java:submitJobInternal(388)) - Cleaning up the staging area 
file:/tmp/hadoop-hadoop/mapred/staging/hadoop-469936305/.staging/job_local_0001
2012-02-01 16:32:20,392 ERROR exec.ExecDriver 
(SessionState.java:printError(380)) - Job Submission failed with exception 
'java.io.FileNotFoundException(File does not exist: 
/home/hadoop/hive-0.8.1/lib/hive-builtins-0.8.1.jar)'
java.io.FileNotFoundException: File does not exist: 
/home/hadoop/hive-0.8.1/lib/hive-builtins-0.8.1.jar
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:764)
at 
org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:208)
at 
org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestamps(ClientDistributedCacheManager.java:71)
at 
org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:246)
at 
org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:284)
at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:355)
at org.apache.hadoop.mapreduce.Job$2.run(Job.java:1159)
at org.apache.hadoop.mapreduce.Job$2.run(Job.java:1156)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1152)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1156)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:571)
at 
org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:452)
at org.apache.hadoop.hive.ql.exec.ExecDriver.main(ExecDriver.java:710)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:189)

Thanks,
xiaofeng




SerDe and InputFormat

2012-02-21 Thread Koert Kuipers
I make changes to the Configuration in my SerDe expecting those to be
passed to the InputFormat (and OutputFormat). Yet the InputFormat seems to
get an unchanged JobConf? Is this a known limitation?

I find it very confusing since the Configuration is the main way to
communicate with the MapReduce process... So i assume i must be doing
something wrong and this is possible.

Thanks for your help. Koert


Re: 2 questions about SerDe

2012-02-21 Thread Roberto Congiu
Have a look at the code for the LazySerDes. When you deserialize in the
SerDe, you don't actually have to deserialize all the columns. Deserialized
could return an object that is not actually deserialized and you can write
an ObjectInspector that deserializes a field from that structure but only
when it's needed (that's when the ObjectInspector is called).

R.

On Tue, Feb 21, 2012 at 7:37 AM, Koert Kuipers ko...@tresata.com wrote:

 1) Is there a way in initialize() of a SerDe to know if it is being used
 as a Serializer or a Deserializer. If not, can i define the Serializer and
 Deserializer separately instead of defining a SerDe (so i have two
 initialize methods)?

 2) Is there a way to find out which columns are being used? say if someone
 does select a,b,c from test, and my SerDe gets initialized for usage in
 that query how can i know that only a,b,c are being needed? i would like to
 take advantage of this information so i dont deserialize unnecessary
 information, without having to resort to more complex lazy deserialization
 tactics.



Re: help with compression and index

2012-02-21 Thread Bejoy Ks
Hi Hamilton
    When you are doing indexing(generate index files) is compression enabled? 
If so you are running into this known issue
https://issues.apache.org/jira/browse/HIVE-2331

Which is fixed in hive 0.8 . An upgrade should get it rolling for you and is 
recommended.

Regards
Bejoy.K.S





 From: Hamilton, Robert (Austin) robert.hamil...@hp.com
To: user@hive.apache.org user@hive.apache.org 
Sent: Tuesday, February 21, 2012 8:48 PM
Subject: help with compression and index
 
Hi all. I sent this to common-user@hadoop hoping there was an easy answer but 
got no response.

I have a couple of users who basically have no use case other than the need to 
extract specific rows based on some predetermined set of keys, so I would like 
to be able to just provide them with an index and show them how to join to the 
detail table using the index.  So I'm looking for a reliable compression+index 
method with hive.  To get an idea of the data size my files add up to about 
80TB uncompressed but currently gzipped to only 10 TB - I need to keep it small 
(ish) until I can get more disk space, so it has to stay compressed. 

I don't mind recompressing to LZO or bzip but need to prove that it would 
actually work first :)

I've done my testing on LZO and uncompressed test samples. If I use 
uncompressed files the indexed select works OK. If I use LZO it returns only a 
fraction of the rows I expect.  I gather that files compressed with other 
compression methods cannot be indexed at all with Hive 0.7.1?

I'm following the prescription to select buckets/offets into a temporary file, 
set hive.index.compact.file to the temp file, set hive.input.format to 
HiveCompactIndexInputFormat and run my select.  That doesn't let me do 
subselects but I don't mind as it is only a very limited use case that I need 
to support.

This is the only method I could find documented on the net.  Is there a better 
way to do this? I don't mind upgrading Hive (currently on 0.7.1) or Hadoop 
(currently 0.20.2)?

Custom SerDe -- tracking down stack trace

2012-02-21 Thread Evan Pollan
I have a custom SerDe that's initializing properly and works on one data set.  
I built it to adapt to a couple of different data formats, though, and it's 
choking on a different data set (different partitions in the same table).

A null pointer exception is being thrown on deserialize, that's being wrapped 
by an IOException somewhere up the stack.  The exception is showing up in the 
hive output (Failed with exception 
java.io.IOException:java.lang.NullPointerException), but I can't find the 
stack trace in any logs.

It's worth noting that I'm running hive via the cli on a machine external to 
the cluster, and the query doesn't get far enough to create any M/R tasks.  I 
looked in all log files in /var/log on the hive client machine, and in all 
userlogs on each cluster instance.  I also looked in derby.log (I'm using the 
embedded metastore) and in /var/lib/hive/metastore on the hive client machine.

I'm sure I'm missing something obvious…  Any ideas?


Re: help with compression and index

2012-02-21 Thread Mark Grover
Hi Robert,
As per https://issues.apache.org/jira/browse/HIVE-1644, Hive 0.8 introduces 
automatic accessing of indexes. That might come in handy too!

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 
e: mgro...@oanda.com 

Best Trading Platform - World Finance's Forex Awards 2009. 
The One to Watch - Treasury Today's Adam Smith Awards 2009. 


- Original Message -
From: Bejoy Ks bejoy...@yahoo.com
To: user@hive.apache.org
Sent: Tuesday, February 21, 2012 11:47:56 AM
Subject: Re: help with compression and index



Hi Hamilton 
When you are doing indexing(generate index files) is compression enabled? If so 
you are running into this known issue 
https://issues.apache.org/jira/browse/HIVE-2331 


Which is fixed in hive 0.8 . An upgrade should get it rolling for you and is 
recommended. 


Regards 
Bejoy.K.S 








From: Hamilton, Robert (Austin) robert.hamil...@hp.com 
To: user@hive.apache.org user@hive.apache.org 
Sent: Tuesday, February 21, 2012 8:48 PM 
Subject: help with compression and index 

Hi all. I sent this to common-user@hadoop hoping there was an easy answer but 
got no response. 

I have a couple of users who basically have no use case other than the need to 
extract specific rows based on some predetermined set of keys, so I would like 
to be able to just provide them with an index and show them how to join to the 
detail table using the index. So I'm looking for a reliable compression+index 
method with hive. To get an idea of the data size my files add up to about 80TB 
uncompressed but currently gzipped to only 10 TB - I need to keep it small 
(ish) until I can get more disk space, so it has to stay compressed. 

I don't mind recompressing to LZO or bzip but need to prove that it would 
actually work first :) 

I've done my testing on LZO and uncompressed test samples. If I use 
uncompressed files the indexed select works OK. If I use LZO it returns only a 
fraction of the rows I expect. I gather that files compressed with other 
compression methods cannot be indexed at all with Hive 0.7.1? 

I'm following the prescription to select buckets/offets into a temporary file, 
set hive.index.compact.file to the temp file, set hive.input.format to 
HiveCompactIndexInputFormat and run my select. That doesn't let me do 
subselects but I don't mind as it is only a very limited use case that I need 
to support. 

This is the only method I could find documented on the net. Is there a better 
way to do this? I don't mind upgrading Hive (currently on 0.7.1) or Hadoop 
(currently 0.20.2)? 





Re: Custom SerDe -- tracking down stack trace

2012-02-21 Thread Evan Pollan
One more data point:  I can read data from this partition as long as I don't 
reference the partition explicitly…

E.g., I my partition column is ArrivalDate, and I have several different 
partitions:  2012-02-01…, and a partition with my test data with 
ArrivalDate=test.

This works:  'select * from table where some constraint such that I only get 
results from the test partition'.

And this works:  'select * from table where ArrivalDate=2012-02-01'

But, this fails:  'select * from table where ArrivalDate=test'

Does this make sense to anybody?



From: Evan Pollan 
evan.pol...@bazaarvoice.commailto:evan.pol...@bazaarvoice.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org
Date: Tue, 21 Feb 2012 20:56:07 +
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Custom SerDe -- tracking down stack trace

I have a custom SerDe that's initializing properly and works on one data set.  
I built it to adapt to a couple of different data formats, though, and it's 
choking on a different data set (different partitions in the same table).

A null pointer exception is being thrown on deserialize, that's being wrapped 
by an IOException somewhere up the stack.  The exception is showing up in the 
hive output (Failed with exception 
java.io.IOException:java.lang.NullPointerException), but I can't find the 
stack trace in any logs.

It's worth noting that I'm running hive via the cli on a machine external to 
the cluster, and the query doesn't get far enough to create any M/R tasks.  I 
looked in all log files in /var/log on the hive client machine, and in all 
userlogs on each cluster instance.  I also looked in derby.log (I'm using the 
embedded metastore) and in /var/lib/hive/metastore on the hive client machine.

I'm sure I'm missing something obvious…  Any ideas?