Re: Can hive 0.8.1 work with hadoop 0.23.0?
Hi Xiaofeng, Which mode are you running Hadoop in, e.g. local, pseudo-distributed, or distributed? Thanks. Carl 2012/2/1 张晓峰 zhangxiaofe...@q.com.cn Hi, ** ** I installed hadoop 0.23.0 which can work. The version of my hive is 0.8.1. The query like ‘select * from tablename’ can work. But an exception is thrown when executing query like ‘select col1 form tablename’. ** ** 2012-02-01 16:32:20,296 WARN mapreduce.JobSubmitter (JobSubmitter.java:copyAndConfigureFiles(139)) - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2012-02-01 16:32:20,389 INFO mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(388)) - Cleaning up the staging area file:/tmp/hadoop-hadoop/mapred/staging/hadoop-469936305/.staging/job_local_0001 2012-02-01 16:32:20,392 ERROR exec.ExecDriver (SessionState.java:printError(380)) - Job Submission failed with exception 'java.io.FileNotFoundException(File does not exist: /home/hadoop/hive-0.8.1/lib/hive-builtins-0.8.1.jar)' java.io.FileNotFoundException: File does not exist: /home/hadoop/hive-0.8.1/lib/hive-builtins-0.8.1.jar at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:764) at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:208) at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestamps(ClientDistributedCacheManager.java:71) at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:246) at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:284) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:355) at org.apache.hadoop.mapreduce.Job$2.run(Job.java:1159) at org.apache.hadoop.mapreduce.Job$2.run(Job.java:1156) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1152) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1156) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:571) at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:452) at org.apache.hadoop.hive.ql.exec.ExecDriver.main(ExecDriver.java:710) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:189) ** ** Thanks, xiaofeng ** **
RE: Can hive 0.8.1 work with hadoop 0.23.0?
Hi Xiaofeng, Backup “hive_exec.jar” in all hadop directory, then delete “hive_exec.jar”. Try it. Because “select * just use hdfs . And “select col1” will use MapReduce. Best regards Ransom. From: Carl Steinbach [mailto:c...@cloudera.com] Sent: Tuesday, February 21, 2012 4:45 PM To: user@hive.apache.org Subject: Re: Can hive 0.8.1 work with hadoop 0.23.0? Hi Xiaofeng, Which mode are you running Hadoop in, e.g. local, pseudo-distributed, or distributed? Thanks. Carl 2012/2/1 张晓峰 zhangxiaofe...@q.com.cnmailto:zhangxiaofe...@q.com.cn Hi, I installed hadoop 0.23.0 which can work. The version of my hive is 0.8.1. The query like ‘select * from tablename’ can work. But an exception is thrown when executing query like ‘select col1 form tablename’. 2012-02-01 16:32:20,296 WARN mapreduce.JobSubmitter (JobSubmitter.java:copyAndConfigureFiles(139)) - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2012-02-01 16:32:20,389 INFO mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(388)) - Cleaning up the staging area file:/tmp/hadoop-hadoop/mapred/staging/hadoop-469936305/.staging/job_local_0001 2012-02-01 16:32:20,392 ERROR exec.ExecDriver (SessionState.java:printError(380)) - Job Submission failed with exception 'java.io.FileNotFoundException(File does not exist: /home/hadoop/hive-0.8.1/lib/hive-builtins-0.8.1.jar)' java.io.FileNotFoundException: File does not exist: /home/hadoop/hive-0.8.1/lib/hive-builtins-0.8.1.jar at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:764) at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:208) at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestamps(ClientDistributedCacheManager.java:71) at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:246) at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:284) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:355) at org.apache.hadoop.mapreduce.Job$2.run(Job.java:1159) at org.apache.hadoop.mapreduce.Job$2.run(Job.java:1156) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1152) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1156) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:571) at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:452) at org.apache.hadoop.hive.ql.exec.ExecDriver.main(ExecDriver.java:710) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:189) Thanks, xiaofeng
SerDe and InputFormat
I make changes to the Configuration in my SerDe expecting those to be passed to the InputFormat (and OutputFormat). Yet the InputFormat seems to get an unchanged JobConf? Is this a known limitation? I find it very confusing since the Configuration is the main way to communicate with the MapReduce process... So i assume i must be doing something wrong and this is possible. Thanks for your help. Koert
Re: 2 questions about SerDe
Have a look at the code for the LazySerDes. When you deserialize in the SerDe, you don't actually have to deserialize all the columns. Deserialized could return an object that is not actually deserialized and you can write an ObjectInspector that deserializes a field from that structure but only when it's needed (that's when the ObjectInspector is called). R. On Tue, Feb 21, 2012 at 7:37 AM, Koert Kuipers ko...@tresata.com wrote: 1) Is there a way in initialize() of a SerDe to know if it is being used as a Serializer or a Deserializer. If not, can i define the Serializer and Deserializer separately instead of defining a SerDe (so i have two initialize methods)? 2) Is there a way to find out which columns are being used? say if someone does select a,b,c from test, and my SerDe gets initialized for usage in that query how can i know that only a,b,c are being needed? i would like to take advantage of this information so i dont deserialize unnecessary information, without having to resort to more complex lazy deserialization tactics.
Re: help with compression and index
Hi Hamilton When you are doing indexing(generate index files) is compression enabled? If so you are running into this known issue https://issues.apache.org/jira/browse/HIVE-2331 Which is fixed in hive 0.8 . An upgrade should get it rolling for you and is recommended. Regards Bejoy.K.S From: Hamilton, Robert (Austin) robert.hamil...@hp.com To: user@hive.apache.org user@hive.apache.org Sent: Tuesday, February 21, 2012 8:48 PM Subject: help with compression and index Hi all. I sent this to common-user@hadoop hoping there was an easy answer but got no response. I have a couple of users who basically have no use case other than the need to extract specific rows based on some predetermined set of keys, so I would like to be able to just provide them with an index and show them how to join to the detail table using the index. So I'm looking for a reliable compression+index method with hive. To get an idea of the data size my files add up to about 80TB uncompressed but currently gzipped to only 10 TB - I need to keep it small (ish) until I can get more disk space, so it has to stay compressed. I don't mind recompressing to LZO or bzip but need to prove that it would actually work first :) I've done my testing on LZO and uncompressed test samples. If I use uncompressed files the indexed select works OK. If I use LZO it returns only a fraction of the rows I expect. I gather that files compressed with other compression methods cannot be indexed at all with Hive 0.7.1? I'm following the prescription to select buckets/offets into a temporary file, set hive.index.compact.file to the temp file, set hive.input.format to HiveCompactIndexInputFormat and run my select. That doesn't let me do subselects but I don't mind as it is only a very limited use case that I need to support. This is the only method I could find documented on the net. Is there a better way to do this? I don't mind upgrading Hive (currently on 0.7.1) or Hadoop (currently 0.20.2)?
Custom SerDe -- tracking down stack trace
I have a custom SerDe that's initializing properly and works on one data set. I built it to adapt to a couple of different data formats, though, and it's choking on a different data set (different partitions in the same table). A null pointer exception is being thrown on deserialize, that's being wrapped by an IOException somewhere up the stack. The exception is showing up in the hive output (Failed with exception java.io.IOException:java.lang.NullPointerException), but I can't find the stack trace in any logs. It's worth noting that I'm running hive via the cli on a machine external to the cluster, and the query doesn't get far enough to create any M/R tasks. I looked in all log files in /var/log on the hive client machine, and in all userlogs on each cluster instance. I also looked in derby.log (I'm using the embedded metastore) and in /var/lib/hive/metastore on the hive client machine. I'm sure I'm missing something obvious… Any ideas?
Re: help with compression and index
Hi Robert, As per https://issues.apache.org/jira/browse/HIVE-1644, Hive 0.8 introduces automatic accessing of indexes. That might come in handy too! Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: mgro...@oanda.com Best Trading Platform - World Finance's Forex Awards 2009. The One to Watch - Treasury Today's Adam Smith Awards 2009. - Original Message - From: Bejoy Ks bejoy...@yahoo.com To: user@hive.apache.org Sent: Tuesday, February 21, 2012 11:47:56 AM Subject: Re: help with compression and index Hi Hamilton When you are doing indexing(generate index files) is compression enabled? If so you are running into this known issue https://issues.apache.org/jira/browse/HIVE-2331 Which is fixed in hive 0.8 . An upgrade should get it rolling for you and is recommended. Regards Bejoy.K.S From: Hamilton, Robert (Austin) robert.hamil...@hp.com To: user@hive.apache.org user@hive.apache.org Sent: Tuesday, February 21, 2012 8:48 PM Subject: help with compression and index Hi all. I sent this to common-user@hadoop hoping there was an easy answer but got no response. I have a couple of users who basically have no use case other than the need to extract specific rows based on some predetermined set of keys, so I would like to be able to just provide them with an index and show them how to join to the detail table using the index. So I'm looking for a reliable compression+index method with hive. To get an idea of the data size my files add up to about 80TB uncompressed but currently gzipped to only 10 TB - I need to keep it small (ish) until I can get more disk space, so it has to stay compressed. I don't mind recompressing to LZO or bzip but need to prove that it would actually work first :) I've done my testing on LZO and uncompressed test samples. If I use uncompressed files the indexed select works OK. If I use LZO it returns only a fraction of the rows I expect. I gather that files compressed with other compression methods cannot be indexed at all with Hive 0.7.1? I'm following the prescription to select buckets/offets into a temporary file, set hive.index.compact.file to the temp file, set hive.input.format to HiveCompactIndexInputFormat and run my select. That doesn't let me do subselects but I don't mind as it is only a very limited use case that I need to support. This is the only method I could find documented on the net. Is there a better way to do this? I don't mind upgrading Hive (currently on 0.7.1) or Hadoop (currently 0.20.2)?
Re: Custom SerDe -- tracking down stack trace
One more data point: I can read data from this partition as long as I don't reference the partition explicitly… E.g., I my partition column is ArrivalDate, and I have several different partitions: 2012-02-01…, and a partition with my test data with ArrivalDate=test. This works: 'select * from table where some constraint such that I only get results from the test partition'. And this works: 'select * from table where ArrivalDate=2012-02-01' But, this fails: 'select * from table where ArrivalDate=test' Does this make sense to anybody? From: Evan Pollan evan.pol...@bazaarvoice.commailto:evan.pol...@bazaarvoice.com Reply-To: user@hive.apache.orgmailto:user@hive.apache.org Date: Tue, 21 Feb 2012 20:56:07 + To: user@hive.apache.orgmailto:user@hive.apache.org user@hive.apache.orgmailto:user@hive.apache.org Subject: Custom SerDe -- tracking down stack trace I have a custom SerDe that's initializing properly and works on one data set. I built it to adapt to a couple of different data formats, though, and it's choking on a different data set (different partitions in the same table). A null pointer exception is being thrown on deserialize, that's being wrapped by an IOException somewhere up the stack. The exception is showing up in the hive output (Failed with exception java.io.IOException:java.lang.NullPointerException), but I can't find the stack trace in any logs. It's worth noting that I'm running hive via the cli on a machine external to the cluster, and the query doesn't get far enough to create any M/R tasks. I looked in all log files in /var/log on the hive client machine, and in all userlogs on each cluster instance. I also looked in derby.log (I'm using the embedded metastore) and in /var/lib/hive/metastore on the hive client machine. I'm sure I'm missing something obvious… Any ideas?