Re: Why the filter push down does not reduce the read data record count
Hi , My env : Hive 1.2.1 and Parquet 1.8.1 Per my search in hive and parquet source code of version 1.8.1, I did not see the paramters in that slides. but found that here : https://github.com/apache/parquet-mr/blob/parquet-1.8.x/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java For hive, (also detected in my test), row group filter is auto-applied , check this ParquetRecordReaderWrapper if (filter != null) { filtedBlocks = RowGroupFilter.filterRowGroups(filter, splitGroup, fileMetaData.getSchema()); if (filtedBlocks.isEmpty()) { LOG.debug("All row groups are dropped due to filter predicates"); return null; } I will dig into more detail of parquet and do some test later. Thanks, Keith From: Furcy Pin <pin.fu...@gmail.com> Sent: Friday, February 23, 2018 8:03:34 AM To: user@hive.apache.org Subject: Re: Why the filter push down does not reduce the read data record count And if you come across a comprehensive documentation of parquet configuration, please share it!!! The Parquet documentation says that it can be configured but doesn't explain how: http://parquet.apache.org/documentation/latest/<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fparquet.apache.org%2Fdocumentation%2Flatest%2F=02%7C01%7Caisun%40ebay.com%7C41a5aa52483a46e8458208d57ad7294b%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549986889004305=QoTRmoY7yiVICa%2FXgJJyNiC1zFTjvzYHv2u8MdvDtM0%3D=0> and apparently, both TAJO (http://tajo.apache.org/docs/0.8.0/table_management/parquet.html<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Ftajo.apache.org%2Fdocs%2F0.8.0%2Ftable_management%2Fparquet.html=02%7C01%7Caisun%40ebay.com%7C41a5aa52483a46e8458208d57ad7294b%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549986889004305=FPOuDplxxU68HxCjdZMRKuHrF9kvuARddiNWARyPaTg%3D=0>) and Drill (https://drill.apache.org/docs/parquet-format/<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrill.apache.org%2Fdocs%2Fparquet-format%2F=02%7C01%7Caisun%40ebay.com%7C41a5aa52483a46e8458208d57ad7294b%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549986889004305=gqpYD9UYPQewMPlpvsc%2BQjaz8Qq6FcII00zALHCWbPU%3D=0>) seem to have some configuration parameters for Parquet. If Hive has configuration parameters for Parquet too, I couldn't find it documented anywhere. On 23 February 2018 at 16:48, Sun, Keith <ai...@ebay.com<mailto:ai...@ebay.com>> wrote: I got your point and thanks for the nice slides info. So the parquet filter is not an easy thing and I will try that according to the deck. Thanks ! From: Furcy Pin <pin.fu...@gmail.com<mailto:pin.fu...@gmail.com>> Sent: Friday, February 23, 2018 3:37:52 AM To: user@hive.apache.org<mailto:user@hive.apache.org> Subject: Re: Why the filter push down does not reduce the read data record count Hi, Unless your table is partitioned or bucketed by myid, Hive generally requires to read through all the records to find the records that match your predicate. In other words, Hive table are generally not indexed for single record retrieval like you would expect RDBMs tables or Vertica tables to be indexed to allow single record. Some file formats like ORC (and maybe Parquet, I'm not sure) allow to add bloom filters on specific columns of a table<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsnippetessay.wordpress.com%2F2015%2F07%2F25%2Fhive-optimizations-with-indexes-bloom-filters-and-statistics%2F=02%7C01%7Caisun%40ebay.com%7C65fc6c45d6394d53c25508d57ab204ff%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549827365379178=rqlaV994fEVnDts8xeKJ3gysOkG738Q6iAi5aWnLTrM%3D=0>, which could work as a kind of index. Also, depending on the query engine you are using (Hive, Spark-SQL, Impala, Presto...) and its version, they may or may not be able to leverage certain storage optimization. For example, Spark still does not support Hive Bucketed Table optimization. But it might come in the upcoming Spark 2.3. I'm much less familiar with Parquet, so if anyone has links to a good documentation for Parquet fine tuning (or even better a comparison with ORC features) that would be really helpful. By googling, I found these slides where someone at Netflix<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FRyanBlue3%2Fparquet-performance-tuning-the-missing-guide=02%7C01%7Caisun%40ebay.com%7C65fc6c45d6394d53c25508d57ab204ff%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549827365379178=Ek%2BezplTbMr5m8xmHFICmwkWIBKhO39zWARXNKCrR18%3D=0> seems to have tried the same kind of optimization as you in Parquet. On 23 February 2018 at 12:02, Sun, Keith <ai...@ebay.com<mailto:ai...@ebay.com>> wrote: Hi, Why Hive still read so much "records" even with a filter pushdown enabled and the
Re: Why the filter push down does not reduce the read data record count
I got your point and thanks for the nice slides info. So the parquet filter is not an easy thing and I will try that according to the deck. Thanks ! From: Furcy Pin <pin.fu...@gmail.com> Sent: Friday, February 23, 2018 3:37:52 AM To: user@hive.apache.org Subject: Re: Why the filter push down does not reduce the read data record count Hi, Unless your table is partitioned or bucketed by myid, Hive generally requires to read through all the records to find the records that match your predicate. In other words, Hive table are generally not indexed for single record retrieval like you would expect RDBMs tables or Vertica tables to be indexed to allow single record. Some file formats like ORC (and maybe Parquet, I'm not sure) allow to add bloom filters on specific columns of a table<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsnippetessay.wordpress.com%2F2015%2F07%2F25%2Fhive-optimizations-with-indexes-bloom-filters-and-statistics%2F=02%7C01%7Caisun%40ebay.com%7C65fc6c45d6394d53c25508d57ab204ff%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549827365379178=rqlaV994fEVnDts8xeKJ3gysOkG738Q6iAi5aWnLTrM%3D=0>, which could work as a kind of index. Also, depending on the query engine you are using (Hive, Spark-SQL, Impala, Presto...) and its version, they may or may not be able to leverage certain storage optimization. For example, Spark still does not support Hive Bucketed Table optimization. But it might come in the upcoming Spark 2.3. I'm much less familiar with Parquet, so if anyone has links to a good documentation for Parquet fine tuning (or even better a comparison with ORC features) that would be really helpful. By googling, I found these slides where someone at Netflix<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FRyanBlue3%2Fparquet-performance-tuning-the-missing-guide=02%7C01%7Caisun%40ebay.com%7C65fc6c45d6394d53c25508d57ab204ff%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549827365379178=Ek%2BezplTbMr5m8xmHFICmwkWIBKhO39zWARXNKCrR18%3D=0> seems to have tried the same kind of optimization as you in Parquet. On 23 February 2018 at 12:02, Sun, Keith <ai...@ebay.com<mailto:ai...@ebay.com>> wrote: Hi, Why Hive still read so much "records" even with a filter pushdown enabled and the returned dataset would be a very small amount ( 4k out of 30billion records). The "RECORDS_IN" counter of Hive which still showed the 30billion count and also the output in the map reduce log like this : org.apache.hadoop.hive.ql.exec.MapOperator: MAP[4]: records read - 10 BTW, I am using parquet as stoarg format and the filter pushdown did work as i see this in log : AM INFO: parquet.filter2.compat.FilterCompat: Filtering using predicate: eq(myid, 223) Thanks, Keith
Why the filter push down does not reduce the read data record count
Hi, Why Hive still read so much "records" even with a filter pushdown enabled and the returned dataset would be a very small amount ( 4k out of 30billion records). The "RECORDS_IN" counter of Hive which still showed the 30billion count and also the output in the map reduce log like this : org.apache.hadoop.hive.ql.exec.MapOperator: MAP[4]: records read - 10 BTW, I am using parquet as stoarg format and the filter pushdown did work as i see this in log : AM INFO: parquet.filter2.compat.FilterCompat: Filtering using predicate: eq(myid, 223) Thanks, Keith
Re: Can't drop table
On Jun 11, 2014, at 13:15 , Edward Capriolo wrote: There is a lengthy discussion, but it's unclear how to actually drop the darn table. In fact, that page Dumb work around: You could go directly to mysql and delete the table... Or make a serde with that name just so you will not get the class not found error :) Okay, thanks. Still a pretty weird bug all in all, considering that the conclusion of that bug report was to not fix the bug. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com Luminous beings are we, not this crude matter. -- Yoda
Re: Can't drop table
That's beside the point. The question is, why can't I drop the table? There is no excuse for not dropping a table just because some serde can't be found. It shouldn't operate that way at all. Thanks. On Jun 10, 2014, at 23:33 , Nitin Pawar wrote: if you have added a table with a serde definition then just keep the jar in local filesystem and then in hive console do add jar full path to jar; this should make your serde available to table and you should be able to drop the table then. I just tried above steps and it works for my json based tables. On Wed, Jun 11, 2014 at 2:24 AM, Keith Wiley kwi...@keithwiley.com wrote: I tried to create a table that would use a csv serde. There were various problems with it, primarily in that I couldn't figure out how to specify the path to the serde (whether I indicate the serde location either locally or on HDFS) there were subsequent errors about not finding the serde, even though the paths were correct). So I decided to drop the table since it wasn't working...but the serde error that is causing these problems to begin with prevents me from dropping the table due to a serde-doesn't-exist error. Yeah, duh, that's the problem in the first place. This is an acknowledged problem, as indicated here: https://issues.apache.org/jira/browse/HIVE-3392 ...but I don't understand from that page how to get around the problem. There is a lengthy discussion, but it's unclear how to actually drop the darn table. In fact, that page suggests the problem is officially classified as won't be fixed. Note, this problem also prevents me from renaming the table via alter. I don't know what to do at this point. Any ideas? Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com The easy confidence with which I know another man's religion is folly teaches me to suspect that my own is also. -- Mark Twain -- Nitin Pawar Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com It's a fine line between meticulous and obsessive-compulsive and a slippery rope between obsessive-compulsive and debilitatingly slow. -- Keith Wiley
Bug in Hive Partition windowing functions?
Hi, we have an issue with windowing function query never completed when running against the large dataset 25,000 rows. That is the reducer (only one) never exit and it appears stuck in an infinite loop. I looked at the Reducer counter and it never changes over the 6 hours when it gets stuck in a loop. When the data set is small 25K rows, it runs fine. Is there any work around this issue? We tested against Hive 0.11/0.12/0.13 and the same result is the same. create table window_function_fail as select a.*, sum(case when bprice is not null then 1 else 0 end) over (partition by date,name order by otime,bprice,aprice desc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) bidpid from large_table a; create table large_table( date string, name string , stime string , bpricedecimal , apricedecimal , otime double ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' stored as textfile; Thanks in advance.
bucketed table problems
I want to convert a table to a bucketed table, so I made a new table with the same schema as the old table and specified a cluster column: create table foo_bucketed ( a string, b int, c float ) clustered by (b) into 10 buckets; Then I populate it from my original table: set hive.enforce.bucketing = true; insert overwrite table foo_bucketed select * from foo; All of the data goes into the first bucket, leaving the remaining 9 buckets empty (in the file system, the remaining 9 files are 0 size). Furthermore, the cluster column is now NULL. Its values have been completely erased by the insertion (which might explain how they all ended up in a single bucket of course). Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com Yet mark his perfect self-contentment, and hence learn his lesson, that to be self-contented is to be vile and ignorant, and that to aspire is better than to be blindly and impotently happy. -- Edwin A. Abbott, Flatland
Re: bucketed table problems
Admittedly the problem has gone away. I *think* I may have offered the schema to the bucketed table creation in the wrong order the first time...I think. All I know is the problem went away when I tried a second time. Should inserting a table into another table match the columns by name or does it just write from left to right utterly ignoring any schema specification? Does it write the column 'a' from foo into column 'a' in foo_bucketed, or does it write column 'a' from foo into the same column position in foo_bucketed without any regard for the names of the columns? On Mar 7, 2014, at 13:52 , Stephen Sprague wrote: yeah. that's not right. 1. lets see the output of show create table foo 2. what version of hive are you using. On Fri, Mar 7, 2014 at 11:46 AM, Keith Wiley kwi...@keithwiley.com wrote: I want to convert a table to a bucketed table, so I made a new table with the same schema as the old table and specified a cluster column: create table foo_bucketed ( a string, b int, c float ) clustered by (b) into 10 buckets; Then I populate it from my original table: set hive.enforce.bucketing = true; insert overwrite table foo_bucketed select * from foo; All of the data goes into the first bucket, leaving the remaining 9 buckets empty (in the file system, the remaining 9 files are 0 size). Furthermore, the cluster column is now NULL. Its values have been completely erased by the insertion (which might explain how they all ended up in a single bucket of course). Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com Yet mark his perfect self-contentment, and hence learn his lesson, that to be self-contented is to be vile and ignorant, and that to aspire is better than to be blindly and impotently happy. -- Edwin A. Abbott, Flatland Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com Luminous beings are we, not this crude matter. -- Yoda
Re: bucketed table problems
Thanks. :-) It's working better for me now. On Mar 7, 2014, at 17:37 , Stephen Sprague wrote: short answer: its by position. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com And what if we picked the wrong religion? Every week, we're just making God madder and madder! -- Homer Simpson
CDH4.4 - 0 progress on select queries
Hi all, We have happily been using hive on our CDH install for some time without issues. We recently upgraded from CDH4.2.0 to CDH 4.4.0 and now select queries against one table fails with the syslog for the job showing what's below. Couple of things to note: * This is a partitioned table * An example failure query: select count(*) from table where partition conditions * other tables do not appear to be effected however many/most of those have small files. This table's directories have only 1 file which is around 60 GB in size * Queries using OOTB hive 0.9.0 pointing at the centralized meta store work without issue * It looks like an infinite loop issue where this query usually completes in ~2 min but we let it run for 45 with the job tracker showing 0% progress on mapping (perhaps related to https://issues.apache.org/jira/browse/HIVE-5235) Have others seen this behavior? Thanks!!! 2013-11-07 13:58:02,102 INFO org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias organic_events for file hdfs://nameservice1/events/organic/2013/11/05 2013-11-07 13:58:02,102 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 7 forwarding 1 rows 2013-11-07 13:58:02,102 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 forwarding 1 rows 2013-11-07 13:58:02,102 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 1 forwarding 1 rows 2013-11-07 13:58:02,103 INFO ExecMapper: ExecMapper: processing 1 rows: used memory = 178407888 2013-11-07 13:58:02,104 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 7 forwarding 10 rows 2013-11-07 13:58:02,104 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 forwarding 10 rows 2013-11-07 13:58:02,104 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 1 forwarding 10 rows 2013-11-07 13:58:02,104 INFO ExecMapper: ExecMapper: processing 10 rows: used memory = 178407888 …. Where processing row counts and used memory keeps increasing
Use distribute to spread across reducers
I'm trying to create a subset of a large table for testing. The following approach works: create table subset_table as select * from large_table limit 1000 ...but it only uses one reducer. I would like to speed up the process of creating a subset but distributing across multiple reducers. I already tried explicitly setting mapred.reduce.tasks and hive.exec.reducers.max to values larger than 1, but in this particular case, those values seem to be over-ridden by Hive's internal query-to-mapreduce conversion; it ignores those parameters. So, I tried this: create table subset_table as select * from large_table limit 1000 distribute by column_name ...but that doesn't parse. I get the following error: OK FAILED: ParseException line 3:0 missing EOF at 'distribute' near '1000'. I have tried NUMEROUS applications of parentheses, nested queries, etc. For example, here's just one (amongst perhaps ten variations on a theme): create table subset_table as select * from ( from ( select * from large_table limit 1000 distribute by column_name )) s Like I said, I've tried all sorts of combinations of the elements shown above. So far I have not even gotten any syntax to parse, much less run. Only the original query at the top will even pass the parsing stage of processing. Any ideas? Thanks. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com I do not feel obliged to believe that the same God who has endowed us with sense, reason, and intellect has intended us to forgo their use. -- Galileo Galilei
Re: Want query to use more reducers
Thanks. mapred.reduce.tasks and hive.exec.reducers.max seem to have fixed the problem. It is now saturating the cluster and running the query super fast. Excellent! On Sep 30, 2013, at 12:28 , Sean Busbey wrote: Hey Keith, It sounds like you should tweak the settings for how Hive handles query execution[1]: 1) Tune the guessed number of reducers based on input size = hive.exec.reducers.bytes.per.reducer Defaults to 1G. Based on your description, it sounds like this is probably still at default. In this case, you should also set a max # of reducers based on your cluster size. = hive.exec.reducers.max I usually set this to the # reduce slots, if there's a decent chance I'll get to saturate the cluster. If not, don't worry about it. 2) Hard code a number of reducers = mapred.reduce.tasks Setting this will cause Hive to always use that number. It defaults to -1, which tells hive to use the heuristic about input size to guess. In either of the above cases, you should look at the options to merge small files (search for merge in the configuration property list) to avoid getting lots of little outputs. HTH [1]: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-QueryExecution -Sean On Mon, Sep 30, 2013 at 11:31 AM, Keith Wiley kwi...@keithwiley.com wrote: I have a query that doesn't use reducers as efficiently as I would hope. If I run it on a large table, it uses more reducers, even saturating the cluster, as I desire. However, on smaller tables it uses as low as a single reducer. While I understand there is a logic in this (not using multiple reducers until the data size is larger), it is nevertheless inefficient to run a query for thirty minutes leaving the entire cluster vacant when the query could distribute the work evenly and wrap things up in a fraction of the time. The query is shown below (abstracted to its basic form). As you can see, it is a little atypical: it is a nested query which obviously implies two map-reduce jobs and it uses a script for the reducer stage that I am trying to speed up. I thought the distribute by clause should make it use the reducers more evenly, but as I said, that is not the behavior I am seeing. Any ideas how I could improve this situation? Thanks. CREATE TABLE output_table ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' as SELECT * FROM ( FROM ( SELECT * FROM input_table DISTRIBUTE BY input_column_1 SORT BY input_column_1 ASC, input_column_2 ASC, input_column_etc ASC) q SELECT TRANSFORM(*) USING 'python my_reducer_script.py' AS( output_column_1, output_column_2, output_column_etc, ) ) s ORDER BY output_column_1; Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com Luminous beings are we, not this crude matter. -- Yoda -- Sean Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com I do not feel obliged to believe that the same God who has endowed us with sense, reason, and intellect has intended us to forgo their use. -- Galileo Galilei
Re: IndexOutOfBoundsException with Snappy compressed SequenceFile from Flume
I was able to resolve it. Turns our we had a misconfiguration with one of our flume hdfs sinks where 2 were writing as sequence file text while one was sequence file bytes. It was due to one bad character in the flume conf. My manual map/reduce jobs worked fine because it would automatically handle both cases. Hive on the other hand would work for the beginning of the query until it hit a non-text based file. One of those 60 hour of debugging leads to a one character change bugs. Hopefully this will be a good lesson for others in the future! Shreepadma Venugopalan shreepa...@cloudera.com wrote: Hi Keith, Were you able to resolve this? Or, is this still an issue? Thanks. Shreepadma On Tue, May 28, 2013 at 6:02 AM, Keith Wright kwri...@nanigans.commailto:kwri...@nanigans.com wrote: Hi all, This is my first post to the hive mailing list and I was hoping to get some help with the exception I am getting below. I am using CDH4.2 (hive 0.10.0) to query snappy compressed, Sequence files that are built using Flume (relevant portion of flume conf below as well). Note that I'm using a SequenceFile as it was needed for Impala integration. Has anyone see this error before? Couple of additional points to help diagnose: 1. Queries seem to be able to process some mappers without issues. In fact, I can do a simple select * from table limit 10 without issue. However if I make the limit high enough, it will eventually fail presumably as it needs to read in a file that has this issue. 2. The same query runs in Impala without errors but appears to skip some data. I can confirm that the missing data is present via a custom map/reduce job 3. I am able to write a map/reduce job that reads through all of the same data without issue and have been unable to identify data corruption 4. This is a partitioned table and queries fail that touch ANY of the partitions (and there are hundreds) so this does not appear to be a sporadic, data integrity problem (table definition below) 5. We are using '\001' as our field separator. We are capturing other data also with SequenceFile, snappy but using '|' as our delimiter and we do not have any issues querying there. Although we are using a different flume source. My next step for debugging was to disable snappy compression and see if I could query the data. If not, switch from SequenceFile to simple text. I appreciate the help!!! CREATE EXTERNAL TABLE ORGANIC_EVENTS ( event_id BIGINT, app_id INT, user_id BIGINT, type STRING, name STRING, value STRING, extra STRING, ip_address STRING, user_agent STRING, referrer STRING, event_time BIGINT, install_flag TINYINT, first_for_user TINYINT, cookie STRING, year int, month int, day int, hour int) PARTITIONED BY (year int, month int, day int,hour int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' STORED AS SEQUENCEFILE LOCATION '/events/organic'; agent.sinks.exhaustHDFSSink3.type = HDFS agent.sinks.exhaustHDFSSink3.channel = exhaustFileChannel agent.sinks.exhaustHDFSSink3.hdfs.path = hdfs://lxscdh001.nanigans.com:8020%{path} agent.sinks.exhaustHDFSSink3.hdfs.filePrefix = 3.%{hostname} agent.sinks.exhaustHDFSSink3.hdfs.rollInterval = 0 agent.sinks.exhaustHDFSSink3.hdfs.idleTimeout = 600 agent.sinks.exhaustHDFSSink3.hdfs.rollSize = 0 agent.sinks.exhaustHDFSSink3.hdfs.rollCount = 0 agent.sinks.exhaustHDFSSink3.hdfs.batchSize = 5000 agent.sinks.exhaustHDFSSink3.hdfs.txnEventMax = 5000 agent.sinks.exhaustHDFSSink3.hdfs.fileType = SequenceFile agent.sinks.exhaustHDFSSink3.hdfs.maxOpenFiles = 100 agent.sinks.exhaustHDFSSink3.hdfs.codeC = snappy agent.sinks.exhaustHDFSSink.3hdfs.writeFormat = Text 2013-05-28 12:29:00,919 WARN org.apache.hadoop.mapred.Child: Error running child java.io.IOException: java.io.IOException: java.lang.IndexOutOfBoundsException at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:330) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:246) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:216) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:201) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418
Re: .sql vs. .hql
Thanks! I'm always up for a good book. Our case is a little open-ended right now, a little difficult to nail down or describe. Basically, I'm deploying Hive systems for a SQl-comfortable data analyst (who is also quite versed in Hive) who would like to run .sql files in Hive and I need to clearly understand how Hive uses input files and how .sql and .hql may differ. Cheers! On May 31, 2013, at 13:20 , Sanjay Subramanian wrote: For Hive u need the other bible by Dean Wampler , Edward Capriolo et al Also if u tell us what use cases u have we could provide helpŠ sanjay On 5/31/13 1:17 PM, Keith Wiley kwi...@keithwiley.com wrote: I'm looking for documentation on how to use .sql and .hql files in Hive and what the differences are between the two file types. I'm not even certain .hql is a real thing, but I'm looking. I didn't see any mention of this issue in White's Hadoop book, at least not clearly indicated in the index. Thanks. __ __ Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com Luminous beings are we, not this crude matter. -- Yoda __ __ CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com I do not feel obliged to believe that the same God who has endowed us with sense, reason, and intellect has intended us to forgo their use. -- Galileo Galilei
IndexOutOfBoundsException with Snappy compressed SequenceFile from Flume
Hi all, This is my first post to the hive mailing list and I was hoping to get some help with the exception I am getting below. I am using CDH4.2 (hive 0.10.0) to query snappy compressed, Sequence files that are built using Flume (relevant portion of flume conf below as well). Note that I'm using a SequenceFile as it was needed for Impala integration. Has anyone see this error before? Couple of additional points to help diagnose: 1. Queries seem to be able to process some mappers without issues. In fact, I can do a simple select * from table limit 10 without issue. However if I make the limit high enough, it will eventually fail presumably as it needs to read in a file that has this issue. 2. The same query runs in Impala without errors but appears to skip some data. I can confirm that the missing data is present via a custom map/reduce job 3. I am able to write a map/reduce job that reads through all of the same data without issue and have been unable to identify data corruption 4. This is a partitioned table and queries fail that touch ANY of the partitions (and there are hundreds) so this does not appear to be a sporadic, data integrity problem (table definition below) 5. We are using '\001' as our field separator. We are capturing other data also with SequenceFile, snappy but using '|' as our delimiter and we do not have any issues querying there. Although we are using a different flume source. My next step for debugging was to disable snappy compression and see if I could query the data. If not, switch from SequenceFile to simple text. I appreciate the help!!! CREATE EXTERNAL TABLE ORGANIC_EVENTS ( event_id BIGINT, app_id INT, user_id BIGINT, type STRING, name STRING, value STRING, extra STRING, ip_address STRING, user_agent STRING, referrer STRING, event_time BIGINT, install_flag TINYINT, first_for_user TINYINT, cookie STRING, year int, month int, day int, hour int) PARTITIONED BY (year int, month int, day int,hour int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' STORED AS SEQUENCEFILE LOCATION '/events/organic'; agent.sinks.exhaustHDFSSink3.type = HDFS agent.sinks.exhaustHDFSSink3.channel = exhaustFileChannel agent.sinks.exhaustHDFSSink3.hdfs.path = hdfs://lxscdh001.nanigans.com:8020%{path} agent.sinks.exhaustHDFSSink3.hdfs.filePrefix = 3.%{hostname} agent.sinks.exhaustHDFSSink3.hdfs.rollInterval = 0 agent.sinks.exhaustHDFSSink3.hdfs.idleTimeout = 600 agent.sinks.exhaustHDFSSink3.hdfs.rollSize = 0 agent.sinks.exhaustHDFSSink3.hdfs.rollCount = 0 agent.sinks.exhaustHDFSSink3.hdfs.batchSize = 5000 agent.sinks.exhaustHDFSSink3.hdfs.txnEventMax = 5000 agent.sinks.exhaustHDFSSink3.hdfs.fileType = SequenceFile agent.sinks.exhaustHDFSSink3.hdfs.maxOpenFiles = 100 agent.sinks.exhaustHDFSSink3.hdfs.codeC = snappy agent.sinks.exhaustHDFSSink.3hdfs.writeFormat = Text 2013-05-28 12:29:00,919 WARN org.apache.hadoop.mapred.Child: Error running child java.io.IOException: java.io.IOException: java.lang.IndexOutOfBoundsException at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:330) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:246) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:216) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:201) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.mapred.Child.main(Child.java:262) Caused by: java.io.IOException: java.lang.IndexOutOfBoundsException at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
Re: Need rank(), can't build m6d's version
Thank you Edward. I appreciate you trying to help out. I'll see if I can get that to work. On Apr 1, 2013, at 17:09 , Edward Capriolo wrote: approx steps git clone hive-test mvn install git clone hive-rang mvn install or just git clone hive-test mvn -Dmaven.test.skip=true install Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com The easy confidence with which I know another man's religion is folly teaches me to suspect that my own is also. -- Mark Twain
Re: Need rank()
On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote: http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/ Is there any possibility there is a bug in Ritesh Agrawal's query statement from that article? I created a test table with the exact column names from the example in the article and used a minimally altered version of the command (I removed the where clause to simplify things a bit) and got an error which suggests there is something slightly wrong with the command (or perhaps the table has to be configured a special way). Here's what I get when I almost perfectly duplicate that example: hive describe test; OK userstring categorystring value int Time taken: 0.082 seconds == hive select * from test; OK user1 cat11 user1 cat12 user1 cat13 user1 cat210 user1 cat220 user1 cat230 user2 cat111 user2 cat121 user2 cat131 user2 cat25 user2 cat26 user2 cat27 Time taken: 0.202 seconds == hive SELECT user, category, value FROM ( SELECT user, category, rank(user) as rank, value FROM test DISTRIBUTE BY user SORT BY user, value desc ) a WHERE rank 1 ORDER BY user, rank; FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or column reference 'rank': (possible column names are: _col0, _col1, _col2) hive Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com What I primarily learned in grad school is how much I *don't* know. Consequently, I left grad school with a higher ignorance to knowledge ratio than when I entered. -- Keith Wiley
Re: Need rank()
I did all that, I just didn't paste it into the email. That isn't the problem. Sorry for the confusion. hive add jar Rank.jar; hive create temporary function rank as 'com.example.hive.udf.Rank'; BTW, small typo, the condition at the end uses less-then, not greater-then...obviously...but that isn't the problem either. On Apr 2, 2013, at 10:06 , Nitin Pawar wrote: I dont see you added rank in the classpath and defined rank function can you try add jar and define the function and try again On Tue, Apr 2, 2013 at 10:33 PM, Keith Wiley kwi...@keithwiley.com wrote: On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote: http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/ Is there any possibility there is a bug in Ritesh Agrawal's query statement from that article? I created a test table with the exact column names from the example in the article and used a minimally altered version of the command (I removed the where clause to simplify things a bit) and got an error which suggests there is something slightly wrong with the command (or perhaps the table has to be configured a special way). Here's what I get when I almost perfectly duplicate that example: hive describe test; OK userstring categorystring value int Time taken: 0.082 seconds == hive select * from test; OK user1 cat11 user1 cat12 user1 cat13 user1 cat210 user1 cat220 user1 cat230 user2 cat111 user2 cat121 user2 cat131 user2 cat25 user2 cat26 user2 cat27 Time taken: 0.202 seconds == hive SELECT user, category, value FROM ( SELECT user, category, rank(user) as rank, value FROM test DISTRIBUTE BY user SORT BY user, value desc ) a WHERE rank 1 ORDER BY user, rank; FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or column reference 'rank': (possible column names are: _col0, _col1, _col2) hive Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com What I primarily learned in grad school is how much I *don't* know. Consequently, I left grad school with a higher ignorance to knowledge ratio than when I entered. -- Keith Wiley -- Nitin Pawar Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com It's a fine line between meticulous and obsessive-compulsive and a slippery rope between obsessive-compulsive and debilitatingly slow. -- Keith Wiley
Re: Need rank()
I suppose. Like I said, I was just following the example on that website and it didn't seem to have any comments on it remarking on errors in the article. Thus my confusion about the whole thing. It is presented as being accurate and is old enough that any errors would have been caught and labeled in the comments long ago. I also thought it was a little odd to name the function the same thing as the alias assigned to the result. I'll try disambiguating that. On Apr 2, 2013, at 10:22 , Nitin Pawar wrote: I am not sure about this but you defined a rank function and then aliased the column as rank itself. Can this cause the issue? From the error it looks like it could not figure out which column is rank in the where clause WHERE rank 1 from Edward's blog the correct query looks similar but his function is named as p_rank SELECT category,country,product,sales,rank FROM ( SELECT category,country,product,sales, p_rank(category, country) rank FROM ( SELECT category,country,product, sales FROM p_rank_demo DISTRIBUTE BY category,country SORT BY category,country,sales desc) t1) t2 WHERE rank = 3 On Tue, Apr 2, 2013 at 10:45 PM, Keith Wiley kwi...@keithwiley.com wrote: I did all that, I just didn't paste it into the email. That isn't the problem. Sorry for the confusion. hive add jar Rank.jar; hive create temporary function rank as 'com.example.hive.udf.Rank'; BTW, small typo, the condition at the end uses less-then, not greater-then...obviously...but that isn't the problem either. On Apr 2, 2013, at 10:06 , Nitin Pawar wrote: I dont see you added rank in the classpath and defined rank function can you try add jar and define the function and try again On Tue, Apr 2, 2013 at 10:33 PM, Keith Wiley kwi...@keithwiley.com wrote: On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote: http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/ Is there any possibility there is a bug in Ritesh Agrawal's query statement from that article? I created a test table with the exact column names from the example in the article and used a minimally altered version of the command (I removed the where clause to simplify things a bit) and got an error which suggests there is something slightly wrong with the command (or perhaps the table has to be configured a special way). Here's what I get when I almost perfectly duplicate that example: hive describe test; OK userstring categorystring value int Time taken: 0.082 seconds == hive select * from test; OK user1 cat11 user1 cat12 user1 cat13 user1 cat210 user1 cat220 user1 cat230 user2 cat111 user2 cat121 user2 cat131 user2 cat25 user2 cat26 user2 cat27 Time taken: 0.202 seconds == hive SELECT user, category, value FROM ( SELECT user, category, rank(user) as rank, value FROM test DISTRIBUTE BY user SORT BY user, value desc ) a WHERE rank 1 ORDER BY user, rank; FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or column reference 'rank': (possible column names are: _col0, _col1, _col2) hive Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com What I primarily learned in grad school is how much I *don't* know. Consequently, I left grad school with a higher ignorance to knowledge ratio than when I entered. -- Keith Wiley -- Nitin Pawar Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com It's a fine line between meticulous and obsessive-compulsive and a slippery rope between obsessive-compulsive and debilitatingly slow. -- Keith Wiley -- Nitin Pawar Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't scratch. All together this implies: He scratched the itch from the scratch that itched but would never itch the scratch from the itch that scratched. -- Keith Wiley
Re: Need rank()
I'm pretty sure there are at least a few bugs in that article. The problem described below goes away if rank is added to the outer select statement. If there's one error I suppose there may be others. Although I have gotten the query to run, I'm not getting properly ranked results yet. I'm still working on it. On Apr 2, 2013, at 10:03 , Keith Wiley wrote: On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote: http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/ Is there any possibility there is a bug in Ritesh Agrawal's query statement from that article? I created a test table with the exact column names from the example in the article and used a minimally altered version of the command (I removed the where clause to simplify things a bit) and got an error which suggests there is something slightly wrong with the command (or perhaps the table has to be configured a special way). Here's what I get when I almost perfectly duplicate that example: hive describe test; OK user string category string value int Time taken: 0.082 seconds == hive select * from test; OK user1 cat11 user1 cat12 user1 cat13 user1 cat210 user1 cat220 user1 cat230 user2 cat111 user2 cat121 user2 cat131 user2 cat25 user2 cat26 user2 cat27 Time taken: 0.202 seconds == hive SELECT user, category, value FROM ( SELECT user, category, rank(user) as rank, value FROM test DISTRIBUTE BY user SORT BY user, value desc ) a WHERE rank 1 ORDER BY user, rank; FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or column reference 'rank': (possible column names are: _col0, _col1, _col2) hive Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com What I primarily learned in grad school is how much I *don't* know. Consequently, I left grad school with a higher ignorance to knowledge ratio than when I entered. -- Keith Wiley Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com Yet mark his perfect self-contentment, and hence learn his lesson, that to be self-contented is to be vile and ignorant, and that to aspire is better than to be blindly and impotently happy. -- Edwin A. Abbott, Flatland
Re: Need rank()
Yep, the original article is definitely erroneous in this regard. I figured out that eventually. I'm not sure how much I can trust that resource now. I may have to look elsewhere. I agree that Edward's description is pretty good, but as I said earlier, I can't actually use his code, so I'm trying to cobble a workable solution together from the various resources available. Ritesh's article, despite the error in the Hive syntax, is still useful in that it enables one to quickly compile a simple rank jar without relying on git, maven, or other project dependencies -- problems which have plagued me with Edward's approach. So, if I can use Ritesh's method to write a simple rank function, and Edward's accurate description of how to construct the query, then I can put all the pieces together into a workable solution. I'll let you know if I get it. On Apr 2, 2013, at 10:56 , Igor Tatarinov wrote: You are getting the error because you are ORDERing BY rank but rank is not in the top SELECT Also, DISTRIBUTE BY/SORT BY are done after SELECT so you have to use a subquery: SELECT ..., rank(user) FROM (SELECT ... DISTRIBUTE BY ... SORT BY) igor decide.com On Tue, Apr 2, 2013 at 10:03 AM, Keith Wiley kwi...@keithwiley.com wrote: On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote: http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/ Is there any possibility there is a bug in Ritesh Agrawal's query statement from that article? I created a test table with the exact column names from the example in the article and used a minimally altered version of the command (I removed the where clause to simplify things a bit) and got an error which suggests there is something slightly wrong with the command (or perhaps the table has to be configured a special way). Here's what I get when I almost perfectly duplicate that example: hive describe test; OK userstring categorystring value int Time taken: 0.082 seconds == hive select * from test; OK user1 cat11 user1 cat12 user1 cat13 user1 cat210 user1 cat220 user1 cat230 user2 cat111 user2 cat121 user2 cat131 user2 cat25 user2 cat26 user2 cat27 Time taken: 0.202 seconds == hive SELECT user, category, value FROM ( SELECT user, category, rank(user) as rank, value FROM test DISTRIBUTE BY user SORT BY user, value desc ) a WHERE rank 1 ORDER BY user, rank; FAILED: Error in semantic analysis: Line 9:15 Invalid table alias or column reference 'rank': (possible column names are: _col0, _col1, _col2) hive Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com What I primarily learned in grad school is how much I *don't* know. Consequently, I left grad school with a higher ignorance to knowledge ratio than when I entered. -- Keith Wiley Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com Luminous beings are we, not this crude matter. -- Yoda
Re: Need rank()
On Apr 2, 2013, at 12:59 , Edward Capriolo wrote: I feel your pain I do not have time to look at the project ATM. However trying to do Java development these days without understanding git, or maven is fairly impossible, most projects depend on N others that depend on M others. So trying to even compile the code outside of maven requires a lot of manually fetching jar etc. Your better off taking a day and getting good with maven, IMHO. You will be dealing with it for the foreseeable future. I agree. I'll cozy up with it first chance I get...but sometimes you're on a deadline and have to go for the low-hanging fruit. I fully understand the motivation behind your approach. I'll circle back around it when I get some free time. Cheers! Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com I do not feel obliged to believe that the same God who has endowed us with sense, reason, and intellect has intended us to forgo their use. -- Galileo Galilei
Rank(): here's how I did it, for better or worse...
I agree, it's probably best to use a better engineered approach such as Edward's. In the meantime, if anyone would benefit from a walk-through of my direct approach, here it is. It combines Ritesh's direct ultra-simplistic method with Edward's correct HiveQL syntax. As would be expected, it is sensitive to shell vagaries that would be better managed by a combo system like git and maven...but it works. Create Rank.java: - package com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; public final class Rank extends UDF{ private int counter; private String last_key; public int evaluate(final String key){ if ( !key.equalsIgnoreCase(this.last_key) ) { this.counter = 0; this.last_key = key; } return this.counter++; } } Compile Rank.java to Rank.class. Then bundle to Rank.jar. Observe that the jar command is highly sensitive to the relative path to the .class file when naming the package inside the resulting .jar file: - $ mkdir ./RankTempDir $ javac -classpath $HIVE_HOME/lib/hive-serde-0.8.1.jar:$HIVE_HOME/lib/hive-exec-0.8.1.jar:$HADOOP_HOME/hadoop-core.jar -d ./RankTempDir Rank.java $ cd RankTempDir; $ jar -cf ../Rank.jar ./com $ cd .. You will have to verify (via echo) HADOOP_HOME and HIVE_HOME and then will have to verify the name of the serde, exec, and core files in each directory. The exact filenames are probably version specific. Verify the package path in Rank.jar: - $ jar -tvf Rank.jar You should see 'com/example/hive/udf/Rank.class'. If you see a different path, the package has not been properly represented in the jar w.r.t. its designation in the .java file. Run hive and prepare the session to use the UDF: $ hive hive add jar Rank.jar; hive create temporary function rank as 'com.example.hive.udf.Rank'; You must either run hive from a directory containing Rank.jar or specify an alternate path to it in the add command. Note that the Rank class's full package is specified in the create command and therefore must, logically, match the package in both the .java and the .jar files. Consider a table named 'test' consisting of columns 'user', 'category', and 'value', containing the following data: hive select * from test; user1 catA1 user1 catB11 user1 catC111 user2 catA222 user2 catB22 user2 catC2 user3 catA3 user3 catB5 user3 catC4 So the top category for user1 is catC, for user2 is catA and for user3 is catB. Say we want the top N valued categories for each user. In the example below, N is 2 (it is indicated in the final WHERE clause). Here is the format of the corresponding ranked query, and its result: hive SELECT user, category, value, ranked_col FROM ( SELECT user, category, value, rank(user) ranked_col FROM ( SELECT user, category, value FROM test DISTRIBUTE BY user SORT BY user, value desc ) a ) b WHERE ranked_col 2 ORDER BY user, ranked_col; ... [wait for Hive query and MapReduce job(s) to finish] ... user1 catC111 0 user1 catB11 1 user2 catA222 0 user2 catB22 1 user3 catB5 0 user3 catC4 1 Note that ranks are 0-indexed (of course I suppose that's a property of the specific .java we wrote above, and therefore is easily amenable to 1-indexing, which would more canonically connote the notion of a rank). Anyway, that's what I came up with. I don't by any means claim it's the best approach. Edward is surely right that the best method would be to use the powerful tools made available by the large developer community such as git and maven. Cheers! Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com And what if we picked the wrong religion? Every week, we're just making God madder and madder! -- Homer Simpson
Need rank(), can't build m6d's version
I need rank() in Hive. I have't had much luck with Edward Capriolo's on git and it comes with no documentation. It depends on hive-test (also by Edward) and I can't get maven to build the tests for hive-test which prevents me from building the hive-test package (the jar)...which is all pretty tangential to the real goal, which is rank of course. Is there any place to download the hive-rank jar pre-built instead of having to build it out of git? If not, has anyone built it recently, or namely has anyone built the hive-test project it depends on? The following is just one example of the sort of error I see. Remember this error is on hive-test, which frankly I don't even care about. :-) Feeling a little desperate...thanks for any help. --- Test set: com.jointhegrid.hive_test.EmbeddedHiveExampleTest --- Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.878 sec FAILURE! testA(com.jointhegrid.hive_test.EmbeddedHiveExampleTest) Time elapsed: 7.823 sec FAILURE! java.lang.AssertionError: expected:0 but was:9 at org.junit.Assert.fail(Assert.java:91) at org.junit.Assert.failNotEquals(Assert.java:645) at org.junit.Assert.assertEquals(Assert.java:126) at org.junit.Assert.assertEquals(Assert.java:470) at org.junit.Assert.assertEquals(Assert.java:454) at com.jointhegrid.hive_test.EmbeddedHiveExampleTest.testA(EmbeddedHiveExampleTest.java:51) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at junit.framework.TestCase.runTest(TestCase.java:168) at junit.framework.TestCase.runBare(TestCase.java:134) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:124) at junit.framework.TestSuite.runTest(TestSuite.java:243) at junit.framework.TestSuite.run(TestSuite.java:238) at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83) at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:35) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:115) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:97) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.maven.surefire.booter.ProviderFactory$ClassLoaderProxy.invoke(ProviderFactory.java:103) at $Proxy0.invoke(Unknown Source) at org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:150) at org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcess(SurefireStarter.java:91) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:69) Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com I do not feel obliged to believe that the same God who has endowed us with sense, reason, and intellect has intended us to forgo their use. -- Galileo Galilei
Re: Need rank(), can't build m6d's version
Thanks! That looks like it has serious potential. I'll study up on it. On Apr 1, 2013, at 16:12 , Alexander Pivovarov wrote: http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/ Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com And what if we picked the wrong religion? Every week, we're just making God madder and madder! -- Homer Simpson
Re: S3/EMR Hive: Load contents of a single file
Okay, I also saw your previous response which analyzed queries into two tables built around two files in the same directory. I guess I was simply wrong in my understanding that a Hive table is fundamentally associated with a directory instead of a file. Turns out, it be can either one. A directory table uses all files in the directory while a file table uses one specific file and properly avoids sibling files. My bad. Thanks for the careful analysis and clarification. TIL! Cheers! On Mar 27, 2013, at 02:58 , Tony Burton wrote: A bit more info - do an extended description of the table: $ desc extended gsrc1; And the “location” field is “location:s3://mybucket/path/to/data/src1.txt” Do the same on a table created with a location pointing at the directory and the same info gives (not surprisingly) “location:s3://mybucket/path/to/data/” Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com I used to be with it, but then they changed what it was. Now, what I'm with isn't it, and what's it seems weird and scary to me. -- Abe (Grandpa) Simpson
Re: S3/EMR Hive: Load contents of a single file
This email and any attachments are confidential, protected by copyright and may be legally privileged. If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened. It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email. Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Gateway House, Milverton Street, London, SE11 4AP. Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404) and Gambling Commission (reg. no. 000-027343-R-308898-001). Any financial promotion contained herein has been issued and approved by Sporting Index Ltd. Outbound email has been scanned for viruses and SPAM CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. * P Pleaseconsider the environment before printing this email Inbound email has been scanned for viruses spam * P Please consider the environment before printing this email or attachments This email and any attachments are confidential, protected by copyright and may be legally privileged. If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened. It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email. Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Gateway House, Milverton Street, London, SE11 4AP. Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404) and Gambling Commission (reg. no. 000-027343-R-308898-001). Any financial promotion contained herein has been issued and approved by Sporting Index Ltd. Outbound email has been scanned for viruses and SPAM Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com Yet mark his perfect self-contentment, and hence learn his lesson, that to be self-contented is to be vile and ignorant, and that to aspire is better than to be blindly and impotently happy. -- Edwin A. Abbott, Flatland
Re: Query crawls through reducer
Thanks. On Mar 22, 2013, at 21:02 , Nitin Pawar wrote: instead of = can you just try = if you want to limit top 100 (b being a partition i guess it will have more that 100 records to fit into your limit) to improve your query performance your table file format matters as well. Which one are you using? how many partitions are there? what's the size of the cluster? you can set the number of reducers but if your query just has one key then only one reducer will get the data and rest will run empty On Sat, Mar 23, 2013 at 4:32 AM, Keith Wiley kwi...@keithwiley.com wrote: The following query translates into a many-map-single-reduce job (which is common) and also slags through the reduce stage...it's killing the overall query: select * from a where b = 'c' order by b desc limit 100 Note that b is a partition. What component is making the reducer heavy? Is it the order by or the limit (I'm sure it's not the partition-specific where clause, right?)? Are there ways to improve its performance? Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't scratch. All together this implies: He scratched the itch from the scratch that itched but would never itch the scratch from the itch that scratched. -- Keith Wiley -- Nitin Pawar Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com It's a fine line between meticulous and obsessive-compulsive and a slippery rope between obsessive-compulsive and debilitatingly slow. -- Keith Wiley
Query crawls through reducer
The following query translates into a many-map-single-reduce job (which is common) and also slags through the reduce stage...it's killing the overall query: select * from a where b = 'c' order by b desc limit 100 Note that b is a partition. What component is making the reducer heavy? Is it the order by or the limit (I'm sure it's not the partition-specific where clause, right?)? Are there ways to improve its performance? Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't scratch. All together this implies: He scratched the itch from the scratch that itched but would never itch the scratch from the itch that scratched. -- Keith Wiley
Rename external table, including HDFS directory
My understanding is that renaming and external table doesn't rename the corresponding HDFS directory to match. I would like to do this; I want to rename the table in Hive and also rename the HDFS directory so they match (they already match with the old name). I'm not sure how to do this. I'm worried that if I rename the HDFS directory through Hadoop, not Hive, that Hive will lose track of the directory, regardless of whether I rename the table in Hive as well. I also don't want to drop and recreate the table because I don't want to notify Hive of all the partitions all over again. Thoughts? Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com Luminous beings are we, not this crude matter. -- Yoda
Combine two overlapping schema?
I have two tables which have overlapping but nonidentical schema. I want to creating a new table that unions them, leaving nulls in any given row where a column name doesn't occur in the other table: SCHEMA 1: { a, b, c, Y } row: { 1, 2, 3, 4 } SCHEMA 2: { a, b, c, Z } row: { 5, 6, 7, 8 } NEW SCHEMA: { a, b, c, Y, Z } new row: { a:1, b:2, c:3, Y:4, Z:null } new row: { a:5, b:6, c:7, Y:null, Z:8 } I don't think either full outer join or union all works. I'm not sure how to do this. Any ideas? Thanks. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't scratch. All together this implies: He scratched the itch from the scratch that itched but would never itch the scratch from the itch that scratched. -- Keith Wiley
Re: Combine two overlapping schema?
Ah. I was stuck on the requirement that the two schema match, but I see your point. I'll see if that works. On Mar 6, 2013, at 10:11 , Dean Wampler wrote: Of the top of my head, I think UNION ALL should work if you explicitly project out the missing columns with NULL or other values, e.g. using nested SELECTs, something like SELECT * FROM ( SELECT a,b,c, Y, NULL AS Z FROM table1 UNION ALL SELECT a,b,c, NULL AS Y, Z FROM table2 ) table12; On Wed, Mar 6, 2013 at 12:03 PM, Keith Wiley kwi...@keithwiley.com wrote: I have two tables which have overlapping but nonidentical schema. I want to creating a new table that unions them, leaving nulls in any given row where a column name doesn't occur in the other table: SCHEMA 1: { a, b, c, Y } row: { 1, 2, 3, 4 } SCHEMA 2: { a, b, c, Z } row: { 5, 6, 7, 8 } NEW SCHEMA: { a, b, c, Y, Z } new row: { a:1, b:2, c:3, Y:4, Z:null } new row: { a:5, b:6, c:7, Y:null, Z:8 } I don't think either full outer join or union all works. I'm not sure how to do this. Any ideas? Thanks. Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't scratch. All together this implies: He scratched the itch from the scratch that itched but would never itch the scratch from the itch that scratched. -- Keith Wiley -- Dean Wampler, Ph.D. thinkbiganalytics.com +1-312-339-1330 Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com I do not feel obliged to believe that the same God who has endowed us with sense, reason, and intellect has intended us to forgo their use. -- Galileo Galilei
Can hive show external table filenames?
show tables shows the HDFS directories which correspond to Hive tables and various select queries show data from inside the files (each row of a file being a row of the overarching table that the entire directory represents)...or so I have convinced myself...but can Hive show the HDFS filenames? Or is hadoop fs -ls the only way to do that? Thanks. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com The easy confidence with which I know another man's religion is folly teaches me to suspect that my own is also. -- Mark Twain
Re: Can hive show external table filenames?
Hmmm, that doesn't seem to list the files in the directory. On Jan 31, 2013, at 12:12 , Edward Capriolo wrote: Try describe extended tablename or describe extended tablename (partition) the location is one of the properties Edward On Thu, Jan 31, 2013 at 2:37 PM, Keith Wiley kwi...@keithwiley.com wrote: show tables shows the HDFS directories which correspond to Hive tables and various select queries show data from inside the files (each row of a file being a row of the overarching table that the entire directory represents)...or so I have convinced myself...but can Hive show the HDFS filenames? Or is hadoop fs -ls the only way to do that? Thanks. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't scratch. All together this implies: He scratched the itch from the scratch that itched but would never itch the scratch from the itch that scratched. -- Keith Wiley
Re: Can hive show external table filenames?
Thanks, that seems like a reasonable summary of my options. Cheers! On Jan 31, 2013, at 14:38 , Edward Capriolo wrote: You can execute dfs commands from inside hive hivedfs -ls You can also use virtual columns inside a select https://cwiki.apache.org/Hive/languagemanual-virtualcolumns.html However there is no direct command in hive to list the files in a table/partition. On Thu, Jan 31, 2013 at 4:55 PM, Keith Wiley kwi...@keithwiley.com wrote: Hmmm, that doesn't seem to list the files in the directory. On Jan 31, 2013, at 12:12 , Edward Capriolo wrote: Try describe extended tablename or describe extended tablename (partition) the location is one of the properties Edward On Thu, Jan 31, 2013 at 2:37 PM, Keith Wiley kwi...@keithwiley.com wrote: show tables shows the HDFS directories which correspond to Hive tables and various select queries show data from inside the files (each row of a file being a row of the overarching table that the entire directory represents)...or so I have convinced myself...but can Hive show the HDFS filenames? Or is hadoop fs -ls the only way to do that? Thanks. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com Yet mark his perfect self-contentment, and hence learn his lesson, that to be self-contented is to be vile and ignorant, and that to aspire is better than to be blindly and impotently happy. -- Edwin A. Abbott, Flatland
apache can't access hive
I have created a php-backed dynamic website which pulls data from hive (via exec) and presents the results through a web interface. This was working fine...then I upgraded to CDH4 and now apache can't access hive anymore. Hive commands work fine on the command line as a real user, but not through the web as apache. Curiously, if I run hive as sudo -u apache, it works. It just doesn't work from php. Remember, I had this all working a few weeks ago, I know the basic approach works. It just that some aspect of the CDH4 upgrade has changed something, and I'm not sure what it could be. The metastore_db directory's permissions are good to go (apache has read/write access to the directory, the derby.log, etc.). I realize the following output is a bit eclectic, but if anyone has any thoughts on the error I would love some help. Thanks. Result: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask Status: 1 Output: Array Output line: Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties Output line: Hive history file=/tmp/apache/hive_job_log_apache_201211121323_696011741.txt Output line: 2012-11-12 21:23:28.806 GMT Thread[main,5,main] java.io.FileNotFoundException: derby.log (Permission denied) Output line: Output line: 2012-11-12 21:23:29.075 GMT: Output line: Booting Derby version The Apache Software Foundation - Apache Derby - 10.4.2.0 - (689064): instance a816c00e-013a-f681-6fa8-025818d8 Output line: on database directory /var/lib/hive/metastore/metastore_db in READ ONLY mode Output line: Output line: Database Class Loader started - derby.database.classpath='' Output line: FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: Cannot get a connection, pool error Could not create a validated object, cause: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection. Output line: NestedThrowables: Output line: org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Could not create a validated object, cause: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection. Output line: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com It's a fine line between meticulous and obsessive-compulsive and a slippery rope between obsessive-compulsive and debilitatingly slow. -- Keith Wiley
Error in semantic analysis: Unable to fetch table
If I run a select command on the command line with -e it works. If I run the same command inside hive (without -e) it still works...but if I exec it with a -e from php (so obviously the apache user), I get Error in semantic analysis: Unable to fetch table [tablename]. Bear in mind that this overall system worked perfectly for weeks. I haven't modified the code or anything. What did change was I had to reinstall hadoop and hive and I'm sure some configuration parameter is not correct now. I found some Google references to this error but no clear-but solutions. What does this error mean and how do I fix it? Thanks. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com The easy confidence with which I know another man's religion is folly teaches me to suspect that my own is also. -- Mark Twain
Re: Error in semantic analysis: Unable to fetch table
Followup: If I run show tables then while I get a list from the command line, I get nothing from php. Furthermore, if I attempt to select from a bogus tablename on the command line, I get that same error, semantic error and whatnot...so it looks like the real problem is that when apache/php runs hive, it doesn't see the table list. Like I said, this used to work...so I'm sure I need to get something configured properly, I just don't remember how. I do see that I am defining HIVE_HOME in my .bashrc...Do I need to do something like this for the apache account or as a HIVE parameter? Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com What I primarily learned in grad school is how much I *don't* know. Consequently, I left grad school with a higher ignorance to knowledge ratio than when I entered. -- Keith Wiley
Increase max tablename length?
The default seems to be 128. Can it be increased? I haven't found a configuration parameter for that yet. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com Luminous beings are we, not this crude matter. -- Yoda
drop table: no err on nonexistant table?
I'm wrapping hive in a web tool and would like to do some basic error-checking. If an attempt is made to drop a table that doesn't exist, I would like to show an error message. The problem is, hive doesn't seem to produce any sort of error when dropping a table that doesn't exists. Furthermore, if hive is run on the command line (-e) it doesn't return an error status. Therefore, I think the only way to detect such an error is to actually inspect the table list (show tables) and manually see if the drop attempt will find a corresponding table. Can I get an error of some sort on this or is it not possible? Thank you. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com I do not feel obliged to believe that the same God who has endowed us with sense, reason, and intellect has intended us to forgo their use. -- Galileo Galilei
Re: drop table: no err on nonexistant table?
Oye, got it. Sorry. RTFM: hive.exec.drop.ignorenonexistent On Aug 6, 2012, at 11:06 , Keith Wiley wrote: I'm wrapping hive in a web tool and would like to do some basic error-checking. If an attempt is made to drop a table that doesn't exist, I would like to show an error message. The problem is, hive doesn't seem to produce any sort of error when dropping a table that doesn't exists. Furthermore, if hive is run on the command line (-e) it doesn't return an error status. Therefore, I think the only way to detect such an error is to actually inspect the table list (show tables) and manually see if the drop attempt will find a corresponding table. Can I get an error of some sort on this or is it not possible? Thank you. Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com I do not feel obliged to believe that the same God who has endowed us with sense, reason, and intellect has intended us to forgo their use. -- Galileo Galilei Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com And what if we picked the wrong religion? Every week, we're just making God madder and madder! -- Homer Simpson
HQL vs. HiveQL
I have found references to both HQL and HiveQL (as abbreviations for Hive Query Language) as the name of the SQL-like language Hive uses. Is there a formal and definitively correct name in this regard or are both considered equally appropriate? I'm writing up a white paper and want to use the right term. Thanks. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com And what if we picked the wrong religion? Every week, we're just making God madder and madder! -- Homer Simpson
Re: LIKE Statement
Thanks for the followup. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com I used to be with it, but then they changed what it was. Now, what I'm with isn't it, and what's it seems weird and scary to me. -- Abe (Grandpa) Simpson
Re: Hive performance vs. SQL?
Thanks for the response. Cheers! On Mar 19, 2012, at 16:42 , Maxime Brugidou wrote: From my experience, if you can fit data in a SQL without sharding or anything, don't ever think twice. Hive is not even comparable. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com What I primarily learned in grad school is how much I *don't* know. Consequently, I left grad school with a higher ignorance to knowledge ratio than when I entered. -- Keith Wiley
Hive performance vs. SQL?
I haven't had an opportunity to set up a huge Hive database yet because exporting csv files from our SQL database is, in itself, a rather laborious task. I was just curious how I might expect Hive to perform vs. SQL on large databases and large queries? I realize Hive is pretty latent since it builds and runs MapReduce jobs for even the simplest queries, but that is precisely why I think it might perform better on long queries against large (external CSV) databases). Would you expect Hive to ever outperform SQL on a single machine (standalone or pseudo-distributed mode)? I am entirely open to the possibility that the answer is no, that Hive could never compete with SQL in a single machine. Is this true? If so, how large (how parallel) do you think the underlying Hadoop cluster needs to be before Hive overtakes SQL? 2X? 10X? Where is the crossover point where Hive actually outperforms SQL? Along similar lines, might Hive never outperform SQL on a database small enough for SQL to run on a single machine, a 10s to 100s of GBs? Must the database itself be so large that SQL is effectively crippled and the data must be distributed before Hive offer significant gains? I am really just trying to get a basic feel for how I might anticipate's Hive's behavior vs. SQL once I get a large system up and running. Thanks. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com I used to be with it, but then they changed what it was. Now, what I'm with isn't it, and what's it seems weird and scary to me. -- Abe (Grandpa) Simpson
Re: order by date
Is see, you store the date-time as a lexicographically sortable string. That's fine, but I'm operating on existing csv tables. I guess I could whip up a hadoop job to convert all the date-time columns to lexicographic strings and then wrap hive around the resulting converted tables. I was just wondering if there was a more direct approach, whether I could apply hive to the original csv tables. *Would* the date functions work in the way I suggested in my first post or do you think a lexicographic conversion is practically required to perform date-ordering on query results? On Mar 13, 2012, at 09:49 , Tucker, Matt wrote: Hi Keith, We generally store date columns as a string in a similar format to ISO 8601 (-mm-dd hh:MM:ss). This way, when we put the date column in the ORDER BY clause, it will be sorted chronologically. It also saves us the trouble of whipping out a unix timestamp calculator to figure out what we're looking at. There is supposed to be a TIMESTAMP data type in Hive 0.8, but I haven't found any documentation on it yet. Matt Tucker -Original Message- From: Keith Wiley [mailto:kwi...@keithwiley.com] Sent: Tuesday, March 13, 2012 12:45 PM To: user@hive.apache.org Subject: order by date I realize that hive doesn't have a date type for the columns and I realize that hive *does* have various date functions. I just haven't found a concrete example of how these two issues are brought together. Ordering the results of a sql query is done by adding order by dateColName to the query. Are the date functions supposed to be used in the form order by UNIX_TIMESTAMP(dateColName, '-MM-dd')? Does the function just go right into the order by clause like that or is that totally wrong? How does one order query results by a date column? Am I on the right track? Thanks. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com Yet mark his perfect self-contentment, and hence learn his lesson, that to be self-contented is to be vile and ignorant, and that to aspire is better than to be blindly and impotently happy. -- Edwin A. Abbott, Flatland
How to import extremely wide csv tables
Wrapping hive around existing csv files consists of manually naming and typing every column during the creation command. I have several csv tables and some of them have a ton of columns. I would love a way to create hive tables which automatically infers the column types by attempting various type conversions or regex matches on the data (say the first row). What would be even cooler is if the first row could actually be interpreted differently from the rest of the table...as a set of string labels to name the columns while the types could be automatically inferred from, say, the *second* row. These csv files are currently of this format, with the first row naming the columns. Does this make sense? Now, I'm sure that hive doesn't support this yet -- and I admit it is a somewhat esoteric desire on my part -- but I'm curious how others would suggest approaching it? I'm thinking of writing a separate isolated program that reads the first two rows of a csv file and dumps a text string of column names and types in the correct syntax for a hive external table creation statement which I would then copy/paste into hive...I was just hoping for a simpler solution. Thoughts? Thanks. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't scratch. All together this implies: He scratched the itch from the scratch that itched but would never itch the scratch from the itch that scratched. -- Keith Wiley
order by having no effect?!
Um, this is weird. It simply isn't modifying the order of the returned rows at all. I get the same result with no 'order by' clause as with one. Adding a limit or specifying 'asc' has no effect. Using 'sort by' also has no effect. The column used for ordering is type INT. In the example below, I was hoping to sort or order the results according to the third column. Like I said, I also tried adding 'limit 10' and/or 'asc' to the end of the query; they had no effect. The jobtracker shows a single mapper and a single reducer being used for the job incidentally, so it can't be some sort of multi-reducer sort discrepancy. hive describe stringmap; OK objecttypecode int attributename string attributevalue int langid int value string displayorderint Time taken: 0.074 seconds hive hive select * from stringmap where attributename='foo' order by 'attributevalue'; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=number In order to limit the maximum number of reducers: set hive.exec.reducers.max=number In order to set a constant number of reducers: set mapred.reduce.tasks=number Starting Job = job_201202221500_0114, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201202221500_0114 Kill Command = /media/sdb1/kwiley/hadoop/hadoop-0.20.2-cdh3u3/bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201202221500_0114 2012-03-13 14:45:28,418 Stage-1 map = 0%, reduce = 0% 2012-03-13 14:45:31,428 Stage-1 map = 100%, reduce = 0% 2012-03-13 14:45:39,459 Stage-1 map = 100%, reduce = 33% 2012-03-13 14:45:40,463 Stage-1 map = 100%, reduce = 100% Ended Job = job_201202221500_0114 OK 3 foo 6 10336 - aaa 6 3 foo 3 10333 - bbb 3 3 foo 4 10334 - ccc 4 4 foo 1 1033Default Value 1 3 foo 2 10332 - ddd 2 3 foo 1 10331 - eee 1 3 foo 5 10335 - fff 5 Time taken: 17.954 seconds hive Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com Luminous beings are we, not this crude matter. -- Yoda
Re: order by having no effect?!
On Mar 13, 2012, at 13:57 , Igor Tatarinov wrote: You have attributevalue in quotes which makes it a constant literal. igor decide.com Argh! You are correct good sir! thanks Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com And what if we picked the wrong religion? Every week, we're just making God madder and madder! -- Homer Simpson
csv boolean type
What string values in a csv field are parsable by Hive as booleans? If I indicate that a column is of type boolean when wrapping an external table around a csv file, what are the legal values? I can imagine numerous possibilities, for example (for the true values): 0 t T true True TRUE y Y yes Yes YES etc. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com I used to be with it, but then they changed what it was. Now, what I'm with isn't it, and what's it seems weird and scary to me. -- Abe (Grandpa) Simpson
Re: csv boolean type
I obviously intended '1', not '0' as an example of a true value. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com The easy confidence with which I know another man's religion is folly teaches me to suspect that my own is also. -- Mark Twain
Re: Basic statement problems
On Mar 9, 2012, at 16:46 , Steven Wong wrote: The LOCATION clause has to specify the directory that contains (only) your data files. I've tried it both ways: CREATE EXTERNAL TABLE stringmap (ObjectTypeCode INT, AttributeName STRING, AttributeValue INT, LangId INT, Value STRING, DisplayOrder INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/crm/records/all/hiveTest'; CREATE EXTERNAL TABLE stringmap (ObjectTypeCode INT, AttributeName STRING, AttributeValue INT, LangId INT, Value STRING, DisplayOrder INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/crm/records/all/hiveTest/StringMap.csv'; In both cases, show tables lists stringmap and describe stringmap describes the columns shown above, but a basic query doesn't return any results. What else should I try here? Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com I used to be with it, but then they changed what it was. Now, what I'm with isn't it, and what's it seems weird and scary to me. -- Abe (Grandpa) Simpson
Re: Basic statement problems
It has started working now. I don't know what I changed. I dropped every single table from hive, explicitly created a new directory on HDFS and moved the .csv file to that directory, ran hive again and created the table. This time it worked. I can perform queries against the directory. Maybe hadoop and hive confused each other about the other directory (it got corrupted or something)...or maybe I screwed something up, I dunno. I would have expected better error-detection. Instead of simply returning 0-length queries, it would be nice if hive would actually produce an error message if I create the table in an invalid or incorrect fashion...but perhaps it couldn't tell; maybe the database just looked empty at hive's level of abstraction. I mean, even if I did screw something up (an option I am entirely open to), I never really got an error about it. Hive gladly wrapped a hive table around the directory, and the csv file in question, without an error of any kind. Hive could see the table, it listed and could describe it, but would then return empty queries against the table. When I moved the exact same csv file to a brand new HDFS directory and tried again from scratch, everything started working. I probably did something wrong, but some sort of error message would have been very helpful. Anyway, these tools are still pretty young. I understand that they will continue to evolve. The ability to detect and report errors will almost certainly improve with time. Thanks for the concerted efforts to help. Cheers! Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com What I primarily learned in grad school is how much I *don't* know. Consequently, I left grad school with a higher ignorance to knowledge ratio than when I entered. -- Keith Wiley
External table point of clarification
It appears as if the syntax for creating an external table takes a column description and a directory where the table (a .csv file) should be found. Notably, it doesn't take the path to the file, rather it takes the path to the directory containing the file. ...which means that if I have multiple .csv files, each a unique table with a unique column description, it is not sufficient to put them in one hdfs directory, since the columns must be described when creating the hive table. Rather, each .csv file conforming to a unique column description must be placed in its own separate hdfs directory, right? Just clarifying...is this correct? Can I not put a set of .csv files comprising multiple tables of a heterogenous database in a single directory and then have hive build a set of tables around those files? Thank you. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com It's a fine line between meticulous and obsessive-compulsive and a slippery rope between obsessive-compulsive and debilitatingly slow. -- Keith Wiley
Re: Error
Thanks, I'll give it a shot on Monday. On Mar 10, 2012, at 01:12 , Jagat wrote: Dear Keith Please delete $HADOOP_HOME/build , in your hadoop home the build directory And try again Thanks Jagat Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com The easy confidence with which I know another man's religion is folly teaches me to suspect that my own is also. -- Mark Twain
Re: Error
Considering that I don't even konw what the metastore is, I doubt I did anything specifically aside from the instructions. All I did was follow the startup guide. Nothing more. More to the point, when I followed the same instructions on a Linux box, it worked. Something is admittedly wrong with my Mac setup, but I don't think there is any third-party metastore of some sort. I certainly didn't install or create anything like that. I'm not even sure what it is. Hmmm, how do I resetup or initialize the metastore from scratch and try again? On Mar 9, 2012, at 13:21 , Mark Grover wrote: Hi Keith, This error is typically encountered when your metastore (that stores the metadata related to Hive tables e.g. name of columns, name of tables, etc. etc.) is not correctly set up. Have you set up an external metastore? By default, Hive should use an embedded Derby metastore which only allows one connection at a time. Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com Best Trading Platform - World Finance's Forex Awards 2009. The One to Watch - Treasury Today's Adam Smith Awards 2009. - Original Message - From: Keith Wiley kwi...@keithwiley.com To: user@hive.apache.org Sent: Friday, March 9, 2012 3:02:40 PM Subject: Error I'm quite comfortable hadoop and the associated lingo, been programming it via Java and via C++ streams for several years. However, I have just started Hive for the first time...and I'm stuck. I was following the getting started page on the apache site. I got a far as this: hive CREATE TABLE pokes (foo INT, bar STRING); FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask If I look at the logs, I see the following. I'm not sure how to proceed from here. Any help is appreciated. Thanks. $ cat /tmp/keithw/hive.log 2012-03-09 12:02:22,087 ERROR exec.DDLTask (SessionState.java:printError(365)) - FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException org.apache.hadoop.hive.ql.metadata.HiveException: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:476) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3176) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:213) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:131) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:516) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:197) Caused by: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1186) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:803) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:698) at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:234) at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:261) at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:196) at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:171) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:354) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.executeWithRetry(HiveMetaStore.java:306) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:451
Basic statement problems
I successfully installed and used Hive to create basic tables (on one of my two machines; another discussion describes the problems I'm having with the other machine). However, basic queries aren't working. I brought a typical CSV file into a Hive table and it seemed fine. Here's how I did it: CREATE EXTERNAL TABLE stringmap (ObjectTypeCode INT, AttributeName STRING, AttributeValue INT, LangId INT, Value STRING, DisplayOrder INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/crm/records/all/hiveTest/StringMap.csv'; show tables and describe stringmap return correct results. However, if I run a really simple query, it returns incorrect results. For example, a row count query returns a 0. Observe: hive select count(*) from stringmap; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=number In order to limit the maximum number of reducers: set hive.exec.reducers.max=number In order to set a constant number of reducers: set mapred.reduce.tasks=number Starting Job = job_201202221500_0103, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201202221500_0103 Kill Command = /media/sdb1/kwiley/hadoop/hadoop-0.20.2-cdh3u3/bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201202221500_0103 2012-03-09 16:20:26,243 Stage-1 map = 0%, reduce = 0% 2012-03-09 16:20:29,258 Stage-1 map = 0%, reduce = 100% 2012-03-09 16:20:32,278 Stage-1 map = 100%, reduce = 100% Ended Job = job_201202221500_0103 OK 0 Time taken: 15.969 seconds The Hadoop job runs without error, but it returns a 0. The job tracker indicates that Hive create a job with 0 mappers and 1 reducer. I don't see any useful output in the reducer task log however. The following is the from the hive log: 2012-03-09 16:20:17,057 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires org.eclipse.core.resources but it cannot be resolved. 2012-03-09 16:20:17,057 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires org.eclipse.core.resources but it cannot be resolved. 2012-03-09 16:20:17,059 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires org.eclipse.core.runtime but it cannot be resolved. 2012-03-09 16:20:17,059 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires org.eclipse.core.runtime but it cannot be resolved. 2012-03-09 16:20:17,060 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires org.eclipse.text but it cannot be resolved. 2012-03-09 16:20:17,060 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires org.eclipse.text but it cannot be resolved. 2012-03-09 16:20:22,781 WARN mapred.JobClient (JobClient.java:copyAndConfigureFiles(649)) - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2012-03-09 16:20:22,946 WARN snappy.LoadSnappy (LoadSnappy.java:clinit(36)) - Snappy native library is available I admit, that does look rather erroneous, but I'm not sure what to make of it. I looked those errors up online but didn't find much that seemed to suggest a cause or solution. Any ideas? Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't scratch. All together this implies: He scratched the itch from the scratch that itched but would never itch the scratch from the itch that scratched. -- Keith Wiley
0.7.1 vs. 0.8.1
The only version feature list I have found so far is the JIRA logs, which are a little difficult to make sense of since they get pretty detailed. I'm curious from a straight-forward high-level sales-pitch perspective what advantages 0.8.1 offers over 0.7.1. I ask because I generally use CDH3 and it provides 0.7.1. I'm curious if I should use a more recent version of Hive with the version of Hadoop in CDH3 (0.20.2). Thanks. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com Yet mark his perfect self-contentment, and hence learn his lesson, that to be self-contented is to be vile and ignorant, and that to aspire is better than to be blindly and impotently happy. -- Edwin A. Abbott, Flatland
Re: Basic statement problems
So a directory, not a specific file. I thought I tried it both ways, but I'll switch it back the other way and try again. Thanks. On Mar 9, 2012, at 16:46 , Steven Wong wrote: The LOCATION clause has to specify the directory that contains (only) your data files. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com And what if we picked the wrong religion? Every week, we're just making God madder and madder! -- Homer Simpson