For more details, see the user@hive discussion January 5 - February 8: "Is Hive Index officially not recommended? <http://apache.markmail.org/message/vkkvryovqovootms?q=hive+list:org%2Eapache%2Ehadoop%2Ehive-user+%22Is+Hive+Index+officially+not+recommended%3F%22> "
-- Lefty On Mon, Feb 15, 2016 at 11:52 PM, Mich Talebzadeh < mich.talebza...@cloudtechnologypartners.co.uk> wrote: > Hi, > > > > "Traditional" Indexes are not currently used in Hive. You can create them > but they are not used by the optimizer. > > You can create storage indexes in Hive using ORC file format that provides > three levels of granularity > > 1. ORC File itself > 2. Multiple stripes within the ORC file > 3. Multiple row groups (row batches) within each stripe > > Effectively: > > - Chunks of data making up ORC file stored as storage index. *Storage > index* is the term used for the combined Index and statistics. > - Each Storage Index has statistics of min, max, count, and sum for > each column in the grouping of rows in batches of 10,000 called *row > group*. Row group both *has row data* and *index data* > - Crucially, it needs the location of the start of each row group, so > that the query could jump straight to the beginning of the row group so > narrowing down the search path. > - The query should perform a SARG pushdown that limits which rows are > required for the query and can avoid reading an entire file, or at least > sections of the file which is by and large what a conventional RDBMS B-tree > index does. > - Support for new ACID features in Hive (insert, update and delete). > > > > HTH. > > > > Mich > > > > On 16/02/2016 03:17, 万修远 wrote: > > Hello, > > *When I use index in hive 1.2.1, I find the index does not work. The > details are as follows:* > > 1. After using index, the query speed does not improve. If I use manual > use of indexes, the query speed improve obviously, but when switch to > automatic use of indexes, the speed makes no difference relative to not use > index. > > 2. After rebuild index, I add a new text file which includes one record > matching my query filter in the table directory. Then, the query results > will show the record included in the new text file. (The case that append > new record in the same file but in different block is the same.) > > 3.When debug the hive source code I find that the function > generateIndexQuery of class CompactIndexHandler is't called. Finally I > find that the function compile in class TaskCompiler returns early at the > follow statements: > if (pCtx.getFetchTask() != null) { > return; > }this will result in index not working for query. But I do't know why to > set FetchTask because I know little about hive. > > -------------------------------------------------------------------------------------------------------- > > *So, My question is :*1. Does hive 1.2.1 support index normally? IF it > supports index completely, what's my issue?2. I want to know how indexes > are used to optimize queries, where can I find some references? > > -------------------------------------------------------------------------------------------------------- > > *Appendix: How do I use index in hive 1.2.1* > > 1.create table and load data: > > create table table01( id int, name string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '\t'; > load data local inpath '/home/hadoop/data/dual.txt' overwrite into table > table01; > > 2.create and rebuild index: > > create index table01_index on table table01(id) as > 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred > rebuild; > alter index table01_index on table01 rebuild; > > 3.set properties: > > set hive.optimize.index.filter.compact.minsize=0; > set hive.optimize.index.filter.compact.maxsize=-1; > set hive.index.compact.query.max.size=-1; > set hive.index.compact.query.max.entries=-1; > set Hive.optimize.index.groupby=false; > set hive.optimize.index.filter=true; > set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; > > 4.execute query statement: > > select * from table01 where id =500000; > > Thanks! > ------------------------------ > Jason > > > > > -- > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > http://talebzadehmich.wordpress.com > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this message > shall not be understood as given or endorsed by Cloud Technology Partners > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus free, > therefore neither Cloud Technology partners Ltd, its subsidiaries nor their > employees accept any responsibility. > > >