For more details, see the user@hive discussion January 5 - February 8:  "Is
Hive Index officially not recommended?
<http://apache.markmail.org/message/vkkvryovqovootms?q=hive+list:org%2Eapache%2Ehadoop%2Ehive-user+%22Is+Hive+Index+officially+not+recommended%3F%22>
"

-- Lefty


On Mon, Feb 15, 2016 at 11:52 PM, Mich Talebzadeh <
mich.talebza...@cloudtechnologypartners.co.uk> wrote:

> Hi,
>
>
>
> "Traditional" Indexes are not currently used in Hive. You can create them
> but they are not used by the optimizer.
>
> You can create storage indexes in Hive using ORC file format that provides
> three levels of granularity
>
>    1. ORC File itself
>    2. Multiple stripes within the ORC file
>    3. Multiple row groups (row batches) within each stripe
>
> Effectively:
>
>    - Chunks of data making up ORC file stored as storage index. *Storage
>    index* is the term used for the combined Index and statistics.
>    - Each Storage Index has statistics of min, max, count, and sum for
>    each column in the grouping of rows in batches of 10,000 called *row
>    group*. Row group both *has row data* and *index data*
>    - Crucially, it needs the location of the start of each row group, so
>    that the query could jump straight to the beginning of the row group so
>    narrowing down the search path.
>    - The query should perform a SARG pushdown that limits which rows are
>    required for the query and can avoid reading an entire file, or at least
>    sections of the file which is by and large what a conventional RDBMS B-tree
>    index does.
>    - Support for new ACID features in Hive (insert, update and delete).
>
>
>
> HTH.
>
>
>
> Mich
>
>
>
> On 16/02/2016 03:17, 万修远 wrote:
>
> Hello,
>
> *When I use index in hive 1.2.1, I find the index does not work.  The
> details are as follows:*
>
> 1. After using index, the query speed does not improve.  If I use manual
> use of indexes, the query speed improve obviously, but when switch to
> automatic use of indexes, the speed makes no difference relative to not use
> index.
>
> 2. After rebuild index, I add a new text file which includes one record
> matching my query filter in the table directory. Then,  the query results
> will show the record included in the new text file. (The case that append
> new record in the same file but in different block is the same.)
>
> 3.When debug the hive source code I find that the function
> generateIndexQuery of class CompactIndexHandler is't called. Finally I
> find that the function compile in class TaskCompiler returns early at the
> follow statements:
> if (pCtx.getFetchTask() != null) {
> return;
> }this will result in index not working for query. But I do't know why to
> set FetchTask because I know little about hive.
>
> --------------------------------------------------------------------------------------------------------
>
> *So, My question is :*1. Does hive 1.2.1 support index normally? IF it
> supports index completely, what's my issue?2. I want to know  how indexes
> are used to optimize queries, where can I find some references?
>
> --------------------------------------------------------------------------------------------------------
>
> *Appendix: How do I use index in hive 1.2.1*
>
> 1.create table and load data:
>
> create table table01( id int, name string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t';
> load data local inpath '/home/hadoop/data/dual.txt' overwrite into table 
> table01;
>
> 2.create and rebuild index:
>
> create index table01_index on table table01(id) as 
> 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred 
> rebuild;
> alter index table01_index on table01 rebuild;
>
> 3.set properties:
>
> set hive.optimize.index.filter.compact.minsize=0;
> set hive.optimize.index.filter.compact.maxsize=-1;
> set hive.index.compact.query.max.size=-1;
> set hive.index.compact.query.max.entries=-1;
> set Hive.optimize.index.groupby=false;
> set hive.optimize.index.filter=true;
> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
>
> 4.execute query statement:
>
> select * from table01 where id =500000;
>
> Thanks!
> ------------------------------
> Jason
>
>
>
>
> --
>
> Dr Mich Talebzadeh
>
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Cloud Technology Partners 
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
> the responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Cloud Technology partners Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>
>
>

Reply via email to