Re: The index for query in hive 1.2.1 does not work.

2016-02-16 Thread Lefty Leverenz
For more details, see the user@hive discussion January 5 - February 8:  "Is
Hive Index officially not recommended?

"

-- Lefty


On Mon, Feb 15, 2016 at 11:52 PM, Mich Talebzadeh <
mich.talebza...@cloudtechnologypartners.co.uk> wrote:

> Hi,
>
>
>
> "Traditional" Indexes are not currently used in Hive. You can create them
> but they are not used by the optimizer.
>
> You can create storage indexes in Hive using ORC file format that provides
> three levels of granularity
>
>1. ORC File itself
>2. Multiple stripes within the ORC file
>3. Multiple row groups (row batches) within each stripe
>
> Effectively:
>
>- Chunks of data making up ORC file stored as storage index. *Storage
>index* is the term used for the combined Index and statistics.
>- Each Storage Index has statistics of min, max, count, and sum for
>each column in the grouping of rows in batches of 10,000 called *row
>group*. Row group both *has row data* and *index data*
>- Crucially, it needs the location of the start of each row group, so
>that the query could jump straight to the beginning of the row group so
>narrowing down the search path.
>- The query should perform a SARG pushdown that limits which rows are
>required for the query and can avoid reading an entire file, or at least
>sections of the file which is by and large what a conventional RDBMS B-tree
>index does.
>- Support for new ACID features in Hive (insert, update and delete).
>
>
>
> HTH.
>
>
>
> Mich
>
>
>
> On 16/02/2016 03:17, 万修远 wrote:
>
> Hello,
>
> *When I use index in hive 1.2.1, I find the index does not work.  The
> details are as follows:*
>
> 1. After using index, the query speed does not improve.  If I use manual
> use of indexes, the query speed improve obviously, but when switch to
> automatic use of indexes, the speed makes no difference relative to not use
> index.
>
> 2. After rebuild index, I add a new text file which includes one record
> matching my query filter in the table directory. Then,  the query results
> will show the record included in the new text file. (The case that append
> new record in the same file but in different block is the same.)
>
> 3.When debug the hive source code I find that the function
> generateIndexQuery of class CompactIndexHandler is't called. Finally I
> find that the function compile in class TaskCompiler returns early at the
> follow statements:
> if (pCtx.getFetchTask() != null) {
> return;
> }this will result in index not working for query. But I do't know why to
> set FetchTask because I know little about hive.
>
> 
>
> *So, My question is :*1. Does hive 1.2.1 support index normally? IF it
> supports index completely, what's my issue?2. I want to know  how indexes
> are used to optimize queries, where can I find some references?
>
> 
>
> *Appendix: How do I use index in hive 1.2.1*
>
> 1.create table and load data:
>
> create table table01( id int, name string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t';
> load data local inpath '/home/hadoop/data/dual.txt' overwrite into table 
> table01;
>
> 2.create and rebuild index:
>
> create index table01_index on table table01(id) as 
> 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred 
> rebuild;
> alter index table01_index on table01 rebuild;
>
> 3.set properties:
>
> set hive.optimize.index.filter.compact.minsize=0;
> set hive.optimize.index.filter.compact.maxsize=-1;
> set hive.index.compact.query.max.size=-1;
> set hive.index.compact.query.max.entries=-1;
> set Hive.optimize.index.groupby=false;
> set hive.optimize.index.filter=true;
> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
>
> 4.execute query statement:
>
> select * from table01 where id =50;
>
> Thanks!
> --
> Jason
>
>
>
>
> --
>
> Dr Mich Talebzadeh
>
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Cloud Technology Partners 
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
> the responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Cloud Technology partners Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>
>
>


Re: The index for query in hive 1.2.1 does not work.

2016-02-15 Thread Mich Talebzadeh
 

Hi, 

"Traditional" Indexes are not currently used in Hive. You can create
them but they are not used by the optimizer. 

You can create storage indexes in Hive using ORC file format that
provides three levels of granularity 

* ORC File itself
* Multiple stripes within the ORC file
* Multiple row groups (row batches) within each stripe

Effectively: 

* Chunks of data making up ORC file stored as storage index. _Storage
index_ is the term used for the combined Index and statistics.
* Each Storage Index has statistics of min, max, count, and sum for
each column in the grouping of rows in batches of 10,000 called _row
group_. Row group both _has row data_ and _index data_
* Crucially, it needs the location of the start of each row group, so
that the query could jump straight to the beginning of the row group so
narrowing down the search path.
* The query should perform a SARG pushdown that limits which rows are
required for the query and can avoid reading an entire file, or at least
sections of the file which is by and large what a conventional RDBMS
B-tree index does.
* Support for new ACID features in Hive (insert, update and delete).

HTH. 

Mich 

On 16/02/2016 03:17, 万修远 wrote: 

> Hello, 
> 
> WHEN I USE INDEX IN HIVE 1.2.1, I FIND THE INDEX DOES NOT WORK. THE DETAILS 
> ARE AS FOLLOWS: 
> 
> 1. After using index, the query speed does not improve. If I use manual use 
> of indexes, the query speed improve obviously, but when switch to automatic 
> use of indexes, the speed makes no difference relative to not use index. 
> 
> 2. After rebuild index, I add a new text file which includes one record 
> matching my query filter in the table directory. Then, the query results will 
> show the record included in the new text file. (The case that append new 
> record in the same file but in different block is the same.) 
> 
> 3.When debug the hive source code I find that the function generateIndexQuery 
> of class CompactIndexHandler is't called. Finally I find that the function 
> compile in class TaskCompiler returns early at the follow statements: 
> if (pCtx.getFetchTask() != null) {
> return;
> }this will result in index not working for query. But I do't know why to set 
> FetchTask because I know little about hive. 
> 
> 
> SO, MY QUESTION IS :1. Does hive 1.2.1 support index normally? IF it supports 
> index completely, what's my issue?2. I want to know how indexes are used to 
> optimize queries, where can I find some references? 
> 
> 
> APPENDIX: HOW DO I USE INDEX IN HIVE 1.2.1 
> 
> 1.create table and load data:
> 
> create table table01( id int, name string) 
> ROW FORMAT DELIMITED 
> FIELDS TERMINATED BY 't'; 
> load data local inpath '/home/hadoop/data/dual.txt' overwrite into table 
> table01;
> 
> 2.create and rebuild index:
> 
> create index table01_index on table table01(id) as 
> 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred 
> rebuild;
> alter index table01_index on table01 rebuild;
> 
> 3.set properties:
> 
> set hive.optimize.index.filter.compact.minsize=0;
> set hive.optimize.index.filter.compact.maxsize=-1;
> set hive.index.compact.query.max.size=-1;
> set hive.index.compact.query.max.entries=-1;
> set Hive.optimize.index.groupby=false;
> set hive.optimize.index.filter=true;
> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
> 
> 4.execute query statement:
> 
> select * from table01 where id =50;
> 
> Thanks! 
> -
> 
> Jason

-- 

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential.
This message is for the designated recipient only, if you are not the
intended recipient, you should destroy it immediately. Any information
in this message shall not be understood as given or endorsed by Cloud
Technology Partners Ltd, its subsidiaries or their employees, unless
expressly so stated. It is the responsibility of the recipient to ensure
that this email is virus free, therefore neither Cloud Technology
partners Ltd, its subsidiaries nor their employees accept any
responsibility.

 

The index for query in hive 1.2.1 does not work.

2016-02-15 Thread 万修远
Hello,

When I use index in hive 1.2.1, I find the index does not work.  The details 
are as follows:

1. After using index, the query speed does not improve.  If I use manual use of 
indexes, the query speed improve obviously, but when switch to automatic use of 
indexes, the speed makes no difference relative to not use index.

2. After rebuild index, I add a new text file which includes one record 
matching my query filter in the table directory. Then,  the query results will 
show the record included in the new text file. (The case that append new record 
in the same file but in different block is the same.)

3.When debug the hive source code I find that the function  generateIndexQuery 
of class CompactIndexHandler is't called. Finally I find that the function 
compile in class TaskCompiler returns early at the follow statements:
if (pCtx.getFetchTask() != null) {
  return;
}this will result in index not working for query. But I do't know why to set 
FetchTask because  I know little about 
hive.So,
 My question is :1. Does hive 1.2.1 support index normally? IF it supports 
index completely, what's my issue?2. I want to know  how indexes are used to 
optimize queries, where can I find some 
references?Appendix:
 How do I use index in hive 1.2.11.create table and load data:create table 
table01( id int, name string)  
ROW FORMAT DELIMITED  
FIELDS TERMINATED BY '\t';  
load data local inpath '/home/hadoop/data/dual.txt' overwrite into table 
table01;2.create and rebuild index:create index table01_index on table 
table01(id) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' 
with deferred rebuild;
alter index table01_index on table01 rebuild;3.set properties:set 
hive.optimize.index.filter.compact.minsize=0;
set hive.optimize.index.filter.compact.maxsize=-1;
set hive.index.compact.query.max.size=-1;
set hive.index.compact.query.max.entries=-1;
set Hive.optimize.index.groupby=false;
set hive.optimize.index.filter=true;
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;4.execute 
query statement:select * from table01 where id =50;
Thanks!


Jason