Hive optimiser seems not to use for a covered query

2015-03-27 Thread Mich Talebzadeh
Hi,

 

I am very new to hive optimiser

 

Here I have a table with 4 million rows imported from Oracle via sqoop/hive. In 
this table object_id column is unique. Oracle table has primary key constraint 
on object_id column which is basically a unique B-tree index.

 

I do a very simple query to see how many unique values are for object_id in 
table. The answer is they are as many as number of rows.

 

So query like below in Oracle

 

SELECT (COUNT(DISTINCT(object_id))/COUNT(object_id)) FROM hddtester.tdash;

 

Should return 1. Now Oracle optimiser only needs to read the index key and work 
it out WITHOUT touching the underlying table and it does that

 

--

Plan hash value: 1988751498

 

---

| Id  | Operation  | Name | Rows  | Bytes | Cost (%CPU)| Time   
  |

---

|   0 | SELECT STATEMENT   |  | 1 |26 | 13952   (1)| 
00:02:48 |

|   1 |  SORT AGGREGATE|  | 1 |26 ||
  |

|   2 |   VIEW | VW_DAG_0 |  4000K|99M| 13952   (1)| 
00:02:48 |

|   3 |SORT GROUP BY NOSORT|  |  4000K|22M| 13952   (1)| 
00:02:48 |

|   4 | INDEX FULL SCAN| TDASH_PK |  4000K|22M| 13952   (1)| 
00:02:48 |

---

 

Here it is shown as Operation Id = 4 “INDEX FULL SCAN”. Please note that the 
table itself is not touched as expected

 

Now I have the same table “tdash” in Hive with a compact index on object_id. I 
have analysed stats for table with “analyze table tdash compute statistics”. 
Now I do explain as below

 

hive> explain SELECT (COUNT(DISTINCT(object_id))/COUNT(object_id)) FROM tdash;

OK

STAGE DEPENDENCIES:

  Stage-1 is a root stage

  Stage-0 depends on stages: Stage-1

 

STAGE PLANS:

  Stage: Stage-1

Map Reduce

  Map Operator Tree:

  TableScan

alias: tdash

Statistics: Num rows: 400 Data size: 32564651117 Basic stats: 
COMPLETE Column stats: NONE

Select Operator

  expressions: object_id (type: double)

  outputColumnNames: object_id

  Statistics: Num rows: 400 Data size: 32564651117 Basic stats: 
COMPLETE Column stats: NONE

  Group By Operator

aggregations: count(DISTINCT object_id), count(object_id)

keys: object_id (type: double)

mode: hash

outputColumnNames: _col0, _col1, _col2

Statistics: Num rows: 400 Data size: 32564651117 Basic 
stats: COMPLETE Column stats: NONE

Reduce Output Operator

  key expressions: _col0 (type: double)

  sort order: +

  Statistics: Num rows: 400 Data size: 32564651117 Basic 
stats: COMPLETE Column stats: NONE

  value expressions: _col2 (type: bigint)

  Reduce Operator Tree:

Group By Operator

  aggregations: count(DISTINCT KEY._col0:0._col0), count(VALUE._col1)

  mode: mergepartial

  outputColumnNames: _col0, _col1

  Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column 
stats: NONE

  Select Operator

expressions: (_col0 / _col1) (type: double)

outputColumnNames: _col0

Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column 
stats: NONE

File Output Operator

  compressed: false

  Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE 
Column stats: NONE

  table:

  input format: org.apache.hadoop.mapred.TextInputFormat

  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

 

  Stage: Stage-0

Fetch Operator

  limit: -1

  Processor Tree:

ListSink

 

Time taken: 0.691 seconds, Fetched: 50 row(s)

 

Trying to understand above does keys: object_id (type: double) refers to use of 
index here? I dropped that index and the same plan was produced! How Hive 
optimiser flag the index usage in the plan. Do I need to update column statists 
as well as table

 

Many thanks

 

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and 
Coherence Cache

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries

Re: Hive optimiser seems not to use for a covered query

2015-03-27 Thread Alan Gates
Hive's optimizer never uses the indexes.  Indexes aren't fully 
implemented in Hive.


Alan.


Mich Talebzadeh 
March 27, 2015 at 1:33