HI Marc,
Regardless of whether you rebuild an index or not I came across checking
whether indexes are used in Hive. In so far as I know indexes are not fully
implemented in Hive and Hive does not use the index.
See the attached emails.
HTH
Mich Talebzadeh
http://talebzadehmich.wordpress.com
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4
Publications due shortly:
Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
Coherence Cache
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one
out shortly
NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this message
shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries
or their employees, unless expressly so stated. It is the responsibility of the
recipient to ensure that this email is virus free, therefore neither Peridale
Ltd, its subsidiaries nor their employees accept any responsibility.
From: Marc Seeger [mailto:m...@web-computing.de]
Sent: 15 May 2015 12:53
To: user@hive.apache.org
Subject: Re: Index Rebuild - DUG failes due to vertex failure
At Hive-Wiki indexing is described:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Indexing and
https://cwiki.apache.org/confluence/display/Hive/IndexDev
As I used it first, Index was build. The index table were filled with data and
the performance increased. But without changes at configuration it does not
work anymore on any table.
Followed you got the result of my explain-query.
0: jdbc:hive2://localhost:10000> EXPLAIN ALTER INDEX ix_key ON DbTest.Tbl_test
REBUILD;
+-------------------------------------------------------------------------------------------------------------------------------------------+--+
| Explain
|
+-------------------------------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES:
|
| Stage-1 is a root stage
|
| Stage-2 depends on stages: Stage-1
|
| Stage-0 depends on stages: Stage-2
|
| Stage-3 depends on stages: Stage-0
|
| Stage-4 depends on stages: Stage-1
|
| Stage-5 depends on stages: Stage-1
|
|
|
| STAGE PLANS:
|
| Stage: Stage-1
|
| Tez
|
| Edges:
|
| Reducer 2 <- Map 1 (SIMPLE_EDGE)
|
| DagName: hive_20150515134040_03126f31-d054-4bb0-9f58-99f72cd8d1ab:94
|
| Vertices:
|
| Map 1
|
| Map Operator Tree:
|
| TableScan
|
| alias: Tbl_test
|
| Statistics: Num rows: 7014810 Data size: 4773111850 Basic
stats: COMPLETE Column stats: NONE |
| Select Operator
|
| expressions: TEST_KEY (type: bigint), INPUT__FILE__NAME
(type: string), BLOCK__OFFSET__INSIDE__FILE (type: bigint) |
| outputColumnNames: TEST_KEY, INPUT__FILE__NAME,
BLOCK__OFFSET__INSIDE__FILE |
| Statistics: Num rows: 7014810 Data size: 4773111850 Basic
stats: COMPLETE Column stats: NONE |
| Group By Operator
|
| aggregations: collect_set(BLOCK__OFFSET__INSIDE__FILE)
|
| keys: TEST_KEY (type: bigint), INPUT__FILE__NAME (type:
string) |
| mode: hash
|
| outputColumnNames: _col0, _col1, _col2
|
| Statistics: Num rows: 7014810 Data size: 4773111850
Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator
|
| key expressions: _col0 (type: bigint), _col1 (type:
string) |
| sort order: ++
|
| Map-reduce partition columns: _col0 (type: bigint)
|
| Statistics: Num rows: 7014810 Data size: 4773111850
Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col2 (type: array<bigint>)
|
| Reducer 2
|
| Reduce Operator Tree:
|
| Group By Operator
|
| aggregations: collect_set(VALUE._col0)
|
| keys: KEY._col0 (type: bigint), KEY._col1 (type: string)
|
| mode: mergepartial
|
| outputColumnNames: _col0, _col1, _col2
|
| Statistics: Num rows: 3507405 Data size: 2386555925 Basic
stats: COMPLETE Column stats: NONE |
| Select Operator
|
| expressions: _col0 (type: bigint), _col1 (type: string),
_col2 (type: array<bigint>) |
| outputColumnNames: _col0, _col1, _col2
|
| Statistics: Num rows: 3507405 Data size: 2386555925 Basic
stats: COMPLETE Column stats: NONE |
| File Output Operator
|
| compressed: false
|
| Statistics: Num rows: 3507405 Data size: 2386555925 Basic
stats: COMPLETE Column stats: NONE |
| table:
|
| input format:
org.apache.hadoop.mapred.TextInputFormat
|
| output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
|
| serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
|
| name: dbtest.dbtest__tbl_test_index_ix_key__
|
|
|
| Stage: Stage-2
|
| Dependency Collection
|
|
|
| Stage: Stage-0
|
| Move Operator
|
| tables:
|
| replace: true
|
| table:
|
| input format: org.apache.hadoop.mapred.TextInputFormat
|
| output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
|
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
|
| name: dbtest.dbtest__tbl_test_index_ix_key__
|
|
|
| Stage: Stage-3
|
| Stats-Aggr Operator
|
|
|
| Stage: Stage-4
|
|
|
| Stage: Stage-5
|
|
|
+-------------------------------------------------------------------------------------------------------------------------------------------+--+
2015-05-15 12:32 GMT+02:00 Dr Mich Talebzadeh <m...@peridale.co.uk>:
Hi Marc,
As far as I know indexes do not work in hive. Have you checked it with
explain?
Thanks,
Mich
On 15/5/2015, "Marc Seeger" <m...@web-computing.de> wrote:
>Hi,
>
>I'm using Hive14 on a HDP2.2-Cluster and have a problem with indexing on
>Hive. I can create an index.
>create INDEX ix_key ON TABLE DbTest.Tbl_test(TEST_KEY)
>
>as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH
>DEFERRED REBUILD;
>
>After that I loaded data into the table and built the index.
>
>ALTER INDEX ix_key ON DbTest.Tbl_test REBUILD;
>
>Hive built the index and it works fine, performance increased. Now I want
>to rebuild the index, but always get an error:
>
>INFO : Session is already open
>INFO : Tez session was closed. Reopening...
>INFO : Session re-established.
>INFO :
>
>ERROR : Status: Failed
>ERROR : Vertex failed, vertexName=Map 1,
>vertexId=vertex_1426585957958_2810_1_00, diagnostics=[Vertex
>vertex_1426585957958_2810_1_00 [Map 1] killed/failed due
>to:ROOT_INPUT_INIT_FAILURE, Vertex Input: Tbl_test initializer failed,
>vertex=vertex_1426585957958_2810_1_00 [Map 1],
>java.lang.NullPointerException
>at
>org.apache.hadoop.hive.ql.exec.tez.DynamicPartitionPruner.initialize(DynamicPartitionPruner.java:135)
>at
>org.apache.hadoop.hive.ql.exec.tez.DynamicPartitionPruner.prune(DynamicPartitionPruner.java:100)
>at
>org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:109)
>at
>org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:245)
>at
>org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:239)
>at java.security.AccessController.doPrivileged(Native Method)
>at javax.security.auth.Subject.doAs(Subject.java:415)
>at
>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>at
>org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:239)
>at
>org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:226)
>at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>at
>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>at
>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>at java.lang.Thread.run(Thread.java:745)
>]
>ERROR : Vertex killed, vertexName=Reducer 2,
>vertexId=vertex_1426585957958_2810_1_01, diagnostics=[Vertex received
>Kill in INITED state., Vertex vertex_1426585957958_2810_1_01 [Reducer
>2] killed/failed due to:null]
>ERROR : DAG failed due to vertex failure. failedVertices:1 killedVertices:1
>Error: Error while processing statement: FAILED: Execution Error,
>return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask
>(state=08S01,code=2)
>
>The base table exists, I can run queries against it. The index table exists
>too. If I create a new index on another table and run rebuild-command, I
>got the same error. I tried the command with beeline and CLI without any
>effect to result.
>Thanks for your help,
>Marc
>
>
--- Begin Message ---
Hi,
I am very new to hive optimiser
Here I have a table with 4 million rows imported from Oracle via sqoop/hive. In
this table object_id column is unique. Oracle table has primary key constraint
on object_id column which is basically a unique B-tree index.
I do a very simple query to see how many unique values are for object_id in
table. The answer is they are as many as number of rows.
So query like below in Oracle
SELECT (COUNT(DISTINCT(object_id))/COUNT(object_id)) FROM hddtester.tdash;
Should return 1. Now Oracle optimiser only needs to read the index key and work
it out WITHOUT touching the underlying table and it does that
----------------------------------------------------------
Plan hash value: 1988751498
-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time
|
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | 13952 (1)|
00:02:48 |
| 1 | SORT AGGREGATE | | 1 | 26 | |
|
| 2 | VIEW | VW_DAG_0 | 4000K| 99M| 13952 (1)|
00:02:48 |
| 3 | SORT GROUP BY NOSORT| | 4000K| 22M| 13952 (1)|
00:02:48 |
| 4 | INDEX FULL SCAN | TDASH_PK | 4000K| 22M| 13952 (1)|
00:02:48 |
-----------------------------------------------------------------------------------
Here it is shown as Operation Id = 4 “INDEX FULL SCAN”. Please note that the
table itself is not touched as expected
Now I have the same table “tdash” in Hive with a compact index on object_id. I
have analysed stats for table with “analyze table tdash compute statistics”.
Now I do explain as below
hive> explain SELECT (COUNT(DISTINCT(object_id))/COUNT(object_id)) FROM tdash;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: tdash
Statistics: Num rows: 4000000 Data size: 32564651117 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: object_id (type: double)
outputColumnNames: object_id
Statistics: Num rows: 4000000 Data size: 32564651117 Basic stats:
COMPLETE Column stats: NONE
Group By Operator
aggregations: count(DISTINCT object_id), count(object_id)
keys: object_id (type: double)
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 4000000 Data size: 32564651117 Basic
stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: double)
sort order: +
Statistics: Num rows: 4000000 Data size: 32564651117 Basic
stats: COMPLETE Column stats: NONE
value expressions: _col2 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: count(DISTINCT KEY._col0:0._col0), count(VALUE._col1)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column
stats: NONE
Select Operator
expressions: (_col0 / _col1) (type: double)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column
stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE
Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.691 seconds, Fetched: 50 row(s)
Trying to understand above does keys: object_id (type: double) refers to use of
index here? I dropped that index and the same plan was produced! How Hive
optimiser flag the index usage in the plan. Do I need to update column statists
as well as table
Many thanks
Mich Talebzadeh
http://talebzadehmich.wordpress.com
Publications due shortly:
Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
Coherence Cache
NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this message
shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries
or their employees, unless expressly so stated. It is the responsibility of the
recipient to ensure that this email is virus free, therefore neither Peridale
Ltd, its subsidiaries nor their employees accept any responsibility.
From: Daniel Haviv [mailto:daniel.ha...@veracity-group.com]
Sent: 26 March 2015 17:27
To: user@hive.apache.org
Subject: Understanding Hive's execution plan
Hi,
Can anyone direct me to a good explanation on understanding Hive's execution
plan?
Thanks,
Daniel
--- End Message ---
--- Begin Message ---
Hive's optimizer never uses the indexes. Indexes aren't fully implemented in
Hive.
Alan.
Mich Talebzadeh <mailto:m...@peridale.co.uk>
March 27, 2015 at 1:33
--- End Message ---