Re: Indexes in Hive

Alan Gates Wed, 06 Jan 2016 10:20:13 -0800

The issue with this is that HDFS lacks the ability to co-locate blocks.So if you break your columns into one file per column (the moretraditional column route) you end up in a situation where 2/3 of thetime only one of your columns is being locally read, which results in asignificant performance penalty. That's why ORC and Parquet and RCFileall use one file for their "columnar" stores.


Alan.

Mich Talebzadeh <mailto:m...@peridale.co.uk>
January 5, 2016 at 22:24
Hi,

Thinking loudly.

Ideally we should consider a totally columnar storage offering inwhich each

column of table is stored as compressed value (I disregard for now how

actually ORC does this but obviously it is not exactly a columnarstorage).


So each table can be considered as a loose federation of columnar storage
and each column is effectively an index?

As columns are far narrower than tables, each index block will be very
higher density and all operations like aggregates can be done directly on
index rather than table.

This type of table offering will be in true nature of data warehouse
storage. Of course row operations (get me all rows for this table) will be
slower but that is the trade-off that we need to consider.

Expecting users to write their own IndexHandler may be technically

interesting but commercially not viable as Hive needs to be a producton its

own merit not a development base. Writing your own storage attributes etc.
requires skills that will put off people seeing Hive as an attractive
proposition (requiring considerable investment in skill sets in order to
maintain Hive).

Thus my thinking on this is to offer true columnar storage in Hive to be a
proper data warehouse. In addition, the development tools cab ne made
available for those interested in tailoring their own specific Hive
solutions.


HTH



Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE15",

ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this

message shall not be understood as given or endorsed by PeridaleTechnologyLtd, its subsidiaries or their employees, unless expressly so stated.It isthe responsibility of the recipient to ensure that this email is virusfree,therefore neither Peridale Ltd, its subsidiaries nor their employeesaccept

any responsibility.


-----Original Message-----

From: Gopal Vijayaraghavan [mailto:go...@hortonworks.com] On Behalf OfGopal

Vijayaraghavan
Sent: 05 January 2016 23:55
To: user@hive.apache.org
Subject: Re: Is Hive Index officially not recommended?


now?

The builtin indexes - those that write data as smaller tables are only

useful in a pre-columnar world, where the indexes offer a hugereduction in

IO.

Part #1 of using hive indexes effectively is to write your own
HiveIndexHandler, with usesIndexTable=false;

And then write a IndexPredicateAnalyzer, which lets you map arbitrary
lookups into other range conditions.

Not coincidentally - we're adding a "ANALYZE TABLE ... CACHE METADATA"
which consolidates the "internal" index into an external store (HBase).

Some of the index data now lives in the HBase metastore, so that the
inclusion/exclusion of whole partitions can be done off the consolidated
index.

https://issues.apache.org/jira/browse/HIVE-11676


The experience from BI workloads run by customers is that in general, the
lookup to the right "slice" of data is more of a problem than the actual
aggregate.

And that for a workhorse data warehouse, this has to survive even ifthere's

a non-stop stream of updates into it.

Cheers,
Gopal

Re: Indexes in Hive

Reply via email to