https://snippetessay.wordpress.com/2015/07/25/hive-optimizations-with-indexes-bloom-filters-and-statistics/ Maybe a compact index makes more sense if you have high cardinality columns
> On 08 Jan 2016, at 10:11, Mich Talebzadeh <m...@peridale.co.uk> wrote: > > Interesting point below: > > Well you use a text format for your data so you should not be surprised. For > text based formats, such as csv, you can always use the hive bitmap index. > > > How can one create a bitmap index in Hive please? > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > Sybase ASE 15 Gold Medal Award 2008 > A Winning Strategy: Running the most Critical Financial Data on ASE 15 > http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf > Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", > ISBN 978-0-9563693-0-7. > co-author "Sybase Transact SQL Guidelines Best Practices", ISBN > 978-0-9759693-0-4 > Publications due shortly: > Complex Event Processing in Heterogeneous Environments, ISBN: > 978-0-9563693-3-8 > Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume > one out shortly > > http://talebzadehmich.wordpress.com > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this message > shall not be understood as given or endorsed by Peridale Technology Ltd, its > subsidiaries or their employees, unless expressly so stated. It is the > responsibility of the recipient to ensure that this email is virus free, > therefore neither Peridale Ltd, its subsidiaries nor their employees accept > any responsibility. > > From: Jörn Franke [mailto:jornfra...@gmail.com] > Sent: 08 January 2016 08:49 > To: user@hive.apache.org > Subject: Re: Impact of partitioning on certain queries > > Well you use a text format for your data so you should not be surprised. For > text based formats, such as csv, you can always use the hive bitmap index. I > do not think it makes a lot of sense to compare here processing csv files and > internal tables of a relational database. > > On 08 Jan 2016, at 09:30, Mich Talebzadeh <m...@peridale.co.uk> wrote: > > Well that is debatable. > > The following table sales is partitioned in Oracle but has local bitmap > indexes that help the query. > > select * from sales where prod_id = 10; > > no rows selected > > > Execution Plan > ---------------------------------------------------------- > Plan hash value: 511273406 > > --------------------------------------------------------------------------------------------------------------------- > | Id | Operation | Name | Rows | Bytes | > Cost (%CPU)| Time | Pstart| Pstop | > --------------------------------------------------------------------------------------------------------------------- > | 0 | SELECT STATEMENT | | 347 | 10063 | > 93 (0)| 00:00:02 | | | > | 1 | PARTITION RANGE ALL | | 347 | 10063 | > 93 (0)| 00:00:02 | 1 | 28 | > | 2 | TABLE ACCESS BY LOCAL INDEX ROWID| SALES | 347 | 10063 | > 93 (0)| 00:00:02 | 1 | 28 | > | 3 | BITMAP CONVERSION TO ROWIDS | | | | > | | | | > |* 4 | BITMAP INDEX SINGLE VALUE | SALES_PROD_BIX | | | > | | 1 | 28 | > --------------------------------------------------------------------------------------------------------------------- > > Obviously at this stage we do not have local indexes in Hive. That could make > it moredefficient for search and IMO will be a great tool. > > Cheers, > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > Sybase ASE 15 Gold Medal Award 2008 > A Winning Strategy: Running the most Critical Financial Data on ASE 15 > http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf > Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", > ISBN 978-0-9563693-0-7. > co-author "Sybase Transact SQL Guidelines Best Practices", ISBN > 978-0-9759693-0-4 > Publications due shortly: > Complex Event Processing in Heterogeneous Environments, ISBN: > 978-0-9563693-3-8 > Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume > one out shortly > > http://talebzadehmich.wordpress.com > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this message > shall not be understood as given or endorsed by Peridale Technology Ltd, its > subsidiaries or their employees, unless expressly so stated. It is the > responsibility of the recipient to ensure that this email is virus free, > therefore neither Peridale Ltd, its subsidiaries nor their employees accept > any responsibility. > > From: Jörn Franke [mailto:jornfra...@gmail.com] > Sent: 08 January 2016 06:20 > To: user@hive.apache.org > Subject: Re: Impact of partitioning on certain queries > > This observation is correct and it is the same behavior as you see it in > other databases supporting partitions. Usually you should avoid many small > partitions. > > On 07 Jan 2016, at 23:53, Mich Talebzadeh <m...@peridale.co.uk> wrote: > > Ok we hope that partitioning improves performance where the predicate is on > partitioned columns > > I have two tables. One a basic table called smallsales defined as below > > CREATE TABLE `smallsales`( | > | `prod_id` bigint, | > | `cust_id` bigint, | > | `time_id` timestamp, | > | `channel_id` bigint, | > | `promo_id` bigint, | > | `quantity_sold` decimal(10,0), | > | `amount_sold` decimal(10,0)) | > | ROW FORMAT SERDE | > | 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' | > | STORED AS INPUTFORMAT | > | 'org.apache.hadoop.mapred.TextInputFormat' | > | OUTPUTFORMAT | > | 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' | > | LOCATION | > | 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/smallsales' | > | TBLPROPERTIES ( | > | 'COLUMN_STATS_ACCURATE'='true', | > | 'last_modified_by'='hduser', | > | 'last_modified_time'='1451644705', | > | 'numFiles'='1', | > | 'numRows'='5000000', | > | 'rawDataSize'='193437457', | > | 'totalSize'='198437457', | > | 'transient_lastDdlTime'='1451784743') | > +-------------------------------------------------------------------------+--+ > > > So 5 million rows. > > > I then created a partitioned table called sales as below > > | createtab_stmt > | > +-------------------------------------------------------------------------------+--+ > | CREATE TABLE `sales`( > | > | `prod_id` bigint, > | > | `cust_id` bigint, > | > | `time_id` timestamp, > | > | `channel_id` bigint, > | > | `promo_id` bigint, > | > | `quantity_sold` decimal(10,0), > | > | `amount_sold` decimal(10,0)) > | > | PARTITIONED BY ( > | > | `year` int, > | > | `month` int) > | > | CLUSTERED BY ( > | > | prod_id, > | > | cust_id, > | > | time_id, > | > | channel_id, > | > | promo_id) > | > | INTO 256 BUCKETS > | > | ROW FORMAT SERDE > | > | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > | > | STORED AS INPUTFORMAT > | > | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > | > | OUTPUTFORMAT > | > | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > | > | LOCATION > | > | 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales' > | > | TBLPROPERTIES ( > | > | 'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID', > | > | 'orc.bloom.filter.fpp'='0.05', > | > | 'orc.compress'='SNAPPY', > | > | 'orc.create.index'='true', > | > | 'orc.row.index.stride'='10000', > | > | 'orc.stripe.size'='268435456', > | > | 'transient_lastDdlTime'='1451814921') > | > +-------------------------------------------------------------------------------+--+ > > And loaded data from smallsales to sales table > > Stats updated in both > > Now when I do the following > > 0: jdbc:hive2://rhes564:10010/default> select * from smallsales where prod_id > = 10; > +---------------------+---------------------+---------------------+------------------------+----------------------+---------------------------+-------------------------+--+ > | smallsales.prod_id | smallsales.cust_id | smallsales.time_id | > smallsales.channel_id | smallsales.promo_id | smallsales.quantity_sold | > smallsales.amount_sold | > +---------------------+---------------------+---------------------+------------------------+----------------------+---------------------------+-------------------------+--+ > +---------------------+---------------------+---------------------+------------------------+----------------------+---------------------------+-------------------------+--+ > No rows selected (2.231 seconds) > > Ok if I do the same query from partitioned bucketed table in takes > > 0: jdbc:hive2://rhes564:10010/default> select * from sales where prod_id = 10; > +----------------+----------------+----------------+-------------------+-----------------+----------------------+--------------------+-------------+--------------+--+ > | sales.prod_id | sales.cust_id | sales.time_id | sales.channel_id | > sales.promo_id | sales.quantity_sold | sales.amount_sold | sales.year | > sales.month | > +----------------+----------------+----------------+-------------------+-----------------+----------------------+--------------------+-------------+--------------+--+ > +----------------+----------------+----------------+-------------------+-----------------+----------------------+--------------------+-------------+--------------+--+ > No rows selected (26.96 seconds) > > > Note that the second query is order of magnitude slower. > > My view is that the query in partitioned table has got to go through every > partitioned file to check the existence of the value, whereas in a > non-partitioned table the operation is much faster. Adding more partition > and buckets also adds more load on NameNode as well. > > Are there other reasons? > > Thanks > > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > Sybase ASE 15 Gold Medal Award 2008 > A Winning Strategy: Running the most Critical Financial Data on ASE 15 > http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf > Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", > ISBN 978-0-9563693-0-7. > co-author "Sybase Transact SQL Guidelines Best Practices", ISBN > 978-0-9759693-0-4 > Publications due shortly: > Complex Event Processing in Heterogeneous Environments, ISBN: > 978-0-9563693-3-8 > Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume > one out shortly > > http://talebzadehmich.wordpress.com > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this message > shall not be understood as given or endorsed by Peridale Technology Ltd, its > subsidiaries or their employees, unless expressly so stated. It is the > responsibility of the recipient to ensure that this email is virus free, > therefore neither Peridale Ltd, its subsidiaries nor their employees accept > any responsibility. >