Re: Impact of partitioning on certain queries

Jörn Franke Fri, 08 Jan 2016 02:08:36 -0800

https://snippetessay.wordpress.com/2015/07/25/hive-optimizations-with-indexes-bloom-filters-and-statistics/
Maybe a compact index makes more sense if you have high cardinality columns


> On 08 Jan 2016, at 10:11, Mich Talebzadeh <m...@peridale.co.uk> wrote:
> 
> Interesting point below:
>  
> Well you use a text format for your data so you should not be surprised. For 
> text based formats, such as csv, you can always use the hive bitmap index.
>  
>  
> How can one create a bitmap index in Hive please?
>  
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
> ISBN 978-0-9563693-0-7.
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
> 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN: 
> 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume 
> one out shortly
>  
> http://talebzadehmich.wordpress.com
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept 
> any responsibility.
>  
> From: Jörn Franke [mailto:jornfra...@gmail.com] 
> Sent: 08 January 2016 08:49
> To: user@hive.apache.org
> Subject: Re: Impact of partitioning on certain queries
>  
> Well you use a text format for your data so you should not be surprised. For 
> text based formats, such as csv, you can always use the hive bitmap index. I 
> do not think it makes a lot of sense to compare here processing csv files and 
> internal tables of a relational database.
> 
> On 08 Jan 2016, at 09:30, Mich Talebzadeh <m...@peridale.co.uk> wrote:
> 
> Well that is debatable.
>  
> The following table sales is partitioned in Oracle but has local bitmap 
> indexes that help the query.
>  
> select * from sales where prod_id = 10;
>  
> no rows selected
>  
>  
> Execution Plan
> ----------------------------------------------------------
> Plan hash value: 511273406
>  
> ---------------------------------------------------------------------------------------------------------------------
> | Id  | Operation                          | Name           | Rows  | Bytes | 
> Cost (%CPU)| Time     | Pstart| Pstop |
> ---------------------------------------------------------------------------------------------------------------------
> |   0 | SELECT STATEMENT                   |                |   347 | 10063 | 
>    93   (0)| 00:00:02 |       |       |
> |   1 |  PARTITION RANGE ALL               |                |   347 | 10063 | 
>    93   (0)| 00:00:02 |     1 |    28 |
> |   2 |   TABLE ACCESS BY LOCAL INDEX ROWID| SALES          |   347 | 10063 | 
>    93   (0)| 00:00:02 |     1 |    28 |
> |   3 |    BITMAP CONVERSION TO ROWIDS     |                |       |       | 
>            |          |       |       |
> |*  4 |     BITMAP INDEX SINGLE VALUE      | SALES_PROD_BIX |       |       | 
>            |          |     1 |    28 |
> ---------------------------------------------------------------------------------------------------------------------
>  
> Obviously at this stage we do not have local indexes in Hive. That could make 
> it moredefficient for search and IMO will be a great tool.
>  
> Cheers,
>  
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
> ISBN 978-0-9563693-0-7.
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
> 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN: 
> 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume 
> one out shortly
>  
> http://talebzadehmich.wordpress.com
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept 
> any responsibility.
>  
> From: Jörn Franke [mailto:jornfra...@gmail.com] 
> Sent: 08 January 2016 06:20
> To: user@hive.apache.org
> Subject: Re: Impact of partitioning on certain queries
>  
> This observation is correct and it is the same  behavior as you see it in 
> other databases supporting partitions. Usually you should avoid many small 
> partitions.
> 
> On 07 Jan 2016, at 23:53, Mich Talebzadeh <m...@peridale.co.uk> wrote:
> 
> Ok we hope that partitioning improves performance where the predicate is on 
> partitioned columns
>  
> I have two tables. One a basic table called smallsales defined as below
>  
> CREATE TABLE `smallsales`(                                              |
> |   `prod_id` bigint,                                                     |
> |   `cust_id` bigint,                                                     |
> |   `time_id` timestamp,                                                  |
> |   `channel_id` bigint,                                                  |
> |   `promo_id` bigint,                                                    |
> |   `quantity_sold` decimal(10,0),                                        |
> |   `amount_sold` decimal(10,0))                                          |
> | ROW FORMAT SERDE                                                        |
> |   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'                  |
> | STORED AS INPUTFORMAT                                                   |
> |   'org.apache.hadoop.mapred.TextInputFormat'                            |
> | OUTPUTFORMAT                                                            |
> |   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'          |
> | LOCATION                                                                |
> |   'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/smallsales'  |
> | TBLPROPERTIES (                                                         |
> |   'COLUMN_STATS_ACCURATE'='true',                                       |
> |   'last_modified_by'='hduser',                                          |
> |   'last_modified_time'='1451644705',                                    |
> |   'numFiles'='1',                                                       |
> |   'numRows'='5000000',                                                  |
> |   'rawDataSize'='193437457',                                            |
> |   'totalSize'='198437457',                                              |
> |   'transient_lastDdlTime'='1451784743')                                 |
> +-------------------------------------------------------------------------+--+
>  
>  
> So 5 million rows.
>  
>  
> I then created a partitioned table called sales as below
>  
> |                                createtab_stmt                               
>   |
> +-------------------------------------------------------------------------------+--+
> | CREATE TABLE `sales`(                                                       
>   |
> |   `prod_id` bigint,                                                         
>   |
> |   `cust_id` bigint,                                                         
>   |
> |   `time_id` timestamp,                                                      
>   |
> |   `channel_id` bigint,                                                      
>   |
> |   `promo_id` bigint,                                                        
>   |
> |   `quantity_sold` decimal(10,0),                                            
>   |
> |   `amount_sold` decimal(10,0))                                              
>   |
> | PARTITIONED BY (                                                            
>   |
> |   `year` int,                                                               
>   |
> |   `month` int)                                                              
>   |
> | CLUSTERED BY (                                                              
>   |
> |   prod_id,                                                                  
>   |
> |   cust_id,                                                                  
>   |
> |   time_id,                                                                  
>   |
> |   channel_id,                                                               
>   |
> |   promo_id)                                                                 
>   |
> | INTO 256 BUCKETS                                                            
>   |
> | ROW FORMAT SERDE                                                            
>   |
> |   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'                               
>   |
> | STORED AS INPUTFORMAT                                                       
>   |
> |   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'                         
>   |
> | OUTPUTFORMAT                                                                
>   |
> |   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'                        
>   |
> | LOCATION                                                                    
>   |
> |   'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales'           
>   |
> | TBLPROPERTIES (                                                             
>   |
> |   'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID', 
>   |
> |   'orc.bloom.filter.fpp'='0.05',                                            
>   |
> |   'orc.compress'='SNAPPY',                                                  
>   |
> |   'orc.create.index'='true',                                                
>   |
> |   'orc.row.index.stride'='10000',                                           
>   |
> |   'orc.stripe.size'='268435456',                                            
>   |
> |   'transient_lastDdlTime'='1451814921')                                     
>   |
> +-------------------------------------------------------------------------------+--+
>  
> And loaded data from smallsales to sales table
>  
> Stats updated in both
>  
> Now when I do the following
>  
> 0: jdbc:hive2://rhes564:10010/default> select * from smallsales where prod_id 
> = 10;
> +---------------------+---------------------+---------------------+------------------------+----------------------+---------------------------+-------------------------+--+
> | smallsales.prod_id  | smallsales.cust_id  | smallsales.time_id  | 
> smallsales.channel_id  | smallsales.promo_id  | smallsales.quantity_sold  | 
> smallsales.amount_sold  |
> +---------------------+---------------------+---------------------+------------------------+----------------------+---------------------------+-------------------------+--+
> +---------------------+---------------------+---------------------+------------------------+----------------------+---------------------------+-------------------------+--+
> No rows selected (2.231 seconds)
>  
> Ok if I do the same query from partitioned bucketed table in takes
>  
> 0: jdbc:hive2://rhes564:10010/default> select * from sales where prod_id = 10;
> +----------------+----------------+----------------+-------------------+-----------------+----------------------+--------------------+-------------+--------------+--+
> | sales.prod_id  | sales.cust_id  | sales.time_id  | sales.channel_id  | 
> sales.promo_id  | sales.quantity_sold  | sales.amount_sold  | sales.year  | 
> sales.month  |
> +----------------+----------------+----------------+-------------------+-----------------+----------------------+--------------------+-------------+--------------+--+
> +----------------+----------------+----------------+-------------------+-----------------+----------------------+--------------------+-------------+--------------+--+
> No rows selected (26.96 seconds)
>  
>  
> Note that the second query is order of magnitude slower.
>  
> My view is that the query in partitioned table has got to go through every 
> partitioned file to check the existence of the value, whereas in a 
> non-partitioned table the operation is much faster.  Adding more partition 
> and buckets also adds more load on NameNode as well.
>  
> Are there other reasons?
>  
> Thanks
>  
>  
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
> ISBN 978-0-9563693-0-7.
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
> 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN: 
> 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume 
> one out shortly
>  
> http://talebzadehmich.wordpress.com
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept 
> any responsibility.
>

Re: Impact of partitioning on certain queries

Reply via email to