Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

Jörn Franke Wed, 30 Dec 2015 11:07:18 -0800

Hmm i think the execution Engine TEZ has (currently) the most optimizations on 
Hive. What about your hardware - is it the same? Do you have also compression 
on Sybase?
Alternatively you need to wait for Hive for interactive analytics (tez 0.8 + 
llap).


> On 30 Dec 2015, at 13:47, Mich Talebzadeh <m...@peridale.co.uk> wrote:
> 
> Hi Jorn,
>  
> Thanks for your reply. My Hive version is 1.2.1 on Spark 1.3.1. I have not 
> tried it on TEZ. I tried the query on MR engine and it did nor fair better. I 
> also ran it without SDDDEV function and found out that the function did not 
> slow it down.
>  
> I tried a simple query as follows builr in sales FACT table 1e9 rows and 
> dimension table times (1826 rows)
>  
> --
> -- Get the total amount sold for each calendar month
> --
> SELECT t.calendar_month_desc, SUM(s.amount_sold)
> FROM sales s, times t WHERE s.time_id = t.time_id
> GROUP BY t.calendar_month_desc;
>  
> Now Sybase IQ comes back in around 30 seconds.
>  
> Started query at Dec 30 2015 08:14:33:399AM
> (48 rows affected)
> Finished query at Dec 30 2015 08:15:04:640AM
>  
> Whereas Hive with the following setting and running the same query
>  
> set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
> set hive.optimize.bucketmapjoin=true;
> set hive.optimize.bucketmapjoin.sortedmerge=true;
>  
> Comes back in
>  
> 48 rows selected (1514.687 seconds)
>  
> I don’t know what else can be done. Obviously this is all schema on read so I 
> am not sure I can change bucketing on FACT table based on one query alone!
>  
>  
>  
> +--------------------------------------------------------------------+--+
> |                           createtab_stmt                           |
> +--------------------------------------------------------------------+--+
> | CREATE TABLE `times`(                                              |
> |   `time_id` timestamp,                                             |
> |   `day_name` varchar(9),                                           |
> |   `day_number_in_week` int,                                        |
> |   `day_number_in_month` int,                                       |
> |   `calendar_week_number` int,                                      |
> |   `fiscal_week_number` int,                                        |
> |   `week_ending_day` timestamp,                                     |
> |   `week_ending_day_id` bigint,                                     |
> |   `calendar_month_number` int,                                     |
> |   `fiscal_month_number` int,                                       |
> |   `calendar_month_desc` varchar(8),                                |
> ----------
> |   `days_in_fis_year` bigint,                                       |
> |   `end_of_cal_year` timestamp,                                     |
> |   `end_of_fis_year` timestamp)                                     |
> | CLUSTERED BY (                                                     |
> |   time_id)                                                         |
> | INTO 256 BUCKETS                                                   |
> | ROW FORMAT SERDE                                                   |
> |   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'                      |
> | STORED AS INPUTFORMAT                                              |
> |   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'                |
> | OUTPUTFORMAT                                                       |
> |   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'               |
> | LOCATION                                                           |
> |   'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/times'  |
> | TBLPROPERTIES (                                                    |
> |   'COLUMN_STATS_ACCURATE'='true',                                  |
> |   'numFiles'='1',                                                  |
> |   'numRows'='1826',                                                |
> |   'orc.bloom.filter.columns'='TIME_ID', |
> |   'orc.bloom.filter.fpp'='0.05',                                   |
> |   'orc.compress'='SNAPPY',                                         |
> |   'orc.create.index'='true',                                       |
> |   'orc.row.index.stride'='10000',                                  |
> |   'orc.stripe.size'='268435456',                                   |
> |   'rawDataSize'='0',                                               |
> |   'totalSize'='11155',                                             |
> |   'transient_lastDdlTime'='1451429900') |
>  
> ;
>  
>  
> http://talebzadehmich.wordpress.com
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept 
> any responsibility.
>  
> From: Jörn Franke [mailto:jornfra...@gmail.com] 
> Sent: 30 December 2015 08:28
> To: user@hive.apache.org
> Subject: Re: Running the same query on 1 billion rows fact table in Hive on 
> Spark compared to Sybase IQ columnar database
>  
> Have you tried it with Hive ob TEZ? It contains (currently) more 
> optimizations than Hive on Spark.
> I assume you use the latest Hive version.
> Additionally you may want to think about calculating statistics (depending on 
> your configuration you need to trigger it) - I am not sure if Spark can use 
> them.
> I am not sure if bloom filters on the columns you mention make sense. You may 
> also want to increase stride size (depending on your data).
> Currently you bucket by a lot of fields, which may not make sense. You also 
> may want to sort the data by customer Id in the table.
> You also seem to have a lot of reducers, which you may want to decrease.
>  
> Have you tried without "having stddev_samp" ? Is the query exactly the same 
> as in Sybase?
> 
> On 29 Dec 2015, at 11:53, Mich Talebzadeh <m...@peridale.co.uk> wrote:
> 
> Hi,
>  
> I have a fact table in Hive imported from Sybase IQ via SQOOP with 1 billion 
> rows as follows:
>  
> show create table sales;
> +-------------------------------------------------------------------------------+--+
> |                                createtab_stmt                               
>   |
> +-------------------------------------------------------------------------------+--+
> | CREATE TABLE `sales`(                                                       
>   |
> |   `prod_id` bigint,                                                         
>   |
> |   `cust_id` bigint,                                                         
>   |
> |   `time_id` timestamp,                                                      
>   |
> |   `channel_id` bigint,                                                      
>   |
> |   `promo_id` bigint,                                                        
>   |
> |   `quantity_sold` decimal(10,0),                                            
>   |
> |   `amount_sold` decimal(10,0))                                              
>   |
> | CLUSTERED BY (                                                              
>   |
> |   prod_id,                                                                  
>   |
> |   cust_id,                                                                  
>   |
> |   time_id,                                                                  
>   |
> |   channel_id,                                                               
>   |
> |   promo_id)                                                                 
>   |
> | INTO 256 BUCKETS                                                            
>   |
> | ROW FORMAT SERDE                                                            
>   |
> |   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'                               
>   |
> | STORED AS INPUTFORMAT                                                       
>   |
> |   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'                         
>   |
> | OUTPUTFORMAT                                                                
>   |
> |   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'                        
>   |
> | LOCATION                                                                    
>   |
> |   'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales'           
>   |
> | TBLPROPERTIES (                                                             
>   |
> |   'COLUMN_STATS_ACCURATE'='true',                                           
>   |
> |   'last_modified_by'='hduser',                                              
>   |
> |   'last_modified_time'='1451305626',                                        
>   |
> |   'numFiles'='11',                                                          
>   |
> |   'numRows'='1000000000',                                                   
>   |
> |   'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID', 
>   |
> |   'orc.bloom.filter.fpp'='0.05',                                            
>   |
> |   'orc.compress'='SNAPPY',                                                  
>   |
> |   'orc.create.index'='true',                                                
>   |
> |   'orc.row.index.stride'='10000',                                           
>   |
> |   'orc.stripe.size'='268435456',                                            
>   |
> |   'rawDataSize'='296000000000',                                             
>   |
> |   'totalSize'='2678882153',                                                 
>   |
> |   'transient_lastDdlTime'='1451305626')                                     
>   |
> +-------------------------------------------------------------------------------+--+
>  
> I use the following query to run against sales table only against Hive
>  
> SELECT
>           rs.Customer_ID
>         , rs.Number_of_orders
>         , rs.Total_customer_amount
>         , rs.Average_order
>         , rs.Standard_deviation
> FROM
> (
>         SELECT cust_id AS Customer_ID,
>         COUNT(amount_sold) AS Number_of_orders,
>         SUM(amount_sold) AS Total_customer_amount,
>         AVG(amount_sold) AS Average_order,
>         stddev_samp(amount_sold) AS Standard_deviation
>         FROM sales
>         GROUP BY cust_id
>         HAVING SUM(amount_sold) > 94000
>         AND AVG(amount_sold) < stddev_samp(amount_sold)
> ) rs
> ORDER BY
>           -- Total_customer_amount DESC
>           3 DESC
>  
> Hive comes back in 17 minutes with 5,948 rows
>  
> bl -f sales.hql > sales.log
> Connecting to jdbc:hive2://rhes564:10010/default
> Connected to: Apache Hive (version 1.2.1)
> Driver: Hive JDBC (version 1.2.1)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Running init script /home/hduser/dba/bin/hive_on_spark_init.hql
> No rows affected (0.097 seconds)
> No rows affected (0.001 seconds)
> No rows affected (0.001 seconds)
> No rows affected (0.038 seconds)
> INFO  : Warning: Using constant number 3 in order by. If you try to use 
> position alias when hive.groupby.orderby.position.alias is false, the 
> position alias will be ignored.
> INFO  :
> Query Hive on Spark job[0] stages:
> INFO  : 0
> INFO  : 1
> INFO  : 2
> INFO  :
> Status: Running (Hive on Spark job[0])
> INFO  : Job Progress Format
> CurrentTime StageId_StageAttemptId: 
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
> [StageCost]
> INFO  : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009       
> Stage-2_0: 0/1
> INFO  : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009       
> Stage-2_0: 0/1
> INFO  : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009     
>   Stage-2_0: 0/1
> INFO  : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009     
>   Stage-2_0: 0/1
> INFO  : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009     
>   Stage-2_0: 0/1
> INFO  : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009     
>   Stage-2_0: 0/1
> INFO  : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009     
>   Stage-2_0: 0/1
> INFO  : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009     
>   Stage-2_0: 0/1
> INFO  : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009     
>   Stage-2_0: 0/1
> INFO  : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009     
>   Stage-2_0: 0/1
> INFO  : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009     
>   Stage-2_0: 0/1
>  
> INFO  : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11    Stage-1_0: 0/1009     
>   Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11    Stage-1_0: 0/1009     
>   Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11    Stage-1_0: 0/1009     
>   Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 0(+2)/1009   Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 6(+2)/1009   Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 30(+2)/1009  Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 59(+2)/1009  Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 90(+2)/1009  Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 124(+2)/1009 Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 160(+2)/1009 Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 193(+2)/1009 Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 228(+2)/1009 Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 262(+2)/1009 Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 298(+2)/1009 Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 338(+2)/1009 Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 376(+2)/1009 Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 417(+3)/1009 Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 460(+2)/1009 Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 502(+2)/1009 Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 542(+2)/1009 Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 584(+2)/1009 Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 624(+2)/1009 Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 667(+2)/1009 Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 709(+3)/1009 Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 754(+3)/1009 Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 797(+2)/1009 Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 844(+2)/1009 Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 888(+2)/1009 Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 934(+2)/1009 Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 981(+2)/1009 Stage-2_0: 0/1
> INFO  : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 1009/1009 Finished   Stage-2_0: 0(+1)/1
> INFO  : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished       Stage-1_0: 
> 1009/1009 Finished   Stage-2_0: 1/1 Finished
> INFO  : Status: Finished successfully in 1036.00 seconds
> 5,948 rows selected (1074.817 seconds)
>  
> So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows in 23 
> seconds
>  
> Sybase IQ is a columnar database so each column is created as a fast 
> projection index by default. In addition I have created LF (bitmap) indexes 
> on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now 
> the query only touches CUST_ID.
>  
> My suspicion is that it is the Standard Deviation function stddev_samp() that 
> could be the bottleneck?
>  
> Thanks
>  
> Mich Talebzadeh
>  
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
> ISBN 978-0-9563693-0-7.
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
> 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN: 
> 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume 
> one out shortly
>  
> http://talebzadehmich.wordpress.com
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept 
> any responsibility.
>

Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

Reply via email to