Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

Jörn Franke Wed, 30 Dec 2015 13:17:53 -0800

Hdp Should have TEZ already on-Board bye default.


> On 30 Dec 2015, at 21:42, Marcin Tustin <mtus...@handybook.com> wrote:
> 
> I'm afraid I use the HDP distribution so I haven't yet had to compile 
> anything. (Incidentally, this isn't a recommendation of HDP over anything 
> else). 
> 
>> On Wed, Dec 30, 2015 at 3:33 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote:
>> Thanks Marcin
>> 
>>  
>> 
>> Trying to build TEZ 0.7 in
>> 
>>  
>> 
>> /usr/lib/apache-tez-0.7.0-src
>> 
>>  
>> 
>> using
>> 
>>  
>> 
>> mvn -X clean package -DskipTests=true -Dmaven.javadoc.skip=true
>> 
>>  
>> 
>> with mvn version 3.2.5 (as opposed to 3.3) as I read that I can build it OK 
>> with 3.2.5 following the same error ass below
>> 
>>  
>> 
>> mvn --version
>> 
>> Apache Maven 3.2.5 (12a6b3acb947671f09b81f49094c53f426d8cea1; 
>> 2014-12-14T17:29:23+00:00)
>> 
>> Maven home: /usr/local/apache-maven/apache-maven-3.2.5
>> 
>> Java version: 1.7.0_25, vendor: Oracle Corporation
>> 
>> Java home: /usr/java/jdk1.7.0_25/jre
>> 
>>  
>> 
>> I get this error
>> 
>>  
>> 
>> [INFO] tez-ui ............................................. FAILURE [  0.411 
>> s]
>> 
>> [
>> 
>>  
>> 
>> DEBUG] -- end configuration --
>> 
>> [INFO] Running 'npm install --color=false' in 
>> /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp
>> 
>> [INFO] 
>> /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp/node/with_new_path.sh: 
>> line 3: 23781 Aborted                 "$@"
>> 
>>  
>> 
>>  
>> 
>> [ERROR] Failed to execute goal 
>> com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on 
>> project tez-ui: Failed to run task: 'npm install --color=false' failed. 
>> (error code 134) -> [Help 1]
>> 
>> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute 
>> goal com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on 
>> project tez-ui: Failed to run task
>> 
>>  
>> 
>>  
>> 
>> any ideas as there is little info available in net.
>> 
>>  
>> 
>>  
>> 
>> Thanks
>> 
>>  
>> 
>> Mich Talebzadeh
>> 
>>  
>> 
>> Sybase ASE 15 Gold Medal Award 2008
>> 
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>> 
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>> 
>> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
>> ISBN 978-0-9563693-0-7.
>> 
>> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
>> 978-0-9759693-0-4
>> 
>> Publications due shortly:
>> 
>> Complex Event Processing in Heterogeneous Environments, ISBN: 
>> 978-0-9563693-3-8
>> 
>> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume 
>> one out shortly
>> 
>>  
>> 
>> http://talebzadehmich.wordpress.com
>> 
>>  
>> 
>> NOTE: The information in this email is proprietary and confidential. This 
>> message is for the designated recipient only, if you are not the intended 
>> recipient, you should destroy it immediately. Any information in this 
>> message shall not be understood as given or endorsed by Peridale Technology 
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
>> the responsibility of the recipient to ensure that this email is virus free, 
>> therefore neither Peridale Ltd, its subsidiaries nor their employees accept 
>> any responsibility.
>> 
>>  
>> 
>> From: Marcin Tustin [mailto:mtus...@handybook.com] 
>> Sent: 30 December 2015 19:27
>> 
>> 
>> To: user@hive.apache.org
>> Subject: Re: Running the same query on 1 billion rows fact table in Hive on 
>> Spark compared to Sybase IQ columnar database
>>  
>> 
>> I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is much 
>> faster than MR in pretty much all cases. Also, with hive, you'll make sure 
>> you've performed optimizations like aligning ORC stripe sizes with HDFS 
>> block sizes, and concatenated your tables (not so much an optimization as a 
>> must for avoiding the small files problem).
>> 
>>  
>> 
>> On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote:
>> 
>> Thanks again Jorn.
>> 
>>  
>> 
>>  
>> 
>> Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I 
>> have compression enabled. The FACT table in IQ (sales) has LF (read bitmap) 
>> indexes on the time_id column. For the dimension table (times) I have 
>> time_id defined as primary key. Also Sybase IQ creates FP (fast projection) 
>> indexes on every column by default.
>> 
>>  
>> 
>> Anyway I am trying to download and build TEZ. Do we know which version of 
>> TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha
>> 
>>  
>> 
>> Thanks
>> 
>>  
>> 
>> Mich Talebzadeh
>> 
>>  
>> 
>> Sybase ASE 15 Gold Medal Award 2008
>> 
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>> 
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>> 
>> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
>> ISBN 978-0-9563693-0-7.
>> 
>> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
>> 978-0-9759693-0-4
>> 
>> Publications due shortly:
>> 
>> Complex Event Processing in Heterogeneous Environments, ISBN: 
>> 978-0-9563693-3-8
>> 
>> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume 
>> one out shortly
>> 
>>  
>> 
>> http://talebzadehmich.wordpress.com
>> 
>>  
>> 
>> NOTE: The information in this email is proprietary and confidential. This 
>> message is for the designated recipient only, if you are not the intended 
>> recipient, you should destroy it immediately. Any information in this 
>> message shall not be understood as given or endorsed by Peridale Technology 
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
>> the responsibility of the recipient to ensure that this email is virus free, 
>> therefore neither Peridale Ltd, its subsidiaries nor their employees accept 
>> any responsibility.
>> 
>>  
>> 
>> From: Jörn Franke [mailto:jornfra...@gmail.com] 
>> Sent: 30 December 2015 16:29
>> 
>> 
>> To: user@hive.apache.org
>> Subject: Re: Running the same query on 1 billion rows fact table in Hive on 
>> Spark compared to Sybase IQ columnar database
>> 
>>  
>> 
>> 
>> Hmm i think the execution Engine TEZ has (currently) the most optimizations 
>> on Hive. What about your hardware - is it the same? Do you have also 
>> compression on Sybase?
>> 
>> Alternatively you need to wait for Hive for interactive analytics (tez 0.8 + 
>> llap). 
>> 
>> 
>> On 30 Dec 2015, at 13:47, Mich Talebzadeh <m...@peridale.co.uk> wrote:
>> 
>> Hi Jorn,
>> 
>>  
>> 
>> Thanks for your reply. My Hive version is 1.2.1 on Spark 1.3.1. I have not 
>> tried it on TEZ. I tried the query on MR engine and it did nor fair better. 
>> I also ran it without SDDDEV function and found out that the function did 
>> not slow it down.
>> 
>>  
>> 
>> I tried a simple query as follows builr in sales FACT table 1e9 rows and 
>> dimension table times (1826 rows)
>> 
>>  
>> 
>> --
>> 
>> -- Get the total amount sold for each calendar month
>> 
>> --
>> 
>> SELECT t.calendar_month_desc, SUM(s.amount_sold)
>> 
>> FROM sales s, times t WHERE s.time_id = t.time_id
>> 
>> GROUP BY t.calendar_month_desc;
>> 
>>  
>> 
>> Now Sybase IQ comes back in around 30 seconds.
>> 
>>  
>> 
>> Started query at Dec 30 2015 08:14:33:399AM
>> 
>> (48 rows affected)
>> 
>> Finished query at Dec 30 2015 08:15:04:640AM
>> 
>>  
>> 
>> Whereas Hive with the following setting and running the same query
>> 
>>  
>> 
>> set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
>> 
>> set hive.optimize.bucketmapjoin=true;
>> 
>> set hive.optimize.bucketmapjoin.sortedmerge=true;
>> 
>>  
>> 
>> Comes back in
>> 
>>  
>> 
>> 48 rows selected (1514.687 seconds)
>> 
>>  
>> 
>> I don’t know what else can be done. Obviously this is all schema on read so 
>> I am not sure I can change bucketing on FACT table based on one query alone!
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> +--------------------------------------------------------------------+--+
>> 
>> |                           createtab_stmt                           |
>> 
>> +--------------------------------------------------------------------+--+
>> 
>> | CREATE TABLE `times`(                                              |
>> 
>> |   `time_id` timestamp,                                             |
>> 
>> |   `day_name` varchar(9),                                           |
>> 
>> |   `day_number_in_week` int,                                        |
>> 
>> |   `day_number_in_month` int,                                       |
>> 
>> |   `calendar_week_number` int,                                      |
>> 
>> |   `fiscal_week_number` int,                                        |
>> 
>> |   `week_ending_day` timestamp,                                     |
>> 
>> |   `week_ending_day_id` bigint,                                     |
>> 
>> |   `calendar_month_number` int,                                     |
>> 
>> |   `fiscal_month_number` int,                                       |
>> 
>> |   `calendar_month_desc` varchar(8),                                |
>> 
>> ----------
>> 
>> |   `days_in_fis_year` bigint,                                       |
>> 
>> |   `end_of_cal_year` timestamp,                                     |
>> 
>> |   `end_of_fis_year` timestamp)                                     |
>> 
>> | CLUSTERED BY (                                                     |
>> 
>> |   time_id)                                                         |
>> 
>> | INTO 256 BUCKETS                                                   |
>> 
>> | ROW FORMAT SERDE                                                   |
>> 
>> |   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'                      |
>> 
>> | STORED AS INPUTFORMAT                                              |
>> 
>> |   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'                |
>> 
>> | OUTPUTFORMAT                                                       |
>> 
>> |   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'               |
>> 
>> | LOCATION                                                           |
>> 
>> |   'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/times'  |
>> 
>> | TBLPROPERTIES (                                                    |
>> 
>> |   'COLUMN_STATS_ACCURATE'='true', |
>> 
>> |   'numFiles'='1',                                                  |
>> 
>> |   'numRows'='1826',                                                |
>> 
>> |   'orc.bloom.filter.columns'='TIME_ID',                            |
>> 
>> |   'orc.bloom.filter.fpp'='0.05',                                   |
>> 
>> |   'orc.compress'='SNAPPY',                                         |
>> 
>> |   'orc.create.index'='true',                                       |
>> 
>> |   'orc.row.index.stride'='10000', |
>> 
>> |   'orc.stripe.size'='268435456',                                   |
>> 
>> |   'rawDataSize'='0',                                               |
>> 
>> |   'totalSize'='11155',                                             |
>> 
>> |   'transient_lastDdlTime'='1451429900')                            |
>> 
>>  
>> 
>> ;
>> 
>>  
>> 
>>  
>> 
>> http://talebzadehmich.wordpress.com
>> 
>>  
>> 
>> NOTE: The information in this email is proprietary and confidential. This 
>> message is for the designated recipient only, if you are not the intended 
>> recipient, you should destroy it immediately. Any information in this 
>> message shall not be understood as given or endorsed by Peridale Technology 
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
>> the responsibility of the recipient to ensure that this email is virus free, 
>> therefore neither Peridale Ltd, its subsidiaries nor their employees accept 
>> any responsibility.
>> 
>>  
>> 
>> From: Jörn Franke [mailto:jornfra...@gmail.com] 
>> Sent: 30 December 2015 08:28
>> To: user@hive.apache.org
>> Subject: Re: Running the same query on 1 billion rows fact table in Hive on 
>> Spark compared to Sybase IQ columnar database
>> 
>>  
>> 
>> Have you tried it with Hive ob TEZ? It contains (currently) more 
>> optimizations than Hive on Spark.
>> 
>> I assume you use the latest Hive version.
>> 
>> Additionally you may want to think about calculating statistics (depending 
>> on your configuration you need to trigger it) - I am not sure if Spark can 
>> use them.
>> 
>> I am not sure if bloom filters on the columns you mention make sense. You 
>> may also want to increase stride size (depending on your data).
>> 
>> Currently you bucket by a lot of fields, which may not make sense. You also 
>> may want to sort the data by customer Id in the table.
>> 
>> You also seem to have a lot of reducers, which you may want to decrease.
>> 
>>  
>> 
>> Have you tried without "having stddev_samp" ? Is the query exactly the same 
>> as in Sybase?
>> 
>> 
>> On 29 Dec 2015, at 11:53, Mich Talebzadeh <m...@peridale.co.uk> wrote:
>> 
>> Hi,
>> 
>>  
>> 
>> I have a fact table in Hive imported from Sybase IQ via SQOOP with 1 billion 
>> rows as follows:
>> 
>>  
>> 
>> show create table sales;
>> 
>> +-------------------------------------------------------------------------------+--+
>> 
>> |                                createtab_stmt                              
>>    |
>> 
>> +-------------------------------------------------------------------------------+--+
>> 
>> | CREATE TABLE `sales`(                                                      
>>    |
>> 
>> |   `prod_id` bigint,                                                        
>>    |
>> 
>> |   `cust_id` bigint,                                                        
>>    |
>> 
>> |   `time_id` timestamp,                                                     
>>    |
>> 
>> |   `channel_id` bigint,                                                     
>>    |
>> 
>> |   `promo_id` bigint,                                                       
>>    |
>> 
>> |   `quantity_sold` decimal(10,0),                                           
>>    |
>> 
>> |   `amount_sold` decimal(10,0))                                             
>>    |
>> 
>> | CLUSTERED BY (                                                             
>>    |
>> 
>> |   prod_id,                                                                 
>>    |
>> 
>> |   cust_id,                                                                 
>>    |
>> 
>> |   time_id,                                                                 
>>    |
>> 
>> |   channel_id,                                                              
>>    |
>> 
>> |   promo_id)                                                                
>>    |
>> 
>> | INTO 256 BUCKETS                                                           
>>    |
>> 
>> | ROW FORMAT SERDE                                                           
>>    |
>> 
>> |   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'                              
>>    |
>> 
>> | STORED AS INPUTFORMAT                                                      
>>    |
>> 
>> |   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'                        
>>    |
>> 
>> | OUTPUTFORMAT                                                               
>>    |
>> 
>> |   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'                       
>>    |
>> 
>> | LOCATION                                                                   
>>    |
>> 
>> |   'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales'          
>>    |
>> 
>> | TBLPROPERTIES (                                                            
>>    |
>> 
>> |   'COLUMN_STATS_ACCURATE'='true',                                          
>>    |
>> 
>> |   'last_modified_by'='hduser',                                             
>>    |
>> 
>> |   'last_modified_time'='1451305626',                                       
>>    |
>> 
>> |   'numFiles'='11',                                                         
>>    |
>> 
>> |   'numRows'='1000000000',                                                  
>>    |
>> 
>> |   
>> 'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID',   |
>> 
>> |   'orc.bloom.filter.fpp'='0.05',                                           
>>    |
>> 
>> |   'orc.compress'='SNAPPY',                                                 
>>    |
>> 
>> |   'orc.create.index'='true',                                               
>>    |
>> 
>> |   'orc.row.index.stride'='10000',                                          
>>    |
>> 
>> |   'orc.stripe.size'='268435456',                                           
>>    |
>> 
>> |   'rawDataSize'='296000000000',                                            
>>    |
>> 
>> |   'totalSize'='2678882153',                                                
>>    |
>> 
>> |   'transient_lastDdlTime'='1451305626')                                    
>>    |
>> 
>> +-------------------------------------------------------------------------------+--+
>> 
>>  
>> 
>> I use the following query to run against sales table only against Hive
>> 
>>  
>> 
>> SELECT
>> 
>>           rs.Customer_ID
>> 
>>         , rs.Number_of_orders
>> 
>>         , rs.Total_customer_amount
>> 
>>         , rs.Average_order
>> 
>>         , rs.Standard_deviation
>> 
>> FROM
>> 
>> (
>> 
>>         SELECT cust_id AS Customer_ID,
>> 
>>         COUNT(amount_sold) AS Number_of_orders,
>> 
>>         SUM(amount_sold) AS Total_customer_amount,
>> 
>>         AVG(amount_sold) AS Average_order,
>> 
>>         stddev_samp(amount_sold) AS Standard_deviation
>> 
>>         FROM sales
>> 
>>         GROUP BY cust_id
>> 
>>         HAVING SUM(amount_sold) > 94000
>> 
>>         AND AVG(amount_sold) < stddev_samp(amount_sold)
>> 
>> ) rs
>> 
>> ORDER BY
>> 
>>           -- Total_customer_amount DESC
>> 
>>           3 DESC
>> 
>>  
>> 
>> Hive comes back in 17 minutes with 5,948 rows
>> 
>>  
>> 
>> bl -f sales.hql > sales.log
>> 
>> Connecting to jdbc:hive2://rhes564:10010/default
>> 
>> Connected to: Apache Hive (version 1.2.1)
>> 
>> Driver: Hive JDBC (version 1.2.1)
>> 
>> Transaction isolation: TRANSACTION_REPEATABLE_READ
>> 
>> Running init script /home/hduser/dba/bin/hive_on_spark_init.hql
>> 
>> No rows affected (0.097 seconds)
>> 
>> No rows affected (0.001 seconds)
>> 
>> No rows affected (0.001 seconds)
>> 
>> No rows affected (0.038 seconds)
>> 
>> INFO  : Warning: Using constant number 3 in order by. If you try to use 
>> position alias when hive.groupby.orderby.position.alias is false, the 
>> position alias will be ignored.
>> 
>> INFO  :
>> 
>> Query Hive on Spark job[0] stages:
>> 
>> INFO  : 0
>> 
>> INFO  : 1
>> 
>> INFO  : 2
>> 
>> INFO  :
>> 
>> Status: Running (Hive on Spark job[0])
>> 
>> INFO  : Job Progress Format
>> 
>> CurrentTime StageId_StageAttemptId: 
>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
>> [StageCost]
>> 
>> INFO  : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009       
>> Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009       
>> Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009    
>>    Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009    
>>    Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009    
>>    Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009    
>>    Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009    
>>    Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009    
>>    Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009    
>>    Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009    
>>    Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009    
>>    Stage-2_0: 0/1
>> 
>>  
>> 
>> INFO  : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11    Stage-1_0: 0/1009    
>>    Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11    Stage-1_0: 0/1009    
>>    Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11    Stage-1_0: 0/1009    
>>    Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 0(+2)/1009   Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 6(+2)/1009   Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 30(+2)/1009  Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 59(+2)/1009  Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 90(+2)/1009  Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 124(+2)/1009 Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 160(+2)/1009 Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 193(+2)/1009 Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 228(+2)/1009 Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 262(+2)/1009 Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 298(+2)/1009 Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 338(+2)/1009 Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 376(+2)/1009 Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 417(+3)/1009 Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 460(+2)/1009 Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 502(+2)/1009 Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 542(+2)/1009 Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 584(+2)/1009 Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 624(+2)/1009 Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 667(+2)/1009 Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 709(+3)/1009 Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 754(+3)/1009 Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 797(+2)/1009 Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 844(+2)/1009 Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 888(+2)/1009 Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 934(+2)/1009 Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 981(+2)/1009 Stage-2_0: 0/1
>> 
>> INFO  : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 1009/1009 Finished   Stage-2_0: 0(+1)/1
>> 
>> INFO  : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished       Stage-1_0: 
>> 1009/1009 Finished   Stage-2_0: 1/1 Finished
>> 
>> INFO  : Status: Finished successfully in 1036.00 seconds
>> 
>> 5,948 rows selected (1074.817 seconds)
>> 
>>  
>> 
>> So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows in 
>> 23 seconds
>> 
>>  
>> 
>> Sybase IQ is a columnar database so each column is created as a fast 
>> projection index by default. In addition I have created LF (bitmap) indexes 
>> on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now 
>> the query only touches CUST_ID.
>> 
>>  
>> 
>> My suspicion is that it is the Standard Deviation function stddev_samp() 
>> that could be the bottleneck?
>> 
>>  
>> 
>> Thanks
>> 
>>  
>> 
>> Mich Talebzadeh
>> 
>>  
>> 
>> Sybase ASE 15 Gold Medal Award 2008
>> 
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>> 
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>> 
>> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
>> ISBN 978-0-9563693-0-7.
>> 
>> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
>> 978-0-9759693-0-4
>> 
>> Publications due shortly:
>> 
>> Complex Event Processing in Heterogeneous Environments, ISBN: 
>> 978-0-9563693-3-8
>> 
>> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume 
>> one out shortly
>> 
>>  
>> 
>> http://talebzadehmich.wordpress.com
>> 
>>  
>> 
>> NOTE: The information in this email is proprietary and confidential. This 
>> message is for the designated recipient only, if you are not the intended 
>> recipient, you should destroy it immediately. Any information in this 
>> message shall not be understood as given or endorsed by Peridale Technology 
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
>> the responsibility of the recipient to ensure that this email is virus free, 
>> therefore neither Peridale Ltd, its subsidiaries nor their employees accept 
>> any responsibility.
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> Want to work at Handy? Check out our culture deck and open roles
>> 
>> Latest news at Handy
>> 
>> Handy just raised $50m led by Fidelity
>> 
>>  
>> 
>> <image001.jpg>
>> 
> 
> 
> Want to work at Handy? Check out our culture deck and open roles
> Latest news at Handy
> Handy just raised $50m led by Fidelity
>

Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

Reply via email to