RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

Mich Talebzadeh Thu, 31 Dec 2015 03:23:13 -0800

Thanks for the tip


Installed Apache drill and need to access hive :)

 

hduser@rhes564::/usr/lib/apache-drill-1.4.0> bin/drill-embedded

/work/tmp/libnetty-transport-native-epoll2451215308710744204.so: 
/lib64/libc.so.6: version `GLIBC_2.10' not found (required by 
/work/tmp/libnetty-transport-native-epoll2451215308710744204.so)

apache drill 1.4.0

"what ever the mind of man can conceive and believe, drill can query"

0: jdbc:drill:zk=local>

 

Trying to connect to Hive

 

Usage: connect <url> <username> <password> [driver]

 

0: jdbc:drill:zk=local> connect //rhes564:10010/default hduser xxxx 
org.apache.hive.jdbc.HiveDriver ;

No current connection

 

Any ideas?

 

Thanks

 

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Lohith Samaga M [mailto:lohith.sam...@mphasis.com] 
Sent: 31 December 2015 10:47
To: user@hive.apache.org
Subject: RE: Running the same query on 1 billion rows fact table in Hive on 
Spark compared to Sybase IQ columnar database

 

HI Mich,

                Did you try the Apache Drill SQL engine?

                It can access HIVE data.

 

                I have seen it to be much faster. But, I have not tested at 
your scale.

 

Best regards / Mit freundlichen Grüßen / Sincères salutations

M. Lohith Samaga

 

 

From: Mich Talebzadeh [mailto:m...@peridale.co.uk] 
Sent: Thursday, December 31, 2015 02.04
To: user@hive.apache.org <mailto:user@hive.apache.org> 
Subject: RE: Running the same query on 1 billion rows fact table in Hive on 
Spark compared to Sybase IQ columnar database

 

Thanks Marcin

 

Trying to build TEZ 0.7 in

 

/usr/lib/apache-tez-0.7.0-src

 

using

 

mvn -X clean package -DskipTests=true -Dmaven.javadoc.skip=true

 

with mvn version 3.2.5 (as opposed to 3.3) as I read that I can build it OK 
with 3.2.5 following the same error ass below

 

mvn --version

Apache Maven 3.2.5 (12a6b3acb947671f09b81f49094c53f426d8cea1; 
2014-12-14T17:29:23+00:00)

Maven home: /usr/local/apache-maven/apache-maven-3.2.5

Java version: 1.7.0_25, vendor: Oracle Corporation

Java home: /usr/java/jdk1.7.0_25/jre

 

I get this error

 

[INFO] tez-ui ............................................. FAILURE [  0.411 s]

[

 

DEBUG] -- end configuration --

[INFO] Running 'npm install --color=false' in 
/usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp

[INFO] 
/usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp/node/with_new_path.sh: 
line 3: 23781 Aborted                 "$@"

 

 

[ERROR] Failed to execute goal 
com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on project 
tez-ui: Failed to run task: 'npm install --color=false' failed. (error code 
134) -> [Help 1]

org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal 
com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on project 
tez-ui: Failed to run task

 

 

any ideas as there is little info available in net.

 

 

Thanks

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Marcin Tustin [mailto:mtus...@handybook.com] 
Sent: 30 December 2015 19:27
To: user@hive.apache.org <mailto:user@hive.apache.org> 
Subject: Re: Running the same query on 1 billion rows fact table in Hive on 
Spark compared to Sybase IQ columnar database

 

I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is much 
faster than MR in pretty much all cases. Also, with hive, you'll make sure 
you've performed optimizations like aligning ORC stripe sizes with HDFS block 
sizes, and concatenated your tables (not so much an optimization as a must for 
avoiding the small files problem).

 

On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

Thanks again Jorn.

 

 

Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I have 
compression enabled. The FACT table in IQ (sales) has LF (read bitmap) indexes 
on the time_id column. For the dimension table (times) I have time_id defined 
as primary key. Also Sybase IQ creates FP (fast projection) indexes on every 
column by default.

 

Anyway I am trying to download and build TEZ. Do we know which version of TEZ 
works with Hive 1.2.1 please? 0.8 seems to be in alpha

 

Thanks

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Jörn Franke [mailto:jornfra...@gmail.com <mailto:jornfra...@gmail.com> ] 
Sent: 30 December 2015 16:29


To: user@hive.apache.org <mailto:user@hive.apache.org> 
Subject: Re: Running the same query on 1 billion rows fact table in Hive on 
Spark compared to Sybase IQ columnar database

 


Hmm i think the execution Engine TEZ has (currently) the most optimizations on 
Hive. What about your hardware - is it the same? Do you have also compression 
on Sybase?

Alternatively you need to wait for Hive for interactive analytics (tez 0.8 + 
llap). 


On 30 Dec 2015, at 13:47, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

Hi Jorn,

 

Thanks for your reply. My Hive version is 1.2.1 on Spark 1.3.1. I have not 
tried it on TEZ. I tried the query on MR engine and it did nor fair better. I 
also ran it without SDDDEV function and found out that the function did not 
slow it down.

 

I tried a simple query as follows builr in sales FACT table 1e9 rows and 
dimension table times (1826 rows)

 

--

-- Get the total amount sold for each calendar month

--

SELECT t.calendar_month_desc, SUM(s.amount_sold)

FROM sales s, times t WHERE s.time_id = t.time_id

GROUP BY t.calendar_month_desc;

 

Now Sybase IQ comes back in around 30 seconds. 

 

Started query at Dec 30 2015 08:14:33:399AM

(48 rows affected)

Finished query at Dec 30 2015 08:15:04:640AM

 

Whereas Hive with the following setting and running the same query

 

set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

set hive.optimize.bucketmapjoin=true;

set hive.optimize.bucketmapjoin.sortedmerge=true;

 

Comes back in 

 

48 rows selected (1514.687 seconds)

 

I don’t know what else can be done. Obviously this is all schema on read so I 
am not sure I can change bucketing on FACT table based on one query alone!

 

 

 

+--------------------------------------------------------------------+--+

|                           createtab_stmt                           |

+--------------------------------------------------------------------+--+

| CREATE TABLE `times`(                                              |

|   `time_id` timestamp,                                             |

|   `day_name` varchar(9),                                           |

|   `day_number_in_week` int,                                        |

|   `day_number_in_month` int,                                       |

|   `calendar_week_number` int,                                      |

|   `fiscal_week_number` int,                                        |

|   `week_ending_day` timestamp,                                     |

|   `week_ending_day_id` bigint,                                     |

|   `calendar_month_number` int,                                     |

|   `fiscal_month_number` int,                                       |

|   `calendar_month_desc` varchar(8),                                |

----------

|   `days_in_fis_year` bigint,                                       |

|   `end_of_cal_year` timestamp,                                     |

|   `end_of_fis_year` timestamp)                                     |

| CLUSTERED BY (                                                     |

|   time_id)                                                         |

| INTO 256 BUCKETS                                                   |

| ROW FORMAT SERDE                                                   |

|   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'                      |

| STORED AS INPUTFORMAT                                              |

|   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'                |

| OUTPUTFORMAT                                                       |

|   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'               |

| LOCATION                                                           |

|   'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/times'  |

| TBLPROPERTIES (                                                    |

|   'COLUMN_STATS_ACCURATE'='true',                                  |

|   'numFiles'='1',                                                  |

|   'numRows'='1826',                                                |

|   'orc.bloom.filter.columns'='TIME_ID',                            |

|   'orc.bloom.filter.fpp'='0.05',                                   |

|   'orc.compress'='SNAPPY',                                         |

|   'orc.create.index'='true',                                       |

|   'orc.row.index.stride'='10000',                                  |

|   'orc.stripe.size'='268435456',                                   |

|   'rawDataSize'='0',                                               |

|   'totalSize'='11155',                                             |

|   'transient_lastDdlTime'='1451429900')                            |

 

;

 

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Jörn Franke [mailto:jornfra...@gmail.com] 
Sent: 30 December 2015 08:28
To: user@hive.apache.org <mailto:user@hive.apache.org> 
Subject: Re: Running the same query on 1 billion rows fact table in Hive on 
Spark compared to Sybase IQ columnar database

 

Have you tried it with Hive ob TEZ? It contains (currently) more optimizations 
than Hive on Spark.

I assume you use the latest Hive version.

Additionally you may want to think about calculating statistics (depending on 
your configuration you need to trigger it) - I am not sure if Spark can use 
them.

I am not sure if bloom filters on the columns you mention make sense. You may 
also want to increase stride size (depending on your data).

Currently you bucket by a lot of fields, which may not make sense. You also may 
want to sort the data by customer Id in the table.

You also seem to have a lot of reducers, which you may want to decrease.

 

Have you tried without "having stddev_samp" ? Is the query exactly the same as 
in Sybase?


On 29 Dec 2015, at 11:53, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

Hi,

 

I have a fact table in Hive imported from Sybase IQ via SQOOP with 1 billion 
rows as follows:

 

show create table sales;

+-------------------------------------------------------------------------------+--+

|                                createtab_stmt                                 
|

+-------------------------------------------------------------------------------+--+

| CREATE TABLE `sales`(                                                         
|

|   `prod_id` bigint,                                                           
|

|   `cust_id` bigint,                                                           
|

|   `time_id` timestamp,                                                        
|

|   `channel_id` bigint,                                                        
|

|   `promo_id` bigint,                                                          
|

|   `quantity_sold` decimal(10,0),                                              
|

|   `amount_sold` decimal(10,0))                                                
|

| CLUSTERED BY (                                                                
|

|   prod_id,                                                                    
|

|   cust_id,                                                                    
|

|   time_id,                                                                    
|

|   channel_id,                                                                 
|

|   promo_id)                                                                   
|

| INTO 256 BUCKETS                                                              
|

| ROW FORMAT SERDE                                                              
|

|   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'                                 
|

| STORED AS INPUTFORMAT                                                         
|

|   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'                           
|

| OUTPUTFORMAT                                                                  
|

|   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'                          
|

| LOCATION                                                                      
|

|   'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales'             
|

| TBLPROPERTIES (                                                               
|

|   'COLUMN_STATS_ACCURATE'='true',                                             
|

|   'last_modified_by'='hduser',                                                
|

|   'last_modified_time'='1451305626',                                          
|

|   'numFiles'='11',                                                            
|

|   'numRows'='1000000000',                                                     
|

|   'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID',   
|

|   'orc.bloom.filter.fpp'='0.05',                                              
|

|   'orc.compress'='SNAPPY',                                                    
|

|   'orc.create.index'='true',                                                  
|

|   'orc.row.index.stride'='10000',                                             
|

|   'orc.stripe.size'='268435456',                                              
|

|   'rawDataSize'='296000000000',                                               
|

|   'totalSize'='2678882153 <tel:2678882153> ',                                 
                  |

|   'transient_lastDdlTime'='1451305626')                                       
|

+-------------------------------------------------------------------------------+--+

 

I use the following query to run against sales table only against Hive

 

SELECT

          rs.Customer_ID

        , rs.Number_of_orders

        , rs.Total_customer_amount

        , rs.Average_order

        , rs.Standard_deviation

FROM

(

        SELECT cust_id AS Customer_ID,

        COUNT(amount_sold) AS Number_of_orders,

        SUM(amount_sold) AS Total_customer_amount,

        AVG(amount_sold) AS Average_order,

        stddev_samp(amount_sold) AS Standard_deviation

        FROM sales

        GROUP BY cust_id

        HAVING SUM(amount_sold) > 94000

        AND AVG(amount_sold) < stddev_samp(amount_sold)

) rs

ORDER BY

          -- Total_customer_amount DESC

          3 DESC

 

Hive comes back in 17 minutes with 5,948 rows

 

bl -f sales.hql > sales.log

Connecting to jdbc:hive2://rhes564:10010/default

Connected to: Apache Hive (version 1.2.1)

Driver: Hive JDBC (version 1.2.1)

Transaction isolation: TRANSACTION_REPEATABLE_READ

Running init script /home/hduser/dba/bin/hive_on_spark_init.hql

No rows affected (0.097 seconds)

No rows affected (0.001 seconds)

No rows affected (0.001 seconds)

No rows affected (0.038 seconds)

INFO  : Warning: Using constant number 3 in order by. If you try to use 
position alias when hive.groupby.orderby.position.alias is false, the position 
alias will be ignored.

INFO  :

Query Hive on Spark job[0] stages:

INFO  : 0

INFO  : 1

INFO  : 2

INFO  :

Status: Running (Hive on Spark job[0])

INFO  : Job Progress Format

CurrentTime StageId_StageAttemptId: 
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
[StageCost]

INFO  : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009       
Stage-2_0: 0/1

INFO  : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009       
Stage-2_0: 0/1

INFO  : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009       
Stage-2_0: 0/1

INFO  : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009       
Stage-2_0: 0/1

INFO  : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009       
Stage-2_0: 0/1

INFO  : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009       
Stage-2_0: 0/1

INFO  : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009       
Stage-2_0: 0/1

INFO  : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009       
Stage-2_0: 0/1

INFO  : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009       
Stage-2_0: 0/1

INFO  : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009       
Stage-2_0: 0/1

INFO  : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11     Stage-1_0: 0/1009       
Stage-2_0: 0/1

 

INFO  : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11    Stage-1_0: 0/1009       
Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11    Stage-1_0: 0/1009       
Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11    Stage-1_0: 0/1009       
Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished       Stage-1_0: 
0(+2)/1009   Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished       Stage-1_0: 
6(+2)/1009   Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished       Stage-1_0: 
30(+2)/1009  Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished       Stage-1_0: 
59(+2)/1009  Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished       Stage-1_0: 
90(+2)/1009  Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished       Stage-1_0: 
124(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished       Stage-1_0: 
160(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished       Stage-1_0: 
193(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished       Stage-1_0: 
228(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished       Stage-1_0: 
262(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished       Stage-1_0: 
298(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished       Stage-1_0: 
338(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished       Stage-1_0: 
376(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished       Stage-1_0: 
417(+3)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished       Stage-1_0: 
460(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished       Stage-1_0: 
502(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished       Stage-1_0: 
542(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished       Stage-1_0: 
584(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished       Stage-1_0: 
624(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished       Stage-1_0: 
667(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished       Stage-1_0: 
709(+3)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished       Stage-1_0: 
754(+3)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished       Stage-1_0: 
797(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished       Stage-1_0: 
844(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished       Stage-1_0: 
888(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished       Stage-1_0: 
934(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished       Stage-1_0: 
981(+2)/1009 Stage-2_0: 0/1

INFO  : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished       Stage-1_0: 
1009/1009 Finished   Stage-2_0: 0(+1)/1

INFO  : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished       Stage-1_0: 
1009/1009 Finished   Stage-2_0: 1/1 Finished

INFO  : Status: Finished successfully in 1036.00 seconds

5,948 rows selected (1074.817 seconds)

 

So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows in 23 
seconds

 

Sybase IQ is a columnar database so each column is created as a fast projection 
index by default. In addition I have created LF (bitmap) indexes on dimension 
columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now the query only 
touches CUST_ID.

 

My suspicion is that it is the Standard Deviation function stddev_samp() that 
could be the bottleneck?

 

Thanks

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

 

 

Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers> 

Latest news <http://www.handy.com/press>  at Handy

Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
  led by Fidelity

 




Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination 
of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at 
mailmas...@mphasis.com <mailto:mailmas...@mphasis.com>  and delete this mail 
from your records.

RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

Reply via email to