Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

Jörn Franke Thu, 31 Dec 2015 10:44:36 -0800

You are using an old version of Spark and it cannot leverage all optimizations 
of Hive, so I think that your conclusion cannot be as easy as you might think.


> On 31 Dec 2015, at 19:34, Mich Talebzadeh <m...@peridale.co.uk> wrote:
> 
> Ok guys.
>  
> I have not succeeded in installing TEZ. Yet so I can try the query on TEZ as 
> well.
>  
> Just to remind that the query is used is pretty common. Get the total amount 
> sold for each calendar month from sales (I billion rows) and times
>  
> SELECT t.calendar_month_desc, SUM(s.amount_sold)
> FROM sales s, times t WHERE s.time_id = t.time_id
> GROUP BY t.calendar_month_desc;
>  
> In total 48 rows are returned back
> Now having thought about It, granted TEZ is going to be faster than MR as it 
> is basically MR with DAG thrown at it. On the other Spark will have both DAG 
> and in-memory calculation.
>  
>  
> The results are as follow:
>  
>  
> Optimiser             Engine               Timing               Compression   
>         Total Table size     
> Hive                 MapReduce             4673.035 seconds      Snappy       
>          totalSize=2678882153 = 2.5GB
> Hive                 Spark 1.3.1           1578.817 seconds      Snappy
> Columnar              Sybase IQ               30.000 seconds      Native      
>           5GB
>  
>  
> It is pretty obvious that Spark outperforms MapReduce more than twice even 
> taking into account the number of rows on the FACT table and frankly I would 
> not have thought that TEZ is going to beat Spark (to be seen). Having said 
> that Hive storage is twice more efficient but I am not sure what one can do 
> to improve the performance. Table in Hive is stored as ORC table and it has 
> crossed my mind that maybe we should think about storing every column of an 
> ORC table as an index. That may improve the performance further.
>  
> HTH
>  
>  
> Mich Talebzadeh
>  
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
> ISBN 978-0-9563693-0-7.
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
> 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN: 
> 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume 
> one out shortly
>  
> http://talebzadehmich.wordpress.com
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept 
> any responsibility.
>  
> From: Marcin Tustin [mailto:mtus...@handybook.com] 
> Sent: 30 December 2015 19:27
> To: user@hive.apache.org
> Subject: Re: Running the same query on 1 billion rows fact table in Hive on 
> Spark compared to Sybase IQ columnar database
>  
> I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is much 
> faster than MR in pretty much all cases. Also, with hive, you'll make sure 
> you've performed optimizations like aligning ORC stripe sizes with HDFS block 
> sizes, and concatenated your tables (not so much an optimization as a must 
> for avoiding the small files problem).
>  
> On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote:
> Thanks again Jorn.
>  
>  
> Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I 
> have compression enabled. The FACT table in IQ (sales) has LF (read bitmap) 
> indexes on the time_id column. For the dimension table (times) I have time_id 
> defined as primary key. Also Sybase IQ creates FP (fast projection) indexes 
> on every column by default.
>  
> Anyway I am trying to download and build TEZ. Do we know which version of TEZ 
> works with Hive 1.2.1 please? 0.8 seems to be in alpha
>  
> Thanks
>  
> Mich Talebzadeh
>  
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
> ISBN 978-0-9563693-0-7.
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
> 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN: 
> 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume 
> one out shortly
>  
> http://talebzadehmich.wordpress.com
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept 
> any responsibility.
>

Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database

Reply via email to