Re: Using spark in tandem with Hive

Jörn Franke Tue, 01 Dec 2015 08:59:01 -0800

How did you create the tables? Do you have automated statistics activated in 
Hive?


Btw mr is outdated as a Hive execution engine. Use TEZ (maybe wait for 0.8 for 
sub second queries ) or use Spark as an execution engine in Hive.

> On 01 Dec 2015, at 17:40, Mich Talebzadeh <m...@peridale.co.uk> wrote:
> 
> What if we decide to use spark with Hive. I look to hear similar views
>  
> My test bed comprised
>  
> 1.    Spark version 1.5.2
> 2.    Hive version 1.2.1
> 3.    Hadoop version 2.6
>  
>  
> I made Spark to use Hive metastore. So using spark-sql I can pretty do 
> whatever one can do with HiveQL
>  
> I created and populated an ORC table in spark-sql.. It took 90 seconds to 
> create and populate the table with 1.7 million rows
>  
> spark-sql> select count(1) from tt;
> 1767886
> Time taken: 5.169 seconds, Fetched 1 row(s)
>  
> Now let me try to do the said operation on the same table with HCL and MR
>  
> hive> use asehadoop;
> OK
> Time taken: 0.639 seconds
> hive> select count(1) from tt;
> Query ID = hduser_20151201162717_e3102633-f501-413b-b9cb-384ac50880ac
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=<number>
> Starting Job = job_1448969636093_0001, Tracking URL = 
> http://rhes564:8088/proxy/application_1448969636093_0001/
> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill 
> job_1448969636093_0001
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 
> 1
> 2015-12-01 16:27:27,154 Stage-1 map = 0%,  reduce = 0%
> 2015-12-01 16:27:35,427 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.65 
> sec
> 2015-12-01 16:27:41,611 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 
> 3.71 sec
> MapReduce Total cumulative CPU time: 3 seconds 710 msec
> Ended Job = job_1448969636093_0001
> MapReduce Jobs Launched:
> Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.71 sec   HDFS Read: 
> 520151 HDFS Write: 8 SUCCESS
> Total MapReduce CPU Time Spent: 3 seconds 710 msec
> OK
> 1767886
> Time taken: 25.635 seconds, Fetched: 1 row(s)
>  
> So 5 seconds in Spark versus 25 seconds in Hive
>  
> On a point query Hive does not seem to return the correct timing?
>  
> hive> select * from tt where data_object_id = 10;
> Time taken: 0.063 seconds, Fetched: 72 row(s)
>  
> Whereas in Spark I get
>  
> spark-sql>  select * from tt where data_object_id = 10;
> Time taken: 9.002 seconds, Fetched 72 row(s)
>  
> 9 seconds looks far more plausible to me than 0.063 seonds. Or in an unlikely 
> event Spark returns elapsed time, whereas Hive returns execution time?
>  
> Thanks
>  
> Mich Talebzadeh
>  
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
> ISBN 978-0-9563693-0-7.
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
> 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN: 
> 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume 
> one out shortly
>  
> http://talebzadehmich.wordpress.com
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept 
> any responsibility.
>

Re: Using spark in tandem with Hive

Reply via email to