Thanks Jeff,
Much like RDBMS that caches data, the same I believe happens in Spark as well
with 0n-memory operations. I ran each job three times to reduce the impact from
physical IOs. It is mentioned below (three runs). I agree with you that this is
only a test with two clusters but essentially all runs used the same hardware.
Granted increasing the number of clusters will add to parallelism and will
improve the performance of Hive on Spark.
There is another pertinent argument here as this query returned only one line.
If the data set was large I would have expected as I have seen before that Hive
takes over as there will not be enough memory for Spark operations. Additionaly
spark-sql does not support certain operations like creating temporary tables
like below
spark-sql> CREATE TEMPORARY TABLE tmp AS
> SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS
TotalSales
> FROM sales s, times t, channels c
> WHERE s.time_id = t.time_id
> AND s.channel_id = c.channel_id
> GROUP BY t.calendar_month_desc, c.channel_desc
> ;
Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
Regards,
Dr Mich Talebzadeh
LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one
out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this message
shall not be understood as given or endorsed by Peridale Technology Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees
accept any responsibility.
From: Xuefu Zhang [mailto:[email protected]]
Sent: 01 February 2016 03:05
To: [email protected]
Subject: Re: Running Spark-sql on Hive metastore
For Hive on Spark, there is a startup cost. The second run should be faster.
More importantly, it looks like you have 18 map tasks but only your cluster
only runs two of them at a time. Thus, you cluster is basically having only two
way parallelism. If you configure your cluster to give more capacity to Hive,
the speed should improve as well. Note that each your map task takes only
seconds to complete.
On Sun, Jan 31, 2016 at 3:07 PM, Mich Talebzadeh <[email protected]
<mailto:[email protected]> > wrote:
Hi,
* Spark 1.5.2 on Hive 1.2.1
* Hive 1.2.1 on Spark 1.3.1
* Oracle Release 11.2.0.1.0
* Hadoop 2.6
I am running spark-sql using Hive metastore and I am pleasantly surprised by
the speed by which Spark performs certain queries on Hive tables.
I imported a 100 Million rows table from Oracle into a Hive staging table via
Sqoop and then did an insert/select into an ORC table in Hive as defined below.
+------------------------------------------------------------+--+
| createtab_stmt |
+------------------------------------------------------------+--+
| CREATE TABLE `dummy`( |
| `id` int, |
| `clustered` int, |
| `scattered` int, |
| `randomised` int, |
| `random_string` varchar(50), |
| `small_vc` varchar(10), |
| `padding` varchar(10)) |
| CLUSTERED BY ( |
| id) |
| INTO 256 BUCKETS |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'numFiles'='35', |
| 'numRows'='100000000', |
| 'orc.bloom.filter.columns'='ID', |
| 'orc.bloom.filter.fpp'='0.05', |
| 'orc.compress'='SNAPPY', |
| 'orc.create.index'='true', |
| 'orc.row.index.stride'='10000', |
| 'orc.stripe.size'='16777216', |
| 'rawDataSize'='33800000000', |
| 'totalSize'='5660813776', |
| 'transient_lastDdlTime'='1454234981') |
+------------------------------------------------------------+--+
I am doing simple min,max functions on columns scattered and randomised from
the above table that are not part of cluster etc in Hive. In addition, in
Oracle there is no index on these columns as well.
If I use Hive 1.2.1 on Spark 1.3.1 it comes back in 50.751 seconds
select min(scattered), max(randomised) from dummy;
INFO :
Query Hive on Spark job[0] stages:
INFO : 0
INFO : 1
INFO :
Status: Running (Hive on Spark job[0])
INFO : Job Progress Format
CurrentTime StageId_StageAttemptId:
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]
INFO : 2016-01-31 22:55:05,114 Stage-0_0: 0/18 Stage-1_0: 0/1
INFO : 2016-01-31 22:55:06,122 Stage-0_0: 0(+2)/18 Stage-1_0: 0/1
INFO : 2016-01-31 22:55:09,165 Stage-0_0: 0(+2)/18 Stage-1_0: 0/1
INFO : 2016-01-31 22:55:12,190 Stage-0_0: 2(+2)/18 Stage-1_0: 0/1
INFO : 2016-01-31 22:55:14,201 Stage-0_0: 3(+2)/18 Stage-1_0: 0/1
INFO : 2016-01-31 22:55:15,209 Stage-0_0: 4(+2)/18 Stage-1_0: 0/1
INFO : 2016-01-31 22:55:17,218 Stage-0_0: 6(+2)/18 Stage-1_0: 0/1
INFO : 2016-01-31 22:55:20,234 Stage-0_0: 8(+2)/18 Stage-1_0: 0/1
INFO : 2016-01-31 22:55:22,245 Stage-0_0: 10(+2)/18 Stage-1_0: 0/1
INFO : 2016-01-31 22:55:25,257 Stage-0_0: 12(+2)/18 Stage-1_0: 0/1
INFO : 2016-01-31 22:55:27,270 Stage-0_0: 14(+2)/18 Stage-1_0: 0/1
INFO : 2016-01-31 22:55:30,289 Stage-0_0: 16(+2)/18 Stage-1_0: 0/1
INFO : 2016-01-31 22:55:31,294 Stage-0_0: 17(+1)/18 Stage-1_0: 0/1
INFO : 2016-01-31 22:55:32,302 Stage-0_0: 18/18 Finished Stage-1_0:
0(+1)/1
INFO : 2016-01-31 22:55:33,309 Stage-0_0: 18/18 Finished Stage-1_0: 1/1
Finished
INFO : Status: Finished successfully in 46.37 seconds
+------+------+--+
| _c0 | _c1 |
+------+------+--+
| 0 | 999 |
+------+------+--+
1 row selected (50.751 seconds)
If I use Spark 1.5.2 on Hive 1.2.1 it comes back in 7.37 seconds (three runs)
select min(scattered), max(randomised) from dummy;
16/01/31 22:59:30 INFO parse.ParseDriver: Parsing command: select
min(scattered), max(randomised) from dummy
16/01/31 22:59:30 INFO parse.ParseDriver: Parse Completed
16/01/31 22:59:30 INFO Configuration.deprecation: mapred.map.tasks is
deprecated. Instead, use mapreduce.job.maps
16/01/31 22:59:30 INFO storage.MemoryStore: ensureFreeSpace(480952) called with
curMem=4732, maxMem=555684986
16/01/31 22:59:30 INFO storage.MemoryStore: Block broadcast_1 stored as values
in memory (estimated size 469.7 KB, free 529.5 MB)
16/01/31 22:59:31 INFO storage.MemoryStore: ensureFreeSpace(41724) called with
curMem=485684, maxMem=555684986
16/01/31 22:59:31 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as
bytes in memory (estimated size 40.7 KB, free 529.4 MB)
16/01/31 22:59:31 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in
memory on 50.140.197.217:50516 <http://50.140.197.217:50516> (size: 40.7 KB,
free: 529.9 MB)
16/01/31 22:59:31 INFO spark.SparkContext: Created broadcast 1 from processCmd
at CliDriver.java:376
16/01/31 22:59:31 INFO spark.SparkContext: Starting job: processCmd at
CliDriver.java:376
16/01/31 22:59:31 INFO log.PerfLogger: <PERFLOG method=OrcGetSplits
from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
16/01/31 22:59:31 INFO Configuration.deprecation: mapred.input.dir is
deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
16/01/31 22:59:31 INFO orc.OrcInputFormat: FooterCacheHitRatio: 0/0
16/01/31 22:59:31 INFO log.PerfLogger: </PERFLOG method=OrcGetSplits
start=1454281171262 end=1454281171330 duration=68
from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
16/01/31 22:59:31 INFO scheduler.DAGScheduler: Registering RDD 6 (processCmd at
CliDriver.java:376)
16/01/31 22:59:38 INFO scheduler.StatsReportListener: 0% 5% 10%
25% 50% 75% 90% 95% 100%
16/01/31 22:59:38 INFO scheduler.StatsReportListener: 0.0 ms 0.0 ms 0.0 ms
0.0 ms 0.0 ms 0.0 ms 0.0 ms 0.0 ms 0.0 ms
0 999
Time taken: 7.37 seconds, Fetched 1 row(s)
Actually sounds like for a full table scan on 100 Million rows table Spark is
on par with Oracle 11g that returns the same results in 7.03 seconds (three
runs) doing a full table scan as expected
[email protected] <mailto:[email protected]> > select
min(scattered), max(randomised) from dummy;
MIN(SCATTERED) MAX(RANDOMISED)
-------------- ---------------
0 999
Elapsed: 00:00:07.03
Execution Plan
----------------------------------------------------------
Plan hash value: 2937163428
----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 8 | 260K (1)| 00:52:12 |
| 1 | SORT AGGREGATE | | 1 | 8 | | |
| 2 | TABLE ACCESS FULL| DUMMY | 100M| 762M| 260K (1)| 00:52:12 |
----------------------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
1347179 consistent gets
1347168 physical reads
0 redo size
612 bytes sent via SQL*Net to client
523 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
Dr Mich Talebzadeh
LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one
out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this message
shall not be understood as given or endorsed by Peridale Technology Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees
accept any responsibility.