Running Spark-sql on Hive metastore

Mich Talebzadeh Sun, 31 Jan 2016 15:08:25 -0800

Hi,


*         Spark 1.5.2 on Hive 1.2.1

*         Hive 1.2.1 on Spark 1.3.1

*         Oracle Release 11.2.0.1.0

*         Hadoop 2.6

 

I am running spark-sql using Hive metastore and I am pleasantly surprised by
the speed by which Spark performs certain queries on Hive tables.

 

I imported a 100 Million rows table from Oracle into a Hive staging table
via Sqoop and then did an insert/select into an ORC table in Hive as defined
below.

 

+------------------------------------------------------------+--+

|                       createtab_stmt                       |

+------------------------------------------------------------+--+

| CREATE TABLE `dummy`(                                      |

|   `id` int,                                                |

|   `clustered` int,                                         |

|   `scattered` int,                                         |

|   `randomised` int,                                        |

|   `random_string` varchar(50),                             |

|   `small_vc` varchar(10),                                  |

|   `padding` varchar(10))                                   |

| CLUSTERED BY (                                             |

|   id)                                                      |

| INTO 256 BUCKETS                                           |

| ROW FORMAT SERDE                                           |

|   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'              |

| STORED AS INPUTFORMAT                                      |

|   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'        |

| OUTPUTFORMAT                                               |

|   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'       |

| LOCATION                                                   |

|   'hdfs://rhes564:9000/user/hive/warehouse/test.db/dummy'  |

| TBLPROPERTIES (                                            |

|   'COLUMN_STATS_ACCURATE'='true',                          |

|   'numFiles'='35',                                         |

|   'numRows'='100000000',                                   |

|   'orc.bloom.filter.columns'='ID',                         |

|   'orc.bloom.filter.fpp'='0.05',                           |

|   'orc.compress'='SNAPPY',                                 |

|   'orc.create.index'='true',                               |

|   'orc.row.index.stride'='10000',                          |

|   'orc.stripe.size'='16777216',                            |

|   'rawDataSize'='33800000000',                             |

|   'totalSize'='5660813776',                                |

|   'transient_lastDdlTime'='1454234981')                    |

+------------------------------------------------------------+--+

 

I am doing simple min,max functions on columns scattered and randomised from
the above table that are not part of cluster etc in Hive. In addition, in
Oracle there is no index on these columns as well.

 

If I use Hive 1.2.1 on Spark 1.3.1 it comes back in 50.751 seconds

 

select min(scattered), max(randomised) from dummy;

INFO  :

Query Hive on Spark job[0] stages:

INFO  : 0

INFO  : 1

INFO  :

Status: Running (Hive on Spark job[0])

INFO  : Job Progress Format

CurrentTime StageId_StageAttemptId:
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]

INFO  : 2016-01-31 22:55:05,114 Stage-0_0: 0/18 Stage-1_0: 0/1

INFO  : 2016-01-31 22:55:06,122 Stage-0_0: 0(+2)/18     Stage-1_0: 0/1

INFO  : 2016-01-31 22:55:09,165 Stage-0_0: 0(+2)/18     Stage-1_0: 0/1

INFO  : 2016-01-31 22:55:12,190 Stage-0_0: 2(+2)/18     Stage-1_0: 0/1

INFO  : 2016-01-31 22:55:14,201 Stage-0_0: 3(+2)/18     Stage-1_0: 0/1

INFO  : 2016-01-31 22:55:15,209 Stage-0_0: 4(+2)/18     Stage-1_0: 0/1

INFO  : 2016-01-31 22:55:17,218 Stage-0_0: 6(+2)/18     Stage-1_0: 0/1

INFO  : 2016-01-31 22:55:20,234 Stage-0_0: 8(+2)/18     Stage-1_0: 0/1

INFO  : 2016-01-31 22:55:22,245 Stage-0_0: 10(+2)/18    Stage-1_0: 0/1

INFO  : 2016-01-31 22:55:25,257 Stage-0_0: 12(+2)/18    Stage-1_0: 0/1

INFO  : 2016-01-31 22:55:27,270 Stage-0_0: 14(+2)/18    Stage-1_0: 0/1

INFO  : 2016-01-31 22:55:30,289 Stage-0_0: 16(+2)/18    Stage-1_0: 0/1

INFO  : 2016-01-31 22:55:31,294 Stage-0_0: 17(+1)/18    Stage-1_0: 0/1

INFO  : 2016-01-31 22:55:32,302 Stage-0_0: 18/18 Finished       Stage-1_0:
0(+1)/1

INFO  : 2016-01-31 22:55:33,309 Stage-0_0: 18/18 Finished       Stage-1_0:
1/1 Finished

INFO  : Status: Finished successfully in 46.37 seconds

+------+------+--+

| _c0  | _c1  |

+------+------+--+

| 0    | 999  |

+------+------+--+

1 row selected (50.751 seconds)

 

If I use Spark 1.5.2 on Hive 1.2.1 it comes back in 7.37 seconds (three
runs)

 

select min(scattered), max(randomised) from dummy; 

16/01/31 22:59:30 INFO parse.ParseDriver: Parsing command: select
min(scattered), max(randomised) from dummy

16/01/31 22:59:30 INFO parse.ParseDriver: Parse Completed

16/01/31 22:59:30 INFO Configuration.deprecation: mapred.map.tasks is
deprecated. Instead, use mapreduce.job.maps

16/01/31 22:59:30 INFO storage.MemoryStore: ensureFreeSpace(480952) called
with curMem=4732, maxMem=555684986

16/01/31 22:59:30 INFO storage.MemoryStore: Block broadcast_1 stored as
values in memory (estimated size 469.7 KB, free 529.5 MB)

16/01/31 22:59:31 INFO storage.MemoryStore: ensureFreeSpace(41724) called
with curMem=485684, maxMem=555684986

16/01/31 22:59:31 INFO storage.MemoryStore: Block broadcast_1_piece0 stored
as bytes in memory (estimated size 40.7 KB, free 529.4 MB)

16/01/31 22:59:31 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in
memory on 50.140.197.217:50516 (size: 40.7 KB, free: 529.9 MB)

16/01/31 22:59:31 INFO spark.SparkContext: Created broadcast 1 from
processCmd at CliDriver.java:376

16/01/31 22:59:31 INFO spark.SparkContext: Starting job: processCmd at
CliDriver.java:376

16/01/31 22:59:31 INFO log.PerfLogger: <PERFLOG method=OrcGetSplits
from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>

16/01/31 22:59:31 INFO Configuration.deprecation: mapred.input.dir is
deprecated. Instead, use mapreduce.input.fileinputformat.inputdir

16/01/31 22:59:31 INFO orc.OrcInputFormat: FooterCacheHitRatio: 0/0

16/01/31 22:59:31 INFO log.PerfLogger: </PERFLOG method=OrcGetSplits
start=1454281171262 end=1454281171330 duration=68
from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>

16/01/31 22:59:31 INFO scheduler.DAGScheduler: Registering RDD 6 (processCmd
at CliDriver.java:376)

16/01/31 22:59:38 INFO scheduler.StatsReportListener:   0%      5%      10%
25%     50%     75%     90%     95%     100%

16/01/31 22:59:38 INFO scheduler.StatsReportListener:   0.0 ms  0.0 ms  0.0
ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms

0       999

Time taken: 7.37 seconds, Fetched 1 row(s)

 

Actually sounds like for a full table scan on 100 Million rows table Spark
is on par with Oracle 11g that returns the same results in 7.03 seconds
(three runs) doing a full table scan as expected

 

scratch...@mydb.mich.LOCAL> select min(scattered), max(randomised) from
dummy; 

 

MIN(SCATTERED) MAX(RANDOMISED)

-------------- ---------------

             0             999

 

Elapsed: 00:00:07.03

 

Execution Plan

----------------------------------------------------------

Plan hash value: 2937163428

 

----------------------------------------------------------------------------

| Id  | Operation          | Name  | Rows  | Bytes | Cost (%CPU)| Time     |

----------------------------------------------------------------------------

|   0 | SELECT STATEMENT   |       |     1 |     8 |   260K  (1)| 00:52:12 |

|   1 |  SORT AGGREGATE    |       |     1 |     8 |            |          |

|   2 |   TABLE ACCESS FULL| DUMMY |   100M|   762M|   260K  (1)| 00:52:12 |

----------------------------------------------------------------------------

 

 

Statistics

----------------------------------------------------------

          0  recursive calls

          0  db block gets

    1347179  consistent gets

    1347168  physical reads

          0  redo size

        612  bytes sent via SQL*Net to client

        523  bytes received via SQL*Net from client

          2  SQL*Net roundtrips to/from client

          0  sorts (memory)

          0  sorts (disk)

          1  rows processed

 

 

 

Dr Mich Talebzadeh

 

LinkedIn
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABU
rV8Pw>
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 
<http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908
.pdf>
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.

Running Spark-sql on Hive metastore

Reply via email to