> Status: Finished successfully in 14.12 seconds
> OK
> 100000000
> Time taken: 14.38 seconds, Fetched: 1 row(s)
That might be an improvement over MR, but that still feels far too slow.
Parquet numbers are in general bad in Hive, but that's because the Parquet
reader gets no actual love from the devs. The community, if it wants to
keep using Parquet heavily needs a Hive dev to go over to Parquet-mr and
cut a significant number of memory copies out of the reader.
The Spark 2.0 build for instance, has a custom Parquet reader for SparkSQL
which does this. SPARK-12854 does for Spark+Parquet what Hive 2.0 does for
ORC (actually, it looks more like hive's VectorizedRowBatch than
Tungsten's flat layouts).
But that reader cannot be used in Hive-on-Spark, because it is not a
public reader impl.
Not to pick an arbitrary dataset, my workhorse example is a TPC-H lineitem
at 10Gb scale with a single 16 box.
hive(tpch_flat_orc_10)> select max(l_discount) from lineitem;
Query ID = gopal_20160711175917_f96371aa-2721-49c8-99a0-f7c4a1eacfda
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id
application_1466700718395_0256)
---------------------------------------------------------------------------
-------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING
PENDING FAILED KILLED
---------------------------------------------------------------------------
-------------------
Map 1 .......... llap SUCCEEDED 13 13 0
0 0 0
Reducer 2 ...... llap SUCCEEDED 1 1 0
0 0 0
---------------------------------------------------------------------------
-------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 0.71 s
---------------------------------------------------------------------------
-------------------
Status: DAG finished successfully in 0.71 seconds
Query Execution Summary
---------------------------------------------------------------------------
-------------------
OPERATION DURATION
---------------------------------------------------------------------------
-------------------
Compile Query 0.21s
Prepare Plan 0.13s
Submit Plan 0.34s
Start DAG 0.23s
Run DAG 0.71s
---------------------------------------------------------------------------
-------------------
Task Execution Summary
---------------------------------------------------------------------------
-------------------
VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS
OUTPUT_RECORDS
---------------------------------------------------------------------------
-------------------
Map 1 604.00 0 0 59,957,438
13
Reducer 2 105.00 0 0 13
0
---------------------------------------------------------------------------
-------------------
LLAP IO Summary
---------------------------------------------------------------------------
-------------------
VERTICES ROWGROUPS META_HIT META_MISS DATA_HIT DATA_MISS ALLOCATION
USED TOTAL_IO
---------------------------------------------------------------------------
-------------------
Map 1 6036 0 146 0B 68.86MB 491.00MB
479.89MB 7.94s
---------------------------------------------------------------------------
-------------------
OK
0.1
Time taken: 1.669 seconds, Fetched: 1 row(s)
hive(tpch_flat_orc_10)>
This is running against a single 16 core box & I would assume it would
take <1.4s to read twice as much (13 tasks is barely touching the load
factors).
It would probably be a bit faster if the cache had hits, but in general
14s to read a 100M rows is nearly a magnitude off where Hive 2.2.0 is.
Cheers,
Gopal