Hi,
I have an ORC table with around 9 million rows. It has an ID column. We are
running a query to make sure that are no duplicate IDs. This is the query
*SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts;*
This is the output from the presto shell
presto:test> SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts;
_col0 | _col1
---------+---------
8588761 | 8588761
(1 row)
Query 20170404_170205_02519_3js42, FINISHED, 5 nodes
Splits: 209 total, 209 done (100.00%)
0:01 [8.59M rows, 113MB] [11M rows/s, 146MB/s]
It took 0.01 seconds. Sometimes it takes 0.1 seconds.
Here is the output from the tez shell
Status: Running (Executing on YARN cluster with App id
application_1490832463187_16626)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING
PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 22 22 0
0 0 0
Reducer 2 ...... container SUCCEEDED 1 1 0
0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 633.29
s
----------------------------------------------------------------------------------------------
OK
8588761 8588761
Time taken: 644.118 seconds, Fetched: 1 row(s)
It took over 10 minutes on an underutilized cluster.
I got similar performance with Hive + LLAP too.
These are the files in HDFS
-rwxr-xr-x 3 hadoop supergroup 10779244 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000000_0
-rwxr-xr-x 3 hadoop supergroup 28062422 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000001_0
-rwxr-xr-x 3 hadoop supergroup 10789792 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000002_0
-rwxr-xr-x 3 hadoop supergroup 28133772 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000003_0
-rwxr-xr-x 3 hadoop supergroup 10741196 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000004_0
-rwxr-xr-x 3 hadoop supergroup 28041034 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000005_0
-rwxr-xr-x 3 hadoop supergroup 10759340 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000006_0
-rwxr-xr-x 3 hadoop supergroup 28027747 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000007_0
-rwxr-xr-x 3 hadoop supergroup 10806057 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000008_0
-rwxr-xr-x 3 hadoop supergroup 28029477 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000009_0
-rwxr-xr-x 3 hadoop supergroup 10870175 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000010_0
-rwxr-xr-x 3 hadoop supergroup 28059399 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000011_0
-rwxr-xr-x 3 hadoop supergroup 10803552 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000012_0
-rwxr-xr-x 3 hadoop supergroup 28018137 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000013_0
-rwxr-xr-x 3 hadoop supergroup 10778503 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000014_0
-rwxr-xr-x 3 hadoop supergroup 28024777 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000015_0
-rwxr-xr-x 3 hadoop supergroup 10798882 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000016_0
-rwxr-xr-x 3 hadoop supergroup 28061968 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000017_0
-rwxr-xr-x 3 hadoop supergroup 10781884 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000018_0
-rwxr-xr-x 3 hadoop supergroup 28030023 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000019_0
-rwxr-xr-x 3 hadoop supergroup 10754706 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000020_0
-rwxr-xr-x 3 hadoop supergroup 28136379 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000021_0
-rwxr-xr-x 3 hadoop supergroup 10783418 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000022_0
-rwxr-xr-x 3 hadoop supergroup 28014191 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000023_0
-rwxr-xr-x 3 hadoop supergroup 10794727 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000024_0
-rwxr-xr-x 3 hadoop supergroup 27992826 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000025_0
-rwxr-xr-x 3 hadoop supergroup 10776184 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000026_0
-rwxr-xr-x 3 hadoop supergroup 28068345 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000027_0
-rwxr-xr-x 3 hadoop supergroup 10829188 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000028_0
-rwxr-xr-x 3 hadoop supergroup 28131416 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000029_0
-rwxr-xr-x 3 hadoop supergroup 10785466 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000030_0
-rwxr-xr-x 3 hadoop supergroup 28031752 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000031_0
What can be done to get the hive query to run faster in hive?
--
Regards,
Premal Shah.