Hi,
I have an ORC table with around 9 million rows. It has an ID column. We are
running a query to make sure that are no duplicate IDs. This is the query

*SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts;*

This is the output from the presto shell

presto:test> SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts;
  _col0  |  _col1
---------+---------
 8588761 | 8588761
(1 row)

Query 20170404_170205_02519_3js42, FINISHED, 5 nodes
Splits: 209 total, 209 done (100.00%)
0:01 [8.59M rows, 113MB] [11M rows/s, 146MB/s]

It took 0.01 seconds. Sometimes it takes 0.1 seconds.

Here is the output from the tez shell

Status: Running (Executing on YARN cluster with App id
application_1490832463187_16626)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING
 PENDING  FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED     22         22        0
 0       0       0
Reducer 2 ...... container     SUCCEEDED      1          1        0
 0       0       0
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 633.29
s
----------------------------------------------------------------------------------------------
OK
8588761 8588761
Time taken: 644.118 seconds, Fetched: 1 row(s)

It took over 10 minutes on an underutilized cluster.

I got similar performance with Hive + LLAP too.


These are the files in HDFS

-rwxr-xr-x   3 hadoop supergroup   10779244 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000000_0
-rwxr-xr-x   3 hadoop supergroup   28062422 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000001_0
-rwxr-xr-x   3 hadoop supergroup   10789792 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000002_0
-rwxr-xr-x   3 hadoop supergroup   28133772 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000003_0
-rwxr-xr-x   3 hadoop supergroup   10741196 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000004_0
-rwxr-xr-x   3 hadoop supergroup   28041034 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000005_0
-rwxr-xr-x   3 hadoop supergroup   10759340 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000006_0
-rwxr-xr-x   3 hadoop supergroup   28027747 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000007_0
-rwxr-xr-x   3 hadoop supergroup   10806057 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000008_0
-rwxr-xr-x   3 hadoop supergroup   28029477 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000009_0
-rwxr-xr-x   3 hadoop supergroup   10870175 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000010_0
-rwxr-xr-x   3 hadoop supergroup   28059399 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000011_0
-rwxr-xr-x   3 hadoop supergroup   10803552 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000012_0
-rwxr-xr-x   3 hadoop supergroup   28018137 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000013_0
-rwxr-xr-x   3 hadoop supergroup   10778503 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000014_0
-rwxr-xr-x   3 hadoop supergroup   28024777 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000015_0
-rwxr-xr-x   3 hadoop supergroup   10798882 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000016_0
-rwxr-xr-x   3 hadoop supergroup   28061968 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000017_0
-rwxr-xr-x   3 hadoop supergroup   10781884 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000018_0
-rwxr-xr-x   3 hadoop supergroup   28030023 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000019_0
-rwxr-xr-x   3 hadoop supergroup   10754706 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000020_0
-rwxr-xr-x   3 hadoop supergroup   28136379 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000021_0
-rwxr-xr-x   3 hadoop supergroup   10783418 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000022_0
-rwxr-xr-x   3 hadoop supergroup   28014191 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000023_0
-rwxr-xr-x   3 hadoop supergroup   10794727 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000024_0
-rwxr-xr-x   3 hadoop supergroup   27992826 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000025_0
-rwxr-xr-x   3 hadoop supergroup   10776184 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000026_0
-rwxr-xr-x   3 hadoop supergroup   28068345 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000027_0
-rwxr-xr-x   3 hadoop supergroup   10829188 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000028_0
-rwxr-xr-x   3 hadoop supergroup   28131416 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000029_0
-rwxr-xr-x   3 hadoop supergroup   10785466 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000030_0
-rwxr-xr-x   3 hadoop supergroup   28031752 2017-04-04 15:05
/user/hive/warehouse/test.db/accounts/000031_0


What can be done to get the hive query to run faster in hive?


-- 
Regards,
Premal Shah.

Reply via email to