Hi, I have an ORC table with around 9 million rows. It has an ID column. We are running a query to make sure that are no duplicate IDs. This is the query
*SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts;* This is the output from the presto shell presto:test> SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts; _col0 | _col1 ---------+--------- 8588761 | 8588761 (1 row) Query 20170404_170205_02519_3js42, FINISHED, 5 nodes Splits: 209 total, 209 done (100.00%) 0:01 [8.59M rows, 113MB] [11M rows/s, 146MB/s] It took 0.01 seconds. Sometimes it takes 0.1 seconds. Here is the output from the tez shell Status: Running (Executing on YARN cluster with App id application_1490832463187_16626) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 22 22 0 0 0 0 Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 633.29 s ---------------------------------------------------------------------------------------------- OK 8588761 8588761 Time taken: 644.118 seconds, Fetched: 1 row(s) It took over 10 minutes on an underutilized cluster. I got similar performance with Hive + LLAP too. These are the files in HDFS -rwxr-xr-x 3 hadoop supergroup 10779244 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000000_0 -rwxr-xr-x 3 hadoop supergroup 28062422 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000001_0 -rwxr-xr-x 3 hadoop supergroup 10789792 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000002_0 -rwxr-xr-x 3 hadoop supergroup 28133772 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000003_0 -rwxr-xr-x 3 hadoop supergroup 10741196 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000004_0 -rwxr-xr-x 3 hadoop supergroup 28041034 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000005_0 -rwxr-xr-x 3 hadoop supergroup 10759340 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000006_0 -rwxr-xr-x 3 hadoop supergroup 28027747 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000007_0 -rwxr-xr-x 3 hadoop supergroup 10806057 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000008_0 -rwxr-xr-x 3 hadoop supergroup 28029477 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000009_0 -rwxr-xr-x 3 hadoop supergroup 10870175 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000010_0 -rwxr-xr-x 3 hadoop supergroup 28059399 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000011_0 -rwxr-xr-x 3 hadoop supergroup 10803552 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000012_0 -rwxr-xr-x 3 hadoop supergroup 28018137 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000013_0 -rwxr-xr-x 3 hadoop supergroup 10778503 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000014_0 -rwxr-xr-x 3 hadoop supergroup 28024777 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000015_0 -rwxr-xr-x 3 hadoop supergroup 10798882 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000016_0 -rwxr-xr-x 3 hadoop supergroup 28061968 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000017_0 -rwxr-xr-x 3 hadoop supergroup 10781884 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000018_0 -rwxr-xr-x 3 hadoop supergroup 28030023 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000019_0 -rwxr-xr-x 3 hadoop supergroup 10754706 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000020_0 -rwxr-xr-x 3 hadoop supergroup 28136379 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000021_0 -rwxr-xr-x 3 hadoop supergroup 10783418 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000022_0 -rwxr-xr-x 3 hadoop supergroup 28014191 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000023_0 -rwxr-xr-x 3 hadoop supergroup 10794727 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000024_0 -rwxr-xr-x 3 hadoop supergroup 27992826 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000025_0 -rwxr-xr-x 3 hadoop supergroup 10776184 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000026_0 -rwxr-xr-x 3 hadoop supergroup 28068345 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000027_0 -rwxr-xr-x 3 hadoop supergroup 10829188 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000028_0 -rwxr-xr-x 3 hadoop supergroup 28131416 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000029_0 -rwxr-xr-x 3 hadoop supergroup 10785466 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000030_0 -rwxr-xr-x 3 hadoop supergroup 28031752 2017-04-04 15:05 /user/hive/warehouse/test.db/accounts/000031_0 What can be done to get the hive query to run faster in hive? -- Regards, Premal Shah.