Rajesh Balamohan created HIVE-26185:
---------------------------------------
Summary: Need support for metadataonly operations with iceberg
(e.g select distinct on partition column)
Key: HIVE-26185
URL: https://issues.apache.org/jira/browse/HIVE-26185
Project: Hive
Issue Type: Bug
Components: HiveServer2
Reporter: Rajesh Balamohan
{noformat}
select distinct ss_sold_date_sk from store_sales
{noformat}
This query scans 1800+ rows in hive acid. But takes ages to process with
NullScanOptimiser during compilation phase
(https://issues.apache.org/jira/browse/HIVE-24262)
{noformat}
Hive ACID
INFO : Executing
command(queryId=hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14):
select distinct ss_sold_date_sk from store_sales
INFO : Compute 'ndembla-test2' is active.
INFO : Query ID = hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14
INFO : Total jobs = 1
INFO : Launching Job 1 out of 1
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : Subscribed to counters: [] for queryId:
hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14
INFO : Tez session hasn't been created yet. Opening session
INFO : Dag name: select distinct ss_sold_date_s...store_sales (Stage-1)
INFO : Status: Running (Executing on YARN cluster with App id
application_1651102345385_0000)
INFO : Status: DAG finished successfully in 1.81 seconds
INFO : DAG ID: dag_1651102345385_0000_5
INFO :
INFO : Query Execution Summary
INFO :
----------------------------------------------------------------------------------------------
INFO : OPERATION DURATION
INFO :
----------------------------------------------------------------------------------------------
INFO : Compile Query 55.47s
INFO : Prepare Plan 2.32s
INFO : Get Query Coordinator (AM) 0.13s
INFO : Submit Plan 0.03s
INFO : Start DAG 0.09s
INFO : Run DAG 1.80s
INFO :
----------------------------------------------------------------------------------------------
INFO :
INFO : Task Execution Summary
INFO :
----------------------------------------------------------------------------------------------
INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms)
INPUT_RECORDS OUTPUT_RECORDS
INFO :
----------------------------------------------------------------------------------------------
INFO : Map 1 1009.00 0 0
1,824 1,824
INFO : Reducer 2 0.00 0 0
1,824 0
INFO :
----------------------------------------------------------------------------------------------
INFO :
{noformat}
However, same query scans *2.8 Billion records.* in iceberg format. This can be
fixed.
{noformat}
INFO : Executing
command(queryId=hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72):
select distinct ss_sold_date_sk from store_sales
INFO : Compute 'ndembla-test2' is active.
INFO : Query ID = hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72
INFO : Total jobs = 1
INFO : Launching Job 1 out of 1
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : Subscribed to counters: [] for queryId:
hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72
INFO : Tez session hasn't been created yet. Opening session
INFO : Dag name: select distinct ss_sold_date_s...store_sales (Stage-1)
INFO : Status: Running (Executing on YARN cluster with App id
application_1651102345385_0000)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING
FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... llap SUCCEEDED 7141 7141 0 0
0 0
Reducer 2 ...... llap SUCCEEDED 2 2 0 0
0 0
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 18.48 s
----------------------------------------------------------------------------------------------
INFO : Status: DAG finished successfully in 17.97 seconds
INFO : DAG ID: dag_1651102345385_0000_4
INFO :
INFO : Query Execution Summary
INFO :
----------------------------------------------------------------------------------------------
INFO : OPERATION DURATION
INFO :
----------------------------------------------------------------------------------------------
INFO : Compile Query 1.81s
INFO : Prepare Plan 0.04s
INFO : Get Query Coordinator (AM) 0.14s
INFO : Submit Plan 0.02s
INFO : Start DAG 0.03s
INFO : Run DAG 17.97s
INFO :
----------------------------------------------------------------------------------------------
INFO :
INFO : Task Execution Summary
INFO :
----------------------------------------------------------------------------------------------
INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms)
INPUT_RECORDS OUTPUT_RECORDS
INFO :
----------------------------------------------------------------------------------------------
INFO : Map 1 15142.00 0 0
28,800,426,268 8,611
INFO : Reducer 2 956.00 0 0
8,611 0
INFO :
----------------------------------------------------------------------------------------------
INFO :
{noformat}
It will be beneficial to make use of MetadataOnlyOptimizer for iceberg tables
as well.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)