Rajesh Balamohan created HIVE-26185: ---------------------------------------
Summary: Need support for metadataonly operations with iceberg (e.g select distinct on partition column) Key: HIVE-26185 URL: https://issues.apache.org/jira/browse/HIVE-26185 Project: Hive Issue Type: Bug Components: HiveServer2 Reporter: Rajesh Balamohan {noformat} select distinct ss_sold_date_sk from store_sales {noformat} This query scans 1800+ rows in hive acid. But takes ages to process with NullScanOptimiser during compilation phase (https://issues.apache.org/jira/browse/HIVE-24262) {noformat} Hive ACID INFO : Executing command(queryId=hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14): select distinct ss_sold_date_sk from store_sales INFO : Compute 'ndembla-test2' is active. INFO : Query ID = hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14 INFO : Total jobs = 1 INFO : Launching Job 1 out of 1 INFO : Starting task [Stage-1:MAPRED] in serial mode INFO : Subscribed to counters: [] for queryId: hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14 INFO : Tez session hasn't been created yet. Opening session INFO : Dag name: select distinct ss_sold_date_s...store_sales (Stage-1) INFO : Status: Running (Executing on YARN cluster with App id application_1651102345385_0000) INFO : Status: DAG finished successfully in 1.81 seconds INFO : DAG ID: dag_1651102345385_0000_5 INFO : INFO : Query Execution Summary INFO : ---------------------------------------------------------------------------------------------- INFO : OPERATION DURATION INFO : ---------------------------------------------------------------------------------------------- INFO : Compile Query 55.47s INFO : Prepare Plan 2.32s INFO : Get Query Coordinator (AM) 0.13s INFO : Submit Plan 0.03s INFO : Start DAG 0.09s INFO : Run DAG 1.80s INFO : ---------------------------------------------------------------------------------------------- INFO : INFO : Task Execution Summary INFO : ---------------------------------------------------------------------------------------------- INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS INFO : ---------------------------------------------------------------------------------------------- INFO : Map 1 1009.00 0 0 1,824 1,824 INFO : Reducer 2 0.00 0 0 1,824 0 INFO : ---------------------------------------------------------------------------------------------- INFO : {noformat} However, same query scans *2.8 Billion records.* in iceberg format. This can be fixed. {noformat} INFO : Executing command(queryId=hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72): select distinct ss_sold_date_sk from store_sales INFO : Compute 'ndembla-test2' is active. INFO : Query ID = hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72 INFO : Total jobs = 1 INFO : Launching Job 1 out of 1 INFO : Starting task [Stage-1:MAPRED] in serial mode INFO : Subscribed to counters: [] for queryId: hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72 INFO : Tez session hasn't been created yet. Opening session INFO : Dag name: select distinct ss_sold_date_s...store_sales (Stage-1) INFO : Status: Running (Executing on YARN cluster with App id application_1651102345385_0000) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... llap SUCCEEDED 7141 7141 0 0 0 0 Reducer 2 ...... llap SUCCEEDED 2 2 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 18.48 s ---------------------------------------------------------------------------------------------- INFO : Status: DAG finished successfully in 17.97 seconds INFO : DAG ID: dag_1651102345385_0000_4 INFO : INFO : Query Execution Summary INFO : ---------------------------------------------------------------------------------------------- INFO : OPERATION DURATION INFO : ---------------------------------------------------------------------------------------------- INFO : Compile Query 1.81s INFO : Prepare Plan 0.04s INFO : Get Query Coordinator (AM) 0.14s INFO : Submit Plan 0.02s INFO : Start DAG 0.03s INFO : Run DAG 17.97s INFO : ---------------------------------------------------------------------------------------------- INFO : INFO : Task Execution Summary INFO : ---------------------------------------------------------------------------------------------- INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS INFO : ---------------------------------------------------------------------------------------------- INFO : Map 1 15142.00 0 0 28,800,426,268 8,611 INFO : Reducer 2 956.00 0 0 8,611 0 INFO : ---------------------------------------------------------------------------------------------- INFO : {noformat} It will be beneficial to make use of MetadataOnlyOptimizer for iceberg tables as well. -- This message was sent by Atlassian Jira (v8.20.7#820007)