Rajesh Balamohan created HIVE-26185:
---------------------------------------

             Summary: Need support for metadataonly operations with iceberg 
(e.g select distinct on partition column)
                 Key: HIVE-26185
                 URL: https://issues.apache.org/jira/browse/HIVE-26185
             Project: Hive
          Issue Type: Bug
          Components: HiveServer2
            Reporter: Rajesh Balamohan


{noformat}
select distinct ss_sold_date_sk from store_sales
{noformat}

This query scans 1800+ rows in hive acid. But takes ages to process with 
NullScanOptimiser during compilation phase 
(https://issues.apache.org/jira/browse/HIVE-24262)

{noformat}
Hive ACID

INFO  : Executing 
command(queryId=hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14): 
select distinct ss_sold_date_sk from store_sales
INFO  : Compute 'ndembla-test2' is active.
INFO  : Query ID = hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Subscribed to counters: [] for queryId: 
hive_20220427233926_282bc9d8-220c-4a09-928d-411601c2ef14
INFO  : Tez session hasn't been created yet. Opening session
INFO  : Dag name: select distinct ss_sold_date_s...store_sales (Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id 
application_1651102345385_0000)

INFO  : Status: DAG finished successfully in 1.81 seconds
INFO  : DAG ID: dag_1651102345385_0000_5
INFO  :
INFO  : Query Execution Summary
INFO  : 
----------------------------------------------------------------------------------------------
INFO  : OPERATION                            DURATION
INFO  : 
----------------------------------------------------------------------------------------------
INFO  : Compile Query                          55.47s
INFO  : Prepare Plan                            2.32s
INFO  : Get Query Coordinator (AM)              0.13s
INFO  : Submit Plan                             0.03s
INFO  : Start DAG                               0.09s
INFO  : Run DAG                                 1.80s
INFO  : 
----------------------------------------------------------------------------------------------
INFO  :
INFO  : Task Execution Summary
INFO  : 
----------------------------------------------------------------------------------------------
INFO  :   VERTICES      DURATION(ms)   CPU_TIME(ms)    GC_TIME(ms)   
INPUT_RECORDS   OUTPUT_RECORDS
INFO  : 
----------------------------------------------------------------------------------------------
INFO  :      Map 1           1009.00              0              0           
1,824            1,824
INFO  :  Reducer 2              0.00              0              0           
1,824                0
INFO  : 
----------------------------------------------------------------------------------------------
INFO  :

{noformat}




However, same query scans *2.8 Billion records.* in iceberg format. This can be 
fixed.

{noformat}

INFO  : Executing 
command(queryId=hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72): 
select distinct ss_sold_date_sk from store_sales
INFO  : Compute 'ndembla-test2' is active.
INFO  : Query ID = hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Subscribed to counters: [] for queryId: 
hive_20220427233519_cddc6dd1-95a3-4f0e-afa5-e11e9dc5fa72
INFO  : Tez session hasn't been created yet. Opening session
INFO  : Dag name: select distinct ss_sold_date_s...store_sales (Stage-1)
INFO  : Status: Running (Executing on YARN cluster with App id 
application_1651102345385_0000)

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED  KILLED
----------------------------------------------------------------------------------------------
Map 1 ..........      llap     SUCCEEDED   7141       7141        0        0    
   0       0
Reducer 2 ......      llap     SUCCEEDED      2          2        0        0    
   0       0
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 18.48 s
----------------------------------------------------------------------------------------------
INFO  : Status: DAG finished successfully in 17.97 seconds
INFO  : DAG ID: dag_1651102345385_0000_4
INFO  :
INFO  : Query Execution Summary
INFO  : 
----------------------------------------------------------------------------------------------
INFO  : OPERATION                            DURATION
INFO  : 
----------------------------------------------------------------------------------------------
INFO  : Compile Query                           1.81s
INFO  : Prepare Plan                            0.04s
INFO  : Get Query Coordinator (AM)              0.14s
INFO  : Submit Plan                             0.02s
INFO  : Start DAG                               0.03s
INFO  : Run DAG                                17.97s
INFO  : 
----------------------------------------------------------------------------------------------
INFO  :
INFO  : Task Execution Summary
INFO  : 
----------------------------------------------------------------------------------------------
INFO  :   VERTICES      DURATION(ms)   CPU_TIME(ms)    GC_TIME(ms)   
INPUT_RECORDS   OUTPUT_RECORDS
INFO  : 
----------------------------------------------------------------------------------------------
INFO  :      Map 1          15142.00              0              0  
28,800,426,268            8,611
INFO  :  Reducer 2            956.00              0              0           
8,611                0
INFO  : 
----------------------------------------------------------------------------------------------
INFO  :
{noformat} 


It will be beneficial to make use of MetadataOnlyOptimizer for iceberg tables 
as well. 





--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to