[jira] [Commented] (HIVE-27099) Iceberg: select count(*) from table queries all data

Butao Zhang (Jira) Thu, 20 Jun 2024 20:15:04 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-27099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856649#comment-17856649
 ]


Butao Zhang commented on HIVE-27099:
------------------------------------

For iceberg table without delete file, we can optimize the count(*) by using 
iceberg metadata file. Like HIVE-28268 wants to do.

> Iceberg: select count(*) from table queries all data
> ----------------------------------------------------
>
>                 Key: HIVE-27099
>                 URL: https://issues.apache.org/jira/browse/HIVE-27099
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: performance
>
> select count is scanning all data. Though it has complete basic stats, it 
> launched tez job which wasn't needed. Second issue is, it ended up scanning 
> ENTIRE 148 GB dataset which is completely not required. It should have got 
> the data from parq files itself. Ideal situation is getting entire records 
> from manifest itself.
> Data is stored in parquet format in external tables. This may be broken for 
> parquet, as for ORC it is able to read less data (footer info). 
> 1. Consider fixing count( * ) for parq
> 2. Check if it is possible to read stats from iceberg manifests after #1.
> {noformat}
> explain select count(*) from store_sales;
> Explain
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       DagId: hive_20230223031934_2abeb3b9-8c18-4ff7-a8f9-df7368010189:5
>       Edges:
>         Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
>       DagName: hive_20230223031934_2abeb3b9-8c18-4ff7-a8f9-df7368010189:5
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: store_sales
>                   Statistics: Num rows: 2879966589 Data size: 195666988943 
> Basic stats: COMPLETE Column stats: COMPLETE
>                   Select Operator
>                     Statistics: Num rows: 2879966589 Data size: 195666988943 
> Basic stats: COMPLETE Column stats: COMPLETE
>                     Group By Operator
>                       aggregations: count()
>                       minReductionHashAggr: 0.5
>                       mode: hash
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 1 Data size: 8 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         null sort order:
>                         sort order:
>                         Statistics: Num rows: 1 Data size: 8 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                         value expressions: _col0 (type: bigint)
>             Execution mode: vectorized
>         Reducer 2
>             Execution mode: vectorized
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 mode: mergepartial
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   table:
>                       input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                       output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                       serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
> 53 rows selected (1.454 seconds)
> 0: jdbc:hive2://ve0:218> select count(*) from store_sales;
> INFO  : Query ID = hive_20230223031940_9ff5d61d-1fe2-4476-a561-7820e4a3a5f8
> INFO  : Total jobs = 1
> INFO  : Launching Job 1 out of 1
> INFO  : Starting task [Stage-1:MAPRED] in serial mode
> INFO  : Subscribed to counters: [] for queryId: 
> hive_20230223031940_9ff5d61d-1fe2-4476-a561-7820e4a3a5f8
> INFO  : Session is already open
> INFO  : Dag name: select count(*) from store_sales (Stage-1)
> INFO  : Status: Running (Executing on YARN cluster with App id 
> application_1676286357243_0061)
> ----------------------------------------------------------------------------------------------
>         VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
> FAILED  KILLED
> ----------------------------------------------------------------------------------------------
> Map 1 .......... container     SUCCEEDED    767        767        0        0  
>      0       0
> Reducer 2 ...... container     SUCCEEDED      1          1        0        0  
>      0       0
> ----------------------------------------------------------------------------------------------
> VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 54.94 s
> ----------------------------------------------------------------------------------------------
> INFO  : Status: DAG finished successfully in 54.85 seconds
> INFO  :
> INFO  : Query Execution Summary
> INFO  : 
> ----------------------------------------------------------------------------------------------
> INFO  : OPERATION                            DURATION
> INFO  : 
> ----------------------------------------------------------------------------------------------
> INFO  : Compile Query                           1.42s
> INFO  : Prepare Plan                            0.18s
> INFO  : Get Query Coordinator (AM)              0.01s
> INFO  : Submit Plan                             0.37s
> INFO  : Start DAG                               0.08s
> INFO  : Run DAG                                54.84s
> INFO  : 
> ----------------------------------------------------------------------------------------------
> INFO  :
> INFO  : Task Execution Summary
> INFO  : 
> ----------------------------------------------------------------------------------------------
> INFO  :   VERTICES      DURATION(ms)   CPU_TIME(ms)    GC_TIME(ms)   
> INPUT_RECORDS   OUTPUT_RECORDS
> INFO  : 
> ----------------------------------------------------------------------------------------------
> INFO  :      Map 1          41326.00      4,280,860         28,890   
> 2,879,966,589            1,332
> INFO  :  Reducer 2          26123.00          4,750             34            
>  767                0
> INFO  : 
> ----------------------------------------------------------------------------------------------
> INFO  :
> INFO  : org.apache.tez.common.counters.DAGCounter:
> INFO  :    NUM_SUCCEEDED_TASKS: 768
> INFO  :    TOTAL_LAUNCHED_TASKS: 768
> INFO  :    DATA_LOCAL_TASKS: 677
> INFO  :    RACK_LOCAL_TASKS: 90
> INFO  :    AM_CPU_MILLISECONDS: 101240
> INFO  :    AM_GC_TIME_MILLIS: 156
> INFO  : File System Counters:
> INFO  :    FILE_BYTES_READ: 2820
> INFO  :    FILE_BYTES_WRITTEN: 46020
> INFO  :    HDFS_BYTES_WRITTEN: 110
> INFO  :    HDFS_READ_OPS: 2
> INFO  :    HDFS_WRITE_OPS: 2
> INFO  :    HDFS_OP_CREATE: 1
> INFO  :    HDFS_OP_GET_FILE_STATUS: 2
> INFO  :    HDFS_OP_RENAME: 1
> INFO  :    OFS_BYTES_READ: 148270746936
> INFO  :    OFS_READ_OPS: 11058
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-27099) Iceberg: select count(*) from table queries all data

Reply via email to