[ https://issues.apache.org/jira/browse/HIVE-27099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856649#comment-17856649 ]
Butao Zhang commented on HIVE-27099: ------------------------------------ For iceberg table without delete file, we can optimize the count(*) by using iceberg metadata file. Like HIVE-28268 wants to do. > Iceberg: select count(*) from table queries all data > ---------------------------------------------------- > > Key: HIVE-27099 > URL: https://issues.apache.org/jira/browse/HIVE-27099 > Project: Hive > Issue Type: Improvement > Reporter: Rajesh Balamohan > Priority: Major > Labels: performance > > select count is scanning all data. Though it has complete basic stats, it > launched tez job which wasn't needed. Second issue is, it ended up scanning > ENTIRE 148 GB dataset which is completely not required. It should have got > the data from parq files itself. Ideal situation is getting entire records > from manifest itself. > Data is stored in parquet format in external tables. This may be broken for > parquet, as for ORC it is able to read less data (footer info). > 1. Consider fixing count( * ) for parq > 2. Check if it is possible to read stats from iceberg manifests after #1. > {noformat} > explain select count(*) from store_sales; > Explain > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Tez > DagId: hive_20230223031934_2abeb3b9-8c18-4ff7-a8f9-df7368010189:5 > Edges: > Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE) > DagName: hive_20230223031934_2abeb3b9-8c18-4ff7-a8f9-df7368010189:5 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: store_sales > Statistics: Num rows: 2879966589 Data size: 195666988943 > Basic stats: COMPLETE Column stats: COMPLETE > Select Operator > Statistics: Num rows: 2879966589 Data size: 195666988943 > Basic stats: COMPLETE Column stats: COMPLETE > Group By Operator > aggregations: count() > minReductionHashAggr: 0.5 > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 8 Basic stats: > COMPLETE Column stats: COMPLETE > Reduce Output Operator > null sort order: > sort order: > Statistics: Num rows: 1 Data size: 8 Basic stats: > COMPLETE Column stats: COMPLETE > value expressions: _col0 (type: bigint) > Execution mode: vectorized > Reducer 2 > Execution mode: vectorized > Reduce Operator Tree: > Group By Operator > aggregations: count(VALUE._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE > Column stats: COMPLETE > File Output Operator > compressed: false > Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE > Column stats: COMPLETE > table: > input format: > org.apache.hadoop.mapred.SequenceFileInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > 53 rows selected (1.454 seconds) > 0: jdbc:hive2://ve0:218> select count(*) from store_sales; > INFO : Query ID = hive_20230223031940_9ff5d61d-1fe2-4476-a561-7820e4a3a5f8 > INFO : Total jobs = 1 > INFO : Launching Job 1 out of 1 > INFO : Starting task [Stage-1:MAPRED] in serial mode > INFO : Subscribed to counters: [] for queryId: > hive_20230223031940_9ff5d61d-1fe2-4476-a561-7820e4a3a5f8 > INFO : Session is already open > INFO : Dag name: select count(*) from store_sales (Stage-1) > INFO : Status: Running (Executing on YARN cluster with App id > application_1676286357243_0061) > ---------------------------------------------------------------------------------------------- > VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING > FAILED KILLED > ---------------------------------------------------------------------------------------------- > Map 1 .......... container SUCCEEDED 767 767 0 0 > 0 0 > Reducer 2 ...... container SUCCEEDED 1 1 0 0 > 0 0 > ---------------------------------------------------------------------------------------------- > VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 54.94 s > ---------------------------------------------------------------------------------------------- > INFO : Status: DAG finished successfully in 54.85 seconds > INFO : > INFO : Query Execution Summary > INFO : > ---------------------------------------------------------------------------------------------- > INFO : OPERATION DURATION > INFO : > ---------------------------------------------------------------------------------------------- > INFO : Compile Query 1.42s > INFO : Prepare Plan 0.18s > INFO : Get Query Coordinator (AM) 0.01s > INFO : Submit Plan 0.37s > INFO : Start DAG 0.08s > INFO : Run DAG 54.84s > INFO : > ---------------------------------------------------------------------------------------------- > INFO : > INFO : Task Execution Summary > INFO : > ---------------------------------------------------------------------------------------------- > INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) > INPUT_RECORDS OUTPUT_RECORDS > INFO : > ---------------------------------------------------------------------------------------------- > INFO : Map 1 41326.00 4,280,860 28,890 > 2,879,966,589 1,332 > INFO : Reducer 2 26123.00 4,750 34 > 767 0 > INFO : > ---------------------------------------------------------------------------------------------- > INFO : > INFO : org.apache.tez.common.counters.DAGCounter: > INFO : NUM_SUCCEEDED_TASKS: 768 > INFO : TOTAL_LAUNCHED_TASKS: 768 > INFO : DATA_LOCAL_TASKS: 677 > INFO : RACK_LOCAL_TASKS: 90 > INFO : AM_CPU_MILLISECONDS: 101240 > INFO : AM_GC_TIME_MILLIS: 156 > INFO : File System Counters: > INFO : FILE_BYTES_READ: 2820 > INFO : FILE_BYTES_WRITTEN: 46020 > INFO : HDFS_BYTES_WRITTEN: 110 > INFO : HDFS_READ_OPS: 2 > INFO : HDFS_WRITE_OPS: 2 > INFO : HDFS_OP_CREATE: 1 > INFO : HDFS_OP_GET_FILE_STATUS: 2 > INFO : HDFS_OP_RENAME: 1 > INFO : OFS_BYTES_READ: 148270746936 > INFO : OFS_READ_OPS: 11058 > ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)