[
https://issues.apache.org/jira/browse/HIVE-27099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17955391#comment-17955391
]
Denys Kuzmenko commented on HIVE-27099:
---------------------------------------
I think HIVE-27347 is addressing the problem mentioned here
> Iceberg: select count(*) from table queries all data
> ----------------------------------------------------
>
> Key: HIVE-27099
> URL: https://issues.apache.org/jira/browse/HIVE-27099
> Project: Hive
> Issue Type: Improvement
> Reporter: Rajesh Balamohan
> Priority: Major
> Labels: performance
>
> select count is scanning all data. Though it has complete basic stats, it
> launched tez job which wasn't needed. Second issue is, it ended up scanning
> ENTIRE 148 GB dataset which is completely not required. It should have got
> the data from parq files itself. Ideal situation is getting entire records
> from manifest itself.
> Data is stored in parquet format in external tables. This may be broken for
> parquet, as for ORC it is able to read less data (footer info).
> 1. Consider fixing count( * ) for parq
> 2. Check if it is possible to read stats from iceberg manifests after #1.
> {noformat}
> explain select count(*) from store_sales;
> Explain
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 depends on stages: Stage-1
> STAGE PLANS:
> Stage: Stage-1
> Tez
> DagId: hive_20230223031934_2abeb3b9-8c18-4ff7-a8f9-df7368010189:5
> Edges:
> Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
> DagName: hive_20230223031934_2abeb3b9-8c18-4ff7-a8f9-df7368010189:5
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: store_sales
> Statistics: Num rows: 2879966589 Data size: 195666988943
> Basic stats: COMPLETE Column stats: COMPLETE
> Select Operator
> Statistics: Num rows: 2879966589 Data size: 195666988943
> Basic stats: COMPLETE Column stats: COMPLETE
> Group By Operator
> aggregations: count()
> minReductionHashAggr: 0.5
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 8 Basic stats:
> COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> null sort order:
> sort order:
> Statistics: Num rows: 1 Data size: 8 Basic stats:
> COMPLETE Column stats: COMPLETE
> value expressions: _col0 (type: bigint)
> Execution mode: vectorized
> Reducer 2
> Execution mode: vectorized
> Reduce Operator Tree:
> Group By Operator
> aggregations: count(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
> Column stats: COMPLETE
> File Output Operator
> compressed: false
> Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
> Column stats: COMPLETE
> table:
> input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Stage: Stage-0
> Fetch Operator
> limit: -1
> Processor Tree:
> ListSink
> 53 rows selected (1.454 seconds)
> 0: jdbc:hive2://ve0:218> select count(*) from store_sales;
> INFO : Query ID = hive_20230223031940_9ff5d61d-1fe2-4476-a561-7820e4a3a5f8
> INFO : Total jobs = 1
> INFO : Launching Job 1 out of 1
> INFO : Starting task [Stage-1:MAPRED] in serial mode
> INFO : Subscribed to counters: [] for queryId:
> hive_20230223031940_9ff5d61d-1fe2-4476-a561-7820e4a3a5f8
> INFO : Session is already open
> INFO : Dag name: select count(*) from store_sales (Stage-1)
> INFO : Status: Running (Executing on YARN cluster with App id
> application_1676286357243_0061)
> ----------------------------------------------------------------------------------------------
> VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING
> FAILED KILLED
> ----------------------------------------------------------------------------------------------
> Map 1 .......... container SUCCEEDED 767 767 0 0
> 0 0
> Reducer 2 ...... container SUCCEEDED 1 1 0 0
> 0 0
> ----------------------------------------------------------------------------------------------
> VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 54.94 s
> ----------------------------------------------------------------------------------------------
> INFO : Status: DAG finished successfully in 54.85 seconds
> INFO :
> INFO : Query Execution Summary
> INFO :
> ----------------------------------------------------------------------------------------------
> INFO : OPERATION DURATION
> INFO :
> ----------------------------------------------------------------------------------------------
> INFO : Compile Query 1.42s
> INFO : Prepare Plan 0.18s
> INFO : Get Query Coordinator (AM) 0.01s
> INFO : Submit Plan 0.37s
> INFO : Start DAG 0.08s
> INFO : Run DAG 54.84s
> INFO :
> ----------------------------------------------------------------------------------------------
> INFO :
> INFO : Task Execution Summary
> INFO :
> ----------------------------------------------------------------------------------------------
> INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms)
> INPUT_RECORDS OUTPUT_RECORDS
> INFO :
> ----------------------------------------------------------------------------------------------
> INFO : Map 1 41326.00 4,280,860 28,890
> 2,879,966,589 1,332
> INFO : Reducer 2 26123.00 4,750 34
> 767 0
> INFO :
> ----------------------------------------------------------------------------------------------
> INFO :
> INFO : org.apache.tez.common.counters.DAGCounter:
> INFO : NUM_SUCCEEDED_TASKS: 768
> INFO : TOTAL_LAUNCHED_TASKS: 768
> INFO : DATA_LOCAL_TASKS: 677
> INFO : RACK_LOCAL_TASKS: 90
> INFO : AM_CPU_MILLISECONDS: 101240
> INFO : AM_GC_TIME_MILLIS: 156
> INFO : File System Counters:
> INFO : FILE_BYTES_READ: 2820
> INFO : FILE_BYTES_WRITTEN: 46020
> INFO : HDFS_BYTES_WRITTEN: 110
> INFO : HDFS_READ_OPS: 2
> INFO : HDFS_WRITE_OPS: 2
> INFO : HDFS_OP_CREATE: 1
> INFO : HDFS_OP_GET_FILE_STATUS: 2
> INFO : HDFS_OP_RENAME: 1
> INFO : OFS_BYTES_READ: 148270746936
> INFO : OFS_READ_OPS: 11058
> ...
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)