Rajesh Balamohan created HIVE-24313: ---------------------------------------
Summary: Optimise stats collection for file sizes on cloud storage Key: HIVE-24313 URL: https://issues.apache.org/jira/browse/HIVE-24313 Project: Hive Issue Type: Improvement Components: HiveServer2 Reporter: Rajesh Balamohan When stats information is not present (e.g external table), RelOptHiveTable computes basic stats at runtime. Following is the codepath. [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L598] {code:java} Statistics stats = StatsUtils.collectStatistics(hiveConf, partitionList, hiveTblMetadata, hiveNonPartitionCols, nonPartColNamesThatRqrStats, colStatsCached, nonPartColNamesThatRqrStats, true); {code} [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L322] {code:java} for (Partition p : partList.getNotDeniedPartns()) { BasicStats basicStats = basicStatsFactory.build(Partish.buildFor(table, p)); partStats.add(basicStats); } {code} [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStats.java#L205] {code:java} try { ds = getFileSizeForPath(path); } catch (IOException e) { ds = 0L; } {code} For a table & query with large number of partitions, this takes long time to compute statistics and increases compilation time. It would be good to fix it with "ForkJoinPool" ( partList.getNotDeniedPartns().parallelStream().forEach((p) ) -- This message was sent by Atlassian Jira (v8.3.4#803005)