[ https://issues.apache.org/jira/browse/HIVE-24313?focusedWorklogId=814357&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-814357 ]
ASF GitHub Bot logged work on HIVE-24313: ----------------------------------------- Author: ASF GitHub Bot Created on: 06/Oct/22 13:36 Start Date: 06/Oct/22 13:36 Worklog Time Spent: 10m Work Description: difin commented on PR #3639: URL: https://github.com/apache/hive/pull/3639#issuecomment-1270063919 Thanks, @kasakrisz ! Issue Time Tracking ------------------- Worklog Id: (was: 814357) Time Spent: 2.5h (was: 2h 20m) > Optimise stats collection for file sizes on cloud storage > --------------------------------------------------------- > > Key: HIVE-24313 > URL: https://issues.apache.org/jira/browse/HIVE-24313 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 > Reporter: Rajesh Balamohan > Assignee: Dmitriy Fingerman > Priority: Major > Labels: pull-request-available > Time Spent: 2.5h > Remaining Estimate: 0h > > When stats information is not present (e.g external table), RelOptHiveTable > computes basic stats at runtime. > Following is the codepath. > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/RelOptHiveTable.java#L598] > {code:java} > Statistics stats = StatsUtils.collectStatistics(hiveConf, partitionList, > hiveTblMetadata, hiveNonPartitionCols, > nonPartColNamesThatRqrStats, colStatsCached, > nonPartColNamesThatRqrStats, true); > {code} > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L322] > {code:java} > for (Partition p : partList.getNotDeniedPartns()) { > BasicStats basicStats = > basicStatsFactory.build(Partish.buildFor(table, p)); > partStats.add(basicStats); > } > {code} > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStats.java#L205] > > {code:java} > try { > ds = getFileSizeForPath(path); > } catch (IOException e) { > ds = 0L; > } > {code} > > For a table & query with large number of partitions, this takes long time to > compute statistics and increases compilation time. It would be good to fix > it with "ForkJoinPool" ( > partList.getNotDeniedPartns().parallelStream().forEach((p) ) > > -- This message was sent by Atlassian Jira (v8.20.10#820010)