Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/20434 )
Change subject: IMPALA-12408: Optimize HdfsScanNode.computeScanRangeLocations() ...................................................................... IMPALA-12408: Optimize HdfsScanNode.computeScanRangeLocations() computeScanRangeLocations() could be very slow for tables with large number of partitions. This patch tries to minimize the use of two expensive function calls: 1. HdfsPartition.getLocation() - This looks like a simple property but actually decompresses the location string. - Was often called indirectly through getFsType(). - After the patch it is only called once per partition. 2. hadoop.fs.FileSystem.getFileSystem() - Hadoop caches the FileSystem object but the key contains UserGroupInformation which is obtained with UserGroupInformation.getCurrentUser(), making the call costly. - As the user is always the same during Impala planning we can cache it simply by scheme + authority part of the location URI. After the patch getFileSystem() is called if scheme/authority is different than in the previous partition, leading to a single call for most tables. Note that caching these values in HdfsPartition could also help but preferred to avoid increasing the size of that class. The patch also changes the implementation of how we count the number of partitions per file system (to avoid the extra calls to getFsType()). This made class SampledPartitionMetadata unnecessary and reverted some of the changes in https://gerrit.cloudera.org/#/c/12282/ Benchmarks: Measured using tpcds.store_sales (1824 partitions) union all'd 256 times: explain select * from tpcds_parquet.store_sales256; Before patch: 8.8s After patch: 1.1s The improvement is also visible on full tpcds benchmark: +----------+-----------------------+---------+------------+------------+----------------+ | Workload | File Format | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) | +----------+-----------------------+---------+------------+------------+----------------+ | TPCDS(2) | parquet / none / none | 0.53 | -8.99% | 0.29 | -10.78% | +----------+-----------------------+---------+------------+------------+----------------+ The effect is less significant on higher scale factors. Testing: - ran core tests Change-Id: Icf3e9c169d65c15df6a6762cc68fbb477fe64a7c Reviewed-on: http://gerrit.cloudera.org:8080/20434 Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> --- M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/catalog/FeFsTable.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java 6 files changed, 83 insertions(+), 86 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/20434 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Icf3e9c169d65c15df6a6762cc68fbb477fe64a7c Gerrit-Change-Number: 20434 Gerrit-PatchSet: 8 Gerrit-Owner: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>