Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/20434 )

Change subject: IMPALA-12408: Optimize HdfsScanNode.computeScanRangeLocations()
......................................................................

IMPALA-12408: Optimize HdfsScanNode.computeScanRangeLocations()

computeScanRangeLocations() could be very slow for tables
with large number of partitions. This patch tries to minimize
the use of two expensive function calls:
1. HdfsPartition.getLocation()
  - This looks like a simple property but actually decompresses
    the location string.
  - Was often called indirectly through getFsType().
  - After the patch it is only called once per partition.
2. hadoop.fs.FileSystem.getFileSystem()
  - Hadoop caches the FileSystem object but the key contains
    UserGroupInformation which is obtained with
    UserGroupInformation.getCurrentUser(), making the call costly.
  - As the user is always the same during Impala planning we can cache
    it simply by scheme + authority part of the location URI. After
    the patch getFileSystem() is called if scheme/authority is
    different than in the previous partition, leading to a single call
    for most tables.

Note that caching these values in HdfsPartition could also help
but preferred to avoid increasing the size of that class.

The patch also changes the implementation of how we count the number
of partitions per file system (to avoid the extra calls to
getFsType()). This made class SampledPartitionMetadata unnecessary and
reverted some of the changes in https://gerrit.cloudera.org/#/c/12282/

Benchmarks:
Measured using tpcds.store_sales (1824 partitions)
union all'd 256 times:
explain select * from tpcds_parquet.store_sales256;
Before patch: 8.8s
After patch: 1.1s

The improvement is also visible on full tpcds benchmark:
+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | 
Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCDS(2) | parquet / none / none | 0.53    | -8.99%     | 0.29       | 
-10.78%        |
+----------+-----------------------+---------+------------+------------+----------------+
The effect is less significant on higher scale factors.

Testing:
- ran core tests

Change-Id: Icf3e9c169d65c15df6a6762cc68fbb477fe64a7c
Reviewed-on: http://gerrit.cloudera.org:8080/20434
Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
---
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/FeFsTable.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
6 files changed, 83 insertions(+), 86 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

--
To view, visit http://gerrit.cloudera.org:8080/20434
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Icf3e9c169d65c15df6a6762cc68fbb477fe64a7c
Gerrit-Change-Number: 20434
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>

Reply via email to