Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )
Change subject: IMPALA-6636: Use async IO in ORC scanner ...................................................................... Patch Set 25: (8 comments) > Patch Set 25: > > (1 comment) > > Patch set 25 add 7 counters and move 3 common counters between HdfsOrcScanner > and HdfsParquetScanner into HdfsColumnarScanner. > > This patch set also add async io if requested range falls in the footer > stream range (initial scan range issued by IssueInitialRanges). Thanks for adding the optimization on footer scan range! I think it really boost the performance on small files, e.g. tpcds_orc_def.store_sales has many files smaller than 100KB. Finish reviewing the reservation part. I can give my +1 after the comments are resolved. http://gerrit.cloudera.org:8080/#/c/15370/25//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15370/25//COMMIT_MSG@24 PS25, Line 24: relies on the backend to divide them as : needed. I think we can do this in the orc-scanner as well. There are some APIs like orc::Reader::getMemoryUseByTypeId which can be used after the scanner parses the footer: https://github.com/apache/orc/blob/rel/release-1.7.0/c%2B%2B/include/orc/Reader.hh#L503 Could you create a follow-up JIRA for this? http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2106 PS25, Line 2106: || format == HdfsFileFormat.ORC Should we check the orc_async_read flag here? I think this causes the estimated reservation changes in resource-requirements.test http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2143 PS25, Line 2143: We may need to adjust this : * logic for nested types in non-shredded columnar formats (e.g. IMPALA-6503 - ORC) : * if/when that is added. nit: Could you update this? http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2161 PS25, Line 2161: Long nit: long http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2181 PS25, Line 2181: Long nit: long http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2199 PS25, Line 2199: the current ORC scanner does not have the select : // count(*) optimization yet like in Parquet. Isn't this the optimization for count(*)? https://github.com/apache/impala/blob/c127b6b1a72d83778378dc181d9328b4cb7dea9e/be/src/exec/hdfs-orc-scanner.cc#L610-L630 http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2274 PS25, Line 2274: , boolean orcAsyncRead not used? http://gerrit.cloudera.org:8080/#/c/15370/25/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2287 PS25, Line 2287: // Estimate the data size with dictionary compression, assuming all distinct : // values occur in the scan range with the largest number of rows and that each : // value can be represented with approximately log2(ndv) bits. Even if Parquet : // dictionary compression does not kick in, general-purpose compression : // algorithms like Snappy can often find redundancy when there are repeated : // values. : long dictBytes = (long)(stats.getAvgSize() * stats.getNumDistinctValues()); : long bitsPerVal = BitUtil.log2Ceiling(stats.getNumDistinctValues()); : long encodedDataBytes = bitsPerVal * maxScanRangeNumRows_ / 8; : reservationBytes = Math.min(reservationBytes, dictBytes + encodedDataBytes); We probably need to review this for ORC. We can create a follow up JIRA if this estimation is still ok (i.e. if it doesn't break anything). -- To view, visit http://gerrit.cloudera.org:8080/15370 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074 Gerrit-Change-Number: 15370 Gerrit-PatchSet: 25 Gerrit-Owner: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Kurt Deschler <kdesc...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Comment-Date: Sun, 30 Jan 2022 08:39:54 +0000 Gerrit-HasComments: Yes