Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )
Change subject: WIP IMPALA-6636: Use async IO in ORC scanner ...................................................................... Patch Set 13: (4 comments) Hi All, David has ran several benchmark run. Perf number seems to improve from this async IO prototype. I will proceed cleaning up the code and add proper commit message. Here are some that I plan to address next. http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-orc-scanner.h File be/src/exec/hdfs-orc-scanner.h: http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-orc-scanner.h@123 PS13, Line 123: // ExecEnv::GetInstance()->disk_io_mgr()->max_buffer_size(); Can be removed? http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-orc-scanner.cc@300 PS13, Line 300: // stream_->ReleaseCompletedResources(true); : stream_->ReleaseCompletedResources(false); Calling 'ReleaseCompletedResources(true)' seems to be OK here? http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-scan-node-base.cc@821 PS13, Line 821: // DCHECK_LE(offset + len, GetFileDesc(metadata->partition_id, file)->file_length) : // << "Scan range beyond end of file (offset=" << offset << ", len=" << len << ")"; Can be removed? http://gerrit.cloudera.org:8080/#/c/15370/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/15370/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2134 PS13, Line 2134: for (SlotDescriptor slot: desc_.getSlots()) { Just for our note, we found a corner case here for "select count(*)" kind of query over ORC. Somehow, desc._getSlots() is empty in this corner case, but HdfsOrcScanner::StartColumnReading actually see couple streams that is eligible for async read. Patch set 12 already adds a workaround within HdfsOrcScanner::StartColumnReading to TryIncreaseReservation 8KB (min_buffer_size) for each eligible stream. If it can't increase, then the rest of the stream will be read synchronously. I will file a follow up JIRA to document this situation. -- To view, visit http://gerrit.cloudera.org:8080/15370 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074 Gerrit-Change-Number: 15370 Gerrit-PatchSet: 13 Gerrit-Owner: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Comment-Date: Tue, 30 Nov 2021 23:17:22 +0000 Gerrit-HasComments: Yes