Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15370 )

Change subject: WIP IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 13:

(4 comments)

Hi All,
David has ran several benchmark run. Perf number seems to improve from this 
async IO prototype.
I will proceed cleaning up the code and add proper commit message. Here are 
some that I plan to address next.

http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-orc-scanner.h@123
PS13, Line 123:  // ExecEnv::GetInstance()->disk_io_mgr()->max_buffer_size();
Can be removed?


http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-orc-scanner.cc@300
PS13, Line 300:     // stream_->ReleaseCompletedResources(true);
              :     stream_->ReleaseCompletedResources(false);
Calling 'ReleaseCompletedResources(true)' seems to be OK here?


http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/15370/13/be/src/exec/hdfs-scan-node-base.cc@821
PS13, Line 821:   // DCHECK_LE(offset + len, 
GetFileDesc(metadata->partition_id, file)->file_length)
              :   //    << "Scan range beyond end of file (offset=" << offset 
<< ", len=" << len << ")";
Can be removed?


http://gerrit.cloudera.org:8080/#/c/15370/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/15370/13/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@2134
PS13, Line 2134:     for (SlotDescriptor slot: desc_.getSlots()) {
Just for our note, we found a corner case here for "select count(*)" kind of 
query over ORC.
Somehow, desc._getSlots() is empty in this corner case, but 
HdfsOrcScanner::StartColumnReading actually see couple streams that is eligible 
for async read.

Patch set 12 already adds a workaround within 
HdfsOrcScanner::StartColumnReading to TryIncreaseReservation 8KB 
(min_buffer_size) for each eligible stream. If it can't increase, then the rest 
of the stream will be read synchronously. I will file a follow up JIRA to 
document this situation.



--
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Nov 2021 23:17:22 +0000
Gerrit-HasComments: Yes

Reply via email to