Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 25:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15370/19/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/15370/19/be/src/exec/hdfs-orc-scanner.h@127
PS19, Line 127: lib.
> My memories are a bit hazy, but using anything but 0 caused some issues, pr
Ah, thanks Csaba! I agree with you after reading the code. We should let ORC 
lib to decide the appropriate size.


http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc@1375
PS25, Line 1375: ReadFooterStream
> I don't get the logic of this function - do we assume that the footer range
Each subclass of HdfsScanner will have their initial stream_ assigned in 
HdfsScanner::Open()
https://github.com/apache/impala/blob/9ed4b3689784670532e840c5cb0389bdd9d5c0e8/be/src/exec/hdfs-scanner.cc#L78

This initial stream_ came from IssueInitialScanRange, which in the case of 
HdfsOrcScanner came from IssueFooterRanges here:
https://github.com/apache/impala/blob/9ed4b3689784670532e840c5cb0389bdd9d5c0e8/be/src/exec/hdfs-orc-scanner.cc#L55

This implies that the footer range (the last 100KB, not the first) is already 
loaded in memory, within the initial steam_.
ORC lib will try to read the file footer first to figure out the file layout. 
So I add logic in HdfsOrcScanner::ScanRangeInputStream::read() to check if the 
requested [offset,offset+length) IsInFooterRange or not.
If true, serve the read from stream_.


http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc@1377
PS25, Line 1377:   if (offset > stream_->file_offset()) {
> Are we prepared for jumping backward instead of forward? Or this is impossi
We found out there is a corner case where ORC lib might want to seek backward a 
bit (it is a bit complicated case of overlapping range during decompression).
We're looking to improve this in OrcLib by eliminating the "seek backward a 
bit" piece. But in the meantime, we will always fallback to readRandom whenever 
the range ORC lib ask is behind the stream's current position.
Similarly for this footer stream, IsInFooterRange will return false if the 
asked range is behind the current stream_->file_offset(), and it will fallback 
to readRandom.
https://gerrit.cloudera.org/c/15370/25/be/src/exec/hdfs-orc-scanner.h#397



--
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 25
Gerrit-Owner: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kdesc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Comment-Date: Mon, 31 Jan 2022 19:00:05 +0000
Gerrit-HasComments: Yes

Reply via email to