Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )
Change subject: IMPALA-6636: Use async IO in ORC scanner ...................................................................... Patch Set 25: (3 comments) http://gerrit.cloudera.org:8080/#/c/15370/19/be/src/exec/hdfs-orc-scanner.h File be/src/exec/hdfs-orc-scanner.h: http://gerrit.cloudera.org:8080/#/c/15370/19/be/src/exec/hdfs-orc-scanner.h@127 PS19, Line 127: lib. > My memories are a bit hazy, but using anything but 0 caused some issues, pr Ah, thanks Csaba! I agree with you after reading the code. We should let ORC lib to decide the appropriate size. http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc@1375 PS25, Line 1375: ReadFooterStream > I don't get the logic of this function - do we assume that the footer range Each subclass of HdfsScanner will have their initial stream_ assigned in HdfsScanner::Open() https://github.com/apache/impala/blob/9ed4b3689784670532e840c5cb0389bdd9d5c0e8/be/src/exec/hdfs-scanner.cc#L78 This initial stream_ came from IssueInitialScanRange, which in the case of HdfsOrcScanner came from IssueFooterRanges here: https://github.com/apache/impala/blob/9ed4b3689784670532e840c5cb0389bdd9d5c0e8/be/src/exec/hdfs-orc-scanner.cc#L55 This implies that the footer range (the last 100KB, not the first) is already loaded in memory, within the initial steam_. ORC lib will try to read the file footer first to figure out the file layout. So I add logic in HdfsOrcScanner::ScanRangeInputStream::read() to check if the requested [offset,offset+length) IsInFooterRange or not. If true, serve the read from stream_. http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc@1377 PS25, Line 1377: if (offset > stream_->file_offset()) { > Are we prepared for jumping backward instead of forward? Or this is impossi We found out there is a corner case where ORC lib might want to seek backward a bit (it is a bit complicated case of overlapping range during decompression). We're looking to improve this in OrcLib by eliminating the "seek backward a bit" piece. But in the meantime, we will always fallback to readRandom whenever the range ORC lib ask is behind the stream's current position. Similarly for this footer stream, IsInFooterRange will return false if the asked range is behind the current stream_->file_offset(), and it will fallback to readRandom. https://gerrit.cloudera.org/c/15370/25/be/src/exec/hdfs-orc-scanner.h#397 -- To view, visit http://gerrit.cloudera.org:8080/15370 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074 Gerrit-Change-Number: 15370 Gerrit-PatchSet: 25 Gerrit-Owner: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Kurt Deschler <kdesc...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Comment-Date: Mon, 31 Jan 2022 19:00:05 +0000 Gerrit-HasComments: Yes