[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Csaba Ringhofer (Code Review) Mon, 31 Jan 2022 03:14:58 -0800

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15370 )


Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 25:

(5 comments)

Thanks a lot for continuing the work on this!

http://gerrit.cloudera.org:8080/#/c/15370/19/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/15370/19/be/src/exec/hdfs-orc-scanner.h@127
PS19, Line 127: lib.
> Done
My memories are a bit hazy, but using anything but 0 caused some issues, 
probably in performance. 0 means that the ORC lib should decide the size, and 
using anything else can lead to extra copying of memory. Should have added a 
comment about this :(

The value returned will be used here:
https://github.com/apache/orc/blob/f381f2dd78e6b305b8a4e0b07124034de72ef1a1/c%2B%2B/src/StripeStream.cc#L97
and will be passed here:
https://github.com/apache/orc/blob/d1755250c4a9841a332093bf8ef03107776eecca/c%2B%2B/src/Compression.cc#L401
and here:
https://github.com/apache/orc/blob/d1755250c4a9841a332093bf8ef03107776eecca/c%2B%2B/src/Compression.cc#L758

So the effect may actually depend on the type of compression used, namely is it 
blocked or not.

I would prefer to keep it 0 unless we have the capacity to benchmark its effect 
with different compressions.


http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc@109
PS25, Line 109:   return split_range->expected_local() && (offset >= 
split_range->offset())
The same logic is also implemented here:
https://github.com/apache/impala/blob/b28da054f3595bb92873433211438306fc22fbc7/be/src/exec/parquet/parquet-page-reader.cc#L89

An idea to merge the two is to move this function to ScanRange as a member 
function.


http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc@1375
PS25, Line 1375: ReadFooterStream
I don't get the logic of this function - do we assume that the footer range 
(first 100KB of the file AFAIK 
https://github.com/apache/impala/blob/9ed4b3689784670532e840c5cb0389bdd9d5c0e8/be/src/exec/hdfs-scanner.h#L386)
will be in memory? But in that case why do we need the streaming logic, 
shouldn't it be random reads to a buffer?


http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc@1377
PS25, Line 1377:   if (offset > stream_->file_offset()) {
Are we prepared for jumping backward instead of forward? Or this is impossible 
during footer processing?


http://gerrit.cloudera.org:8080/#/c/15370/25/common/thrift/Query.thrift
File common/thrift/Query.thrift:

http://gerrit.cloudera.org:8080/#/c/15370/25/common/thrift/Query.thrift@572
PS25, Line 572:   141: optional bool orc_async_read = false;
I disagree with turning off async read by default, as this means that it is 
totally untested at the moment.



--
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 25
Gerrit-Owner: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kdesc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Comment-Date: Mon, 31 Jan 2022 11:10:45 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6636: Use async IO in ORC scanner

Reply via email to