Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15370 )

Change subject: IMPALA-6636: Use async IO in ORC scanner
......................................................................


Patch Set 25:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc@1375
PS25, Line 1375: ReadFooterStream
> It is still not 100% clear to me - so until now we were reading the last 10
You are correct. Until now, we were reading the last 100KB, but didn't actually 
use it to read metadata. Patch set 25 reuse this by serving the range through 
HdfsOrcScanner::ReadFooterStream() when possible.

You are also correct that ORC lib might move backward very early after reading 
the postscript to read the rest of footer region. The way ORC lib behave is 
described here:
https://github.com/apache/orc/blob/89af2cb/c%2B%2B/src/Reader.cc#L1352-L1376

ORC lib guess that footer+postscript should all contained within the last 16KB 
(DIRECTORY_SIZE_GUESS) of file, which it read in single stream read at line 
1353.
If we're lucky that the 16KB tail already contain the footer as well, ORC lib 
will not issue the second stream read at line 1369 (which so far is true for 
small ORC files in tpcds_orc_def).
ReadFooterStream will not be used anymore after this 16KB read, since stream_ 
is deemed fully exhausted.
If we're unlucky that footer is not contained or only partially contained 
within that 16KB, the next footer read will need to be served by readRandom.

I suppose 16KB tail is a good guess for most ORC files. If not, then we should 
consider increasing DIRECTORY_SIZE_GUESS in ORC lib. On the other hand, we can 
also make initial scan range more efficient by lowering the FOOTER_SIZE from 
100KB to 16KB for ORC.



--
To view, visit http://gerrit.cloudera.org:8080/15370
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074
Gerrit-Change-Number: 15370
Gerrit-PatchSet: 25
Gerrit-Owner: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kdesc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Comment-Date: Mon, 31 Jan 2022 23:31:58 +0000
Gerrit-HasComments: Yes

Reply via email to