Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15370 )
Change subject: IMPALA-6636: Use async IO in ORC scanner ...................................................................... Patch Set 25: (1 comment) http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/15370/25/be/src/exec/hdfs-orc-scanner.cc@1375 PS25, Line 1375: ReadFooterStream > It is still not 100% clear to me - so until now we were reading the last 10 You are correct. Until now, we were reading the last 100KB, but didn't actually use it to read metadata. Patch set 25 reuse this by serving the range through HdfsOrcScanner::ReadFooterStream() when possible. You are also correct that ORC lib might move backward very early after reading the postscript to read the rest of footer region. The way ORC lib behave is described here: https://github.com/apache/orc/blob/89af2cb/c%2B%2B/src/Reader.cc#L1352-L1376 ORC lib guess that footer+postscript should all contained within the last 16KB (DIRECTORY_SIZE_GUESS) of file, which it read in single stream read at line 1353. If we're lucky that the 16KB tail already contain the footer as well, ORC lib will not issue the second stream read at line 1369 (which so far is true for small ORC files in tpcds_orc_def). ReadFooterStream will not be used anymore after this 16KB read, since stream_ is deemed fully exhausted. If we're unlucky that footer is not contained or only partially contained within that 16KB, the next footer read will need to be served by readRandom. I suppose 16KB tail is a good guess for most ORC files. If not, then we should consider increasing DIRECTORY_SIZE_GUESS in ORC lib. On the other hand, we can also make initial scan range more efficient by lowering the FOOTER_SIZE from 100KB to 16KB for ORC. -- To view, visit http://gerrit.cloudera.org:8080/15370 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I348ad9e55f0cae7dff0d74d941b026dcbf5e4074 Gerrit-Change-Number: 15370 Gerrit-PatchSet: 25 Gerrit-Owner: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Kurt Deschler <kdesc...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Comment-Date: Mon, 31 Jan 2022 23:31:58 +0000 Gerrit-HasComments: Yes