Re: [PR] ORC-262: [C++] Support async io prefetch for orc c++ lib [orc]

via GitHub Thu, 24 Oct 2024 18:57:21 -0700


taiyang-li commented on code in PR #2048:
URL: https://github.com/apache/orc/pull/2048#discussion_r1815888206



##########
c++/include/orc/Reader.hh:
##########
@@ -605,6 +612,26 @@ namespace orc {
      */
     virtual std::map<uint32_t, BloomFilterIndex> getBloomFilters(
         uint32_t stripeIndex, const std::set<uint32_t>& included) const = 0;
+
+    /**
+     * Get the input stream for the ORC file.
+     */
+    virtual InputStream* getStream() const = 0;
+
+    /**
+     * Get the footer of the ORC file.
+     */
+    virtual const proto::Footer* getFooter() const = 0;
+
+    /**
+     * Get the schema of the ORC file.
+     */
+    virtual const proto::Metadata* getMetadata() const = 0;
+
+    virtual void preBuffer(const std::vector<int>& stripes, const 
std::list<uint64_t>& includeTypes,

Review Comment:
   I think the `preBuffer` doesn't bother users because it is an optional 
operation. If users doesn't need to hide io, they can ignore it and the 
behavior is totally the same with previous versions. If users require higher 
performance and have the knowledge about which stripes and columns to read, 
they could prefetch the next stripe using `preBuffer` while processing current 
stripe. In conclusion, I think invoking `preBuffer` manually is necessary 
cognitive load for users to gain higher performance.
   
   Pls refer to my performance comparison in:  
https://github.com/ClickHouse/ClickHouse/pull/70534#issuecomment-2403754651 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] ORC-262: [C++] Support async io prefetch for orc c++ lib [orc]

Reply via email to