etseidl commented on code in PR #8376:
URL: https://github.com/apache/arrow-rs/pull/8376#discussion_r2373121740


##########
parquet/src/file/serialized_reader.rs:
##########
@@ -732,8 +737,12 @@ impl SerializedPageReaderContext {
         _page_index: usize,
         _dictionary_page: bool,
     ) -> Result<PageHeader> {
-        let mut prot = TCompactInputProtocol::new(input);
-        Ok(PageHeader::read_from_in_protocol(&mut prot)?)
+        let mut prot = ThriftReadInputProtocol::new(input);

Review Comment:
   I was thinking we could short-circuit the footer parsing and exit right 
after decoding the schema. With that in hand, we could then jump back in, 
skipping the schema and then we'd be able to skip over row groups or columns 
that we don't want. This would still incur the some of the thrift overhead, but 
skipping objects is quite a bit faster than decoding them.
   
   I know I've seen this idea kicked around before, but we could also do a fast 
indexing pass over the metadata where we save the starting offsets of each row 
group and column chunk. We could then just do random access into the footer and 
decode only those structs we need.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to