etseidl commented on code in PR #8376:
URL: https://github.com/apache/arrow-rs/pull/8376#discussion_r2373121740
##########
parquet/src/file/serialized_reader.rs:
##########
@@ -732,8 +737,12 @@ impl SerializedPageReaderContext {
_page_index: usize,
_dictionary_page: bool,
) -> Result<PageHeader> {
- let mut prot = TCompactInputProtocol::new(input);
- Ok(PageHeader::read_from_in_protocol(&mut prot)?)
+ let mut prot = ThriftReadInputProtocol::new(input);
Review Comment:
I was thinking we could short-circuit the footer parsing and exit right
after decoding the schema. With that in hand, we could then jump back in,
skipping the schema and then we'd be able to skip over row groups or columns
that we don't want. This would still incur the some of the thrift overhead, but
skipping objects is quite a bit faster than decoding them.
I know I've seen this idea kicked around before, but we could also do a fast
indexing pass over the metadata where we save the starting offsets of each row
group and column chunk. We could then just do random access into the footer and
decode only those structs we need.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]