alamb commented on PR #18112:
URL: https://github.com/apache/datafusion/pull/18112#issuecomment-3418442831

   From what I can tell, to read ParquetMetadata with the default configuration 
of LIstingTable DataFusion will issue 3 object store requests:
   * 8 bytes for the footer (which has the length of the metadata)
   * N bytes for the metadata (has offsets to the page index, but not the page 
index structures itself)
   * M bytes for the page index structures (which are typically right before 
the metadata in the file, but not required to be)
   
   The first 8 byte request could be avoiding by changing the default 
prefetch_hint aka https://github.com/apache/datafusion/issues/18118
   
   I think you could potentially avoiding the third request if you extended the 
prefetch_hint code to use the page index if it was fetched in the initial 
request
   
   So the flow would be DataFusion makes an initial request for the last 
`prefetch_hint` bytes in the file. If that happens to contain enough bytes for 
metadata and page index no more requests would be made. If additional data was 
needed additional requests would be made
   
   The newly added Push metadata decoder likely makes this easier to implement: 
   - 
https://docs.rs/parquet/latest/parquet/file/metadata/struct.ParquetMetaDataPushDecoder.html
   
   (as it will tell you what ranges are needed)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to