alamb commented on PR #18112: URL: https://github.com/apache/datafusion/pull/18112#issuecomment-3425800587
> This PR is still needed, because currently we split two part of loading metadata if we enable page index, and the second one loading page index will not use reminder, so even we setting prefetch_hint now, we can't reduce the page index request without this PR. As I understand it, your goal is to reduce the number of object store requests when loading parquet metadata (which is a good goal 👍 ) However, I am not sure this PR achieve this goal-- instead what i think this PR does is change **when** the requests are made to object store to a bit earlier in the processing pipeline, but not the actual number of them. I think the best way to ensure we minimize the object store requests is: 1. Use `prefetch_hint` to make a single request and read a portion of the end of the file 2. Try and parse both the metadata and page index from the result (if the prefetch was bit enough it will have both structures) If prefetch fetches enough bytes, this strategy will result in a single object store requests to read all required metadata I realize setting prefetch today may not actually also parse the page index, but I think that is what we should be working towards (rather than adding another flag, unless there is some need I am missing) i personally suggest starting with an end to end type test (perhaps in https://github.com/apache/datafusion/blob/main/datafusion/core/tests/parquet_config.rs) that illustrates what is happening: 1. Runs a SQL query from a parquet file 2. Uses an instrumented object store (e.g. something like https://github.com/apache/datafusion/blob/f363e382661a4f45dad2912e9988f1703e46939b/datafusion/core/src/datasource/file_format/parquet.rs#L304-L303 or https://github.com/apache/datafusion/blob/93f136c06dcb6d4cb362110ae5a4b2b3b8571bb7/datafusion-cli/src/object_storage/instrumented.rs#L253-L252) to verify what requests are made Then we can configure various prefetch settings and ensure that only the expected number of requests are made -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
