BlakeOrth commented on PR #18160: URL: https://github.com/apache/datafusion/pull/18160#issuecomment-3429171331
@alamb Yes, agreed this should be a positive performance improvement on most datasets when using high latency storage, especially since fetching the parquet footer followed by the parquet metadata is a strictly sequential operation for each file. The benchmark results here are a bit curious and look inconsistent (perhaps due to reasons out of everyone's control). However, I wouldn't be too surprised to see minor performance improvements from some local disk backed queries. The 8B fetch for the parquet footer is below pretty much any reasonable storage device's and file system's block size, so the local disk and filesystem are probably doing the same amount of work in either case, and this PR eliminates one extra call to disk and any internal runtime scheduling around managing that call. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
