Re: [I] Improve performance of `datafusion-cli` when reading from remote storage [datafusion]

via GitHub Tue, 12 Aug 2025 13:26:02 -0700


alamb commented on issue #16365:
URL: https://github.com/apache/datafusion/issues/16365#issuecomment-3180916288

> At the end of the day I'm going to be working on some way to get listing
resulted cached, and I'd much rather make those changes here to contribute back
to open source than keep it in our proprietary code. I'm happy to help out to
move this forward wherever I can.

@BlakeOrth

I think we should make a new issue. I think we can take the same approach
for listing results as we took for parquet metadata caching (basically follow
the path that @nuno-faria blazed):
- https://github.com/apache/datafusion/issues/17000

Basically
1. Provide a default implementation for the (already existing)
[ListFilesCache](https://docs.rs/datafusion/latest/datafusion/execution/cache/cache_manager/struct.CacheManager.html#method.get_list_files_cache)
2. Implement some reasonable default value for refresh along with a config
setting to change that default
3. Implement some way to see the contents of the cache

If you are willing to potentially help with this work, I can spec it out in
a ticket / epic.

> In my mind the work to normalize performance between flat and hive
partitioned datasets is separate, but related, to any work that would actually
cache the listing results from either of those workflows. Should discussions on
approach happen here or in separate issue(s) more aligned with the work
directly?

Since they all use the ListingTable implementation I think the code will the
same

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Improve performance of `datafusion-cli` when reading from remote storage [datafusion]

Reply via email to