Yuvraj-cyborg opened a new pull request, #19298:
URL: https://github.com/apache/datafusion/pull/19298
## Which issue does this PR close?
Closes #19273
## Rationale for this change
```DefaultListFilesCache``` currently uses the exact listing path as the
cache key. When partition pruning narrows queries to specific partition
prefixes (e.g., s3://bucket/table/year=2024/), the cache lookup fails even if a
full table listing (s3://bucket/table/) was previously cached. This leads to
redundant object store calls and duplicate cache entries for different
partition filters on the same table.
## What changes are included in this PR?
- Updated ```ListFilesCache``` trait to use ```Extra = Option<Path>```
(partition prefix) instead of ObjectMeta
- Added ```get_with_prefix```(table_base, prefix, now) method to
```DefaultListFilesCache``` that:
- Uses the table base path as a stable cache key
- Optionally filters cached results by a partition prefix
- Handles TTL expiration checks
- Updated ```list_with_cache``` in ```url.rs``` to:
- Always use the table base path as the cache key
- Compute the relative prefix between the listing URL and table base
- Always cache full table listings to ensure complete data is available
for subsequent partition queries
- Added ```compute_relative_prefix``` helper function
## Are these changes tested?
Yes. Six dedicated unit tests validate prefix-aware cache behavior:
test_prefix_aware_cache_hit - filters cached results by prefix
test_prefix_aware_cache_no_filter_returns_all - returns all files when no
prefix specified
test_prefix_aware_cache_miss_no_entry - handles cache misses correctly
test_prefix_aware_cache_no_matching_files - returns empty when no files
match prefix
test_prefix_aware_nested_partitions - handles nested partition paths (e.g.,
year=2024/month=01/)
test_prefix_aware_different_tables - ensures different tables have isolated
cache entries
## Are there any user-facing changes?
No direct API changes. Users will see improved cache efficiency when
querying partitioned tables - partition-pruned queries can now be served from
cached full-table listings, reducing object store calls.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]