[PR] Feat: DefaultListFilesCache prefix-aware for partition pruning optimization [datafusion]

via GitHub Fri, 12 Dec 2025 07:52:36 -0800


Yuvraj-cyborg opened a new pull request, #19298:
URL: https://github.com/apache/datafusion/pull/19298


   ## Which issue does this PR close?
   Closes #19273 
   
   ## Rationale for this change
   ```DefaultListFilesCache``` currently uses the exact listing path as the 
cache key. When partition pruning narrows queries to specific partition 
prefixes (e.g., s3://bucket/table/year=2024/), the cache lookup fails even if a 
full table listing (s3://bucket/table/) was previously cached. This leads to 
redundant object store calls and duplicate cache entries for different 
partition filters on the same table.
   
   ## What changes are included in this PR?
   - Updated ```ListFilesCache``` trait to use ```Extra = Option<Path>``` 
(partition prefix) instead of ObjectMeta
   - Added ```get_with_prefix```(table_base, prefix, now) method to 
```DefaultListFilesCache``` that:
     - Uses the table base path as a stable cache key
     - Optionally filters cached results by a partition prefix
     - Handles TTL expiration checks
   - Updated ```list_with_cache``` in ```url.rs``` to:
     - Always use the table base path as the cache key
     - Compute the relative prefix between the listing URL and table base
     - Always cache full table listings to ensure complete data is available 
for subsequent partition queries
     - Added ```compute_relative_prefix``` helper function
     
   ## Are these changes tested?
   Yes. Six dedicated unit tests validate prefix-aware cache behavior:
   
   test_prefix_aware_cache_hit - filters cached results by prefix
   test_prefix_aware_cache_no_filter_returns_all - returns all files when no 
prefix specified
   test_prefix_aware_cache_miss_no_entry - handles cache misses correctly
   test_prefix_aware_cache_no_matching_files - returns empty when no files 
match prefix
   test_prefix_aware_nested_partitions - handles nested partition paths (e.g., 
year=2024/month=01/)
   test_prefix_aware_different_tables - ensures different tables have isolated 
cache entries
   
   ## Are there any user-facing changes?
   No direct API changes. Users will see improved cache efficiency when 
querying partitioned tables - partition-pruned queries can now be served from 
cached full-table listings, reducing object store calls.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Feat: DefaultListFilesCache prefix-aware for partition pruning optimization [datafusion]

Reply via email to