[I] new feature: Efficient Client-Side Glob Implementation via Guided Traversal [opendal]

via GitHub Wed, 27 Aug 2025 22:49:23 -0700


chitralverma opened a new issue, #6535:
URL: https://github.com/apache/opendal/issues/6535


   ### Feature Description
   
   The glob_support RFC 
([apache/opendal#6209](https://github.com/apache/opendal/pull/6209)) proposes a 
client-side fallback for services that don't support native globbing. The 
default implementation for this would be to list all entries recursively and 
then filter them, which can be inefficient for large directories. This issue 
proposes a more advanced client-side implementation that avoids these 
inefficiencies.
   
   **Solution Description**
   I propose implementing the client-side glob functionality using a "Guided 
Traversal" algorithm. This approach is significantly more efficient than a 
simple "list-then-filter" strategy, especially for remote object stores where 
API calls are expensive.
   
   The core idea is to parse the glob pattern into segments and use these 
segments to intelligently guide the directory traversal step-by-step, pruning 
the search space at each level.
   
   
   
   ### Problem and Solution
   
   **How it Works**
   For a pattern like `**/*.jpg` starting from a base_path `s3://bucket/media/`:
   
   1. **Parse Pattern:** The pattern is broken down into segments: `['media', 
'**', '*.jpg']`.
   2. **Stateful Traversal:** A queue manages the search state, holding 
`(path_to_search, remaining_segments)` tuples.
   3. **Guided API Calls:** Instead of one large recursive list, the algorithm 
performs a series of targeted, single-level `list` calls.
       - It first lists the `base_path` to find a directory named `media`.
       - Once inside `media/`, it handles the `**` by recursively exploring 
subdirectories, looking for entries that could satisfy the next segment 
(`*.jpg`).
       - At each level of the `**` traversal, it attempts to match `*.jpg`, 
effectively pruning branches that don't contain matching files.
   
   This turns the glob operation from a brute-force listing into an 
intelligent, stateful search that minimizes API calls and data transfer.
   
   **Complexity Analysis**
     - **Time Complexity:** `O(N * L)`, where `N` is the number of relevant 
entries actually visited and L is the average path length. N is kept to a 
minimum through aggressive pruning.
   
     - **Space Complexity:** `O(D)`, where `D` is the maximum depth of the 
search. Memory usage is low as it processes entries as a stream and does not 
hold the full list in memory.
   
   
   
   ### Additional Context
   
   I have a proof-of-concept implementation in Rust that demonstrates this 
logic, producing a lazy Stream of entries, which aligns perfectly with the 
proposed `lister_with().glob()` API mentioned in the RFC. This can serve as a 
strong foundation for a PR.
   
   ### Are you willing to contribute to the development of this feature?
   
   - [x] Yes, I am willing to contribute to the development of this feature.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] new feature: Efficient Client-Side Glob Implementation via Guided Traversal [opendal]

Reply via email to