chitralverma opened a new issue, #6535: URL: https://github.com/apache/opendal/issues/6535
### Feature Description The glob_support RFC ([apache/opendal#6209](https://github.com/apache/opendal/pull/6209)) proposes a client-side fallback for services that don't support native globbing. The default implementation for this would be to list all entries recursively and then filter them, which can be inefficient for large directories. This issue proposes a more advanced client-side implementation that avoids these inefficiencies. **Solution Description** I propose implementing the client-side glob functionality using a "Guided Traversal" algorithm. This approach is significantly more efficient than a simple "list-then-filter" strategy, especially for remote object stores where API calls are expensive. The core idea is to parse the glob pattern into segments and use these segments to intelligently guide the directory traversal step-by-step, pruning the search space at each level. ### Problem and Solution **How it Works** For a pattern like `**/*.jpg` starting from a base_path `s3://bucket/media/`: 1. **Parse Pattern:** The pattern is broken down into segments: `['media', '**', '*.jpg']`. 2. **Stateful Traversal:** A queue manages the search state, holding `(path_to_search, remaining_segments)` tuples. 3. **Guided API Calls:** Instead of one large recursive list, the algorithm performs a series of targeted, single-level `list` calls. - It first lists the `base_path` to find a directory named `media`. - Once inside `media/`, it handles the `**` by recursively exploring subdirectories, looking for entries that could satisfy the next segment (`*.jpg`). - At each level of the `**` traversal, it attempts to match `*.jpg`, effectively pruning branches that don't contain matching files. This turns the glob operation from a brute-force listing into an intelligent, stateful search that minimizes API calls and data transfer. **Complexity Analysis** - **Time Complexity:** `O(N * L)`, where `N` is the number of relevant entries actually visited and L is the average path length. N is kept to a minimum through aggressive pruning. - **Space Complexity:** `O(D)`, where `D` is the maximum depth of the search. Memory usage is low as it processes entries as a stream and does not hold the full list in memory. ### Additional Context I have a proof-of-concept implementation in Rust that demonstrates this logic, producing a lazy Stream of entries, which aligns perfectly with the proposed `lister_with().glob()` API mentioned in the RFC. This can serve as a strong foundation for a PR. ### Are you willing to contribute to the development of this feature? - [x] Yes, I am willing to contribute to the development of this feature. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
