tustvold opened a new issue, #7171:
URL: https://github.com/apache/arrow-rs/issues/7171

   **Problem**
   
   Initially the ObjectStore API was relatively simple, consisting of a few 
methods to interact with object stores. As such many systems took this 
abstraction and used it as a generic IO abstraction, this is good and what the 
crate was designed for.
   
   As people wanted additional functionality, such as instrumentation, caching 
or concurrency limiting, this was implemented by creating ObjectStore 
implementations that wrap the existing ones. Again this worked well.
   
   However, over time the ObjectStore API has grown, and now has 8 required 
methods and a further 10 methods with default implementations. This creates a 
number of challenges for this wrapper based approach for composition.
   
   ***API Surface***
   
   As a wrapper must avoid "despecializing" methods, it must implement all 18 
methods. Not only is this burdensome, but creates upgrade hazards as new 
methods are added, potentially in non-breaking versions.
   
   ***Additional Context***
   
   As the logic within these wrappers has grown more complex, there comes the 
need to pass additional information through to this logic. This motivates 
requests like #7155
   
   ***Interface Creep***
   
   In many places the ObjectStore interface gets used as the abstraction for 
components that don't actually require the full breadth of ObjectStore 
functionality. There is no need, for example, for a parquet reader to depend on 
more than the ability to fetch ranges of bytes. 
   
   This leads to perverse "ObjectStore" implementations, that actually only 
implement say get functionality. Similarly in contexts like 
https://github.com/apache/datafusion/pull/14286 it creates complexities around 
how to shim the full ObjectStore interface, despite the actual operators in 
question only using a very small subset of this functionality.
   
   ***Request Correlation***
   
   As the ObjectStore logic has gotten more sophisticated, incorporating 
automatic retries, request batching, etc... the relationship between an 
ObjectStore method call and requests has gotten rather fuzzy. This makes 
implementing instrumentation, concurrency limiting, tokio task dispatch, etc... 
at this API boundary increasingly inaccurate/problematic.
   
   **Thoughts**
   
   I personally think we should encourage a move away from this wrapper based 
form of composition and instead do the following:
   
   * Encourage use of specialized traits like parquet's 
[AsyncFileReader](https://docs.rs/parquet/latest/parquet/arrow/async_reader/trait.AsyncFileReader.html)
 that reflect what a given component actually needs, and can evolve 
independently of ObjectStore
   * Add additional functionality for injecting logic into the HTTP request 
path (#6056) allowing
       * More accurate instrumentation
       * More accurate concurrency limiting
       * Potential sophistication w.r.t tokio runtime dispatch
   
   I can't help feeling right now ObjectStore is stuck between trying to expose 
the functionality of ObjectStore's in a portable and ergonomic fashion, whilst 
also trying to provide some sort of generic all-purpose IO subsystem 
abstraction, which I'm not sure aren't incompatible goals....
   
   Tagging @alamb @crepererum @Xuanwo @waynr @kylebarron 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to