> Currently, parquet.rs only supports local disk files. Potentially, this can > be done using the rusoto crate that provides a s3 client. What would be a > good way to do this? > 1. create a remote parquet reader (potentially duplicate lots of code) > 2. create an interface to abstract away reading from local/remote files (not > sure about performance if the reader blocks on every operation)
This is a great question. I think that approach (2) is superior, although it requires more work than approach (1) to design an interface that works well across multiple file stores that have different performance characteristics. To accommodate storage-specific performance optimizations, I expect that the common interface will have to be more elaborate than the current reader API. Is it possible for the Rust reader to use the c++ implementation (https://github.com/apache/arrow/tree/master/cpp/src/arrow/filesystem)? If this reuse of implementation is feasible, then we could focus efforts on improving the c++ implementation and get the benefits in Python, Rust, etc. In the Java ecosystem, the (non-Arrow, row-wise) Parquet reader uses the Hadoop FileSystem abstraction. This abstraction is complex, leaky, and not well specialized for read patterns that are typical for Parquet files. We can learn from these mistakes to create a superior reader interface in the Arrow/Parquet project. Steve