Hello all, We've been following the Arrow Datasets project with great interest, especially as we have an in-house library with similar goals built on top of PyArrow. Recently, we noticed some discussion around optimizing I/O for such use cases (e.g. PARQUET-1698), which is also where we had focused our efforts.
Our long-term goal has been to open-source our library. However, our code is in Python, but it would be most useful to everyone in the C++ core, so that R, Python, Ruby, etc. could benefit. Thus, we'd like to share our high-level design, and offer to work with the community on the implementation - at the very least, to avoid duplicating work. We've summarized our approach, and hope this can start a discussion on how to integrate such optimizations into Datasets: https://docs.google.com/document/d/1tZsT3dC7UXbLTkqxgVeFGWm9piXScUDujsa0ncvK_Fs/edit# At a high level, we have three main optimizations: - Given a set of columns to read, and potentially a filter on a partition key, we can use Parquet metadata to compute exact byte ranges to read from remote storage, and coalesce/split up reads as necessary based on the characteristics of the storage platform. - Given byte ranges to read, we can read them in parallel, using a global thread pool and concurrency manager to limit parallelism and resource consumption. - By working at the level of a dataset, we can parallelize these operations across files, and pipeline steps like reading Parquet metadata with reading and deserialization. We focus on Parquet and S3/object storage here, but these concepts apply to other file formats and storage systems. The main questions here are whether we think the optimizations are useful for Arrow Datasets, and if so, how the API design and implementation would proceed - I'd appreciate any feedback on the approach here and potential API. David