Hello all,

We've been following the Arrow Datasets project with great interest,
especially as we have an in-house library with similar goals built on
top of PyArrow. Recently, we noticed some discussion around optimizing
I/O for such use cases (e.g. PARQUET-1698), which is also where we had
focused our efforts.

Our long-term goal has been to open-source our library. However, our
code is in Python, but it would be most useful to everyone in the C++
core, so that R, Python, Ruby, etc. could benefit. Thus, we'd like to
share our high-level design, and offer to work with the community on
the implementation - at the very least, to avoid duplicating work.
We've summarized our approach, and hope this can start a discussion on
how to integrate such optimizations into Datasets:
https://docs.google.com/document/d/1tZsT3dC7UXbLTkqxgVeFGWm9piXScUDujsa0ncvK_Fs/edit#

At a high level, we have three main optimizations:
- Given a set of columns to read, and potentially a filter on a
partition key, we can use Parquet metadata to compute exact byte
ranges to read from remote storage, and coalesce/split up reads as
necessary based on the characteristics of the storage platform.
- Given byte ranges to read, we can read them in parallel, using a
global thread pool and concurrency manager to limit parallelism and
resource consumption.
- By working at the level of a dataset, we can parallelize these
operations across files, and pipeline steps like reading Parquet
metadata with reading and deserialization.

We focus on Parquet and S3/object storage here, but these concepts
apply to other file formats and storage systems.

The main questions here are whether we think the optimizations are
useful for Arrow Datasets, and if so, how the API design and
implementation would proceed - I'd appreciate any feedback on the
approach here and potential API.

David

Reply via email to