Hello,

Thanks for the write-up.

Have you considered sharing this document with the Apache Iceberg community?

My feeling is that there are some shared goals here between the two
projects.
And while their implementation is in Java, their spec is language agnostic.

Regards, Joel


On Sun, Feb 24, 2019 at 6:56 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hi folks,
>
> We've spent a good amount of energy up until now implementing
> interfaces for reading different kinds of file formats in C++, like
> Parquet, ORC, CSV, and JSON. There's some higher level layers missing,
> through, which are necessary if we want to make use of these file
> formats in the context of an in-memory query engine. This includes:
>
> * Scanning multiple files as a single logical dataset
> * Schema normalization and evolution
> * Handling partitioned datasets, and datasets consistenting of
> heterogeneous storage (a mix of file formats)
> * Predicate pushdown: taking row filtering and column selection into
> account while reading a file
>
> We have implemented some parts of this already in limited form for
> Python users in the pyarrow.parquet module. This is problematic since
> a) it is implemented in Python and cannot be used by Ruby or R, for
> example and b) it is specific to a single file format
>
> Since this is a large topic, I tried to write up a summary of what I
> believe to be the important problems that need to be solved:
>
>
> https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=sharing
>
> This project will also allow for "user-defined" data sources, so that
> other people in the Arrow ecosystem can contribute new data interfaces
> to interact with different kinds of storage systems using a common
> API, so if they want to "plug in" to any computation layers available
> in Apache Arrow then there is a reasonably straightforward path to do
> that.
>
> Your comments and ideas on this project would be appreciated.
>
> Thank you,
> Wes
>

Reply via email to