Hello, this should definitely be shared with the Apache Iceberg community (cc'ed). The title of the document may be a bit confusing. What is proposed in there is actually constructing the building blocks in C++ that are required for supporting Python/C++/.. implementations for things like Iceberg.
While there are things proposed in the document that may overlap a bit with Iceberg, Icebergs main goal is to define a table format whereas the things in the document should support the underlying I/O capabilities of the table format but don't specify a table format. Cheers Uwe On Mon, Feb 25, 2019, at 10:20 AM, Joel Pfaff wrote: > Hello, > > Thanks for the write-up. > > Have you considered sharing this document with the Apache Iceberg community? > > My feeling is that there are some shared goals here between the two > projects. > And while their implementation is in Java, their spec is language agnostic. > > Regards, Joel > > > On Sun, Feb 24, 2019 at 6:56 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > hi folks, > > > > We've spent a good amount of energy up until now implementing > > interfaces for reading different kinds of file formats in C++, like > > Parquet, ORC, CSV, and JSON. There's some higher level layers missing, > > through, which are necessary if we want to make use of these file > > formats in the context of an in-memory query engine. This includes: > > > > * Scanning multiple files as a single logical dataset > > * Schema normalization and evolution > > * Handling partitioned datasets, and datasets consistenting of > > heterogeneous storage (a mix of file formats) > > * Predicate pushdown: taking row filtering and column selection into > > account while reading a file > > > > We have implemented some parts of this already in limited form for > > Python users in the pyarrow.parquet module. This is problematic since > > a) it is implemented in Python and cannot be used by Ruby or R, for > > example and b) it is specific to a single file format > > > > Since this is a large topic, I tried to write up a summary of what I > > believe to be the important problems that need to be solved: > > > > > > https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=sharing > > > > This project will also allow for "user-defined" data sources, so that > > other people in the Arrow ecosystem can contribute new data interfaces > > to interact with different kinds of storage systems using a common > > API, so if they want to "plug in" to any computation layers available > > in Apache Arrow then there is a reasonably straightforward path to do > > that. > > > > Your comments and ideas on this project would be appreciated. > > > > Thank you, > > Wes > > >