Re: Developing a "dataset" API / framework for Arrow C++ users

Wes McKinney Mon, 25 Feb 2019 08:13:39 -0800

hi Joel and Uwe,

yes, feedback from the Iceberg community would be useful about what
kinds of APIs are required to be able to interact well with table
formats like Iceberg. As Uwe says, the objective of the C++ code I am
proposing to develop is to have appropriate C++ APIs for interacting
with different kinds of stored datasets, where Iceberg is one way to
store and track the contents of a dataset. Defining a new "table
format" or some kind of "metastore" is not in scope -- this is now
listed in the "non-goals" section of the document.


I think that having an Arrow C++ (and therefore Python, R, Ruby)
interface for Iceberg will be valuable for Iceberg adoption.

- Wes

- Wes

On Mon, Feb 25, 2019 at 3:50 AM Uwe L. Korn <m...@uwekorn.com> wrote:
>
> Hello,
>
> this should definitely be shared with the Apache Iceberg community (cc'ed). 
> The title of the document may be a bit confusing. What is proposed in there 
> is actually constructing the building blocks in C++ that are required for 
> supporting Python/C++/.. implementations for things like Iceberg.
>
> While there are things proposed in the document that may overlap a bit with 
> Iceberg, Icebergs main goal is to define a table format whereas the things in 
> the document should support the underlying I/O capabilities of the table 
> format but don't specify a table format.
>
> Cheers
>
> Uwe
>
> On Mon, Feb 25, 2019, at 10:20 AM, Joel Pfaff wrote:
> > Hello,
> >
> > Thanks for the write-up.
> >
> > Have you considered sharing this document with the Apache Iceberg community?
> >
> > My feeling is that there are some shared goals here between the two
> > projects.
> > And while their implementation is in Java, their spec is language agnostic.
> >
> > Regards, Joel
> >
> >
> > On Sun, Feb 24, 2019 at 6:56 PM Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > > hi folks,
> > >
> > > We've spent a good amount of energy up until now implementing
> > > interfaces for reading different kinds of file formats in C++, like
> > > Parquet, ORC, CSV, and JSON. There's some higher level layers missing,
> > > through, which are necessary if we want to make use of these file
> > > formats in the context of an in-memory query engine. This includes:
> > >
> > > * Scanning multiple files as a single logical dataset
> > > * Schema normalization and evolution
> > > * Handling partitioned datasets, and datasets consistenting of
> > > heterogeneous storage (a mix of file formats)
> > > * Predicate pushdown: taking row filtering and column selection into
> > > account while reading a file
> > >
> > > We have implemented some parts of this already in limited form for
> > > Python users in the pyarrow.parquet module. This is problematic since
> > > a) it is implemented in Python and cannot be used by Ruby or R, for
> > > example and b) it is specific to a single file format
> > >
> > > Since this is a large topic, I tried to write up a summary of what I
> > > believe to be the important problems that need to be solved:
> > >
> > >
> > > https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=sharing
> > >
> > > This project will also allow for "user-defined" data sources, so that
> > > other people in the Arrow ecosystem can contribute new data interfaces
> > > to interact with different kinds of storage systems using a common
> > > API, so if they want to "plug in" to any computation layers available
> > > in Apache Arrow then there is a reasonably straightforward path to do
> > > that.
> > >
> > > Your comments and ideas on this project would be appreciated.
> > >
> > > Thank you,
> > > Wes
> > >
> >

Re: Developing a "dataset" API / framework for Arrow C++ users

Reply via email to