Re: Developing a "dataset" API / framework for Arrow C++ users

Uwe L. Korn Mon, 25 Feb 2019 01:51:47 -0800

Hello,

this should definitely be shared with the Apache Iceberg community (cc'ed). The 
title of the document may be a bit confusing. What is proposed in there is 
actually constructing the building blocks in C++ that are required for 
supporting Python/C++/.. implementations for things like Iceberg.


While there are things proposed in the document that may overlap a bit with 
Iceberg, Icebergs main goal is to define a table format whereas the things in 
the document should support the underlying I/O capabilities of the table format 
but don't specify a table format.

Cheers

Uwe 

On Mon, Feb 25, 2019, at 10:20 AM, Joel Pfaff wrote:
> Hello,
> 
> Thanks for the write-up.
> 
> Have you considered sharing this document with the Apache Iceberg community?
> 
> My feeling is that there are some shared goals here between the two
> projects.
> And while their implementation is in Java, their spec is language agnostic.
> 
> Regards, Joel
> 
> 
> On Sun, Feb 24, 2019 at 6:56 PM Wes McKinney <wesmck...@gmail.com> wrote:
> 
> > hi folks,
> >
> > We've spent a good amount of energy up until now implementing
> > interfaces for reading different kinds of file formats in C++, like
> > Parquet, ORC, CSV, and JSON. There's some higher level layers missing,
> > through, which are necessary if we want to make use of these file
> > formats in the context of an in-memory query engine. This includes:
> >
> > * Scanning multiple files as a single logical dataset
> > * Schema normalization and evolution
> > * Handling partitioned datasets, and datasets consistenting of
> > heterogeneous storage (a mix of file formats)
> > * Predicate pushdown: taking row filtering and column selection into
> > account while reading a file
> >
> > We have implemented some parts of this already in limited form for
> > Python users in the pyarrow.parquet module. This is problematic since
> > a) it is implemented in Python and cannot be used by Ruby or R, for
> > example and b) it is specific to a single file format
> >
> > Since this is a large topic, I tried to write up a summary of what I
> > believe to be the important problems that need to be solved:
> >
> >
> > https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=sharing
> >
> > This project will also allow for "user-defined" data sources, so that
> > other people in the Arrow ecosystem can contribute new data interfaces
> > to interact with different kinds of storage systems using a common
> > API, so if they want to "plug in" to any computation layers available
> > in Apache Arrow then there is a reasonably straightforward path to do
> > that.
> >
> > Your comments and ideas on this project would be appreciated.
> >
> > Thank you,
> > Wes
> >
>

Re: Developing a "dataset" API / framework for Arrow C++ users

Reply via email to