Re: IO considerations for PyArrow

Micah Kornfield Fri, 03 Jun 2016 10:17:30 -0700

Hi Wes,

At what level do you imagine, the "opt-in" happening.  Right now it
seems like it would be fairly straightforward at build time.  However,
when we start packaging pyarrow for distribution how do you imagine it
will work? (If [1] already answers this, please let me know, I've been
meaning to take a look at it).


I need to grok the python code base a little bit more to understand
the implications of the scope creep and the pain around taking a more
fine-grained component approach.  But in general my experience has
been that packaging things together while maintaining clear internal
code boundaries for later separation is a good pragmatic approach.

As a side note, hopefully, we'll be able to re-use some existing
projects to do the heavy lifting for blob store integration.  SFrame
is one option [2] and [3] might be worth investigating as well (both
appear to be Apache 2.0 licensed).

Thanks,
-Micah

[1] https://github.com/apache/arrow/pull/79/files
[2] https://github.com/apache/incubator-hawq/tree/master/depends/libhdfs3
[3] https://github.com/aws/aws-sdk-cpp



On Tue, May 31, 2016 at 5:44 PM, Wes McKinney <wesmck...@gmail.com> wrote:
> hi folks,
>
> I wanted to bring up what is likely to become an issue very soon in
> the context of our work to provide an Arrow-based Parquet interface
> for Python Arrow users.
>
> https://github.com/apache/arrow/pull/83
>
> At the moment, parquet-cpp features an API that enables reading a file
> from local disk (using C standard library calls):
>
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/file/reader.h#L111
>
> This is fine for now, however we will quickly need to deal with a few
> additional sources of data:
>
> 1) File-like Python objects (i.e. an object that has `seek`, `tell`,
> and `read` methods)
> 2) Remote blob stores: HDFS and S3
>
> Implementing #1 at present is a routine exercise in using the Python C
> API. #2 is less so -- one of the approaches that has been taken by
> others is to create separate Python file-like wrapper classes for
> remote storage to make them seem file like. This has multiple
> downsides:
>
> - read/seek/tell calls must cross up into the Python interpreter and
> back down into the C++ layer
> - bytes buffered by read calls get copied into Python bytes objects
> (see PyBytes_FromStringAndSize)
>
> Outside of the GIL / concurrency issues, there's efficiency loss that
> can be remedied by implementing instead:
>
> - Direct C/C++-level interface (independent of Python interpreter)
> with remote blob stores. These can then buffer bytes directly in the
> form requested by other C++ consumer libraries (like parquet-cpp)
>
> - Implement a Python file-like interface, so that users can still get
> at the bytes in pure Python if they want (for example: some functions,
> like pandas.read_csv, primarily deal with file-like things)
>
> This is a clearly superior solution, and has been notably pursued in
> recent times by Dato's SFrame library (BSD 3-clause):
>
> https://github.com/dato-code/SFrame/tree/master/oss_src/fileio
>
> The problem, however, is the inevitable scope creep for the PyArrow
> Python package. Unlike some other programming languages, Python
> programmers face a substantial development complexity burden if they
> choose to break libraries containing C extensions into smaller
> components, as libraries must define "internal" C APIs for each other
> to connect together . Notable example is NumPy
> (http://docs.scipy.org/doc/numpy-1.10.1/reference/c-api.html), whose C
> API is already being used in PyArrow.
>
> I've been thinking about this problem for several weeks, and my net
> recommendation is that we embrace the scope creep in PyArrow (as long
> as we try to make optional features, e.g. low-level S3 / libhdfs
> integration, "opt-in" versus required for all users). I'd like to hear
> from some others, though (e.g. Uwe, Micah, etc.).
>
> thanks,
> Wes

Re: IO considerations for PyArrow

Reply via email to