Re: Python Parquet package

2016-09-21 Thread Wes McKinney
Sure, I'm happy to do that. Do you want me to take care of refactoring
to account for the arrow::io API changes I just made? Then we can go
ahead and remove arrow/parquet from the Arrow project.

On Wed, Sep 21, 2016 at 3:47 PM, Uwe Korn  wrote:
> Sounds reasonable for me. I will then to continue to implement the missing 
> interfaces for Parquet in pyarrow.parquet.
>
> @wesm Can you take care that we easily depend on a pinned version of 
> parquet-cpp in pyarrow’s travis builds?
>
> Uwe
>
>> Am 21.09.2016 um 20:07 schrieb Wes McKinney :
>>
>> I don't agree with this approach right now. Here are my reasons:
>>
>> 1. The Parquet Python integration will need to depend both on PyArrow
>> and the Arrow C++ libraries, so these libraries would generally need
>> to be developed together
>>
>> 2. PyArrow would need to define and maintain a C++ or Cython API so
>> that the equivalent of the current pyarrow.parquet library can access
>> C-level data. For example:
>>
>> https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.pyx#L31
>>
>> Cython does permit cross-project C API access (we are already doing
>> cross-module Cython APi access within pyarrow). This adds additional
>> complexity that I think we should avoid for now.
>>
>> 3. Maintaining a separate C++ build toolchain for a Python package
>> adds additional maintenance and packaging burden on us
>>
>> My inclination is to keep the code where it is and make the Parquet
>> extension optional.
>>
>> - Wes
>>
>> On Wed, Sep 21, 2016 at 10:16 AM, Uwe Korn  wrote:
>>> Hello,
>>>
>>> as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, we
>>> still have to decide on how we are going to proceed with the Arrow<->Parquet
>>> Python integration. For the moment, it seems that the best way to go ahead
>>> is to pull the pyarrow.parquet module out into a separate Python package.
>>> From an organisational point, I'm unclear how I should proceed here. Should
>>> we put this in a separate repo? If so, as part of the Apache organisation?
>>>
>>> Uwe
>


Re: Python Parquet package

2016-09-21 Thread Uwe Korn
Sounds reasonable for me. I will then to continue to implement the missing 
interfaces for Parquet in pyarrow.parquet. 

@wesm Can you take care that we easily depend on a pinned version of 
parquet-cpp in pyarrow’s travis builds?

Uwe

> Am 21.09.2016 um 20:07 schrieb Wes McKinney :
> 
> I don't agree with this approach right now. Here are my reasons:
> 
> 1. The Parquet Python integration will need to depend both on PyArrow
> and the Arrow C++ libraries, so these libraries would generally need
> to be developed together
> 
> 2. PyArrow would need to define and maintain a C++ or Cython API so
> that the equivalent of the current pyarrow.parquet library can access
> C-level data. For example:
> 
> https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.pyx#L31
> 
> Cython does permit cross-project C API access (we are already doing
> cross-module Cython APi access within pyarrow). This adds additional
> complexity that I think we should avoid for now.
> 
> 3. Maintaining a separate C++ build toolchain for a Python package
> adds additional maintenance and packaging burden on us
> 
> My inclination is to keep the code where it is and make the Parquet
> extension optional.
> 
> - Wes
> 
> On Wed, Sep 21, 2016 at 10:16 AM, Uwe Korn  wrote:
>> Hello,
>> 
>> as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, we
>> still have to decide on how we are going to proceed with the Arrow<->Parquet
>> Python integration. For the moment, it seems that the best way to go ahead
>> is to pull the pyarrow.parquet module out into a separate Python package.
>> From an organisational point, I'm unclear how I should proceed here. Should
>> we put this in a separate repo? If so, as part of the Apache organisation?
>> 
>> Uwe



Re: Python Parquet package

2016-09-21 Thread Wes McKinney
I don't agree with this approach right now. Here are my reasons:

1. The Parquet Python integration will need to depend both on PyArrow
and the Arrow C++ libraries, so these libraries would generally need
to be developed together

2. PyArrow would need to define and maintain a C++ or Cython API so
that the equivalent of the current pyarrow.parquet library can access
C-level data. For example:

https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.pyx#L31

Cython does permit cross-project C API access (we are already doing
cross-module Cython APi access within pyarrow). This adds additional
complexity that I think we should avoid for now.

3. Maintaining a separate C++ build toolchain for a Python package
adds additional maintenance and packaging burden on us

My inclination is to keep the code where it is and make the Parquet
extension optional.

- Wes

On Wed, Sep 21, 2016 at 10:16 AM, Uwe Korn  wrote:
> Hello,
>
> as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, we
> still have to decide on how we are going to proceed with the Arrow<->Parquet
> Python integration. For the moment, it seems that the best way to go ahead
> is to pull the pyarrow.parquet module out into a separate Python package.
> From an organisational point, I'm unclear how I should proceed here. Should
> we put this in a separate repo? If so, as part of the Apache organisation?
>
> Uwe


Python Parquet package

2016-09-21 Thread Uwe Korn

Hello,

as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, 
we still have to decide on how we are going to proceed with the 
Arrow<->Parquet Python integration. For the moment, it seems that the 
best way to go ahead is to pull the pyarrow.parquet module out into a 
separate Python package. From an organisational point, I'm unclear how I 
should proceed here. Should we put this in a separate repo? If so, as 
part of the Apache organisation?


Uwe