Re: Python Parquet package
Sure, I'm happy to do that. Do you want me to take care of refactoring to account for the arrow::io API changes I just made? Then we can go ahead and remove arrow/parquet from the Arrow project. On Wed, Sep 21, 2016 at 3:47 PM, Uwe Korn wrote: > Sounds reasonable for me. I will then to continue to implement the missing > interfaces for Parquet in pyarrow.parquet. > > @wesm Can you take care that we easily depend on a pinned version of > parquet-cpp in pyarrow’s travis builds? > > Uwe > >> Am 21.09.2016 um 20:07 schrieb Wes McKinney : >> >> I don't agree with this approach right now. Here are my reasons: >> >> 1. The Parquet Python integration will need to depend both on PyArrow >> and the Arrow C++ libraries, so these libraries would generally need >> to be developed together >> >> 2. PyArrow would need to define and maintain a C++ or Cython API so >> that the equivalent of the current pyarrow.parquet library can access >> C-level data. For example: >> >> https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.pyx#L31 >> >> Cython does permit cross-project C API access (we are already doing >> cross-module Cython APi access within pyarrow). This adds additional >> complexity that I think we should avoid for now. >> >> 3. Maintaining a separate C++ build toolchain for a Python package >> adds additional maintenance and packaging burden on us >> >> My inclination is to keep the code where it is and make the Parquet >> extension optional. >> >> - Wes >> >> On Wed, Sep 21, 2016 at 10:16 AM, Uwe Korn wrote: >>> Hello, >>> >>> as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, we >>> still have to decide on how we are going to proceed with the Arrow<->Parquet >>> Python integration. For the moment, it seems that the best way to go ahead >>> is to pull the pyarrow.parquet module out into a separate Python package. >>> From an organisational point, I'm unclear how I should proceed here. Should >>> we put this in a separate repo? If so, as part of the Apache organisation? >>> >>> Uwe >
Re: Python Parquet package
Sounds reasonable for me. I will then to continue to implement the missing interfaces for Parquet in pyarrow.parquet. @wesm Can you take care that we easily depend on a pinned version of parquet-cpp in pyarrow’s travis builds? Uwe > Am 21.09.2016 um 20:07 schrieb Wes McKinney : > > I don't agree with this approach right now. Here are my reasons: > > 1. The Parquet Python integration will need to depend both on PyArrow > and the Arrow C++ libraries, so these libraries would generally need > to be developed together > > 2. PyArrow would need to define and maintain a C++ or Cython API so > that the equivalent of the current pyarrow.parquet library can access > C-level data. For example: > > https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.pyx#L31 > > Cython does permit cross-project C API access (we are already doing > cross-module Cython APi access within pyarrow). This adds additional > complexity that I think we should avoid for now. > > 3. Maintaining a separate C++ build toolchain for a Python package > adds additional maintenance and packaging burden on us > > My inclination is to keep the code where it is and make the Parquet > extension optional. > > - Wes > > On Wed, Sep 21, 2016 at 10:16 AM, Uwe Korn wrote: >> Hello, >> >> as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, we >> still have to decide on how we are going to proceed with the Arrow<->Parquet >> Python integration. For the moment, it seems that the best way to go ahead >> is to pull the pyarrow.parquet module out into a separate Python package. >> From an organisational point, I'm unclear how I should proceed here. Should >> we put this in a separate repo? If so, as part of the Apache organisation? >> >> Uwe
Re: Python Parquet package
I don't agree with this approach right now. Here are my reasons: 1. The Parquet Python integration will need to depend both on PyArrow and the Arrow C++ libraries, so these libraries would generally need to be developed together 2. PyArrow would need to define and maintain a C++ or Cython API so that the equivalent of the current pyarrow.parquet library can access C-level data. For example: https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.pyx#L31 Cython does permit cross-project C API access (we are already doing cross-module Cython APi access within pyarrow). This adds additional complexity that I think we should avoid for now. 3. Maintaining a separate C++ build toolchain for a Python package adds additional maintenance and packaging burden on us My inclination is to keep the code where it is and make the Parquet extension optional. - Wes On Wed, Sep 21, 2016 at 10:16 AM, Uwe Korn wrote: > Hello, > > as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, we > still have to decide on how we are going to proceed with the Arrow<->Parquet > Python integration. For the moment, it seems that the best way to go ahead > is to pull the pyarrow.parquet module out into a separate Python package. > From an organisational point, I'm unclear how I should proceed here. Should > we put this in a separate repo? If so, as part of the Apache organisation? > > Uwe
Python Parquet package
Hello, as we have moved the Arrow<->Parquet C++ integration into parquet-cpp, we still have to decide on how we are going to proceed with the Arrow<->Parquet Python integration. For the moment, it seems that the best way to go ahead is to pull the pyarrow.parquet module out into a separate Python package. From an organisational point, I'm unclear how I should proceed here. Should we put this in a separate repo? If so, as part of the Apache organisation? Uwe