Re: Parquet PyData integration

Ryan Blue Fri, 21 Nov 2014 10:24:18 -0800

On 11/20/2014 08:59 AM, Peter Prettenhofer wrote:

Hi all,


I'd like to integrate Parquet with pandas, a popular Python library for
in-memory data analysis.
My plan is to build an efficient connector based on the parquet-cpp project
-- is that the recommended way to do this?
Somebody told me that Impala's parquet reader is much more performant but
also tightly integrated into Impala and hard to extract (I havent checked
the licencing and if that is allowed at all). Is this still correct?


The Impala codebase is licensed with the Apache Software License, version 2:
  https://github.com/cloudera/Impala/blob/master/LICENSE.txt

So there aren't worries about using the Impala code, though you mightwant to work with the Impala community to make it externally usable ifit isn't already. I'm really not sure how parquet-cpp and Impala arerelated, perhaps Nong can comment.

As far as the readme file goes: parquet-cpp only supports reading parquet
files but not writing. Do you plan to support writing access in the future
via this cpp API?


Patches are welcome. :)

Furthermore, pushing down predicates would be very important for my
use-case -- does parquet-cpp allow to do that? I haven't seen anything in
the code-base yet.

My current plan is to start from this example [1] and write a thin wrapper
in Cython to expose some of the column reader functionality.

Any thoughts/remarks/concerns highly recommended.

Cython is a good choice, and I think you're correct to be careful aboutthe implementation you start with, either parquet-cpp or one from Impala.


Thanks,
  Peter

[1]
https://github.com/apache/incubator-parquet-cpp/blob/master/example/compute_stats.cc


rb


--
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Parquet PyData integration

Reply via email to