On 11/20/2014 08:59 AM, Peter Prettenhofer wrote:
Hi all,
I'd like to integrate Parquet with pandas, a popular Python library for
in-memory data analysis.
My plan is to build an efficient connector based on the parquet-cpp project
-- is that the recommended way to do this?
Somebody told me that Impala's parquet reader is much more performant but
also tightly integrated into Impala and hard to extract (I havent checked
the licencing and if that is allowed at all). Is this still correct?
The Impala codebase is licensed with the Apache Software License, version 2:
https://github.com/cloudera/Impala/blob/master/LICENSE.txt
So there aren't worries about using the Impala code, though you might
want to work with the Impala community to make it externally usable if
it isn't already. I'm really not sure how parquet-cpp and Impala are
related, perhaps Nong can comment.
As far as the readme file goes: parquet-cpp only supports reading parquet
files but not writing. Do you plan to support writing access in the future
via this cpp API?
Patches are welcome. :)
Furthermore, pushing down predicates would be very important for my
use-case -- does parquet-cpp allow to do that? I haven't seen anything in
the code-base yet.
My current plan is to start from this example [1] and write a thin wrapper
in Cython to expose some of the column reader functionality.
Any thoughts/remarks/concerns highly recommended.
Cython is a good choice, and I think you're correct to be careful about
the implementation you start with, either parquet-cpp or one from Impala.
Thanks,
Peter
[1]
https://github.com/apache/incubator-parquet-cpp/blob/master/example/compute_stats.cc
rb
--
Ryan Blue
Software Engineer
Cloudera, Inc.