Hi, Tom,

This does not address the question directly, but for what is worth, I had
the same issue and thus released a Python binding for DataFusion
<https://pypi.org/project/datafusion/>. It allows e.g. to create a pyarrow
RecordBatch by reading from s3 (via pyarrow), and use it as a source to
DataFusion's plan via SQL or DataFrame API. Because it uses the C data
interface, there is virtually no cost in moving from and to
datafusion/pyarrow. It supports UDFs and UDAF in native pyarrow arrays,
which means that there is no performance hit when using a UDF with a
pyarrow/C++ kernel also. Performance decrates when you need to map the
pyarrow array to some other format (e.g. numpy), typically to push it to
sklearn, scipy, etc.

`pip install datafusion`, but fyi this is *not* production ready and many
of the pyarrow types are not supported yet. :)

Best,
Jorge




On Fri, Feb 12, 2021 at 5:41 PM Tom Scheffers <t...@youngbulls.nl.invalid>
wrote:

> Dear devs,
>
> I am really interested in an in-memory query interface to Arrow tables
> (like DataFusion is for Rust), preferably in Python. In my opinion, there
> are three routes: 1. create a wrapper/interface to DataFusion directly, 2.
> copy Arrow to pandas and use an existing framework (like Ibis) and 3.
> build/extend something new based on pyarrow (with small conversions back
> and forth to numpy or pandas).
>
> The Arrow / DataFusion route currently lacks some capabilities, like
> parquet files directly from S3, but also the push down of predicates.
> Therefore, I would rather wait for things to mature. Besides, the C++
> branch of Arrow seems to be more mature and integrates nicely with Python.
>
> The pandas route is probably more convenient, however it will be much less
> efficient. Columnar storage, predicate push downs and statistics
> optimizations are the main reason for using Arrow, which will not be fully
> utilized in this route.
>
> Is there already something like DataFusion on the roadmap for C++ (and thus
> Python)? Or is there an Ibis like engine which acts directly on Pyarrow? I
> would like to help on advancements into this direction, but struggle in
> finding where to start.
>
> Thanks for your help.
>
> Kind regards,
>
> Tom
>

Reply via email to