Re: user-defined Python-based data-sources in Arrow

Yaron Gvili Wed, 22 Jun 2022 12:47:30 -0700

Sure, it can be found at 
https://lists.apache.org/thread/o2nc7jnmfpt8lhcnjths1gnzvy86yfxo . Compared to 
this thread, the design proposed here is more mature, now that I have a 
reasonable version of the Ibis and Ibis-Substrait parts implemented locally (if 
it helps this discussion, I could provide some details about this 
implementation). I no longer propose registering the data-source function nor 
using arrow::compute::Function for it, since it would be directly added to a 
source execution node, be it manually or via deserialization of a Substrait 
plan. Also, I now define the data-source function as producing schema-carrying 
tabular data.



Yaron.
________________________________
From: Li Jin <ice.xell...@gmail.com>
Sent: Wednesday, June 22, 2022 2:50 PM
To: dev@arrow.apache.org <dev@arrow.apache.org>
Subject: Re: user-defined Python-based data-sources in Arrow

Yaron,

Do you mind also linking the previous mailing list discussion here?

On Wed, Jun 22, 2022 at 11:40 AM Yaron Gvili <rt...@hotmail.com> wrote:

> Hi,
>
> I'd like to get the community's feedback about a design proposal
> (discussed below) for integrating user-defined Python-based data-sources in
> Arrow. This is part of a larger project I'm working on to provide
> end-to-end (Ibis/Ibis-Substrait/Arrow) support for such data-sources.
>
> A user-defined Python-based data-source is basically a function
> implemented in Python that takes no arguments and returns schema-carrying
> tabular data, e.g., a dataframe or a record-batch stream, as well as
> exposes the schema. Normally, such a function would be generated by a
> factory-function that does take arguments to embed them (or values derived
> from them) in the returned data-source function. The data-source function
> is intended to be integrated within an input execution node of an Acero
> execution plan.
>
> This suggests distinguishing between a couple of data-source roles:
>
>   *   Author: the person/component implementing the data-source factory
> function
>   *   Producer: the person/component creating a specific data-source
> function
>   *   Consumer: the person/component sourcing data using the specific
> data-source function
>
> In an end-to-end scenario (whose design details I'm leaving out here),
> authoring would be done using Python, producing using Ibis, serialization
> using Ibis-Substrait, and consuming using PyArrow+Acero.
>
> In Arrow, the integration of a user-defined data-source would involve
> these steps:
>
>   *   A data-source function is obtained, either as an argument to a
> PyArrow API or by deserializing from a Substrait plan in which it is
> encoded (I have this encoding of Python functions working locally)
>   *   A data-source function is wrapped using Cython (similar to Python
> scalar UDFs - see https://github.com/apache/arrow/pull/12590) and held by
> an input execution node implemented in C++
>   *   One or more such input execution nodes are created as part of
> assembling an Acero execution plan
>   *   Each input execution node uses the data-source function it holds to
>      *   expose via Acero APIs the schema of the data-source function
>      *   source data and convert it to record-batches that are pushed on
> to the next node in the plan
>
> Yaron.
>

Re: user-defined Python-based data-sources in Arrow

Reply via email to