Sure, it can be found at https://lists.apache.org/thread/o2nc7jnmfpt8lhcnjths1gnzvy86yfxo . Compared to this thread, the design proposed here is more mature, now that I have a reasonable version of the Ibis and Ibis-Substrait parts implemented locally (if it helps this discussion, I could provide some details about this implementation). I no longer propose registering the data-source function nor using arrow::compute::Function for it, since it would be directly added to a source execution node, be it manually or via deserialization of a Substrait plan. Also, I now define the data-source function as producing schema-carrying tabular data.
Yaron. ________________________________ From: Li Jin <ice.xell...@gmail.com> Sent: Wednesday, June 22, 2022 2:50 PM To: dev@arrow.apache.org <dev@arrow.apache.org> Subject: Re: user-defined Python-based data-sources in Arrow Yaron, Do you mind also linking the previous mailing list discussion here? On Wed, Jun 22, 2022 at 11:40 AM Yaron Gvili <rt...@hotmail.com> wrote: > Hi, > > I'd like to get the community's feedback about a design proposal > (discussed below) for integrating user-defined Python-based data-sources in > Arrow. This is part of a larger project I'm working on to provide > end-to-end (Ibis/Ibis-Substrait/Arrow) support for such data-sources. > > A user-defined Python-based data-source is basically a function > implemented in Python that takes no arguments and returns schema-carrying > tabular data, e.g., a dataframe or a record-batch stream, as well as > exposes the schema. Normally, such a function would be generated by a > factory-function that does take arguments to embed them (or values derived > from them) in the returned data-source function. The data-source function > is intended to be integrated within an input execution node of an Acero > execution plan. > > This suggests distinguishing between a couple of data-source roles: > > * Author: the person/component implementing the data-source factory > function > * Producer: the person/component creating a specific data-source > function > * Consumer: the person/component sourcing data using the specific > data-source function > > In an end-to-end scenario (whose design details I'm leaving out here), > authoring would be done using Python, producing using Ibis, serialization > using Ibis-Substrait, and consuming using PyArrow+Acero. > > In Arrow, the integration of a user-defined data-source would involve > these steps: > > * A data-source function is obtained, either as an argument to a > PyArrow API or by deserializing from a Substrait plan in which it is > encoded (I have this encoding of Python functions working locally) > * A data-source function is wrapped using Cython (similar to Python > scalar UDFs - see https://github.com/apache/arrow/pull/12590) and held by > an input execution node implemented in C++ > * One or more such input execution nodes are created as part of > assembling an Acero execution plan > * Each input execution node uses the data-source function it holds to > * expose via Acero APIs the schema of the data-source function > * source data and convert it to record-batches that are pushed on > to the next node in the plan > > Yaron. >