Hi, I'd like to get the community's feedback about a design proposal (discussed below) for integrating user-defined Python-based data-sources in Arrow. This is part of a larger project I'm working on to provide end-to-end (Ibis/Ibis-Substrait/Arrow) support for such data-sources.
A user-defined Python-based data-source is basically a function implemented in Python that takes no arguments and returns schema-carrying tabular data, e.g., a dataframe or a record-batch stream, as well as exposes the schema. Normally, such a function would be generated by a factory-function that does take arguments to embed them (or values derived from them) in the returned data-source function. The data-source function is intended to be integrated within an input execution node of an Acero execution plan. This suggests distinguishing between a couple of data-source roles: * Author: the person/component implementing the data-source factory function * Producer: the person/component creating a specific data-source function * Consumer: the person/component sourcing data using the specific data-source function In an end-to-end scenario (whose design details I'm leaving out here), authoring would be done using Python, producing using Ibis, serialization using Ibis-Substrait, and consuming using PyArrow+Acero. In Arrow, the integration of a user-defined data-source would involve these steps: * A data-source function is obtained, either as an argument to a PyArrow API or by deserializing from a Substrait plan in which it is encoded (I have this encoding of Python functions working locally) * A data-source function is wrapped using Cython (similar to Python scalar UDFs - see https://github.com/apache/arrow/pull/12590) and held by an input execution node implemented in C++ * One or more such input execution nodes are created as part of assembling an Acero execution plan * Each input execution node uses the data-source function it holds to * expose via Acero APIs the schema of the data-source function * source data and convert it to record-batches that are pushed on to the next node in the plan Yaron.