user-defined Python-based data-sources in Arrow

Yaron Gvili Wed, 22 Jun 2022 08:40:50 -0700

Hi,

I'd like to get the community's feedback about a design proposal (discussed 
below) for integrating user-defined Python-based data-sources in Arrow. This is 
part of a larger project I'm working on to provide end-to-end 
(Ibis/Ibis-Substrait/Arrow) support for such data-sources.


A user-defined Python-based data-source is basically a function implemented in 
Python that takes no arguments and returns schema-carrying tabular data, e.g., 
a dataframe or a record-batch stream, as well as exposes the schema. Normally, 
such a function would be generated by a factory-function that does take 
arguments to embed them (or values derived from them) in the returned 
data-source function. The data-source function is intended to be integrated 
within an input execution node of an Acero execution plan.

This suggests distinguishing between a couple of data-source roles:

  *   Author: the person/component implementing the data-source factory function
  *   Producer: the person/component creating a specific data-source function
  *   Consumer: the person/component sourcing data using the specific 
data-source function

In an end-to-end scenario (whose design details I'm leaving out here), 
authoring would be done using Python, producing using Ibis, serialization using 
Ibis-Substrait, and consuming using PyArrow+Acero.

In Arrow, the integration of a user-defined data-source would involve these 
steps:

  *   A data-source function is obtained, either as an argument to a PyArrow 
API or by deserializing from a Substrait plan in which it is encoded (I have 
this encoding of Python functions working locally)
  *   A data-source function is wrapped using Cython (similar to Python scalar 
UDFs - see https://github.com/apache/arrow/pull/12590) and held by an input 
execution node implemented in C++
  *   One or more such input execution nodes are created as part of assembling 
an Acero execution plan
  *   Each input execution node uses the data-source function it holds to
     *   expose via Acero APIs the schema of the data-source function
     *   source data and convert it to record-batches that are pushed on to the 
next node in the plan

Yaron.

user-defined Python-based data-sources in Arrow

Reply via email to