I agree with Weston about dynamically loading a shared object with 
initialization code for registering node factories. For custom node factories, 
I think this loading would best be done from a separate Python module, 
different than "_exec_plan.pyx", that the user would need to import for 
triggering (once) the registration. This would avoid merging custom code into 
"_exec_plan.pyx" and maintaining it. You would likely want to code up files for 
your module that are analogous to 
"python/pyarrow/includes/libarrow_dataset.pxd", "python/pyarrow/_dataset.pxd", 
and "python/pyarrow/dataset.py". You would need to modify the files 
"python/setup.py" and "python/CMakeLists.txt" in order to build your module 
within PyArrow's build, or alternatively to roll your own version of these 
files to build your Python module separately. This is where you would add a 
build flag for pulling in C++ header files for your Python module, under 
"python/pyarrow/include", and for making it.


Yaron.
________________________________
From: Li Jin <ice.xell...@gmail.com>
Sent: Wednesday, September 21, 2022 3:51 PM
To: dev@arrow.apache.org <dev@arrow.apache.org>
Subject: Re: Register custom ExecNode factories

Thanks Weston - I have not rewritten Python/C++ bridge so this is also new
to me and I am hoping to get some information from people that know how to
do this.

I will leave this open for other people to offer help :) and will ask some
internal folks as well.

Will circle back on this.

On Tue, Sep 20, 2022 at 8:50 PM Weston Pace <weston.p...@gmail.com> wrote:

> I'm not great at this build stuff but I think the basic idea is that
> you will need to package your custom nodes into a shared object.
> You'll need to then somehow trigger that shared object to load from
> python.  This seems like a good place to invoke the initialize method.
>
> Currently pyarrow has to do this because the datasets module
> (libarrow_dataset.so) adds some custom nodes (scan node, dataset write
> node).  The datasets module defines the Initialize method.  This
> method is called in _exec_plan.pyx when the python module is loaded.
> I don't know cython well enough to know how exactly it triggers the
> datasets shared object to load.
>
> On Tue, Sep 20, 2022 at 11:01 AM Li Jin <ice.xell...@gmail.com> wrote:
> >
> > Hi,
> >
> > Recently I am working on adding a custom data source node to Acero and
> was
> > pointed to a few examples in the dataset code.
> >
> > If I understand this correctly, the registering of dataset exec node is
> > currently happening when this is loaded:
> >
> https://github.com/apache/arrow/blob/master/python/pyarrow/_exec_plan.pyx#L36
> >
> > I wonder if I have a custom "Initialize'' method that registers
> additional
> > ExecNode, where is the right place to invoke such initialization?
> > Eventually I want to execute my query via ibis-substrait and Acero
> > substrait consumer Python API.
> >
> > Thanks,
> > Li
>

Reply via email to