In pyarrow it is "string(s) -> arrow Table".  However, in the actual
C++ (e.g. relation_internal.cc) it is already "string(s) ->
compute::Declaration" which should be sufficiently general for your
needs.  A "compute::Declaration" is a combination of node factory name
and node options so you should be able to return something like
{"flight_source", FlightNodeOptions(...)}.

I think the only tricky part will be how to expose the declaration to
python / cython.  The node factory name is pretty straightforward but
the options can be pretty much anything.

On Tue, Sep 27, 2022 at 7:41 AM Li Jin <ice.xell...@gmail.com> wrote:
>
> I did some more digging into this and have some ideas -
>
> Currently, the logic for deserialization named table is:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/engine/substrait/relation_internal.cc#L129
> and it will look up named tables from a user provided dictionary from
> string -> arrow Table.
>
> My idea is to make some short term changes to allow named tables to be
> dispatched differently (This logic can be reverted/removed once we figure
> out the proper way to support custom data sources, perhaps via substrait
> Extensions.), specifically:
>
> (1) The user creates named table with uris for custom data source, i.e.,
> "my_datasource://tablename?begin=20200101&end=20210101"
> (2) In the substrait consumer, allowing user to register custom dispatch
> rules based on uri scheme (similar to how exec node registry works), i.e.,
> sth like:
>
> substrait_named_table_registry.add("my_datasource", deser_my_datasource)
> and deser_my_datasource is a function that takes the NamedTable substrait
> message and returns a declaration.
>
> I know doing this just for named tables might not be a very general
> solution but seems the easiest path forward, and we can always remove this
> later in favor of a more generic solution.
>
> Thoughts?
>
> Li
>
>
>
>
>
> On Mon, Sep 26, 2022 at 10:58 AM Li Jin <ice.xell...@gmail.com> wrote:
>
> > Hello!
> >
> > I am working on adding a custom data source node in Acero. I have a few
> > previous threads related to this topic.
> >
> > Currently, I am able to register my custom factory method with Acero and
> > create a Custom source node, i.e., I can register and execute this with
> > Acero:
> >
> > MySourceNodeOptions source_options = ...
> > Declaration source{"my_source", source_option}
> >
> > The next step I want to do is to pass this through to the Acero substrait
> > consumer. From previous discussions, I am going to use "NamedTable '' as a
> > temporary way to define my custom data source in substrait. My question is
> > this:
> >
> > What I need to do in substrait in order to register my own substrait
> > consumer rule/function for deserializing my custom named table protobuf
> > message into the declaration above. If this is not supported right now,
> > what is a reasonable/minimal change to make this work?
> >
> > Thanks,
> > Li
> >

Reply via email to