Going back to the default_exec_factory_registry idea, I think ultimately
maybe we want registration API that looks like:

"""
MetaRegistry* registry = compute::default_meta_registry();
registry->RegisterNamedTableProvider(...);
registry->exec_factory_registry()->AddFactory("my_custom_node",
MakeMyCustomNode)
...
"""

On Thu, Oct 13, 2022 at 1:32 PM Li Jin <ice.xell...@gmail.com> wrote:

> Weston - was trying the pyarrow approach you suggested:
>
> >def custom_source(endpoint):
>   return pc.Declaration("my_custom_source", create_my_custom_options())
>
> (1) I didn't see "Declaration" under pyarrow.compute - which package is
> this under?
> (2) What Python object should I return with  create_my_custom_options()?
> Currently I only have a C++ class for my custom option.
>
> On Thu, Oct 13, 2022 at 12:58 PM Li Jin <ice.xell...@gmail.com> wrote:
>
>> > I may be assuming here but I think your problem is more that there is
>> no way to more flexibly describe a source in python and less that you
>> need to change the default.
>>
>> Correct.
>>
>> > For example, if you could do something like this (in pyarrow) would it
>> work?
>> I could try to see if that works. I feel registering the extension in C++
>> via one initialization seems cleaner to me because there are many other
>> extension points that we initialize (add registering in the 
>> default_exec_factory_registry
>> similar to
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/plan.cc#L30).
>> Having some of the initialization in Pyarrow and some in Acero is a bit
>> complicated IMO.
>>
>> Maybe the proper long term fix is to add something similar to
>> compute::default_exec_factory_registry for all extension points (named
>> table, substrait extension message) and the user can register custom stuff
>> similar to how they can register via default_exec_factory_registry?)
>>
>> On Thu, Oct 13, 2022 at 12:41 PM Weston Pace <weston.p...@gmail.com>
>> wrote:
>>
>>> > Does that sound like a reasonable way to do this?
>>>
>>> It's not ideal.
>>>
>>> I may be assuming here but I think your problem is more that there is
>>> no way to more flexibly describe a source in python and less that you
>>> need to change the default.
>>>
>>> For example, if you could do something like this (in pyarrow) would it
>>> work?
>>>
>>> ```
>>> def custom_source(endpoint):
>>>   return pc.Declaration("my_custom_source", create_my_custom_options())
>>>
>>> def table_provider(names):
>>>   return custom_sources[names[0]]
>>>
>>> pa.substrait.run_query(my_plan, table_provider=table_provider)
>>> ```
>>>
>>> On Thu, Oct 13, 2022 at 8:24 AM Li Jin <ice.xell...@gmail.com> wrote:
>>> >
>>> > We did some work around this recently and think there needs to be some
>>> > small change to allow users to override this default provider. I will
>>> > explain in more details:
>>> >
>>> > (1) Since the variable is defined as static in the substrait/options.h
>>> > file, each translation unit will have a separate copy of the
>>> > kDefaultNamedTableProvider
>>> > variable. And therefore, the user cannot really change the default
>>> that is
>>> > used here:
>>> >
>>> https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125
>>> >
>>> > In order to allow user to override the kDefaultNamedTableProvider (and
>>> > change the behavior of
>>> >
>>> https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125
>>> > to use a custom NamedTableProvider), we need to
>>> > (1) in substrait/options.hh, change the definition of
>>> > kDefaultNamedTableProvider to be an extern declaration
>>> > (2) move the definition of kDefaultNamedTableProvider to an
>>> > substrait/options.cc file
>>> >
>>> > We are still testing this but based on my limited C++ knowledge, I
>>> > think this would allow users to do
>>> > """
>>> > #include "arrow/engine/substrait/options.h"
>>> >
>>> > void initialize() {
>>> >     arrow::engine::kDefaultNamedTableProvider =
>>> > some_custom_name_table_provider;
>>> > }
>>> > """
>>> > And then calling `pa.substrat.run_query" should pick up the custom name
>>> > table provider.
>>> >
>>> > Does that sound like a reasonable way to do this?
>>> >
>>> >
>>> >
>>> >
>>> > On Tue, Sep 27, 2022 at 1:59 PM Li Jin <ice.xell...@gmail.com> wrote:
>>> >
>>> > > Thanks both. I think NamedTableProvider is close to what I want, and
>>> like
>>> > > Weston said, the tricky bit is how to use a custom
>>> NamedTableProvider when
>>> > > calling the pyarrow substrait API.
>>> > >
>>> > > It's a little hacky but I *think* I can override the value
>>> "kDefaultNamedTableProvider"
>>> > > here and pass "table_provider=None" then it "should" work:
>>> > >
>>> > >
>>> https://github.com/apache/arrow/blob/529f653dfa58887522af06028e5c32e8dd1a14ea/cpp/src/arrow/engine/substrait/options.h#L66
>>> > >
>>> > > I am going to give that a shot once I pull/build Arrow default into
>>> our
>>> > > internal build system.
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On Tue, Sep 27, 2022 at 10:50 AM Benjamin Kietzman <
>>> bengil...@gmail.com>
>>> > > wrote:
>>> > >
>>> > >> It seems to me that your use case could be handled by defining a
>>> custom
>>> > >> NamedTableProvider and
>>> > >> assigning this to ConversionOptions::named_table_provider. This was
>>> added
>>> > >> in
>>> > >> https://github.com/apache/arrow/pull/13613 to provide user
>>> configurable
>>> > >> dispatching for named tables;
>>> > >> if it doesn't address your use case then we might want to create a
>>> JIRA to
>>> > >> extend it.
>>> > >>
>>> > >> On Tue, Sep 27, 2022 at 10:41 AM Li Jin <ice.xell...@gmail.com>
>>> wrote:
>>> > >>
>>> > >> > I did some more digging into this and have some ideas -
>>> > >> >
>>> > >> > Currently, the logic for deserialization named table is:
>>> > >> >
>>> > >> >
>>> > >>
>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/engine/substrait/relation_internal.cc#L129
>>> > >> > and it will look up named tables from a user provided dictionary
>>> from
>>> > >> > string -> arrow Table.
>>> > >> >
>>> > >> > My idea is to make some short term changes to allow named tables
>>> to be
>>> > >> > dispatched differently (This logic can be reverted/removed once we
>>> > >> figure
>>> > >> > out the proper way to support custom data sources, perhaps via
>>> substrait
>>> > >> > Extensions.), specifically:
>>> > >> >
>>> > >> > (1) The user creates named table with uris for custom data
>>> source, i.e.,
>>> > >> > "my_datasource://tablename?begin=20200101&end=20210101"
>>> > >> > (2) In the substrait consumer, allowing user to register custom
>>> dispatch
>>> > >> > rules based on uri scheme (similar to how exec node registry
>>> works),
>>> > >> i.e.,
>>> > >> > sth like:
>>> > >> >
>>> > >> > substrait_named_table_registry.add("my_datasource",
>>> deser_my_datasource)
>>> > >> > and deser_my_datasource is a function that takes the NamedTable
>>> > >> substrait
>>> > >> > message and returns a declaration.
>>> > >> >
>>> > >> > I know doing this just for named tables might not be a very
>>> general
>>> > >> > solution but seems the easiest path forward, and we can always
>>> remove
>>> > >> this
>>> > >> > later in favor of a more generic solution.
>>> > >> >
>>> > >> > Thoughts?
>>> > >> >
>>> > >> > Li
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> > On Mon, Sep 26, 2022 at 10:58 AM Li Jin <ice.xell...@gmail.com>
>>> wrote:
>>> > >> >
>>> > >> > > Hello!
>>> > >> > >
>>> > >> > > I am working on adding a custom data source node in Acero. I
>>> have a
>>> > >> few
>>> > >> > > previous threads related to this topic.
>>> > >> > >
>>> > >> > > Currently, I am able to register my custom factory method with
>>> Acero
>>> > >> and
>>> > >> > > create a Custom source node, i.e., I can register and execute
>>> this
>>> > >> with
>>> > >> > > Acero:
>>> > >> > >
>>> > >> > > MySourceNodeOptions source_options = ...
>>> > >> > > Declaration source{"my_source", source_option}
>>> > >> > >
>>> > >> > > The next step I want to do is to pass this through to the Acero
>>> > >> substrait
>>> > >> > > consumer. From previous discussions, I am going to use
>>> "NamedTable ''
>>> > >> as
>>> > >> > a
>>> > >> > > temporary way to define my custom data source in substrait. My
>>> > >> question
>>> > >> > is
>>> > >> > > this:
>>> > >> > >
>>> > >> > > What I need to do in substrait in order to register my own
>>> substrait
>>> > >> > > consumer rule/function for deserializing my custom named table
>>> > >> protobuf
>>> > >> > > message into the declaration above. If this is not supported
>>> right
>>> > >> now,
>>> > >> > > what is a reasonable/minimal change to make this work?
>>> > >> > >
>>> > >> > > Thanks,
>>> > >> > > Li
>>> > >> > >
>>> > >> >
>>> > >>
>>> > >
>>>
>>

Reply via email to