Going back to the default_exec_factory_registry idea, I think ultimately maybe we want registration API that looks like:
""" MetaRegistry* registry = compute::default_meta_registry(); registry->RegisterNamedTableProvider(...); registry->exec_factory_registry()->AddFactory("my_custom_node", MakeMyCustomNode) ... """ On Thu, Oct 13, 2022 at 1:32 PM Li Jin <ice.xell...@gmail.com> wrote: > Weston - was trying the pyarrow approach you suggested: > > >def custom_source(endpoint): > return pc.Declaration("my_custom_source", create_my_custom_options()) > > (1) I didn't see "Declaration" under pyarrow.compute - which package is > this under? > (2) What Python object should I return with create_my_custom_options()? > Currently I only have a C++ class for my custom option. > > On Thu, Oct 13, 2022 at 12:58 PM Li Jin <ice.xell...@gmail.com> wrote: > >> > I may be assuming here but I think your problem is more that there is >> no way to more flexibly describe a source in python and less that you >> need to change the default. >> >> Correct. >> >> > For example, if you could do something like this (in pyarrow) would it >> work? >> I could try to see if that works. I feel registering the extension in C++ >> via one initialization seems cleaner to me because there are many other >> extension points that we initialize (add registering in the >> default_exec_factory_registry >> similar to >> https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/plan.cc#L30). >> Having some of the initialization in Pyarrow and some in Acero is a bit >> complicated IMO. >> >> Maybe the proper long term fix is to add something similar to >> compute::default_exec_factory_registry for all extension points (named >> table, substrait extension message) and the user can register custom stuff >> similar to how they can register via default_exec_factory_registry?) >> >> On Thu, Oct 13, 2022 at 12:41 PM Weston Pace <weston.p...@gmail.com> >> wrote: >> >>> > Does that sound like a reasonable way to do this? >>> >>> It's not ideal. >>> >>> I may be assuming here but I think your problem is more that there is >>> no way to more flexibly describe a source in python and less that you >>> need to change the default. >>> >>> For example, if you could do something like this (in pyarrow) would it >>> work? >>> >>> ``` >>> def custom_source(endpoint): >>> return pc.Declaration("my_custom_source", create_my_custom_options()) >>> >>> def table_provider(names): >>> return custom_sources[names[0]] >>> >>> pa.substrait.run_query(my_plan, table_provider=table_provider) >>> ``` >>> >>> On Thu, Oct 13, 2022 at 8:24 AM Li Jin <ice.xell...@gmail.com> wrote: >>> > >>> > We did some work around this recently and think there needs to be some >>> > small change to allow users to override this default provider. I will >>> > explain in more details: >>> > >>> > (1) Since the variable is defined as static in the substrait/options.h >>> > file, each translation unit will have a separate copy of the >>> > kDefaultNamedTableProvider >>> > variable. And therefore, the user cannot really change the default >>> that is >>> > used here: >>> > >>> https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125 >>> > >>> > In order to allow user to override the kDefaultNamedTableProvider (and >>> > change the behavior of >>> > >>> https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125 >>> > to use a custom NamedTableProvider), we need to >>> > (1) in substrait/options.hh, change the definition of >>> > kDefaultNamedTableProvider to be an extern declaration >>> > (2) move the definition of kDefaultNamedTableProvider to an >>> > substrait/options.cc file >>> > >>> > We are still testing this but based on my limited C++ knowledge, I >>> > think this would allow users to do >>> > """ >>> > #include "arrow/engine/substrait/options.h" >>> > >>> > void initialize() { >>> > arrow::engine::kDefaultNamedTableProvider = >>> > some_custom_name_table_provider; >>> > } >>> > """ >>> > And then calling `pa.substrat.run_query" should pick up the custom name >>> > table provider. >>> > >>> > Does that sound like a reasonable way to do this? >>> > >>> > >>> > >>> > >>> > On Tue, Sep 27, 2022 at 1:59 PM Li Jin <ice.xell...@gmail.com> wrote: >>> > >>> > > Thanks both. I think NamedTableProvider is close to what I want, and >>> like >>> > > Weston said, the tricky bit is how to use a custom >>> NamedTableProvider when >>> > > calling the pyarrow substrait API. >>> > > >>> > > It's a little hacky but I *think* I can override the value >>> "kDefaultNamedTableProvider" >>> > > here and pass "table_provider=None" then it "should" work: >>> > > >>> > > >>> https://github.com/apache/arrow/blob/529f653dfa58887522af06028e5c32e8dd1a14ea/cpp/src/arrow/engine/substrait/options.h#L66 >>> > > >>> > > I am going to give that a shot once I pull/build Arrow default into >>> our >>> > > internal build system. >>> > > >>> > > >>> > > >>> > > >>> > > On Tue, Sep 27, 2022 at 10:50 AM Benjamin Kietzman < >>> bengil...@gmail.com> >>> > > wrote: >>> > > >>> > >> It seems to me that your use case could be handled by defining a >>> custom >>> > >> NamedTableProvider and >>> > >> assigning this to ConversionOptions::named_table_provider. This was >>> added >>> > >> in >>> > >> https://github.com/apache/arrow/pull/13613 to provide user >>> configurable >>> > >> dispatching for named tables; >>> > >> if it doesn't address your use case then we might want to create a >>> JIRA to >>> > >> extend it. >>> > >> >>> > >> On Tue, Sep 27, 2022 at 10:41 AM Li Jin <ice.xell...@gmail.com> >>> wrote: >>> > >> >>> > >> > I did some more digging into this and have some ideas - >>> > >> > >>> > >> > Currently, the logic for deserialization named table is: >>> > >> > >>> > >> > >>> > >> >>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/engine/substrait/relation_internal.cc#L129 >>> > >> > and it will look up named tables from a user provided dictionary >>> from >>> > >> > string -> arrow Table. >>> > >> > >>> > >> > My idea is to make some short term changes to allow named tables >>> to be >>> > >> > dispatched differently (This logic can be reverted/removed once we >>> > >> figure >>> > >> > out the proper way to support custom data sources, perhaps via >>> substrait >>> > >> > Extensions.), specifically: >>> > >> > >>> > >> > (1) The user creates named table with uris for custom data >>> source, i.e., >>> > >> > "my_datasource://tablename?begin=20200101&end=20210101" >>> > >> > (2) In the substrait consumer, allowing user to register custom >>> dispatch >>> > >> > rules based on uri scheme (similar to how exec node registry >>> works), >>> > >> i.e., >>> > >> > sth like: >>> > >> > >>> > >> > substrait_named_table_registry.add("my_datasource", >>> deser_my_datasource) >>> > >> > and deser_my_datasource is a function that takes the NamedTable >>> > >> substrait >>> > >> > message and returns a declaration. >>> > >> > >>> > >> > I know doing this just for named tables might not be a very >>> general >>> > >> > solution but seems the easiest path forward, and we can always >>> remove >>> > >> this >>> > >> > later in favor of a more generic solution. >>> > >> > >>> > >> > Thoughts? >>> > >> > >>> > >> > Li >>> > >> > >>> > >> > >>> > >> > >>> > >> > >>> > >> > >>> > >> > On Mon, Sep 26, 2022 at 10:58 AM Li Jin <ice.xell...@gmail.com> >>> wrote: >>> > >> > >>> > >> > > Hello! >>> > >> > > >>> > >> > > I am working on adding a custom data source node in Acero. I >>> have a >>> > >> few >>> > >> > > previous threads related to this topic. >>> > >> > > >>> > >> > > Currently, I am able to register my custom factory method with >>> Acero >>> > >> and >>> > >> > > create a Custom source node, i.e., I can register and execute >>> this >>> > >> with >>> > >> > > Acero: >>> > >> > > >>> > >> > > MySourceNodeOptions source_options = ... >>> > >> > > Declaration source{"my_source", source_option} >>> > >> > > >>> > >> > > The next step I want to do is to pass this through to the Acero >>> > >> substrait >>> > >> > > consumer. From previous discussions, I am going to use >>> "NamedTable '' >>> > >> as >>> > >> > a >>> > >> > > temporary way to define my custom data source in substrait. My >>> > >> question >>> > >> > is >>> > >> > > this: >>> > >> > > >>> > >> > > What I need to do in substrait in order to register my own >>> substrait >>> > >> > > consumer rule/function for deserializing my custom named table >>> > >> protobuf >>> > >> > > message into the declaration above. If this is not supported >>> right >>> > >> now, >>> > >> > > what is a reasonable/minimal change to make this work? >>> > >> > > >>> > >> > > Thanks, >>> > >> > > Li >>> > >> > > >>> > >> > >>> > >> >>> > > >>> >>