> Does that sound like a reasonable way to do this? It's not ideal.
I may be assuming here but I think your problem is more that there is no way to more flexibly describe a source in python and less that you need to change the default. For example, if you could do something like this (in pyarrow) would it work? ``` def custom_source(endpoint): return pc.Declaration("my_custom_source", create_my_custom_options()) def table_provider(names): return custom_sources[names[0]] pa.substrait.run_query(my_plan, table_provider=table_provider) ``` On Thu, Oct 13, 2022 at 8:24 AM Li Jin <ice.xell...@gmail.com> wrote: > > We did some work around this recently and think there needs to be some > small change to allow users to override this default provider. I will > explain in more details: > > (1) Since the variable is defined as static in the substrait/options.h > file, each translation unit will have a separate copy of the > kDefaultNamedTableProvider > variable. And therefore, the user cannot really change the default that is > used here: > https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125 > > In order to allow user to override the kDefaultNamedTableProvider (and > change the behavior of > https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125 > to use a custom NamedTableProvider), we need to > (1) in substrait/options.hh, change the definition of > kDefaultNamedTableProvider to be an extern declaration > (2) move the definition of kDefaultNamedTableProvider to an > substrait/options.cc file > > We are still testing this but based on my limited C++ knowledge, I > think this would allow users to do > """ > #include "arrow/engine/substrait/options.h" > > void initialize() { > arrow::engine::kDefaultNamedTableProvider = > some_custom_name_table_provider; > } > """ > And then calling `pa.substrat.run_query" should pick up the custom name > table provider. > > Does that sound like a reasonable way to do this? > > > > > On Tue, Sep 27, 2022 at 1:59 PM Li Jin <ice.xell...@gmail.com> wrote: > > > Thanks both. I think NamedTableProvider is close to what I want, and like > > Weston said, the tricky bit is how to use a custom NamedTableProvider when > > calling the pyarrow substrait API. > > > > It's a little hacky but I *think* I can override the value > > "kDefaultNamedTableProvider" > > here and pass "table_provider=None" then it "should" work: > > > > https://github.com/apache/arrow/blob/529f653dfa58887522af06028e5c32e8dd1a14ea/cpp/src/arrow/engine/substrait/options.h#L66 > > > > I am going to give that a shot once I pull/build Arrow default into our > > internal build system. > > > > > > > > > > On Tue, Sep 27, 2022 at 10:50 AM Benjamin Kietzman <bengil...@gmail.com> > > wrote: > > > >> It seems to me that your use case could be handled by defining a custom > >> NamedTableProvider and > >> assigning this to ConversionOptions::named_table_provider. This was added > >> in > >> https://github.com/apache/arrow/pull/13613 to provide user configurable > >> dispatching for named tables; > >> if it doesn't address your use case then we might want to create a JIRA to > >> extend it. > >> > >> On Tue, Sep 27, 2022 at 10:41 AM Li Jin <ice.xell...@gmail.com> wrote: > >> > >> > I did some more digging into this and have some ideas - > >> > > >> > Currently, the logic for deserialization named table is: > >> > > >> > > >> https://github.com/apache/arrow/blob/master/cpp/src/arrow/engine/substrait/relation_internal.cc#L129 > >> > and it will look up named tables from a user provided dictionary from > >> > string -> arrow Table. > >> > > >> > My idea is to make some short term changes to allow named tables to be > >> > dispatched differently (This logic can be reverted/removed once we > >> figure > >> > out the proper way to support custom data sources, perhaps via substrait > >> > Extensions.), specifically: > >> > > >> > (1) The user creates named table with uris for custom data source, i.e., > >> > "my_datasource://tablename?begin=20200101&end=20210101" > >> > (2) In the substrait consumer, allowing user to register custom dispatch > >> > rules based on uri scheme (similar to how exec node registry works), > >> i.e., > >> > sth like: > >> > > >> > substrait_named_table_registry.add("my_datasource", deser_my_datasource) > >> > and deser_my_datasource is a function that takes the NamedTable > >> substrait > >> > message and returns a declaration. > >> > > >> > I know doing this just for named tables might not be a very general > >> > solution but seems the easiest path forward, and we can always remove > >> this > >> > later in favor of a more generic solution. > >> > > >> > Thoughts? > >> > > >> > Li > >> > > >> > > >> > > >> > > >> > > >> > On Mon, Sep 26, 2022 at 10:58 AM Li Jin <ice.xell...@gmail.com> wrote: > >> > > >> > > Hello! > >> > > > >> > > I am working on adding a custom data source node in Acero. I have a > >> few > >> > > previous threads related to this topic. > >> > > > >> > > Currently, I am able to register my custom factory method with Acero > >> and > >> > > create a Custom source node, i.e., I can register and execute this > >> with > >> > > Acero: > >> > > > >> > > MySourceNodeOptions source_options = ... > >> > > Declaration source{"my_source", source_option} > >> > > > >> > > The next step I want to do is to pass this through to the Acero > >> substrait > >> > > consumer. From previous discussions, I am going to use "NamedTable '' > >> as > >> > a > >> > > temporary way to define my custom data source in substrait. My > >> question > >> > is > >> > > this: > >> > > > >> > > What I need to do in substrait in order to register my own substrait > >> > > consumer rule/function for deserializing my custom named table > >> protobuf > >> > > message into the declaration above. If this is not supported right > >> now, > >> > > what is a reasonable/minimal change to make this work? > >> > > > >> > > Thanks, > >> > > Li > >> > > > >> > > >> > >