> Does that sound like a reasonable way to do this?

It's not ideal.

I may be assuming here but I think your problem is more that there is
no way to more flexibly describe a source in python and less that you
need to change the default.

For example, if you could do something like this (in pyarrow) would it work?

```
def custom_source(endpoint):
  return pc.Declaration("my_custom_source", create_my_custom_options())

def table_provider(names):
  return custom_sources[names[0]]

pa.substrait.run_query(my_plan, table_provider=table_provider)
```

On Thu, Oct 13, 2022 at 8:24 AM Li Jin <ice.xell...@gmail.com> wrote:
>
> We did some work around this recently and think there needs to be some
> small change to allow users to override this default provider. I will
> explain in more details:
>
> (1) Since the variable is defined as static in the substrait/options.h
> file, each translation unit will have a separate copy of the
> kDefaultNamedTableProvider
> variable. And therefore, the user cannot really change the default that is
> used here:
> https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125
>
> In order to allow user to override the kDefaultNamedTableProvider (and
> change the behavior of
> https://github.com/apache/arrow/blob/master/python/pyarrow/_substrait.pyx#L125
> to use a custom NamedTableProvider), we need to
> (1) in substrait/options.hh, change the definition of
> kDefaultNamedTableProvider to be an extern declaration
> (2) move the definition of kDefaultNamedTableProvider to an
> substrait/options.cc file
>
> We are still testing this but based on my limited C++ knowledge, I
> think this would allow users to do
> """
> #include "arrow/engine/substrait/options.h"
>
> void initialize() {
>     arrow::engine::kDefaultNamedTableProvider =
> some_custom_name_table_provider;
> }
> """
> And then calling `pa.substrat.run_query" should pick up the custom name
> table provider.
>
> Does that sound like a reasonable way to do this?
>
>
>
>
> On Tue, Sep 27, 2022 at 1:59 PM Li Jin <ice.xell...@gmail.com> wrote:
>
> > Thanks both. I think NamedTableProvider is close to what I want, and like
> > Weston said, the tricky bit is how to use a custom NamedTableProvider when
> > calling the pyarrow substrait API.
> >
> > It's a little hacky but I *think* I can override the value 
> > "kDefaultNamedTableProvider"
> > here and pass "table_provider=None" then it "should" work:
> >
> > https://github.com/apache/arrow/blob/529f653dfa58887522af06028e5c32e8dd1a14ea/cpp/src/arrow/engine/substrait/options.h#L66
> >
> > I am going to give that a shot once I pull/build Arrow default into our
> > internal build system.
> >
> >
> >
> >
> > On Tue, Sep 27, 2022 at 10:50 AM Benjamin Kietzman <bengil...@gmail.com>
> > wrote:
> >
> >> It seems to me that your use case could be handled by defining a custom
> >> NamedTableProvider and
> >> assigning this to ConversionOptions::named_table_provider. This was added
> >> in
> >> https://github.com/apache/arrow/pull/13613 to provide user configurable
> >> dispatching for named tables;
> >> if it doesn't address your use case then we might want to create a JIRA to
> >> extend it.
> >>
> >> On Tue, Sep 27, 2022 at 10:41 AM Li Jin <ice.xell...@gmail.com> wrote:
> >>
> >> > I did some more digging into this and have some ideas -
> >> >
> >> > Currently, the logic for deserialization named table is:
> >> >
> >> >
> >> https://github.com/apache/arrow/blob/master/cpp/src/arrow/engine/substrait/relation_internal.cc#L129
> >> > and it will look up named tables from a user provided dictionary from
> >> > string -> arrow Table.
> >> >
> >> > My idea is to make some short term changes to allow named tables to be
> >> > dispatched differently (This logic can be reverted/removed once we
> >> figure
> >> > out the proper way to support custom data sources, perhaps via substrait
> >> > Extensions.), specifically:
> >> >
> >> > (1) The user creates named table with uris for custom data source, i.e.,
> >> > "my_datasource://tablename?begin=20200101&end=20210101"
> >> > (2) In the substrait consumer, allowing user to register custom dispatch
> >> > rules based on uri scheme (similar to how exec node registry works),
> >> i.e.,
> >> > sth like:
> >> >
> >> > substrait_named_table_registry.add("my_datasource", deser_my_datasource)
> >> > and deser_my_datasource is a function that takes the NamedTable
> >> substrait
> >> > message and returns a declaration.
> >> >
> >> > I know doing this just for named tables might not be a very general
> >> > solution but seems the easiest path forward, and we can always remove
> >> this
> >> > later in favor of a more generic solution.
> >> >
> >> > Thoughts?
> >> >
> >> > Li
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Mon, Sep 26, 2022 at 10:58 AM Li Jin <ice.xell...@gmail.com> wrote:
> >> >
> >> > > Hello!
> >> > >
> >> > > I am working on adding a custom data source node in Acero. I have a
> >> few
> >> > > previous threads related to this topic.
> >> > >
> >> > > Currently, I am able to register my custom factory method with Acero
> >> and
> >> > > create a Custom source node, i.e., I can register and execute this
> >> with
> >> > > Acero:
> >> > >
> >> > > MySourceNodeOptions source_options = ...
> >> > > Declaration source{"my_source", source_option}
> >> > >
> >> > > The next step I want to do is to pass this through to the Acero
> >> substrait
> >> > > consumer. From previous discussions, I am going to use "NamedTable ''
> >> as
> >> > a
> >> > > temporary way to define my custom data source in substrait. My
> >> question
> >> > is
> >> > > this:
> >> > >
> >> > > What I need to do in substrait in order to register my own substrait
> >> > > consumer rule/function for deserializing my custom named table
> >> protobuf
> >> > > message into the declaration above. If this is not supported right
> >> now,
> >> > > what is a reasonable/minimal change to make this work?
> >> > >
> >> > > Thanks,
> >> > > Li
> >> > >
> >> >
> >>
> >

Reply via email to