[jira] [Commented] (ARROW-18063) [C++][Python] Custom streaming data providers in {{run_query}}

Li Jin (Jira) Mon, 17 Oct 2022 08:09:09 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-18063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618950#comment-17618950
 ]


Li Jin commented on ARROW-18063:
--------------------------------

>It might be slightly nicer to throw an error when setting the default named 
>table provider if it has already been set. There are more complex alternatives 
>such as a named table provider registry or a chain of named table providers 
>but I'm not sure they are needed in this case.

I think either override or raise error is fine. In practice I don't see our 
application would need to invoke the initialization of custom registration more 
than once.

 

>Another alternative, which might be a more long term solution, is to create a 
>new Substrait extension which defines a new {{read_type}} (e.g. 
>{{{}ExtensionTable{}}}) which contains the needed information (e.g. URL).

We would then need to make it possible to construct custom sources from 
{{ExtensionTable}} though which probably puts us in roughly the same boat :). 
We would need an {{ExtensionTableProvider}} and we would probably want the 
default to be configurable.

I have the same thinking as well. Long term we should allow user to register 
custom ExtensionTableProvider as well and ideally with the similar way of how 
to extend ExecFactoryRegistry and NamedTableProvider.

> [C++][Python] Custom streaming data providers in {{run_query}}
> --------------------------------------------------------------
>
>                 Key: ARROW-18063
>                 URL: https://issues.apache.org/jira/browse/ARROW-18063
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Ben Kietzman
>            Priority: Major
>
> [Mailing list 
> thread|https://lists.apache.org/thread/r484sqrd6xjdd058prbrcwh3t5vg91so]
> The goal is to:
> - generate a substrait plan in Python using Ibis
> - ... wherein tables are specified using custom URLs
> - use the python API {{run_query}} to execute the plan
> - ... against source data which is *streamed* from those URLs rather than 
> pulled fully into local memory
> The obstacles include:
> - The API for constructing a data stream from the custom URLs is only 
> available in c++
> - The python {{run_query}} function requires tables as input and cannot 
> accept a RecordBatchReader even if one could be constructed from a custom URL
> - Writing custom cython is not preferred
> Some potential solutions:
> - Use ExecuteSerializedPlan() directly usable from c++ so that construction 
> of data sources need not be handled in python. Passing a buffer from 
> python/ibis down to C++ is much simpler and can be navigated without writing 
> cython
> - Refactor NamedTableProvider from a lambda mapping {{names -> data source}} 
> into a registry so that data source factories can be added from c++ then 
> referenced by name from python
> - Extend {{run_query}} to support non-Table sources and require the user to 
> write a python mapping from URLs to {{pa.RecordBatchReader}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-18063) [C++][Python] Custom streaming data providers in {{run_query}}

Reply via email to