Ben Kietzman created ARROW-18063:
------------------------------------

             Summary: [C++][Python] 
                 Key: ARROW-18063
                 URL: https://issues.apache.org/jira/browse/ARROW-18063
             Project: Apache Arrow
          Issue Type: Improvement
            Reporter: Ben Kietzman


[Mailing list 
thread|https://lists.apache.org/thread/r484sqrd6xjdd058prbrcwh3t5vg91so]

The goal is to:
- generate a substrait plan in Python using Ibis
- ... wherein tables are specified using custom URLs
- use the python API {{run_query}} to execute the plan
- ... against source data which is *streamed* from those URLs rather than 
pulled fully into local memory

The obstacles include:
- The API for constructing a data stream from the custom URLs is only available 
in c++
- The python {{run_query}} function requires tables as input and cannot accept 
a RecordBatchReader even if one could be constructed from a custom URL
- Writing custom cython is not preferred

Some potential solutions:
- Use ExecuteSerializedPlan() directly usable from c++ so that construction of 
data sources need not be handled in python. Passing a buffer from python/ibis 
down to C++ is much simpler and can be navigated without writing cython
- Refactor NamedTableProvider from a lambda mapping {{names -> data source}} 
into a registry so that data source factories can be added from c++ then 
referenced by name from python
- Extend {{run_query}} to support non-Table sources and require the user to 
write a python mapping from URLs to {{pa.RecordBatchReader}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to