Ben Kietzman created ARROW-18063: ------------------------------------ Summary: [C++][Python] Key: ARROW-18063 URL: https://issues.apache.org/jira/browse/ARROW-18063 Project: Apache Arrow Issue Type: Improvement Reporter: Ben Kietzman
[Mailing list thread|https://lists.apache.org/thread/r484sqrd6xjdd058prbrcwh3t5vg91so] The goal is to: - generate a substrait plan in Python using Ibis - ... wherein tables are specified using custom URLs - use the python API {{run_query}} to execute the plan - ... against source data which is *streamed* from those URLs rather than pulled fully into local memory The obstacles include: - The API for constructing a data stream from the custom URLs is only available in c++ - The python {{run_query}} function requires tables as input and cannot accept a RecordBatchReader even if one could be constructed from a custom URL - Writing custom cython is not preferred Some potential solutions: - Use ExecuteSerializedPlan() directly usable from c++ so that construction of data sources need not be handled in python. Passing a buffer from python/ibis down to C++ is much simpler and can be navigated without writing cython - Refactor NamedTableProvider from a lambda mapping {{names -> data source}} into a registry so that data source factories can be added from c++ then referenced by name from python - Extend {{run_query}} to support non-Table sources and require the user to write a python mapping from URLs to {{pa.RecordBatchReader}} -- This message was sent by Atlassian Jira (v8.20.10#820010)