Re: [VOTE][RUST] Release Apache Arrow Rust 17.0.0 RC1
+1 (binding) On Fri, Jun 24, 2022 at 7:36 PM Remzi Yang <1371656737...@gmail.com> wrote: > > +1 (non-binding). Verified on Mac M1. > Thanks Andrew. > > Remzi > > On Sat, 25 Jun 2022 at 09:33, Chao Sun wrote: > > > +1 (non-binding). Verified on Intel Mac. > > > > Thanks Andrew. > > > > On Fri, Jun 24, 2022 at 5:17 PM L. C. Hsieh wrote: > > > > > > +1 (non-binding) > > > > > > Verified on Intel Mac. > > > > > > Thank you, Andrew. > > > > > > On Fri, Jun 24, 2022 at 5:00 PM Andy Grove > > wrote: > > > > > > > > +1 (binding) > > > > > > > > Verified on Ubuntu 20.04.4 LTS. > > > > > > > > Thanks, Andrew. > > > > > > > > On Fri, Jun 24, 2022 at 2:45 PM Andrew Lamb > > wrote: > > > > > > > > > Hi, > > > > > > > > > > I would like to propose a release of Apache Arrow Rust > > Implementation, > > > > > version 17.0.0. > > > > > > > > > > This release candidate is based on commit: > > > > > 9f7b6004d365b0c0bac8e30170b49bdd66cc7df0 [1] > > > > > > > > > > The proposed release tarball and signatures are hosted at [2]. > > > > > > > > > > The changelog is located at [3]. > > > > > > > > > > Please download, verify checksums and signatures, run the unit tests, > > > > > and vote on the release. There is a script [4] that automates some of > > > > > the verification. > > > > > > > > > > The vote will be open for at least 72 hours. > > > > > > > > > > [ ] +1 Release this as Apache Arrow Rust > > > > > [ ] +0 > > > > > [ ] -1 Do not release this as Apache Arrow Rust because... > > > > > > > > > > [1]: > > > > > > > > > > > > https://github.com/apache/arrow-rs/tree/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0 > > > > > [2]: > > > > > > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-17.0.0-rc1 > > > > > [3]: > > > > > > > > > > > > https://github.com/apache/arrow-rs/blob/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0/CHANGELOG.md > > > > > [4]: > > > > > > > > > > > > https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh > > > > > - > > > > > > >
Re: [VOTE][RUST] Release Apache Arrow Rust 17.0.0 RC1
+1 (non-binding). Verified on Mac M1. Thanks Andrew. Remzi On Sat, 25 Jun 2022 at 09:33, Chao Sun wrote: > +1 (non-binding). Verified on Intel Mac. > > Thanks Andrew. > > On Fri, Jun 24, 2022 at 5:17 PM L. C. Hsieh wrote: > > > > +1 (non-binding) > > > > Verified on Intel Mac. > > > > Thank you, Andrew. > > > > On Fri, Jun 24, 2022 at 5:00 PM Andy Grove > wrote: > > > > > > +1 (binding) > > > > > > Verified on Ubuntu 20.04.4 LTS. > > > > > > Thanks, Andrew. > > > > > > On Fri, Jun 24, 2022 at 2:45 PM Andrew Lamb > wrote: > > > > > > > Hi, > > > > > > > > I would like to propose a release of Apache Arrow Rust > Implementation, > > > > version 17.0.0. > > > > > > > > This release candidate is based on commit: > > > > 9f7b6004d365b0c0bac8e30170b49bdd66cc7df0 [1] > > > > > > > > The proposed release tarball and signatures are hosted at [2]. > > > > > > > > The changelog is located at [3]. > > > > > > > > Please download, verify checksums and signatures, run the unit tests, > > > > and vote on the release. There is a script [4] that automates some of > > > > the verification. > > > > > > > > The vote will be open for at least 72 hours. > > > > > > > > [ ] +1 Release this as Apache Arrow Rust > > > > [ ] +0 > > > > [ ] -1 Do not release this as Apache Arrow Rust because... > > > > > > > > [1]: > > > > > > > > > https://github.com/apache/arrow-rs/tree/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0 > > > > [2]: > > > > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-17.0.0-rc1 > > > > [3]: > > > > > > > > > https://github.com/apache/arrow-rs/blob/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0/CHANGELOG.md > > > > [4]: > > > > > > > > > https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh > > > > - > > > > >
Re: [VOTE][RUST] Release Apache Arrow Rust 17.0.0 RC1
+1 (non-binding). Verified on Intel Mac. Thanks Andrew. On Fri, Jun 24, 2022 at 5:17 PM L. C. Hsieh wrote: > > +1 (non-binding) > > Verified on Intel Mac. > > Thank you, Andrew. > > On Fri, Jun 24, 2022 at 5:00 PM Andy Grove wrote: > > > > +1 (binding) > > > > Verified on Ubuntu 20.04.4 LTS. > > > > Thanks, Andrew. > > > > On Fri, Jun 24, 2022 at 2:45 PM Andrew Lamb wrote: > > > > > Hi, > > > > > > I would like to propose a release of Apache Arrow Rust Implementation, > > > version 17.0.0. > > > > > > This release candidate is based on commit: > > > 9f7b6004d365b0c0bac8e30170b49bdd66cc7df0 [1] > > > > > > The proposed release tarball and signatures are hosted at [2]. > > > > > > The changelog is located at [3]. > > > > > > Please download, verify checksums and signatures, run the unit tests, > > > and vote on the release. There is a script [4] that automates some of > > > the verification. > > > > > > The vote will be open for at least 72 hours. > > > > > > [ ] +1 Release this as Apache Arrow Rust > > > [ ] +0 > > > [ ] -1 Do not release this as Apache Arrow Rust because... > > > > > > [1]: > > > > > > https://github.com/apache/arrow-rs/tree/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0 > > > [2]: > > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-17.0.0-rc1 > > > [3]: > > > > > > https://github.com/apache/arrow-rs/blob/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0/CHANGELOG.md > > > [4]: > > > > > > https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh > > > - > > >
Re: [VOTE][RUST] Release Apache Arrow Rust 17.0.0 RC1
+1 (non-binding) Verified on Intel Mac. Thank you, Andrew. On Fri, Jun 24, 2022 at 5:00 PM Andy Grove wrote: > > +1 (binding) > > Verified on Ubuntu 20.04.4 LTS. > > Thanks, Andrew. > > On Fri, Jun 24, 2022 at 2:45 PM Andrew Lamb wrote: > > > Hi, > > > > I would like to propose a release of Apache Arrow Rust Implementation, > > version 17.0.0. > > > > This release candidate is based on commit: > > 9f7b6004d365b0c0bac8e30170b49bdd66cc7df0 [1] > > > > The proposed release tarball and signatures are hosted at [2]. > > > > The changelog is located at [3]. > > > > Please download, verify checksums and signatures, run the unit tests, > > and vote on the release. There is a script [4] that automates some of > > the verification. > > > > The vote will be open for at least 72 hours. > > > > [ ] +1 Release this as Apache Arrow Rust > > [ ] +0 > > [ ] -1 Do not release this as Apache Arrow Rust because... > > > > [1]: > > > > https://github.com/apache/arrow-rs/tree/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0 > > [2]: > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-17.0.0-rc1 > > [3]: > > > > https://github.com/apache/arrow-rs/blob/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0/CHANGELOG.md > > [4]: > > > > https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh > > - > >
Re: [VOTE][RUST] Release Apache Arrow Rust 17.0.0 RC1
+1 (binding) Verified on Ubuntu 20.04.4 LTS. Thanks, Andrew. On Fri, Jun 24, 2022 at 2:45 PM Andrew Lamb wrote: > Hi, > > I would like to propose a release of Apache Arrow Rust Implementation, > version 17.0.0. > > This release candidate is based on commit: > 9f7b6004d365b0c0bac8e30170b49bdd66cc7df0 [1] > > The proposed release tarball and signatures are hosted at [2]. > > The changelog is located at [3]. > > Please download, verify checksums and signatures, run the unit tests, > and vote on the release. There is a script [4] that automates some of > the verification. > > The vote will be open for at least 72 hours. > > [ ] +1 Release this as Apache Arrow Rust > [ ] +0 > [ ] -1 Do not release this as Apache Arrow Rust because... > > [1]: > > https://github.com/apache/arrow-rs/tree/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0 > [2]: > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-17.0.0-rc1 > [3]: > > https://github.com/apache/arrow-rs/blob/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0/CHANGELOG.md > [4]: > > https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh > - >
[VOTE][RUST] Release Apache Arrow Rust 17.0.0 RC1
Hi, I would like to propose a release of Apache Arrow Rust Implementation, version 17.0.0. This release candidate is based on commit: 9f7b6004d365b0c0bac8e30170b49bdd66cc7df0 [1] The proposed release tarball and signatures are hosted at [2]. The changelog is located at [3]. Please download, verify checksums and signatures, run the unit tests, and vote on the release. There is a script [4] that automates some of the verification. The vote will be open for at least 72 hours. [ ] +1 Release this as Apache Arrow Rust [ ] +0 [ ] -1 Do not release this as Apache Arrow Rust because... [1]: https://github.com/apache/arrow-rs/tree/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0 [2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-17.0.0-rc1 [3]: https://github.com/apache/arrow-rs/blob/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0/CHANGELOG.md [4]: https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh -
Re: vectorized processing for arrow::take()
Hi Aldrin, Use Case: I am taking a subset of a really large input_array. It is used in places where OpenMP-like and MPI-like parallelism are used. So vectorization seems to be the next low-hanging fruit. I have added this to the original stackoverflow post. On Thu, Jun 23, 2022 at 5:53 PM Aldrin wrote: > Without knowing implementation details of the Take function, I know that > Arrow uses xsimd and does try to enable the compiler to vectorize code > where it can. To clarify, are you asking how to improve the performance > you're seeing, or are you asking how to check if the compiled code is using > vector instructions? I think a little bit more context about what you know > and what you're trying to do could also help others who know more about > this function (and vectorization in Arrow in general) to chime in. > > Aldrin Montana > Computer Science PhD Student > UC Santa Cruz > > > On Thu, Jun 23, 2022 at 12:41 PM Chak-Pong Chung > wrote: > > > correction: not clang, I meant the Vectorizers from LLVM > > > > https://llvm.org/docs/Vectorizers.html > > > > if we can use it with arrow array > > > > On Thu, Jun 23, 2022 at 3:35 PM Chak-Pong Chung > > > wrote: > > > > > I asked a question here about vectorized processing. > > > > > > > > > > > > https://stackoverflow.com/questions/72735678/how-to-vectorize-arrowcomputetake > > > > > > Any idea? I am also open to the approaches like Intel MKL, xsimd, clang > > > and so on. > > > > > > > > > > > > -- > > > Regards, > > > Chak-Pong > > > > > > > > > -- > > Regards, > > Chak-Pong > > > -- Regards, Chak-Pong
Re: user-defined Python-based data-sources in Arrow
+1 for me. What you are describing is a good idea and having different ways to provide sources, especially through Substrait and Python, is something we can very much use right away. I think we are probably pretty close to having the components you describe. I'm not quite sure I follow all of the details but those things can be worked out in PRs. On Fri, Jun 24, 2022 at 12:51 AM Yaron Gvili wrote: > > I'm using schema-carrying tabular data as a more general term than a > RecordBatch stream. For example, an ExecBatch stream or a vector of > equal-length Arrays plus associated schema is also good. The Python > data-source node would be responsible for converting from the Python > data-source function to invocations of InputReceived on the node's output - > the SourceNode and TableSourceNode classes in source_node.cc are examples of > this structure. Over time, the Python data-source node could extend its > support to various Python data-source interfaces. > > > A very similar interface is the RecordBatchReader > > The PyArrow version of this > (https://kou.github.io/arrow-site/docs/python/generated/pyarrow.RecordBatchReader.html) > is one Python data-source interface that makes sense to support. Since the > overall goal of my project is to support existing Python-based data-sources > that were not necessarily designed for PyArrow, some adapters would surely be > needed too. > > > I think a source node that wraps a RecordBatchReader would be a great idea, > > we probably have something similar to this > > Indeed, TableSourceNode in source_node.cc seems to be similar enough - it > converts a Table to an ExecBatch generator, though it consumes eagerly. A > RecordBatchReader would need to be consumed lazily. > > > Would a python adapter to RecordBatchReader be sufficient? Or is something > > different? > > As noted above, this is one interface that makes sense to support. > > > How is cancellation handled? For example, the user gives up and cancels > > the query early. (Acero doesn't handle cancellation well at the moment but > > I'm hoping to work on that and cancellable sources is an important piece)? > > This question is more about the implementation of (an adapter for) a > Python-based data source than about its integration, which is the focus of > the design I'm proposing. Taking a stab at this question, I think one > solution involves using a sorting-queue (by timestamp or index, where > insertions cover disjoint intervals) to coordinate parallel producing threads > and the (source node's) consumer thread. Accordingly, a batch is allowed to > be fetched only when it is current (i.e., when it is clear no batch coming > before it could be inserted later) and cancellation is done by disabling the > queue (e.g., making it return a suitable error upon insertion). > > > Can the function be called reentrantly? In other words, can we call the > > function before the previous call finishes if we want to read the source in > > parallel? > > IIUC, you have in mind a ReadNext kind of function. The design does not > specify that the Python-based data-source must have this interface. If the > Python-based data-source is only sequentially accessible, then it makes sense > to write a RecordBatchReader (or an ExecBatch reader) adapter for it that has > this kind of ReadNext function, and then the reentrancy problem is mot since > no parallel-access occurs. OTOH, if the Python-based data-source can be > accessed in parallel, the above sorting-queue solution is better suited and > would avoid the reentrancy problem of a ReadNext function. > > > Yaron. > > From: Weston Pace > Sent: Thursday, June 23, 2022 8:21 PM > To: dev@arrow.apache.org > Subject: Re: user-defined Python-based data-sources in Arrow > > This seems reasonable to me. A very similar interface is the > RecordBatchReader[1] which is roughly (glossing over details)... > > ``` > class RecordBatchReader { > virtual std::shared_ptr schema() const = 0; > virtual Result> Next() = 0; > virtual Status Close() = 0; > }; > ``` > > This seems pretty close to what you are describing. I think a source > node that wraps a RecordBatchReader would be a great idea, we probably > have something similar to this. So my questions would be: > > * Would a python adapter to RecordBatchReader be sufficient? Or is > something different? > * How is cancellation handled? For example, the user gives up and > cancels the query early. (Acero doesn't handle cancellation well at > the moment but I'm hoping to work on that and cancellable sources is > an important piece)? > * Can the function be called reentrantly? In other words, can we > call the function before the previous call finishes if we want to read > the source in parallel? > > [1] > https://github.com/apache/arrow/blob/86915807af6fe10f44bc881e57b2f425f97c56c7/cpp/src/arrow/record_batch.h#L219 > > On Wed, Jun 22, 2022 at 9:47 AM Yaron Gvili
Re: [question] Arrow C GLib conda package
got it now! thanks for the explanation sure thing, I am working on that today. thanks!! On Fri, Jun 24, 2022 at 10:51 AM Sutou Kouhei wrote: > Hi, > > Sorry, no. I wanted to say we need a new repository for > Apache Arrow C GLib. If we mix Apache Arrow C++ and Apache > Arrow C GLib into > https://github.com/conda-forge/arrow-cpp-feedstock, all > arrow-cpp users need to install Apache Arrow C GLib too. So > we need a new repository for Apache Arrow C GLib. > > https://conda-forge.org/docs/maintainer/adding_pkgs.html#creating-recipes > will help you. > > Thanks, > -- > kou > > In > "Re: [question] Arrow C GLib conda package" on Fri, 24 Jun 2022 10:36:50 > -0400, > Ivan Ogasawara wrote: > > > Hi Sutou, thanks for your answer. > > > >> something like https://github.com/conda-forge/arrow-cpp-feedstock for > it. > > > > So you mean to create a new output on arrow-cpp-feedstock for Apache > Arrow > > C GLib? > > > > I like the idea, it could be easy to maintain everything in just one > place. > > > > If it is ok, I can start to work on that today. > > > > > > On Thu, Jun 23, 2022 at 7:58 PM Sutou Kouhei wrote: > > > >> Hi, > >> > >> No. The Apache Arrow C GLib package doesn't exist on > >> conda-forge. > >> > >> I think that we need to create > >> https://github.com/conda-forge/arrow-c-glib-feedstock or > >> something like > >> https://github.com/conda-forge/arrow-cpp-feedstock for it. > >> > >> Uwe will help you. > >> > >> > >> Thanks, > >> -- > >> kou > >> > >> In > >> "[question] Arrow C GLib conda package" on Thu, 23 Jun 2022 18:24:48 > >> -0400, > >> Ivan Ogasawara wrote: > >> > >> > Hi everyone, > >> > > >> > Is the Arrow C GLib ( > https://github.com/apache/arrow/tree/master/c_glib) > >> > available on conda-forge? > >> > > >> > If not, I can contribute to add this package to conda-forge. I would > just > >> > need some guidance about where it would be the better place to have > it. > >> > > >> > Thanks for the attention, > >> > Ivan > >> >
Re: [question] Arrow C GLib conda package
Hi, Sorry, no. I wanted to say we need a new repository for Apache Arrow C GLib. If we mix Apache Arrow C++ and Apache Arrow C GLib into https://github.com/conda-forge/arrow-cpp-feedstock, all arrow-cpp users need to install Apache Arrow C GLib too. So we need a new repository for Apache Arrow C GLib. https://conda-forge.org/docs/maintainer/adding_pkgs.html#creating-recipes will help you. Thanks, -- kou In "Re: [question] Arrow C GLib conda package" on Fri, 24 Jun 2022 10:36:50 -0400, Ivan Ogasawara wrote: > Hi Sutou, thanks for your answer. > >> something like https://github.com/conda-forge/arrow-cpp-feedstock for it. > > So you mean to create a new output on arrow-cpp-feedstock for Apache Arrow > C GLib? > > I like the idea, it could be easy to maintain everything in just one place. > > If it is ok, I can start to work on that today. > > > On Thu, Jun 23, 2022 at 7:58 PM Sutou Kouhei wrote: > >> Hi, >> >> No. The Apache Arrow C GLib package doesn't exist on >> conda-forge. >> >> I think that we need to create >> https://github.com/conda-forge/arrow-c-glib-feedstock or >> something like >> https://github.com/conda-forge/arrow-cpp-feedstock for it. >> >> Uwe will help you. >> >> >> Thanks, >> -- >> kou >> >> In >> "[question] Arrow C GLib conda package" on Thu, 23 Jun 2022 18:24:48 >> -0400, >> Ivan Ogasawara wrote: >> >> > Hi everyone, >> > >> > Is the Arrow C GLib (https://github.com/apache/arrow/tree/master/c_glib) >> > available on conda-forge? >> > >> > If not, I can contribute to add this package to conda-forge. I would just >> > need some guidance about where it would be the better place to have it. >> > >> > Thanks for the attention, >> > Ivan >>
Re: [question] Arrow C GLib conda package
Hi Sutou, thanks for your answer. > something like https://github.com/conda-forge/arrow-cpp-feedstock for it. So you mean to create a new output on arrow-cpp-feedstock for Apache Arrow C GLib? I like the idea, it could be easy to maintain everything in just one place. If it is ok, I can start to work on that today. On Thu, Jun 23, 2022 at 7:58 PM Sutou Kouhei wrote: > Hi, > > No. The Apache Arrow C GLib package doesn't exist on > conda-forge. > > I think that we need to create > https://github.com/conda-forge/arrow-c-glib-feedstock or > something like > https://github.com/conda-forge/arrow-cpp-feedstock for it. > > Uwe will help you. > > > Thanks, > -- > kou > > In > "[question] Arrow C GLib conda package" on Thu, 23 Jun 2022 18:24:48 > -0400, > Ivan Ogasawara wrote: > > > Hi everyone, > > > > Is the Arrow C GLib (https://github.com/apache/arrow/tree/master/c_glib) > > available on conda-forge? > > > > If not, I can contribute to add this package to conda-forge. I would just > > need some guidance about where it would be the better place to have it. > > > > Thanks for the attention, > > Ivan >
[Datafusion] Streaming - integration with kafka - kafka_writer
Hi, I am just trying to integrate datafusion with kafka, final goal is to have end-to-end streaming. But I started from a "different side" -> step 1 is to publish output to kafka, so I copied code/ created kafka publisher: https://github.com/yarenty/arrow-datafusion/tree/master/datafusion/core/src/physical_plan/kafka Test case is here: https://github.com/yarenty/arrow-datafusion/blob/master/datafusion/core/tests/ordered_sql_to_kafka.rs All finished with something like this: ```rust #[tokio::main] async fn main() -> Result<()> { let ctx = SessionContext::new(); ctx.register_csv("example", "tests/example.csv", CsvReadOptions::new()).await?; let df = ctx .sql("SELECT a, MIN(b) as bmin FROM example GROUP BY a ORDER BY a LIMIT 100") .await?; // kafka context let stream_ctx = KafkaContext::with_config( KafkaConfig::new("test_topic") .set("bootstrap.servers", "127.0.0.1:9092") .set("compression.codec", "snappy"), ); df.publish_to_kafka( stream_ctx).await?; Ok(()) } ``` Still not sure if this is the correct way to do it and if I put code in the proper places ... still: learning something new every day. Is there any other place where you can share code / check ideas? Jaro yare...@gmail.com
Re: user-defined Python-based data-sources in Arrow
I'm using schema-carrying tabular data as a more general term than a RecordBatch stream. For example, an ExecBatch stream or a vector of equal-length Arrays plus associated schema is also good. The Python data-source node would be responsible for converting from the Python data-source function to invocations of InputReceived on the node's output - the SourceNode and TableSourceNode classes in source_node.cc are examples of this structure. Over time, the Python data-source node could extend its support to various Python data-source interfaces. > A very similar interface is the RecordBatchReader The PyArrow version of this (https://kou.github.io/arrow-site/docs/python/generated/pyarrow.RecordBatchReader.html) is one Python data-source interface that makes sense to support. Since the overall goal of my project is to support existing Python-based data-sources that were not necessarily designed for PyArrow, some adapters would surely be needed too. > I think a source node that wraps a RecordBatchReader would be a great idea, > we probably have something similar to this Indeed, TableSourceNode in source_node.cc seems to be similar enough - it converts a Table to an ExecBatch generator, though it consumes eagerly. A RecordBatchReader would need to be consumed lazily. > Would a python adapter to RecordBatchReader be sufficient? Or is something > different? As noted above, this is one interface that makes sense to support. > How is cancellation handled? For example, the user gives up and cancels the > query early. (Acero doesn't handle cancellation well at the moment but I'm > hoping to work on that and cancellable sources is an important piece)? This question is more about the implementation of (an adapter for) a Python-based data source than about its integration, which is the focus of the design I'm proposing. Taking a stab at this question, I think one solution involves using a sorting-queue (by timestamp or index, where insertions cover disjoint intervals) to coordinate parallel producing threads and the (source node's) consumer thread. Accordingly, a batch is allowed to be fetched only when it is current (i.e., when it is clear no batch coming before it could be inserted later) and cancellation is done by disabling the queue (e.g., making it return a suitable error upon insertion). > Can the function be called reentrantly? In other words, can we call the > function before the previous call finishes if we want to read the source in > parallel? IIUC, you have in mind a ReadNext kind of function. The design does not specify that the Python-based data-source must have this interface. If the Python-based data-source is only sequentially accessible, then it makes sense to write a RecordBatchReader (or an ExecBatch reader) adapter for it that has this kind of ReadNext function, and then the reentrancy problem is mot since no parallel-access occurs. OTOH, if the Python-based data-source can be accessed in parallel, the above sorting-queue solution is better suited and would avoid the reentrancy problem of a ReadNext function. Yaron. From: Weston Pace Sent: Thursday, June 23, 2022 8:21 PM To: dev@arrow.apache.org Subject: Re: user-defined Python-based data-sources in Arrow This seems reasonable to me. A very similar interface is the RecordBatchReader[1] which is roughly (glossing over details)... ``` class RecordBatchReader { virtual std::shared_ptr schema() const = 0; virtual Result> Next() = 0; virtual Status Close() = 0; }; ``` This seems pretty close to what you are describing. I think a source node that wraps a RecordBatchReader would be a great idea, we probably have something similar to this. So my questions would be: * Would a python adapter to RecordBatchReader be sufficient? Or is something different? * How is cancellation handled? For example, the user gives up and cancels the query early. (Acero doesn't handle cancellation well at the moment but I'm hoping to work on that and cancellable sources is an important piece)? * Can the function be called reentrantly? In other words, can we call the function before the previous call finishes if we want to read the source in parallel? [1] https://github.com/apache/arrow/blob/86915807af6fe10f44bc881e57b2f425f97c56c7/cpp/src/arrow/record_batch.h#L219 On Wed, Jun 22, 2022 at 9:47 AM Yaron Gvili wrote: > > Sure, it can be found at > https://lists.apache.org/thread/o2nc7jnmfpt8lhcnjths1gnzvy86yfxo . Compared > to this thread, the design proposed here is more mature, now that I have a > reasonable version of the Ibis and Ibis-Substrait parts implemented locally > (if it helps this discussion, I could provide some details about this > implementation). I no longer propose registering the data-source function nor > using arrow::compute::Function for it, since it would be directly added to a > source execution node, be it manually or via deserialization of a Sub