Re: [VOTE][RUST] Release Apache Arrow Rust 17.0.0 RC1

2022-06-24 Thread QP Hou
+1 (binding)

On Fri, Jun 24, 2022 at 7:36 PM Remzi Yang <1371656737...@gmail.com> wrote:
>
> +1 (non-binding). Verified on Mac M1.
> Thanks Andrew.
>
> Remzi
>
> On Sat, 25 Jun 2022 at 09:33, Chao Sun  wrote:
>
> > +1 (non-binding). Verified on Intel Mac.
> >
> > Thanks Andrew.
> >
> > On Fri, Jun 24, 2022 at 5:17 PM L. C. Hsieh  wrote:
> > >
> > > +1 (non-binding)
> > >
> > > Verified on Intel Mac.
> > >
> > > Thank you, Andrew.
> > >
> > > On Fri, Jun 24, 2022 at 5:00 PM Andy Grove 
> > wrote:
> > > >
> > > > +1 (binding)
> > > >
> > > > Verified on Ubuntu 20.04.4 LTS.
> > > >
> > > > Thanks, Andrew.
> > > >
> > > > On Fri, Jun 24, 2022 at 2:45 PM Andrew Lamb 
> > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I would like to propose a release of Apache Arrow Rust
> > Implementation,
> > > > > version 17.0.0.
> > > > >
> > > > > This release candidate is based on commit:
> > > > > 9f7b6004d365b0c0bac8e30170b49bdd66cc7df0 [1]
> > > > >
> > > > > The proposed release tarball and signatures are hosted at [2].
> > > > >
> > > > > The changelog is located at [3].
> > > > >
> > > > > Please download, verify checksums and signatures, run the unit tests,
> > > > > and vote on the release. There is a script [4] that automates some of
> > > > > the verification.
> > > > >
> > > > > The vote will be open for at least 72 hours.
> > > > >
> > > > > [ ] +1 Release this as Apache Arrow Rust
> > > > > [ ] +0
> > > > > [ ] -1 Do not release this as Apache Arrow Rust  because...
> > > > >
> > > > > [1]:
> > > > >
> > > > >
> > https://github.com/apache/arrow-rs/tree/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0
> > > > > [2]:
> > > > >
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-17.0.0-rc1
> > > > > [3]:
> > > > >
> > > > >
> > https://github.com/apache/arrow-rs/blob/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0/CHANGELOG.md
> > > > > [4]:
> > > > >
> > > > >
> > https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
> > > > > -
> > > > >
> >


Re: [VOTE][RUST] Release Apache Arrow Rust 17.0.0 RC1

2022-06-24 Thread Remzi Yang
+1 (non-binding). Verified on Mac M1.
Thanks Andrew.

Remzi

On Sat, 25 Jun 2022 at 09:33, Chao Sun  wrote:

> +1 (non-binding). Verified on Intel Mac.
>
> Thanks Andrew.
>
> On Fri, Jun 24, 2022 at 5:17 PM L. C. Hsieh  wrote:
> >
> > +1 (non-binding)
> >
> > Verified on Intel Mac.
> >
> > Thank you, Andrew.
> >
> > On Fri, Jun 24, 2022 at 5:00 PM Andy Grove 
> wrote:
> > >
> > > +1 (binding)
> > >
> > > Verified on Ubuntu 20.04.4 LTS.
> > >
> > > Thanks, Andrew.
> > >
> > > On Fri, Jun 24, 2022 at 2:45 PM Andrew Lamb 
> wrote:
> > >
> > > > Hi,
> > > >
> > > > I would like to propose a release of Apache Arrow Rust
> Implementation,
> > > > version 17.0.0.
> > > >
> > > > This release candidate is based on commit:
> > > > 9f7b6004d365b0c0bac8e30170b49bdd66cc7df0 [1]
> > > >
> > > > The proposed release tarball and signatures are hosted at [2].
> > > >
> > > > The changelog is located at [3].
> > > >
> > > > Please download, verify checksums and signatures, run the unit tests,
> > > > and vote on the release. There is a script [4] that automates some of
> > > > the verification.
> > > >
> > > > The vote will be open for at least 72 hours.
> > > >
> > > > [ ] +1 Release this as Apache Arrow Rust
> > > > [ ] +0
> > > > [ ] -1 Do not release this as Apache Arrow Rust  because...
> > > >
> > > > [1]:
> > > >
> > > >
> https://github.com/apache/arrow-rs/tree/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0
> > > > [2]:
> > > >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-17.0.0-rc1
> > > > [3]:
> > > >
> > > >
> https://github.com/apache/arrow-rs/blob/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0/CHANGELOG.md
> > > > [4]:
> > > >
> > > >
> https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
> > > > -
> > > >
>


Re: [VOTE][RUST] Release Apache Arrow Rust 17.0.0 RC1

2022-06-24 Thread Chao Sun
+1 (non-binding). Verified on Intel Mac.

Thanks Andrew.

On Fri, Jun 24, 2022 at 5:17 PM L. C. Hsieh  wrote:
>
> +1 (non-binding)
>
> Verified on Intel Mac.
>
> Thank you, Andrew.
>
> On Fri, Jun 24, 2022 at 5:00 PM Andy Grove  wrote:
> >
> > +1 (binding)
> >
> > Verified on Ubuntu 20.04.4 LTS.
> >
> > Thanks, Andrew.
> >
> > On Fri, Jun 24, 2022 at 2:45 PM Andrew Lamb  wrote:
> >
> > > Hi,
> > >
> > > I would like to propose a release of Apache Arrow Rust Implementation,
> > > version 17.0.0.
> > >
> > > This release candidate is based on commit:
> > > 9f7b6004d365b0c0bac8e30170b49bdd66cc7df0 [1]
> > >
> > > The proposed release tarball and signatures are hosted at [2].
> > >
> > > The changelog is located at [3].
> > >
> > > Please download, verify checksums and signatures, run the unit tests,
> > > and vote on the release. There is a script [4] that automates some of
> > > the verification.
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Release this as Apache Arrow Rust
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Arrow Rust  because...
> > >
> > > [1]:
> > >
> > > https://github.com/apache/arrow-rs/tree/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0
> > > [2]:
> > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-17.0.0-rc1
> > > [3]:
> > >
> > > https://github.com/apache/arrow-rs/blob/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0/CHANGELOG.md
> > > [4]:
> > >
> > > https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
> > > -
> > >


Re: [VOTE][RUST] Release Apache Arrow Rust 17.0.0 RC1

2022-06-24 Thread L. C. Hsieh
+1 (non-binding)

Verified on Intel Mac.

Thank you, Andrew.

On Fri, Jun 24, 2022 at 5:00 PM Andy Grove  wrote:
>
> +1 (binding)
>
> Verified on Ubuntu 20.04.4 LTS.
>
> Thanks, Andrew.
>
> On Fri, Jun 24, 2022 at 2:45 PM Andrew Lamb  wrote:
>
> > Hi,
> >
> > I would like to propose a release of Apache Arrow Rust Implementation,
> > version 17.0.0.
> >
> > This release candidate is based on commit:
> > 9f7b6004d365b0c0bac8e30170b49bdd66cc7df0 [1]
> >
> > The proposed release tarball and signatures are hosted at [2].
> >
> > The changelog is located at [3].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. There is a script [4] that automates some of
> > the verification.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow Rust
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow Rust  because...
> >
> > [1]:
> >
> > https://github.com/apache/arrow-rs/tree/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0
> > [2]:
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-17.0.0-rc1
> > [3]:
> >
> > https://github.com/apache/arrow-rs/blob/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0/CHANGELOG.md
> > [4]:
> >
> > https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
> > -
> >


Re: [VOTE][RUST] Release Apache Arrow Rust 17.0.0 RC1

2022-06-24 Thread Andy Grove
+1 (binding)

Verified on Ubuntu 20.04.4 LTS.

Thanks, Andrew.

On Fri, Jun 24, 2022 at 2:45 PM Andrew Lamb  wrote:

> Hi,
>
> I would like to propose a release of Apache Arrow Rust Implementation,
> version 17.0.0.
>
> This release candidate is based on commit:
> 9f7b6004d365b0c0bac8e30170b49bdd66cc7df0 [1]
>
> The proposed release tarball and signatures are hosted at [2].
>
> The changelog is located at [3].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. There is a script [4] that automates some of
> the verification.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Release this as Apache Arrow Rust
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow Rust  because...
>
> [1]:
>
> https://github.com/apache/arrow-rs/tree/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0
> [2]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-17.0.0-rc1
> [3]:
>
> https://github.com/apache/arrow-rs/blob/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0/CHANGELOG.md
> [4]:
>
> https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
> -
>


[VOTE][RUST] Release Apache Arrow Rust 17.0.0 RC1

2022-06-24 Thread Andrew Lamb
Hi,

I would like to propose a release of Apache Arrow Rust Implementation,
version 17.0.0.

This release candidate is based on commit:
9f7b6004d365b0c0bac8e30170b49bdd66cc7df0 [1]

The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]:
https://github.com/apache/arrow-rs/tree/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0
[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-17.0.0-rc1
[3]:
https://github.com/apache/arrow-rs/blob/9f7b6004d365b0c0bac8e30170b49bdd66cc7df0/CHANGELOG.md
[4]:
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
-


Re: vectorized processing for arrow::take()

2022-06-24 Thread Chak-Pong Chung
Hi Aldrin,

Use Case: I am taking a subset of a really large input_array. It is used in
places where OpenMP-like and MPI-like parallelism are used. So
vectorization seems to be the next low-hanging fruit.

I have added this to the original stackoverflow post.



On Thu, Jun 23, 2022 at 5:53 PM Aldrin  wrote:

> Without knowing implementation details of the Take function, I know that
> Arrow uses xsimd and does try to enable the compiler to vectorize code
> where it can. To clarify, are you asking how to improve the performance
> you're seeing, or are you asking how to check if the compiled code is using
> vector instructions? I think a little bit more context about what you know
> and what you're trying to do could also help others who know more about
> this function (and vectorization in Arrow in general) to chime in.
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz
>
>
> On Thu, Jun 23, 2022 at 12:41 PM Chak-Pong Chung 
> wrote:
>
> > correction: not clang, I meant the Vectorizers from LLVM
> >
> > https://llvm.org/docs/Vectorizers.html
> >
> > if we can use it with arrow array
> >
> > On Thu, Jun 23, 2022 at 3:35 PM Chak-Pong Chung  >
> > wrote:
> >
> > > I asked a question here about vectorized processing.
> > >
> > >
> > >
> >
> https://stackoverflow.com/questions/72735678/how-to-vectorize-arrowcomputetake
> > >
> > > Any idea? I am also open to the approaches like Intel MKL, xsimd, clang
> > > and so on.
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Chak-Pong
> > >
> >
> >
> > --
> > Regards,
> > Chak-Pong
> >
>


-- 
Regards,
Chak-Pong


Re: user-defined Python-based data-sources in Arrow

2022-06-24 Thread Weston Pace
+1 for me.  What you are describing is a good idea and having
different ways to provide sources, especially through Substrait and
Python, is something we can very much use right away.  I think we are
probably pretty close to having the components you describe.  I'm not
quite sure I follow all of the details but those things can be worked
out in PRs.

On Fri, Jun 24, 2022 at 12:51 AM Yaron Gvili  wrote:
>
> I'm using schema-carrying tabular data as a more general term than a 
> RecordBatch stream. For example, an ExecBatch stream or a vector of 
> equal-length Arrays plus associated schema is also good. The Python 
> data-source node would be responsible for converting from the Python 
> data-source function to invocations of InputReceived on the node's output - 
> the SourceNode and TableSourceNode classes in source_node.cc are examples of 
> this structure. Over time, the Python data-source node could extend its 
> support to various Python data-source interfaces.
>
> > A very similar interface is the RecordBatchReader
>
> The PyArrow version of this 
> (https://kou.github.io/arrow-site/docs/python/generated/pyarrow.RecordBatchReader.html)
>  is one Python data-source interface that makes sense to support. Since the 
> overall goal of my project is to support existing Python-based data-sources 
> that were not necessarily designed for PyArrow, some adapters would surely be 
> needed too.
>
> > I think a source node that wraps a RecordBatchReader would be a great idea, 
> > we probably have something similar to this
>
> Indeed, TableSourceNode in source_node.cc seems to be similar enough - it 
> converts a Table to an ExecBatch generator, though it consumes eagerly. A 
> RecordBatchReader would need to be consumed lazily.
>
> > Would a python adapter to RecordBatchReader be sufficient?  Or is something 
> > different?
>
> As noted above, this is one interface that makes sense to support.
>
> > How is cancellation handled?  For example, the user gives up and cancels 
> > the query early. (Acero doesn't handle cancellation well at the moment but 
> > I'm hoping to work on that and cancellable sources is an important piece)?
>
> This question is more about the implementation of (an adapter for) a 
> Python-based data source than about its integration, which is the focus of 
> the design I'm proposing. Taking a stab at this question, I think one 
> solution involves using a sorting-queue (by timestamp or index, where 
> insertions cover disjoint intervals) to coordinate parallel producing threads 
> and the (source node's) consumer thread. Accordingly, a batch is allowed to 
> be fetched only when it is current (i.e., when it is clear no batch coming 
> before it could be inserted later) and cancellation is done by disabling the 
> queue (e.g., making it return a suitable error upon insertion).
>
> > Can the function be called reentrantly?  In other words, can we call the 
> > function before the previous call finishes if we want to read the source in 
> > parallel?
>
> IIUC, you have in mind a ReadNext kind of function. The design does not 
> specify that the Python-based data-source must have this interface. If the 
> Python-based data-source is only sequentially accessible, then it makes sense 
> to write a RecordBatchReader (or an ExecBatch reader) adapter for it that has 
> this kind of ReadNext function, and then the reentrancy problem is mot since 
> no parallel-access occurs. OTOH, if the Python-based data-source can be 
> accessed in parallel, the above sorting-queue solution is better suited and 
> would avoid the reentrancy problem of a ReadNext function.
>
>
> Yaron.
> 
> From: Weston Pace 
> Sent: Thursday, June 23, 2022 8:21 PM
> To: dev@arrow.apache.org 
> Subject: Re: user-defined Python-based data-sources in Arrow
>
> This seems reasonable to me.  A very similar interface is the
> RecordBatchReader[1] which is roughly (glossing over details)...
>
> ```
> class RecordBatchReader {
>   virtual std::shared_ptr schema() const = 0;
>   virtual Result> Next() = 0;
>   virtual Status Close() = 0;
> };
> ```
>
> This seems pretty close to what you are describing.  I think a source
> node that wraps a RecordBatchReader would be a great idea, we probably
> have something similar to this.  So my questions would be:
>
>  * Would a python adapter to RecordBatchReader be sufficient?  Or is
> something different?
>  * How is cancellation handled?  For example, the user gives up and
> cancels the query early. (Acero doesn't handle cancellation well at
> the moment but I'm hoping to work on that and cancellable sources is
> an important piece)?
>  * Can the function be called reentrantly?  In other words, can we
> call the function before the previous call finishes if we want to read
> the source in parallel?
>
> [1] 
> https://github.com/apache/arrow/blob/86915807af6fe10f44bc881e57b2f425f97c56c7/cpp/src/arrow/record_batch.h#L219
>
> On Wed, Jun 22, 2022 at 9:47 AM Yaron Gvili  

Re: [question] Arrow C GLib conda package

2022-06-24 Thread Ivan Ogasawara
got it now! thanks for the explanation

sure thing, I am working on that today. thanks!!

On Fri, Jun 24, 2022 at 10:51 AM Sutou Kouhei  wrote:

> Hi,
>
> Sorry, no. I wanted to say we need a new repository for
> Apache Arrow C GLib. If we mix Apache Arrow C++ and Apache
> Arrow C GLib into
> https://github.com/conda-forge/arrow-cpp-feedstock, all
> arrow-cpp users need to install Apache Arrow C GLib too. So
> we need a new repository for Apache Arrow C GLib.
>
> https://conda-forge.org/docs/maintainer/adding_pkgs.html#creating-recipes
> will help you.
>
> Thanks,
> --
> kou
>
> In 
>   "Re: [question] Arrow C GLib conda package" on Fri, 24 Jun 2022 10:36:50
> -0400,
>   Ivan Ogasawara  wrote:
>
> > Hi Sutou, thanks for your answer.
> >
> >> something like https://github.com/conda-forge/arrow-cpp-feedstock for
> it.
> >
> > So you mean to create a new output on arrow-cpp-feedstock for Apache
> Arrow
> > C GLib?
> >
> > I like the idea, it could be easy to maintain everything in just one
> place.
> >
> > If it is ok, I can start to work on that today.
> >
> >
> > On Thu, Jun 23, 2022 at 7:58 PM Sutou Kouhei  wrote:
> >
> >> Hi,
> >>
> >> No. The Apache Arrow C GLib package doesn't exist on
> >> conda-forge.
> >>
> >> I think that we need to create
> >> https://github.com/conda-forge/arrow-c-glib-feedstock or
> >> something like
> >> https://github.com/conda-forge/arrow-cpp-feedstock for it.
> >>
> >> Uwe will help you.
> >>
> >>
> >> Thanks,
> >> --
> >> kou
> >>
> >> In 
> >>   "[question] Arrow C GLib conda package" on Thu, 23 Jun 2022 18:24:48
> >> -0400,
> >>   Ivan Ogasawara  wrote:
> >>
> >> > Hi everyone,
> >> >
> >> > Is the Arrow C GLib (
> https://github.com/apache/arrow/tree/master/c_glib)
> >> > available on conda-forge?
> >> >
> >> > If not, I can contribute to add this package to conda-forge. I would
> just
> >> > need some guidance about where it would be the better place to have
> it.
> >> >
> >> > Thanks for the attention,
> >> > Ivan
> >>
>


Re: [question] Arrow C GLib conda package

2022-06-24 Thread Sutou Kouhei
Hi,

Sorry, no. I wanted to say we need a new repository for
Apache Arrow C GLib. If we mix Apache Arrow C++ and Apache
Arrow C GLib into
https://github.com/conda-forge/arrow-cpp-feedstock, all
arrow-cpp users need to install Apache Arrow C GLib too. So
we need a new repository for Apache Arrow C GLib.

https://conda-forge.org/docs/maintainer/adding_pkgs.html#creating-recipes
will help you.

Thanks,
-- 
kou

In 
  "Re: [question] Arrow C GLib conda package" on Fri, 24 Jun 2022 10:36:50 
-0400,
  Ivan Ogasawara  wrote:

> Hi Sutou, thanks for your answer.
> 
>> something like https://github.com/conda-forge/arrow-cpp-feedstock for it.
> 
> So you mean to create a new output on arrow-cpp-feedstock for Apache Arrow
> C GLib?
> 
> I like the idea, it could be easy to maintain everything in just one place.
> 
> If it is ok, I can start to work on that today.
> 
> 
> On Thu, Jun 23, 2022 at 7:58 PM Sutou Kouhei  wrote:
> 
>> Hi,
>>
>> No. The Apache Arrow C GLib package doesn't exist on
>> conda-forge.
>>
>> I think that we need to create
>> https://github.com/conda-forge/arrow-c-glib-feedstock or
>> something like
>> https://github.com/conda-forge/arrow-cpp-feedstock for it.
>>
>> Uwe will help you.
>>
>>
>> Thanks,
>> --
>> kou
>>
>> In 
>>   "[question] Arrow C GLib conda package" on Thu, 23 Jun 2022 18:24:48
>> -0400,
>>   Ivan Ogasawara  wrote:
>>
>> > Hi everyone,
>> >
>> > Is the Arrow C GLib (https://github.com/apache/arrow/tree/master/c_glib)
>> > available on conda-forge?
>> >
>> > If not, I can contribute to add this package to conda-forge. I would just
>> > need some guidance about where it would be the better place to have it.
>> >
>> > Thanks for the attention,
>> > Ivan
>>


Re: [question] Arrow C GLib conda package

2022-06-24 Thread Ivan Ogasawara
Hi Sutou, thanks for your answer.

> something like https://github.com/conda-forge/arrow-cpp-feedstock for it.

So you mean to create a new output on arrow-cpp-feedstock for Apache Arrow
C GLib?

I like the idea, it could be easy to maintain everything in just one place.

If it is ok, I can start to work on that today.


On Thu, Jun 23, 2022 at 7:58 PM Sutou Kouhei  wrote:

> Hi,
>
> No. The Apache Arrow C GLib package doesn't exist on
> conda-forge.
>
> I think that we need to create
> https://github.com/conda-forge/arrow-c-glib-feedstock or
> something like
> https://github.com/conda-forge/arrow-cpp-feedstock for it.
>
> Uwe will help you.
>
>
> Thanks,
> --
> kou
>
> In 
>   "[question] Arrow C GLib conda package" on Thu, 23 Jun 2022 18:24:48
> -0400,
>   Ivan Ogasawara  wrote:
>
> > Hi everyone,
> >
> > Is the Arrow C GLib (https://github.com/apache/arrow/tree/master/c_glib)
> > available on conda-forge?
> >
> > If not, I can contribute to add this package to conda-forge. I would just
> > need some guidance about where it would be the better place to have it.
> >
> > Thanks for the attention,
> > Ivan
>


[Datafusion] Streaming - integration with kafka - kafka_writer

2022-06-24 Thread Jaroslaw Nowosad
Hi,
I am just trying to integrate datafusion with kafka,  final goal is to have
end-to-end streaming. But I started from a "different side" -> step 1 is to
publish output to kafka, so I copied code/ created kafka publisher:
https://github.com/yarenty/arrow-datafusion/tree/master/datafusion/core/src/physical_plan/kafka

Test case is here:
https://github.com/yarenty/arrow-datafusion/blob/master/datafusion/core/tests/ordered_sql_to_kafka.rs


All finished with something like this:
```rust

#[tokio::main]
async fn main() -> Result<()> {
let ctx = SessionContext::new();
ctx.register_csv("example", "tests/example.csv",
CsvReadOptions::new()).await?;

let df = ctx
.sql("SELECT a, MIN(b) as bmin FROM example GROUP BY a ORDER BY a
LIMIT 100")
.await?;

// kafka context
let stream_ctx = KafkaContext::with_config(
KafkaConfig::new("test_topic")
.set("bootstrap.servers", "127.0.0.1:9092")
.set("compression.codec", "snappy"),
);

df.publish_to_kafka( stream_ctx).await?;

Ok(())
}
```

Still not sure if this is the correct way to do it and if I put code in the
proper places ... still: learning something new every day.

Is there any other place where you can share code / check ideas?

Jaro
yare...@gmail.com


Re: user-defined Python-based data-sources in Arrow

2022-06-24 Thread Yaron Gvili
I'm using schema-carrying tabular data as a more general term than a 
RecordBatch stream. For example, an ExecBatch stream or a vector of 
equal-length Arrays plus associated schema is also good. The Python data-source 
node would be responsible for converting from the Python data-source function 
to invocations of InputReceived on the node's output - the SourceNode and 
TableSourceNode classes in source_node.cc are examples of this structure. Over 
time, the Python data-source node could extend its support to various Python 
data-source interfaces.

> A very similar interface is the RecordBatchReader

The PyArrow version of this 
(https://kou.github.io/arrow-site/docs/python/generated/pyarrow.RecordBatchReader.html)
 is one Python data-source interface that makes sense to support. Since the 
overall goal of my project is to support existing Python-based data-sources 
that were not necessarily designed for PyArrow, some adapters would surely be 
needed too.

> I think a source node that wraps a RecordBatchReader would be a great idea, 
> we probably have something similar to this

Indeed, TableSourceNode in source_node.cc seems to be similar enough - it 
converts a Table to an ExecBatch generator, though it consumes eagerly. A 
RecordBatchReader would need to be consumed lazily.

> Would a python adapter to RecordBatchReader be sufficient?  Or is something 
> different?

As noted above, this is one interface that makes sense to support.

> How is cancellation handled?  For example, the user gives up and cancels the 
> query early. (Acero doesn't handle cancellation well at the moment but I'm 
> hoping to work on that and cancellable sources is an important piece)?

This question is more about the implementation of (an adapter for) a 
Python-based data source than about its integration, which is the focus of the 
design I'm proposing. Taking a stab at this question, I think one solution 
involves using a sorting-queue (by timestamp or index, where insertions cover 
disjoint intervals) to coordinate parallel producing threads and the (source 
node's) consumer thread. Accordingly, a batch is allowed to be fetched only 
when it is current (i.e., when it is clear no batch coming before it could be 
inserted later) and cancellation is done by disabling the queue (e.g., making 
it return a suitable error upon insertion).

> Can the function be called reentrantly?  In other words, can we call the 
> function before the previous call finishes if we want to read the source in 
> parallel?

IIUC, you have in mind a ReadNext kind of function. The design does not specify 
that the Python-based data-source must have this interface. If the Python-based 
data-source is only sequentially accessible, then it makes sense to write a 
RecordBatchReader (or an ExecBatch reader) adapter for it that has this kind of 
ReadNext function, and then the reentrancy problem is mot since no 
parallel-access occurs. OTOH, if the Python-based data-source can be accessed 
in parallel, the above sorting-queue solution is better suited and would avoid 
the reentrancy problem of a ReadNext function.


Yaron.

From: Weston Pace 
Sent: Thursday, June 23, 2022 8:21 PM
To: dev@arrow.apache.org 
Subject: Re: user-defined Python-based data-sources in Arrow

This seems reasonable to me.  A very similar interface is the
RecordBatchReader[1] which is roughly (glossing over details)...

```
class RecordBatchReader {
  virtual std::shared_ptr schema() const = 0;
  virtual Result> Next() = 0;
  virtual Status Close() = 0;
};
```

This seems pretty close to what you are describing.  I think a source
node that wraps a RecordBatchReader would be a great idea, we probably
have something similar to this.  So my questions would be:

 * Would a python adapter to RecordBatchReader be sufficient?  Or is
something different?
 * How is cancellation handled?  For example, the user gives up and
cancels the query early. (Acero doesn't handle cancellation well at
the moment but I'm hoping to work on that and cancellable sources is
an important piece)?
 * Can the function be called reentrantly?  In other words, can we
call the function before the previous call finishes if we want to read
the source in parallel?

[1] 
https://github.com/apache/arrow/blob/86915807af6fe10f44bc881e57b2f425f97c56c7/cpp/src/arrow/record_batch.h#L219

On Wed, Jun 22, 2022 at 9:47 AM Yaron Gvili  wrote:
>
> Sure, it can be found at 
> https://lists.apache.org/thread/o2nc7jnmfpt8lhcnjths1gnzvy86yfxo . Compared 
> to this thread, the design proposed here is more mature, now that I have a 
> reasonable version of the Ibis and Ibis-Substrait parts implemented locally 
> (if it helps this discussion, I could provide some details about this 
> implementation). I no longer propose registering the data-source function nor 
> using arrow::compute::Function for it, since it would be directly added to a 
> source execution node, be it manually or via deserialization of a