[DataFusion] Projection pushdown and pushed down filters

2023-04-11 Thread Markus Appel
Hello,

I hope this is the right place to ask this.

While working on a project based on arrow-datafusion, I came across some weird 
behavior where a projection did not get eliminated as expected, thus breaking a 
custom optimizer rule's assumption (into which I won't go further, as it's not 
important for this question).

Specifically, I found an execution plan like:

Projection [col1, col2]
 TableScan projection=[col1, col2, col3], full_filters=[ ... col3 ...]

to not be simplified to:

TableScan projection=[col1, col2], full_filters=[... col3 ...]

This does seem to be intentional, as I have found this 
snippet
 in the optimizer rule:

...
LogicalPlan::TableScan(scan)
if !scan.projected_schema.fields().is_empty() =>
{
let mut used_columns: HashSet = HashSet::new();
// filter expr may not exist in expr in projection.
// like: TableScan: t1 projection=[bool_col, int_col], 
full_filters=[t1.id = Int32(1)]
// projection=[bool_col, int_col] don't contain `ti.id`.
exprlist_to_columns(,  used_columns)?;
...

However, the comment does not explain why we need to keep the extra projection 
and return the extra column - after all, the filters inside of the scan are 
internal to that scan, and should not affect the execution plan, right?

I am looking forward to any opinions.

Best regards, Markus Appel




Re: [ANNOUNCE] New Arrow committer: Ruihang Xia

2023-04-11 Thread Matt Topol
Congrats!! Welcome!

On Tue, Apr 11, 2023, 11:29 PM Jacob Wujciak 
wrote:

> Congratulations and welcome!
>
> On Mon, Apr 10, 2023 at 8:13 AM Wang Xudong 
> wrote:
>
> > Congratulations!
> >
> > Yang Jiang  于2023年4月10日周一 13:37写道:
> >
> > >
> > > Congratulations !!!
> > >
> > > On 2023/04/09 11:25:19 Andrew Lamb wrote:
> > > > On behalf of the Arrow PMC, I'm happy to announce that Ruihang Xia
> > > > has accepted an invitation to become a committer on Apache
> > > > Arrow. Welcome, and thank you for your contributions!
> > > >
> > > > Andrew
> > > >
> > >
> >
>


Re: [ANNOUNCE] New Arrow committer: Ruihang Xia

2023-04-11 Thread Jacob Wujciak
Congratulations and welcome!

On Mon, Apr 10, 2023 at 8:13 AM Wang Xudong  wrote:

> Congratulations!
>
> Yang Jiang  于2023年4月10日周一 13:37写道:
>
> >
> > Congratulations !!!
> >
> > On 2023/04/09 11:25:19 Andrew Lamb wrote:
> > > On behalf of the Arrow PMC, I'm happy to announce that Ruihang Xia
> > > has accepted an invitation to become a committer on Apache
> > > Arrow. Welcome, and thank you for your contributions!
> > >
> > > Andrew
> > >
> >
>


Re: [VOTE][Julia] Release Apache Arrow Julia 2.5.1 RC1

2023-04-11 Thread Jacob Quinn
Hmmm, I'm also on MacOS m1, but didn't have any issues running tests.

David, is the error reproducible? We fixed an issue for this in [this
commit](
https://github.com/apache/arrow-julia/commit/6d0ac4946f062414e2b60aa3d67c2875bb2e7958),
but it's possible that our check for this condition wasn't strong enough or
something. If it's reproducible, I'd appreciate being able to do a debug
build for you and have it report some data around our check for this.

-Jacob

On Tue, Apr 11, 2023 at 6:36 PM David Li  wrote:

> I had an issue during verification (macOS/AArch64) [1]
>
> The gist seems to be:
>
> ```
>   nested task error: ArgumentError: unsafe_wrap: pointer 0x293389438
> is not properly aligned to 16 bytes
>   Stacktrace:
> [1] #unsafe_wrap#100
>   @ ./pointer.jl:92 [inlined]
> [2] unsafe_wrap
>   @ ./pointer.jl:90 [inlined]
> [3] reinterp(#unused#::Type{Arrow.Decimal{2, 2, Int128}},
> batch::Arrow.Batch, buf::Arrow.Flatbuf.Buffer,
> compression::Arrow.Flatbuf.BodyCompression)
>   @ Arrow
> ~/Code/arrow-julia/verification/apache-arrow-julia-2.5.1/src/table.jl:557
> [4] build(f::Arrow.Flatbuf.Field, #unused#::Arrow.Flatbuf.Decimal,
> batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64,
> Arrow.DictEncoding}, nodeidx::Int64, bufferidx::Int64, convert::Bool)
>   @ Arrow
> ~/Code/arrow-julia/verification/apache-arrow-julia-2.5.1/src/table.jl:685
> [5] build(field::Arrow.Flatbuf.Field, batch::Arrow.Batch,
> rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64, Arrow.DictEncoding},
> nodeidx::Int64, bufferidx::Int64, convert::Bool)
>   @ Arrow
> ~/Code/arrow-julia/verification/apache-arrow-julia-2.5.1/src/table.jl:498
> [6] iterate(x::Arrow.VectorIterator, ::Tuple{Int64, Int64, Int64})
>   @ Arrow
> ~/Code/arrow-julia/verification/apache-arrow-julia-2.5.1/src/table.jl:474
> [7] iterate
>   @
> ~/Code/arrow-julia/verification/apache-arrow-julia-2.5.1/src/table.jl:471
> [inlined]
> [8] copyto!(dest::Vector{Any}, src::Arrow.VectorIterator)
>   @ Base ./abstractarray.jl:946
> [9] _collect
>   @ ./array.jl:713 [inlined]
>[10] collect
>   @ ./array.jl:707 [inlined]
>[11] macro expansion
>   @
> ~/Code/arrow-julia/verification/apache-arrow-julia-2.5.1/src/table.jl:376
> [inlined]
>[12] (::Arrow.var"#108#114"{Bool, Channel{Any},
> WorkerUtilities.OrderedSynchronizer, Dict{Int64, Arrow.DictEncoding},
> Arrow.Batch, Int64})()
>   @ Arrow ./threadingconstructs.jl:341
> ```
>
> I haven't gotten a chance to look more into it/try again.
>
> [1]: https://gist.github.com/lidavidm/b8f604b60c0a2cdfb04e96d4e58bdfdb
>
> On Wed, Apr 12, 2023, at 06:50, Sutou Kouhei wrote:
> > Hi,
> >
> > I would like to propose the following release candidate (RC1) of
> > Apache Arrow Julia version 2.5.1.
> >
> > This release candidate is based on commit:
> > 22088f1cb59bcd99fbffbf9d8248e491690dbfd9 [1]
> >
> > The source release rc1 is hosted at [2].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [3] for how to validate a release candidate.
> >
> > The vote will be open for at least 24 hours.
> >
> > [ ] +1 Release this as Apache Arrow Julia 2.5.1
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow Julia 2.5.1 because...
> >
> > [1]:
> >
> https://github.com/apache/arrow-julia/tree/22088f1cb59bcd99fbffbf9d8248e491690dbfd9
> > [2]:
> >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-julia-2.5.1-rc1/
> > [3]:
> >
> https://github.com/apache/arrow-julia/blob/main/dev/release/README.md#verify
>


Re: [VOTE][Julia] Release Apache Arrow Julia 2.5.1 RC1

2023-04-11 Thread David Li
I had an issue during verification (macOS/AArch64) [1]

The gist seems to be:

```
  nested task error: ArgumentError: unsafe_wrap: pointer 0x293389438 is not 
properly aligned to 16 bytes
  Stacktrace:
[1] #unsafe_wrap#100
  @ ./pointer.jl:92 [inlined]
[2] unsafe_wrap
  @ ./pointer.jl:90 [inlined]
[3] reinterp(#unused#::Type{Arrow.Decimal{2, 2, Int128}}, 
batch::Arrow.Batch, buf::Arrow.Flatbuf.Buffer, 
compression::Arrow.Flatbuf.BodyCompression)
  @ Arrow 
~/Code/arrow-julia/verification/apache-arrow-julia-2.5.1/src/table.jl:557
[4] build(f::Arrow.Flatbuf.Field, #unused#::Arrow.Flatbuf.Decimal, 
batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64, 
Arrow.DictEncoding}, nodeidx::Int64, bufferidx::Int64, convert::Bool)
  @ Arrow 
~/Code/arrow-julia/verification/apache-arrow-julia-2.5.1/src/table.jl:685
[5] build(field::Arrow.Flatbuf.Field, batch::Arrow.Batch, 
rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64, Arrow.DictEncoding}, 
nodeidx::Int64, bufferidx::Int64, convert::Bool)
  @ Arrow 
~/Code/arrow-julia/verification/apache-arrow-julia-2.5.1/src/table.jl:498
[6] iterate(x::Arrow.VectorIterator, ::Tuple{Int64, Int64, Int64})
  @ Arrow 
~/Code/arrow-julia/verification/apache-arrow-julia-2.5.1/src/table.jl:474
[7] iterate
  @ 
~/Code/arrow-julia/verification/apache-arrow-julia-2.5.1/src/table.jl:471 
[inlined]
[8] copyto!(dest::Vector{Any}, src::Arrow.VectorIterator)
  @ Base ./abstractarray.jl:946
[9] _collect
  @ ./array.jl:713 [inlined]
   [10] collect
  @ ./array.jl:707 [inlined]
   [11] macro expansion
  @ 
~/Code/arrow-julia/verification/apache-arrow-julia-2.5.1/src/table.jl:376 
[inlined]
   [12] (::Arrow.var"#108#114"{Bool, Channel{Any}, 
WorkerUtilities.OrderedSynchronizer, Dict{Int64, Arrow.DictEncoding}, 
Arrow.Batch, Int64})()
  @ Arrow ./threadingconstructs.jl:341
```

I haven't gotten a chance to look more into it/try again.

[1]: https://gist.github.com/lidavidm/b8f604b60c0a2cdfb04e96d4e58bdfdb

On Wed, Apr 12, 2023, at 06:50, Sutou Kouhei wrote:
> Hi,
>
> I would like to propose the following release candidate (RC1) of
> Apache Arrow Julia version 2.5.1.
>
> This release candidate is based on commit:
> 22088f1cb59bcd99fbffbf9d8248e491690dbfd9 [1]
>
> The source release rc1 is hosted at [2].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [3] for how to validate a release candidate.
>
> The vote will be open for at least 24 hours.
>
> [ ] +1 Release this as Apache Arrow Julia 2.5.1
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow Julia 2.5.1 because...
>
> [1]: 
> https://github.com/apache/arrow-julia/tree/22088f1cb59bcd99fbffbf9d8248e491690dbfd9
> [2]: 
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-julia-2.5.1-rc1/
> [3]: 
> https://github.com/apache/arrow-julia/blob/main/dev/release/README.md#verify


[VOTE][Julia] Release Apache Arrow Julia 2.5.1 RC1

2023-04-11 Thread Sutou Kouhei
Hi,

I would like to propose the following release candidate (RC1) of
Apache Arrow Julia version 2.5.1.

This release candidate is based on commit:
22088f1cb59bcd99fbffbf9d8248e491690dbfd9 [1]

The source release rc1 is hosted at [2].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [3] for how to validate a release candidate.

The vote will be open for at least 24 hours.

[ ] +1 Release this as Apache Arrow Julia 2.5.1
[ ] +0
[ ] -1 Do not release this as Apache Arrow Julia 2.5.1 because...

[1]: 
https://github.com/apache/arrow-julia/tree/22088f1cb59bcd99fbffbf9d8248e491690dbfd9
[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-julia-2.5.1-rc1/
[3]: 
https://github.com/apache/arrow-julia/blob/main/dev/release/README.md#verify


Arrow community meeting April 12 at 16:00 UTC

2023-04-11 Thread Ian Cook
Hi all,

Our biweekly Arrow community meeting is tomorrow at 16:00 UTC / 12:00 EDT.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

The notes for this and future instances of this meeting will be
captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend this meeting, you are welcome to edit the
document to add the topics that you would like to discuss.

Thanks,
Ian


Re: [CROWDSOURCING] Apache Arrow Board Report - April 12, 2023

2023-04-11 Thread Matt Topol
My apologies, I forgot to add updates for the Go section previously, I've
added to the Google doc now for the Go updates.

On Tue, Apr 11, 2023 at 9:29 AM Andrew Lamb  wrote:

> As a reminder, I will submit the ASF board report [1]  tomorrow summarizing
> the state of the project. Thank you to everyone who has contributed content
> already.
>
> I encourage everyone who is interested in the goings on with Arrow to check
> it out -- there is lots going on in this project.
>
> Andrew
>
> [1]:
>
> https://docs.google.com/document/d/13FSDydEVXT2UUFdy4XKjVKNJW-WR8ylvG3aI6lD-dNI/edit#
>
>
> On Wed, Mar 29, 2023 at 6:58 AM Andrew Lamb  wrote:
>
> > Hello Arrow Community,
> >
> > One of the responsibilities of being part of the Apache Software
> > Foundation (ASF) is to regularly summarize the state of the project in a
> > quarterly update to the ASF board. The next report is due on April 12,
> 2023
> >
> > Historically[1], Arrow has crowd sourced the content which has worked
> > well.  Please add your comments directly to [2] or reply to this email
> and
> > I will incorporate your comments.
> >
> > While this is partly an administrative reporting exercise, I think it is
> > also valuable to reflect on the past and think about goals for the
> future.
> >
> > It would be especially interesting if anyone from the following
> > implementation communities could provide an update of around a paragraph
> or
> > so:
> >
> > ### ADBC
> > ### C++
> > ### C#
> > ### Go
> > ### Java
> > ### JavaScript
> > ### Julia
> > ### nanoarrow
> > ### Rust
> > ### C (GLib)
> > ### MATLAB
> > ### Python
> > ### R
> > ### Ruby
> >
> > Andrew
> >
> > [1]: https://lists.apache.org/thread/xjcj3lkvs76k95hrkcp76g6d6z0mlq27
> >
> > [2]:
> >
> https://docs.google.com/document/d/13FSDydEVXT2UUFdy4XKjVKNJW-WR8ylvG3aI6lD-dNI/edit#
> >
>


Re: [VOTE][RUST][DataFusion] Release DataFusion Python Bindings 22.0.0 RC1

2023-04-11 Thread Jeremy Dyer
+1 (non-binding)

Ran through verification script. Built conda packages manually and
validated. Also included in 3rd party library and validated in working
order.

Thanks Andy!

On Tue, Apr 11, 2023 at 9:39 AM Andrew Lamb  wrote:

> +1
>
> Verified on x86 mac
>
> Thanks Andy
>
> Andrew
>
> On Mon, Apr 10, 2023 at 8:10 PM L. C. Hsieh  wrote:
>
> > +1 (binding)
> >
> > Verified on Intel Mac.
> >
> > Thanks Andy.
> >
> > On Mon, Apr 10, 2023 at 4:47 PM Andy Grove 
> wrote:
> > >
> > > Hi,
> > >
> > > I would like to propose a release of Apache Arrow DataFusion Python
> > > Bindings,
> > > version 22.0.0.
> > >
> > > This release candidate is based on commit:
> > > 5c90187207e218393297ff3cbe4f08b69049d224 [1]
> > > The proposed release tarball and signatures are hosted at [2].
> > > The changelog is located at [3].
> > > The Python wheels are located at [4].
> > >
> > > Please download, verify checksums and signatures, run the unit tests,
> and
> > > vote
> > > on the release. The vote will be open for at least 72 hours.
> > >
> > > Only votes from PMC members are binding, but all members of the
> community
> > > are
> > > encouraged to test the release and vote with "(non-binding)".
> > >
> > > The standard verification procedure is documented at
> > >
> >
> https://github.com/apache/arrow-datafusion-python/blob/main/dev/release/README.md#verifying-release-candidates
> > > .
> > >
> > > [ ] +1 Release this as Apache Arrow DataFusion Python 22.0.0
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Arrow DataFusion Python 22.0.0
> > > because...
> > >
> > > Here is my vote:
> > >
> > > +1
> > >
> > > [1]:
> > >
> >
> https://github.com/apache/arrow-datafusion-python/tree/5c90187207e218393297ff3cbe4f08b69049d224
> > > [2]:
> > >
> >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-python-22.0.0-rc1
> > > [3]:
> > >
> >
> https://github.com/apache/arrow-datafusion-python/blob/5c90187207e218393297ff3cbe4f08b69049d224/CHANGELOG.md
> > > [4]: https://test.pypi.org/project/datafusion/22.0.0/
> >
>


Re: OpenTelemetry + Arrow

2023-04-11 Thread Laurent Quérel
Thank you very much Andrew.

I should be able to work on the second article next week and I will follow
the same process.

Cheers, Laurent

On Tue, Apr 11, 2023 at 4:31 AM Andrew Lamb  wrote:

> The blog post is now live on the arrow site [1]
>
> Thanks again Laurent
>
> [1]:
>
> https://arrow.apache.org/blog/2023/04/11/our-journey-at-f5-with-apache-arrow-part-1/
>
> On Sun, Apr 2, 2023 at 9:07 PM Laurent Quérel 
> wrote:
>
> > Hi Andrew,
> >
> > The feedback seems to be good so I created a PR.
> >
> > https://github.com/apache/arrow-site/pull/340
> >
> > Best regards,
> >
> > Laurent Querel
> >
> > On Thu, Mar 30, 2023 at 3:28 PM Laurent Quérel  >
> > wrote:
> >
> > > I'm glad to know that the article has been well-received. In the second
> > > article, I will allocate a dedicated section to summarize the various
> > > challenges encountered when using Arrow for this type of project.
> > >
> > > @Matt, I want to express my gratitude for your continuous support
> > > throughout this project. Your contributions and refinements to the
> Arrow
> > Go
> > > library have enabled me to make significant progress with minimal
> > obstacles.
> > >
> > > Best Regards, Laurent
> > >
> > > On Thu, Mar 30, 2023 at 2:24 PM Matt Topol 
> > wrote:
> > >
> > >> +1 (non -binding)
> > >>
> > >> I'm glad others on here are finding this as useful and interesting as
> I
> > >> did.
> > >>
> > >> Great job Laurent!
> > >>
> > >> --Matt
> > >>
> > >> On Thu, Mar 30, 2023, 3:26 PM Raphael Taylor-Davies
> > >>  wrote:
> > >>
> > >> > Hi Laurent,
> > >> >
> > >> > I gave the first blog post a read and I also really like it and
> would
> > be
> > >> > +1 on publishing it, nice work.
> > >> >
> > >> > I would also like to echo Will's sentiment that getting real-world
> > case
> > >> > studies for the more complex Arrow schemas is invaluable and will
> help
> > >> > drive improvements in this space, so thank you for driving this
> > forward.
> > >> >
> > >> > Kind Regards,
> > >> >
> > >> > Raphael
> > >> >
> > >> > On 30/03/2023 19:52, Will Jones wrote:
> > >> > > Hi Laurent,
> > >> > >
> > >> > > I have read the first post and I really like it. I'd be +1 on
> > >> publishing
> > >> > > these to the blog. I'm interested to read the second one when it's
> > >> > finished.
> > >> > >
> > >> > > IMO the blog could use more examples of using Arrow that's not
> > >> building a
> > >> > > data frame library / query engine, and I appreciate that this blog
> > >> > provides
> > >> > > advice for some of the trickier parts of working with complex
> Arrow
> > >> > > schemas. I think this will also provide a good concrete use case
> for
> > >> us
> > >> > to
> > >> > > think about improving the ecosystem's support for nested data.
> > >> > >
> > >> > > Best,
> > >> > >
> > >> > > Will Jones
> > >> > >
> > >> > > On Thu, Mar 30, 2023 at 10:56 AM Laurent Quérel <
> > >> > laurent.que...@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > >> Hello everyone,
> > >> > >>
> > >> > >> I was wondering if the Apache Arrow community would be interested
> > in
> > >> > >> featuring a two-part article series on their blog, discussing the
> > >> > >> experiences and insights gained from an experimental version of
> the
> > >> > >> OpenTelemetry protocol (OTLP) utilizing Apache Arrow. As the main
> > >> > author of
> > >> > >> the OTLP Arrow specification
> > >> > >> <
> > >> >
> > >>
> >
> https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md
> > >> > >>> ,
> > >> > >> the reference implementation otlp-arrow-adapter
> > >> > >> , and the two articles
> > >> (see
> > >> > >> links
> > >> > >> below), I believe that fostering collaboration between
> open-source
> > >> > projects
> > >> > >> like these is essential and mutually beneficial.
> > >> > >>
> > >> > >> These articles would serve as a fitting complement to the three
> > >> > >> introductory articles that Andrew Lamb and Raphael Taylor-Davies
> > >> > >> co-authored. They delve into the practical aspects of integrating
> > >> Apache
> > >> > >> Arrow into an existing project, as well as the process of
> > converting
> > >> a
> > >> > >> hierarchical data model into its Arrow representation. The first
> > >> article
> > >> > >> examines various mapping techniques for aligning an existing data
> > >> model
> > >> > >> with the corresponding Arrow representation, while the second
> > article
> > >> > >> explores an adaptive schema technique that I implemented in the
> > >> > library's
> > >> > >> final version in greater depth. Although the second article is
> > still
> > >> > under
> > >> > >> development, the core framework description is already in place.
> > >> > >>
> > >> > >> What are your thoughts on this proposal?
> > >> > >>
> > >> > >> Article 1:
> > >> > >>
> > >> > >>
> > >> >
> > >>
> >
> https://docs.google.com/document/d/11lG7Go2IgKOyW-RReBRW6r7HIdV1X7lu5WrDGlW5LbQ/edit?usp=sharing
> > >> > >>
> > >> > >> Article 2 (WIP):
> > >> > 

Re: [VOTE][RUST][DataFusion] Release DataFusion Python Bindings 22.0.0 RC1

2023-04-11 Thread Andrew Lamb
+1

Verified on x86 mac

Thanks Andy

Andrew

On Mon, Apr 10, 2023 at 8:10 PM L. C. Hsieh  wrote:

> +1 (binding)
>
> Verified on Intel Mac.
>
> Thanks Andy.
>
> On Mon, Apr 10, 2023 at 4:47 PM Andy Grove  wrote:
> >
> > Hi,
> >
> > I would like to propose a release of Apache Arrow DataFusion Python
> > Bindings,
> > version 22.0.0.
> >
> > This release candidate is based on commit:
> > 5c90187207e218393297ff3cbe4f08b69049d224 [1]
> > The proposed release tarball and signatures are hosted at [2].
> > The changelog is located at [3].
> > The Python wheels are located at [4].
> >
> > Please download, verify checksums and signatures, run the unit tests, and
> > vote
> > on the release. The vote will be open for at least 72 hours.
> >
> > Only votes from PMC members are binding, but all members of the community
> > are
> > encouraged to test the release and vote with "(non-binding)".
> >
> > The standard verification procedure is documented at
> >
> https://github.com/apache/arrow-datafusion-python/blob/main/dev/release/README.md#verifying-release-candidates
> > .
> >
> > [ ] +1 Release this as Apache Arrow DataFusion Python 22.0.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow DataFusion Python 22.0.0
> > because...
> >
> > Here is my vote:
> >
> > +1
> >
> > [1]:
> >
> https://github.com/apache/arrow-datafusion-python/tree/5c90187207e218393297ff3cbe4f08b69049d224
> > [2]:
> >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-python-22.0.0-rc1
> > [3]:
> >
> https://github.com/apache/arrow-datafusion-python/blob/5c90187207e218393297ff3cbe4f08b69049d224/CHANGELOG.md
> > [4]: https://test.pypi.org/project/datafusion/22.0.0/
>


Re: [CROWDSOURCING] Apache Arrow Board Report - April 12, 2023

2023-04-11 Thread Andrew Lamb
As a reminder, I will submit the ASF board report [1]  tomorrow summarizing
the state of the project. Thank you to everyone who has contributed content
already.

I encourage everyone who is interested in the goings on with Arrow to check
it out -- there is lots going on in this project.

Andrew

[1]:
https://docs.google.com/document/d/13FSDydEVXT2UUFdy4XKjVKNJW-WR8ylvG3aI6lD-dNI/edit#


On Wed, Mar 29, 2023 at 6:58 AM Andrew Lamb  wrote:

> Hello Arrow Community,
>
> One of the responsibilities of being part of the Apache Software
> Foundation (ASF) is to regularly summarize the state of the project in a
> quarterly update to the ASF board. The next report is due on April 12, 2023
>
> Historically[1], Arrow has crowd sourced the content which has worked
> well.  Please add your comments directly to [2] or reply to this email and
> I will incorporate your comments.
>
> While this is partly an administrative reporting exercise, I think it is
> also valuable to reflect on the past and think about goals for the future.
>
> It would be especially interesting if anyone from the following
> implementation communities could provide an update of around a paragraph or
> so:
>
> ### ADBC
> ### C++
> ### C#
> ### Go
> ### Java
> ### JavaScript
> ### Julia
> ### nanoarrow
> ### Rust
> ### C (GLib)
> ### MATLAB
> ### Python
> ### R
> ### Ruby
>
> Andrew
>
> [1]: https://lists.apache.org/thread/xjcj3lkvs76k95hrkcp76g6d6z0mlq27
>
> [2]:
> https://docs.google.com/document/d/13FSDydEVXT2UUFdy4XKjVKNJW-WR8ylvG3aI6lD-dNI/edit#
>


Re: Best practice on populating from VectorSchemaRoot to VectorSchemaRoot, ArrowStreamReader to ArrowStreamWriter

2023-04-11 Thread David Dali Susanibar Arce
Hi Wenbo Hu,

Sorry to join late. Wenbo, what about the proposal mentioned in the Java
Flight Cookbook (1). The method acceptPut will be an upstream with
VectorUnloader needed, then getStream method will be a downstream with
VectorLoader needed. Initially this cookbook use ArrowRecordBatch.
cloneWithTransfer but it did not work for all scenarios and finally was
changed to VectorLoader.load (2)

Please let us know how you see that.

(1) https://arrow.apache.org/cookbook/java/flight.html
(2) https://github.com/apache/arrow-cookbook/issues/218

Best regards,

David

El lun, 3 abr 2023 a las 7:59, Wenbo Hu () escribió:

> Hi,
>
> Consider a situation, when doGet a ticket on arrow flight rpc server,
> the server retrieves several IPC upstreams (read parquet files through
> dataset api) and push into the same downstream, how to implement with
> less copies?
> Normally with one single IPC upstream, I'll direct start
> ServerStreamListener with the getVectorSchemaRoot of the reader of the
> upstream IPC.
> It seems that I have to deal with VectorSchemaRoot rather than
> ArrowRecordBatch directly.
> What is the proper impelmentation on popluating root to root? Is that
> correct use VectorLoad/Unloader?
> Does this introduce extra steps making immediate ArrowRecordBatch
> unnecessarily? (ArrowBuf -> VectorSchemaRoot@UpstreamReader ->
> ArrowBuf@Loader ->VectorSchemaRoot@DownstreamWriter -> ArrowBuf)
>
> Maybe it relates to the allocator, is it any better implementations on
> same allocator?
> --
> -
> Best Regards,
> Wenbo Hu,
>


Re: OpenTelemetry + Arrow

2023-04-11 Thread Andrew Lamb
The blog post is now live on the arrow site [1]

Thanks again Laurent

[1]:
https://arrow.apache.org/blog/2023/04/11/our-journey-at-f5-with-apache-arrow-part-1/

On Sun, Apr 2, 2023 at 9:07 PM Laurent Quérel 
wrote:

> Hi Andrew,
>
> The feedback seems to be good so I created a PR.
>
> https://github.com/apache/arrow-site/pull/340
>
> Best regards,
>
> Laurent Querel
>
> On Thu, Mar 30, 2023 at 3:28 PM Laurent Quérel 
> wrote:
>
> > I'm glad to know that the article has been well-received. In the second
> > article, I will allocate a dedicated section to summarize the various
> > challenges encountered when using Arrow for this type of project.
> >
> > @Matt, I want to express my gratitude for your continuous support
> > throughout this project. Your contributions and refinements to the Arrow
> Go
> > library have enabled me to make significant progress with minimal
> obstacles.
> >
> > Best Regards, Laurent
> >
> > On Thu, Mar 30, 2023 at 2:24 PM Matt Topol 
> wrote:
> >
> >> +1 (non -binding)
> >>
> >> I'm glad others on here are finding this as useful and interesting as I
> >> did.
> >>
> >> Great job Laurent!
> >>
> >> --Matt
> >>
> >> On Thu, Mar 30, 2023, 3:26 PM Raphael Taylor-Davies
> >>  wrote:
> >>
> >> > Hi Laurent,
> >> >
> >> > I gave the first blog post a read and I also really like it and would
> be
> >> > +1 on publishing it, nice work.
> >> >
> >> > I would also like to echo Will's sentiment that getting real-world
> case
> >> > studies for the more complex Arrow schemas is invaluable and will help
> >> > drive improvements in this space, so thank you for driving this
> forward.
> >> >
> >> > Kind Regards,
> >> >
> >> > Raphael
> >> >
> >> > On 30/03/2023 19:52, Will Jones wrote:
> >> > > Hi Laurent,
> >> > >
> >> > > I have read the first post and I really like it. I'd be +1 on
> >> publishing
> >> > > these to the blog. I'm interested to read the second one when it's
> >> > finished.
> >> > >
> >> > > IMO the blog could use more examples of using Arrow that's not
> >> building a
> >> > > data frame library / query engine, and I appreciate that this blog
> >> > provides
> >> > > advice for some of the trickier parts of working with complex Arrow
> >> > > schemas. I think this will also provide a good concrete use case for
> >> us
> >> > to
> >> > > think about improving the ecosystem's support for nested data.
> >> > >
> >> > > Best,
> >> > >
> >> > > Will Jones
> >> > >
> >> > > On Thu, Mar 30, 2023 at 10:56 AM Laurent Quérel <
> >> > laurent.que...@gmail.com>
> >> > > wrote:
> >> > >
> >> > >> Hello everyone,
> >> > >>
> >> > >> I was wondering if the Apache Arrow community would be interested
> in
> >> > >> featuring a two-part article series on their blog, discussing the
> >> > >> experiences and insights gained from an experimental version of the
> >> > >> OpenTelemetry protocol (OTLP) utilizing Apache Arrow. As the main
> >> > author of
> >> > >> the OTLP Arrow specification
> >> > >> <
> >> >
> >>
> https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md
> >> > >>> ,
> >> > >> the reference implementation otlp-arrow-adapter
> >> > >> , and the two articles
> >> (see
> >> > >> links
> >> > >> below), I believe that fostering collaboration between open-source
> >> > projects
> >> > >> like these is essential and mutually beneficial.
> >> > >>
> >> > >> These articles would serve as a fitting complement to the three
> >> > >> introductory articles that Andrew Lamb and Raphael Taylor-Davies
> >> > >> co-authored. They delve into the practical aspects of integrating
> >> Apache
> >> > >> Arrow into an existing project, as well as the process of
> converting
> >> a
> >> > >> hierarchical data model into its Arrow representation. The first
> >> article
> >> > >> examines various mapping techniques for aligning an existing data
> >> model
> >> > >> with the corresponding Arrow representation, while the second
> article
> >> > >> explores an adaptive schema technique that I implemented in the
> >> > library's
> >> > >> final version in greater depth. Although the second article is
> still
> >> > under
> >> > >> development, the core framework description is already in place.
> >> > >>
> >> > >> What are your thoughts on this proposal?
> >> > >>
> >> > >> Article 1:
> >> > >>
> >> > >>
> >> >
> >>
> https://docs.google.com/document/d/11lG7Go2IgKOyW-RReBRW6r7HIdV1X7lu5WrDGlW5LbQ/edit?usp=sharing
> >> > >>
> >> > >> Article 2 (WIP):
> >> > >>
> >> > >>
> >> >
> >>
> https://docs.google.com/document/d/1K2CqAtF4pZjpiVts8BOcq34sOcNgozvZ9ZZw-_zTv6I/edit?usp=sharing
> >> > >>
> >> > >>
> >> > >> Best regards,
> >> > >>
> >> > >> Laurent Quérel
> >> > >>
> >> > >> --
> >> > >> Laurent Quérel
> >> > >>
> >> >
> >>
> >
> >
> > --
> > Laurent Quérel
> >
> >
>
>
> --
> Laurent Quérel
>


Re: [DISCUSS] Acero roadmap / philosophy

2023-04-11 Thread Weston Pace
Yes, you could use Acero for this.  However, I would hope that someday you
could also use DuckDb and Datafusion to do the combining as well.

In my mind an "engine" is something that takes a plan (Substrait) and zero
or more input streams (Arrow C stream interface[1]) and has one output
stream (Arrow C stream interface).

So, for example, if "combine them without considering the source they came
from" means "interleave the batches into a common stream" then you could
create a substrait plan with two named table relations (input0 and input1)
and a set union all relation.  Then you could execute your DuckDb plan and
your Datafusion plan.  You should now have two Arrow streams and a
Substrait plan.  You could pass those into Acero, DuckDb, or Datafusion to
do the actual execution of the plan.

Today I think only Acero has the ability to input C streams as named tables
so I think you could only do it with Acero.  But it should be a small lift
for Datafusion or DuckDb to support.  In pyarrow it would look something
like...

```
datafusion_c_stream = ... # Execute query with datafusion and obtain a
stream
datafusion_reader =
pyarrow.RecordBatchReader._import_from_c(datafusion_c_stream)
duckdb_c_stream = ... # Execute query with duckdb and obtain a stream
duckdb_reader = pyarrow.RecordBatchReader._import_from_c(duckdb_c_stream)
plan = ... # load plan from file or build programmatically with something
like ibis or PySubstrait

def provide_table(names):
  if names[0] == "input0":
return datafusion_reader
  else:
return duckdb_reader

pyarrow.substrait.run_query(plan, table_provider=provide_table)
```

I'm probably missing a few details and pyarrow doesn't actually let you
return a RecordBatchReader from a table provider (it's doable in C++) but
that's the rough idea.

[1] https://arrow.apache.org/docs/format/CStreamInterface.html

On Mon, Apr 10, 2023 at 4:36 PM Will Ayd 
wrote:

> I am still wrapping my head around some of the technologies so excuse
> any ignorance, but seeing as the OP mentioned the use case of /switching
> /between execution engines is there not a gap if the concern is more
> about /combining/ execution engines? AFAIU Substrait would allow me to
> submit different queries to DuckDB and Datafusion - if I wanted to take
> these results back and combine them without considering the source they
> came from is Acero not the right tool for the job?
>
> On 3/14/23 11:50, Li Jin wrote:
> > Late to the party.
> >
> > Thanks Weston for sharing the thoughts around Acero. We are actually a
> > pretty heavy Acero user right now and are trying to take part in Acero
> > maintenance and development. Internally we are using Acero for a time
> > series streaming data processing system.
> >
> > I would +1 on many of Weston's directions here, in particular to make
> Acero
> > extensionable / customizable. IMO Acero might not be the fastest "Arrow
> > SQL/TPC-H" engine, but the ability to customize it for ordered time
> series
> > is a huge/kill feature.
> >
> > In addition to what Weston has already said, my other two cents is that I
> > think Acero would benefit from a separation from the Arrow core C++
> > library, similar to how Arrow Flight is. The main reason is that Arrow
> core
> > being such a widely used library, it benefits more from being stable and
> > Acero being a relatively new and standalone component, benefits more from
> > fast moving / quick experiment. My colleague and I are working on
> > https://github.com/apache/arrow/issues/15280  to make this happen.
> >
> >
> >
> >
> >
> > On Fri, Mar 10, 2023 at 5:59 AM Andrew Lamb
> wrote:
> >
> >> I don't know much about the Acero user base, but gathering some
> significant
> >> term users (e.g. Ballista, Urban Logiq, GreptimeDB, InfluxDB IOx, etc)
> has
> >> been very helpful for DataFusion. Not only do such users bring some
> amount
> >> of maintenance capacity, but perhaps more relevantly to your discussion
> >> they bring a focus to the project with their usecases.
> >>
> >> With so many possible tradeoffs (e.g. streaming vs larger batch
> execution
> >> as you mention above) having people to help focus the choice of project
> I
> >> think has served DataFusion well.
> >>
> >> If Acero has such users (or potential users) perhaps reaching out to
> them /
> >> soliciting their ideas of where they want to see the project go would
> be a
> >> valuable focusing exercise.
> >>
> >> Andrew
> >>
> >> On Thu, Mar 9, 2023 at 6:35 PM Aldrin
> wrote:
> >>
> >>> Thanks for sharing! There are a variety of things that I didn't know
> >> about
> >>> (such as ExecBatchBuilder) and it's interesting to hear about the
> >>> performance challenges.
> >>>
> >>> How much would future substrait work involve integration with Acero?
> I'm
> >>> curious how much more support of substrait is seen as valuable (should
> be
> >>> prioritized) or
> >>> if additional support is going to be "as-needed". Note that I have a
> >>> minimal understanding of how "large" substrait