Re: [ANNOUNCE] New Arrow committer: Brent Gardner

2023-01-11 Thread Kun Liu
Congratulations!



David Li  于2023年1月12日周四 10:01写道:

> Congrats, Brent!
>
> On Wed, Jan 11, 2023, at 19:07, Jacob Wujciak wrote:
> > Congrats!
> >
> > On Thu, Jan 12, 2023 at 12:06 AM QP Hou  wrote:
> >
> >> Congratulations Brent!
> >>
> >> On Wed, Jan 11, 2023 at 2:56 PM Andy Grove 
> wrote:
> >>
> >> > On behalf of the Arrow PMC, I'm happy to announce that Brent Gardner
> >> > has accepted an invitation to become a committer on Apache
> >> > Arrow. Welcome, and thank you for your contributions!
> >> >
> >> > Andy.
> >> >
> >>
>


Re: [ANNOUNCE] New Arrow committer: Brent Gardner

2023-01-11 Thread David Li
Congrats, Brent!

On Wed, Jan 11, 2023, at 19:07, Jacob Wujciak wrote:
> Congrats!
>
> On Thu, Jan 12, 2023 at 12:06 AM QP Hou  wrote:
>
>> Congratulations Brent!
>>
>> On Wed, Jan 11, 2023 at 2:56 PM Andy Grove  wrote:
>>
>> > On behalf of the Arrow PMC, I'm happy to announce that Brent Gardner
>> > has accepted an invitation to become a committer on Apache
>> > Arrow. Welcome, and thank you for your contributions!
>> >
>> > Andy.
>> >
>>


Re: [ANNOUNCE] New Arrow committer: Brent Gardner

2023-01-11 Thread Jacob Wujciak
Congrats!

On Thu, Jan 12, 2023 at 12:06 AM QP Hou  wrote:

> Congratulations Brent!
>
> On Wed, Jan 11, 2023 at 2:56 PM Andy Grove  wrote:
>
> > On behalf of the Arrow PMC, I'm happy to announce that Brent Gardner
> > has accepted an invitation to become a committer on Apache
> > Arrow. Welcome, and thank you for your contributions!
> >
> > Andy.
> >
>


Adding a CODEOWNERS file

2023-01-11 Thread Jacob Wujciak
Hello Everyone,

As discussed in an issue spawned by the state of the project thread [1] I
have created a draft PR that adds a CODEOWNERS file to apache/arrow [2].

Adding a CODEOWNERS file will allow committers to be automatically
requested for reviews that they are interested in (based on touched files,
enabling them to basically "subscribe" to a selection of PRs based on their
interests/competence within the monorepo without having to watch all
notifications for the repo.
The main advantage in my opinion is, that it removes the burden of finding
an (initial) reviewer for a PR for contributors, which is a major block in
the arrow dev workflow, especially for new contributors.

Note that adding a CODEOWNERS file will not automatically activate the
branch protection rules to enforce a codeowner review on the respective
code.

Please review the PR and add yourself to the file via suggestion or direct
push to the branch! Documentation on CODEOWNERS file and syntax: [3]

Thanks,

Jacob
[1]: https://github.com/apache/arrow/issues/15232
[2]: https://github.com/apache/arrow/pull/33622
[3]:
https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners


Re: [ANNOUNCE] New Arrow committer: Brent Gardner

2023-01-11 Thread QP Hou
Congratulations Brent!

On Wed, Jan 11, 2023 at 2:56 PM Andy Grove  wrote:

> On behalf of the Arrow PMC, I'm happy to announce that Brent Gardner
> has accepted an invitation to become a committer on Apache
> Arrow. Welcome, and thank you for your contributions!
>
> Andy.
>


[ANNOUNCE] New Arrow committer: Brent Gardner

2023-01-11 Thread Andy Grove
On behalf of the Arrow PMC, I'm happy to announce that Brent Gardner
has accepted an invitation to become a committer on Apache
Arrow. Welcome, and thank you for your contributions!

Andy.


Re: [VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 16.0.0 RC1

2023-01-11 Thread Andy Grove
I saw that a PR related to this issue was merged, but the issue is still
open. I added a comment on the issue asking whether this issue is resolved.



On Mon, Jan 9, 2023 at 11:28 PM Andrew Lamb  wrote:

> There is a report[1] of a seemingly serious regression. I recommend we hold
> up finalizing this vote until we have resolved the issue (either in code or
> decided it is not a release blocker)
>
> Andrew
>
> [1] https://github.com/apache/arrow-datafusion/issues/4844
>
> On Mon, Jan 9, 2023 at 10:25 PM Patrick Horan  wrote:
>
> > +1 verified on Mac M1
> >
> > On Mon, Jan 9, 2023, at 9:34 AM, Ian Joiner wrote:
> > > +1 (Non-binding)
> > >
> > > Ian
> > >
> > > Verified on my System76 / Ubuntu 22.04 / AMD64
> > >
> > > On Sat, Jan 7, 2023 at 6:18 PM Andy Grove 
> wrote:
> > >
> > > > Hi,
> > > >
> > > > I would like to propose a release of Apache Arrow DataFusion
> > > > Implementation,
> > > > version 16.0.0.
> > > >
> > > > This release candidate is based on commit:
> > > > dcd52ee3d87c4dd9e2c176165e9e20644f66988b [1]
> > > > The proposed release tarball and signatures are hosted at [2].
> > > > The changelog is located at [3].
> > > >
> > > > Please download, verify checksums and signatures, run the unit tests,
> > and
> > > > vote
> > > > on the release. The vote will be open for at least 72 hours.
> > > >
> > > > Only votes from PMC members are binding, but all members of the
> > community
> > > > are
> > > > encouraged to test the release and vote with "(non-binding)".
> > > >
> > > > The standard verification procedure is documented at
> > > >
> > > >
> >
> https://github.com/apache/arrow-datafusion/blob/master/dev/release/README.md#verifying-release-candidates
> > > > .
> > > >
> > > > [ ] +1 Release this as Apache Arrow DataFusion 16.0.0
> > > > [ ] +0
> > > > [ ] -1 Do not release this as Apache Arrow DataFusion 16.0.0
> because...
> > > >
> > > > Here is my vote:
> > > >
> > > > +1
> > > >
> > > > [1]:
> > > >
> > > >
> >
> https://github.com/apache/arrow-datafusion/tree/dcd52ee3d87c4dd9e2c176165e9e20644f66988b
> > > > [2]:
> > > >
> > > >
> >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-16.0.0-rc1
> > > > [3]:
> > > >
> > > >
> >
> https://github.com/apache/arrow-datafusion/blob/dcd52ee3d87c4dd9e2c176165e9e20644f66988b/CHANGELOG.md
> > > >
> > >
> >
>


Re: [DISCUSS] Updating what are considered reference implementations?

2023-01-11 Thread Brian Hulette
I think this [1] is the thread where the policy was proposed, but it
doesn't look like we ever settled on "Java and C++" vs. "any two
implementations", or had a vote.

I worry that requiring maintainers to add new format features to two
"complete" implementations will just lead to fragmentation. People might
opt to maintain a fork rather than unblock themselves by implementing a
backlog of features they don't need.

[1] https://lists.apache.org/thread/9t0pglrvxjhrt4r4xcsc1zmgmbtr8pxj

On Fri, Jan 6, 2023 at 12:33 PM Weston Pace  wrote:

> I think it would be reasonable to state that a reference
> implementation must be a complete implementation (i.e. supports all
> existing types) that is not derived from another implementation (e.g.
> you can't pick pyarrow and arrow-c++).  If an implementation does not
> plan on ever supporting a new array type then maintainers of that
> implementation should be empowered to vote against it.  Given that, it
> seems like a reasonable burden to ask maintainers to catch up first
> before expanding in new directions.
>
>
> On Fri, Jan 6, 2023 at 10:20 AM Micah Kornfield 
> wrote:
> >
> > >
> > > Note this wording talks about "two reference implementations" not
> "*the*
> > > two reference implementations". So there can be more than two reference
> > > implementations.
> >
> >
> > Maybe reference implementation is the wrong wording here.  My main
> concern
> > is that we try to maintain two "feature complete" implementations at all
> > times.  I worry if there is a pick  2 from N reference implementations
> that
> > potentially leads to fragmentation more quickly.  But maybe this is
> > premature?
> >
> > Cheers,
> > Micah
> >
> >
> > On Fri, Jan 6, 2023 at 10:02 AM Antoine Pitrou 
> wrote:
> >
> > >
> > > Le 06/01/2023 à 18:58, Micah Kornfield a écrit :
> > > > I'm having trouble finding it, but I think we've previously agreed
> that
> > > new
> > > > features needed implementations in 2 reference implementations before
> > > > approval (I had thought the community agreed on Java and C++ as the
> two
> > > > implementations but I can't find the vote thread on it).
> > >
> > > Note this wording talks about "two reference implementations" not
> "*the*
> > > two reference implementations". So there can be more than two reference
> > > implementations.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
>


Re: DISCUSS: ADBC More Canonical Options

2023-01-11 Thread David Li
Sorry for the double email. "here [1]" should reference 
https://github.com/apache/arrow-adbc/milestone/3.

On Wed, Jan 11, 2023, at 14:16, David Li wrote:
> Thanks for bringing this up. My thought is:
>
> - We are treating ADBC's APIs as a specification, so we should vote in 
> general.
> - The changes here are minimal and don't introduce any compatibility 
> concerns - they just add more constant definitions - so I say we vote 
> and just merge them into main, instead of adding more friction.
>
> There is a set of more major proposals I have begun collecting here [1] 
> that would require some work to maintain compatibility.  For those, I 
> think we would want to do development on a branch, then vote and merge 
> them and bump the specification version.  And ideally, bundle these 
> changes and any others together to avoid introducing a lot of work for 
> implementations to maintain compatibility.
>
> -David
>
> On Wed, Jan 11, 2023, at 11:44, Matt Topol wrote:
>> Hey all,
>>
>> I've filed a PR with ADBC (https://github.com/apache/arrow-adbc/pull/316)
>> to add some more explicitly defined canonical options. This then leads the
>> an interesting question that should be posed:
>>
>> For changes like this in general along with other potential updates, should
>> we do a series of small votes that are merged into a branch and then
>> bundled up into a v1.1.0 release? Or just do votes to merge to main and
>> then bump to v1.0.1? Or some other combination of ideas? As this is
>> technically a change to the ADBC definitions, it should warrant some kind
>> of release, but it might end up spammy to bump versions frequently for
>> changes like this for now?
>>
>> Anyway, I figured it'd be good to open it up for discussion here and see
>> what people's opinions on this are.
>>
>> Thanks all!
>>
>> --Matt


Re: DISCUSS: ADBC More Canonical Options

2023-01-11 Thread David Li
Thanks for bringing this up. My thought is:

- We are treating ADBC's APIs as a specification, so we should vote in general.
- The changes here are minimal and don't introduce any compatibility concerns - 
they just add more constant definitions - so I say we vote and just merge them 
into main, instead of adding more friction.

There is a set of more major proposals I have begun collecting here [1] that 
would require some work to maintain compatibility.  For those, I think we would 
want to do development on a branch, then vote and merge them and bump the 
specification version.  And ideally, bundle these changes and any others 
together to avoid introducing a lot of work for implementations to maintain 
compatibility.

-David

On Wed, Jan 11, 2023, at 11:44, Matt Topol wrote:
> Hey all,
>
> I've filed a PR with ADBC (https://github.com/apache/arrow-adbc/pull/316)
> to add some more explicitly defined canonical options. This then leads the
> an interesting question that should be posed:
>
> For changes like this in general along with other potential updates, should
> we do a series of small votes that are merged into a branch and then
> bundled up into a v1.1.0 release? Or just do votes to merge to main and
> then bump to v1.0.1? Or some other combination of ideas? As this is
> technically a change to the ADBC definitions, it should warrant some kind
> of release, but it might end up spammy to bump versions frequently for
> changes like this for now?
>
> Anyway, I figured it'd be good to open it up for discussion here and see
> what people's opinions on this are.
>
> Thanks all!
>
> --Matt


Arrow R package development sync call - tomorrow (Thurs 12th Jan) at 17:30 UTC

2023-01-11 Thread Nic Crane
The Arrow R package dev community call is tomorrow at 17:30 UTC.

Joining instructions are below.

Thursday, 12 January · 17:30 – 18:30
Google Meet joining info
Video call link: https://meet.google.com/dbm-ybmv-evb
Or dial: ‪(ES) +34 910 48 95 10‬ PIN: ‪919 955 818 9233‬#
More phone numbers: https://tel.meet/dbm-ybmv-evb?pin=9199558189233

The notes from the last call can be found at:
https://docs.google.com/document/d/1nSIfJw8mfqtvScqvSVqmktpWff80pFmkqiZT7nTtiDo/edit?usp=sharing


Thanks,

Nic


DISCUSS: ADBC More Canonical Options

2023-01-11 Thread Matt Topol
Hey all,

I've filed a PR with ADBC (https://github.com/apache/arrow-adbc/pull/316)
to add some more explicitly defined canonical options. This then leads the
an interesting question that should be posed:

For changes like this in general along with other potential updates, should
we do a series of small votes that are merged into a branch and then
bundled up into a v1.1.0 release? Or just do votes to merge to main and
then bump to v1.0.1? Or some other combination of ideas? As this is
technically a change to the ADBC definitions, it should warrant some kind
of release, but it might end up spammy to bump versions frequently for
changes like this for now?

Anyway, I figured it'd be good to open it up for discussion here and see
what people's opinions on this are.

Thanks all!

--Matt


Re: DISCUSS: ADBC Press Release

2023-01-11 Thread David Li
Sorry, I didn't mean to imply that Flight SQL was Dremio-specific (and indeed 
we want to position Flight SQL as a vendor-agnostic protocol). A PR with some 
tweaks (and a notice about the correction) would be welcome.

Possibly something like

> ...For example, applications can get Arrow data from BigQuery via the 
> BigQuery Storage API.  Other systems, like Dremio, support Arrow Flight SQL, 
> an Arrow-native protocol designed to be implemented by multiple vendors.  But 
> not all vendors will implement Arrow Flight SQL, so client applications ...

-David

On Wed, Jan 11, 2023, at 02:54, Andrew Lamb wrote:
> I believe the blog post in question is [1] and the relevant text is
>
>> Use vendor-specific protocols. For some databases, applications can use a
> database-specific protocol or SDK to directly get Arrow data. For example,
> applications could use Dremio via Arrow Flight SQL. But client applications
> that want to support multiple database vendors would need to integrate with
> each of them. (Look at all the connectors that Trino implements.) And
> databases like PostgreSQL don’t offer an option supporting Arrow in the
> first place.
>
> I did not read that to mean FlightSQL was a vendor specific protocol, but
> if others did so clarifying the wording sounds like a good idea to me
>
> Perhaps you could propose a specific rephrasing on a PR to [2].
>
> Andrew
>
> [1] https://arrow.apache.org/blog/2023/01/05/introducing-arrow-adbc/
> [2]
> https://github.com/apache/arrow-site/blob/master/_posts/2023-01-05-introducing-arrow-adbc.md
>
> On Wed, Jan 11, 2023 at 8:02 AM James Duong
>  wrote:
>
>> Hi,
>>
>> In the ADBC blog entry that Flight SQL was mentioned as a vendor-specific
>> protocol and s Dremio is mentioned in the same sentence.
>>
>> The intent of the Flight SQL was to be database agnostic and this sort of
>> implies Flight SQL as a Dremio-specific protocol which is not really what
>> we want.
>>
>> Perhaps this can be rephrased? Maybe highlight that ADBC can help with
>> building generic Arrow-based applications that work with both databases
>> that have a specific Arrow-interface such as Big Query in addition to any
>> Flight SQL-capable sources.
>>


Re: Apache Arrow Board Report, by Jan 11 2023

2023-01-11 Thread Andrew Lamb
Here is the report that was submitted

## Description:
The mission of Apache Arrow is the creation and maintenance of software
related
to columnar in-memory processing and data interchange

## Issues:
Lack of ASF sponsored invite-free chat service is a minor source of friction
for community building. Most subprojects now use github for tickets to
lower
the barrier to entry for new / casual contributors, but we still have
fragmented
stories for group chat. ASF Slack requires an invite and some sub
communities
use other chat-like services.

## Membership Data:
Apache Arrow was founded 2016-01-20 (7 years ago)
There are currently 89 committers and 45 PMC members in this project.
The Committer-to-PMC ratio is roughly 2:1.

Community changes, past quarter:
- Kun Liu was added to the PMC on 2022-11-13
- Jacob Quinn was added to the PMC on 2022-10-25
- Nicola Crane was added to the PMC on 2022-10-25
- Jacob Wujciak was added as committer on 2022-12-19
- Ben Baumgold was added as committer on 2022-10-26
- Bogumił Kamiński was added as committer on 2022-10-24
- Eric Hanson was added as committer on 2022-10-26
- Jie Wen was added as committer on 2023-01-08
- Jarrett Revels was added as committer on 2022-11-02
- Curtis Vogt was added as committer on 2022-11-02
- Raúl Cumplido was added as committer on 2022-12-05
- Will Jones was added as committer on 2022-10-28
- Yang Jiang was added as committer on 2022-11-02

## Project Activity:
* Switching from JIRA to github issues in order to keep the overhead for new
  contributors low (no need to register for an ASF JIRA account)
* [ADBC] (Arrow Database Connectivity) first release:
* Community voted to add RLE to the specification
* Additional subproject updates are below
* We continue to release several different products and releases per quarter

[ADBC]: https://arrow.apache.org/blog/2023/01/05/introducing-arrow-adbc/

Recent releases:
ADBC-0.1.0 was released on 2023-01-10.
RS-30.0.1 was released on 2023-01-08.
RS-OS-0.5.3 was released on 2023-01-08.
RS-30.0.0 was released on 2023-01-03.
RS-29.0.0 was released on 2022-12-12.
RS-OS-0.5.2 was released on 2022-12-07.
RS-DATAFUSION-15.0.0 was released on 2022-12-05.
DATAFUSION-PYTHON-0.7.0 was released on 2022-11-29.
RS-28.0.0 was released on 2022-11-28.
10.0.1 was released on 2022-11-22.
RS-BALLISTA-0.10.0 was released on 2022-11-21.
JULIA-2.4.1 was released on 2022-11-18.
RS-27.0.0 was released on 2022-11-15.
RS-DATAFUSION-14.0.0 was released on 2022-11-07.
RS-26.0.0 was released on 2022-11-03.
10.0.0 was released on 2022-10-26.
JULIA-2.4.0 was released on 2022-10-26.
RS-BALLISTA-0.9.0 was released on 2022-10-26.
RS-25.0.0 was released on 2022-10-17.

## Community Health:
The community health appears good, discussions on the mailing lists and
github
are productive. We recently had a nice discussion on the State of the
Project:
https://lists.apache.org/thread/r8gl3wvjgy9k8n2t194r0bbdbxx6ksqc  and
discussed
various ways to keep encouraging the community.

## Language Area Updates

Arrow has at least 12 different language bindings, as explained in
https://arrow.apache.org/overview/

Arrow 10.0.0 release:
https://arrow.apache.org/blog/2022/10/31/10.0.0-release/

### C++

### C#

### Go

We’re seeing significant increases in interest and usage of the Arrow Go
library. From startups like Spice.AI to being incorporated and used in
Google
BigQuery’s quickstart example and more. 2022 was a big year of updates,
fixes,
and drumming up interest for the Go module that we hope to continue for
increased adoption and usage. The Go module, along with C++, is used as the
initial implementation for the Run-End Encoding array implementation.

Future development plans are to continue to expand the compute capabilities
of
the Go module and extend integration with Substrait.

### Java

### JavaScript

### Julia
We’ve worked again on simplifying and streamlining the administrative side
for
the Julia implementation; adding additional committers, simplifying the
release process, etc. This has increased the rate of contributions, as
expected. There’s interest in finishing the C data/stream interfaces for the
Julia implementation soon.

### Rust
Rust has several projects: arrow-rs (arrow, parquet, arrow-flight
object_store
implementations) arrow-datafusion: rust query engine arrow-ballista:
distributed query engine

We are working to incorporate substrait into DataFusion

Working on external communication with several blog posts about technology
on
sorting Fast and Memory Efficient Multi-Column Sorts in Apache Arrow Rust,
Part 1 and Querying Parquet with Millisecond Latency

We also continue calendar based release train with good results.

### C (GLib)

We’ve added support for 16-bit float type.

### MATLAB
1. We have been focusing development efforts on implementing an "object
   dispatch layer" that uses MEX to "connect" MATLAB objects with
   corresponding C++ objects. This code is being actively developed at
   github.com/mathworks/libmexclass.