Re: [VOTE][Julia] Release Apache Arrow Julia 2.4.3 RC1

2023-02-02 Thread Jacob Quinn
+1

Ran on macos m1.

-Jacob

On Thu, Feb 2, 2023 at 7:53 PM Sutou Kouhei  wrote:

> +1
>
> I ran the following command line on Debian GNU/Linux sid:
>
>   VERIFY_FORCE_USE_JULIA_BINARY=1 dev/release/verify_rc.sh 2.4.3 1
>
>
> Thanks,
> --
> kou
>
>
> In <20230203.113400.196149433832986@clear-code.com>
>   "[VOTE][Julia] Release Apache Arrow Julia 2.4.3 RC1" on Fri, 03 Feb 2023
> 11:34:00 +0900 (JST),
>   Sutou Kouhei  wrote:
>
> > Hi,
> >
> > I would like to propose the following release candidate (RC1) of
> > Apache Arrow Julia version 2.4.3.
> >
> > This release candidate is based on commit:
> > 8c0cc4498801758064bd72ffa2fa6460cfc51fdc [1]
> >
> > The source release rc1 is hosted at [2].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [3] for how to validate a release candidate.
> >
> > The vote will be open for at least 24 hours.
> >
> > [ ] +1 Release this as Apache Arrow Julia 2.4.3
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow Julia 2.4.3 because...
> >
> > [1]:
> https://github.com/apache/arrow-julia/tree/8c0cc4498801758064bd72ffa2fa6460cfc51fdc
> > [2]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-julia-2.4.3-rc1/
> > [3]:
> https://github.com/apache/arrow-julia/blob/main/dev/release/README.md#verify
> >
> >
> > Thanks,
> > --
> > kou
>


Re: [VOTE][Julia] Release Apache Arrow Julia 2.4.3 RC1

2023-02-02 Thread Sutou Kouhei
+1

I ran the following command line on Debian GNU/Linux sid:

  VERIFY_FORCE_USE_JULIA_BINARY=1 dev/release/verify_rc.sh 2.4.3 1


Thanks,
-- 
kou


In <20230203.113400.196149433832986@clear-code.com>
  "[VOTE][Julia] Release Apache Arrow Julia 2.4.3 RC1" on Fri, 03 Feb 2023 
11:34:00 +0900 (JST),
  Sutou Kouhei  wrote:

> Hi,
> 
> I would like to propose the following release candidate (RC1) of
> Apache Arrow Julia version 2.4.3.
> 
> This release candidate is based on commit:
> 8c0cc4498801758064bd72ffa2fa6460cfc51fdc [1]
> 
> The source release rc1 is hosted at [2].
> 
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [3] for how to validate a release candidate.
> 
> The vote will be open for at least 24 hours.
> 
> [ ] +1 Release this as Apache Arrow Julia 2.4.3
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow Julia 2.4.3 because...
> 
> [1]: 
> https://github.com/apache/arrow-julia/tree/8c0cc4498801758064bd72ffa2fa6460cfc51fdc
> [2]: 
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-julia-2.4.3-rc1/
> [3]: 
> https://github.com/apache/arrow-julia/blob/main/dev/release/README.md#verify
> 
> 
> Thanks,
> -- 
> kou


[VOTE][Julia] Release Apache Arrow Julia 2.4.3 RC1

2023-02-02 Thread Sutou Kouhei
Hi,

I would like to propose the following release candidate (RC1) of
Apache Arrow Julia version 2.4.3.

This release candidate is based on commit:
8c0cc4498801758064bd72ffa2fa6460cfc51fdc [1]

The source release rc1 is hosted at [2].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [3] for how to validate a release candidate.

The vote will be open for at least 24 hours.

[ ] +1 Release this as Apache Arrow Julia 2.4.3
[ ] +0
[ ] -1 Do not release this as Apache Arrow Julia 2.4.3 because...

[1]: 
https://github.com/apache/arrow-julia/tree/8c0cc4498801758064bd72ffa2fa6460cfc51fdc
[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-julia-2.4.3-rc1/
[3]: 
https://github.com/apache/arrow-julia/blob/main/dev/release/README.md#verify


Thanks,
-- 
kou


[Python] Add ClientCookieMiddleware

2023-02-02 Thread Ravjot Brar
Hi there,

I'm a contributor for Dremio's Python Arrow Flight Client 
example. 
We currently implement 
CookieMiddleware
 to use Arrow Flight to connect from Python to the Dremio Flight endpoint. I 
plan to move this into pyarrow instead. Here is the issue I created: 
https://github.com/apache/arrow/issues/34016.

Thanks,

Ravjot

Ravjot Brar | Software Developer II | ravjot.b...@improving.com
Improving – It’s what we do.™

improving.com
Software Development | Consulting Services | Training & Coaching | Outsourcing 
| Community


[GitHub] [arrow-ballista-python] andygrove merged pull request #2: Add .asf.yaml

2023-02-02 Thread via GitHub


andygrove merged PR #2:
URL: https://github.com/apache/arrow-ballista-python/pull/2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-ballista-python] andygrove commented on pull request #1: Move the `python` directory of `arrow-ballista` to the new `arrow-ballista-python` repo

2023-02-02 Thread via GitHub


andygrove commented on PR #1:
URL: 
https://github.com/apache/arrow-ballista-python/pull/1#issuecomment-1414528630

   I think this is fine to unblock the development of Ballista core, but I we 
should at least enable the RAT check in this PR to ensure that no code gets 
checked in without the appropriate license. It would also be good to file an 
issue for the remaining tasks so that others can pick those up.
   
   Thanks for taking the lead on doing this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-ballista-python] iajoiner commented on pull request #1: Move the `python` directory of `arrow-ballista` to the new `arrow-ballista-python` repo

2023-02-02 Thread via GitHub


iajoiner commented on PR #1:
URL: 
https://github.com/apache/arrow-ballista-python/pull/1#issuecomment-1414407993

   @andygrove 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [DISCUSS] PR automation workflow

2023-02-02 Thread Andrew Lamb
A process that we use in arrow-rs / arrow-datafusion,  which is less
precise but seems to be working well enough at the moment, is :

1. Mark  PRs that have received feedback and need more work prior to merge
from `Ready to Review` back to `Draft`
2. Ask the author to set it back to "ready to review" when it is ready for
the next round of review

Andrew


On Thu, Feb 2, 2023 at 4:17 AM Antoine Pitrou  wrote:

>
> Hi Raul,
>
> Since I'm the one who proposed that we reuse CPython's existing workflow
> infrastructure, it follows logically that I'm in favour :-)
>
> I'm a CPython core developer myself (though inactive lately), I will add
> that this workflow is really easing the work of reviewing PRs, as it
> makes obvious whether a PR is needing attention from a committer.
>
> Once we start working with it, we may decide to make adjustments to
> better fit our expectations, but I think we can start with the
> unmodified workflow scheme.
>
> Regards
>
> Antoine.
>
>
> Le 01/02/2023 à 15:34, Raúl Cumplido a écrit :
> > Hi,
> >
> > I would like to start working on some automation for our PRs and issues
> > workflows.
> >
> > I've heard, and have experienced, the frustration of spending a lot of
> time
> > on our issue tracker and our PRs to follow up on their status.
> > It is very hard to keep track of which PRs and issues are waiting for
> user
> > feedback, have gone stale or are pending maintainer/committer action.
> > This means users frequently get no timely response, all the while we
> > regularly spend time on GH to look for PRs / issues needing action from
> us.
> > As a first step we should probably tackle PRs, once PRs are tackled and
> we
> > are satisfied with the solution, we can try to devise a similar one for
> GH
> > issues.
> >
> > An example of a great improvement is the CODEOWNERS addition [1]. This
> > allows us to use filters like `is:pr is:open user-review-requested:@me`
> [2]
> > which will show PRs that have requested a review from us. This does not
> > solve the problem of what are the PRs waiting for second review,
> > waiting for changes, etcetera.
> >
> > I don't think we have to reinvent the wheel, CPython has something that
> > works well and can easily be adapted/tweaked.
> > They use a GitHub bot (bedevere) with the following state machine:
> > https://github.com/python/bedevere#pr-state-machine
> >
> > PRs have one label of the following workflow labels, depending of the
> state:
> > - `Awaiting review`
> > - `Awaiting core review`
> > - `Awaiting changes`
> > - `Awaiting change review`
> > - `Awaiting merge`
> >
> > I would like to propose adding a GitHub bot to our repo that triggers on
> PR
> > changes / comments implementing a similar workflow than the one on the
> > CPython repository.
> >
> > I am going to start working on it and I would love to hear feedback about
> > that workflow. I have also created an issue on the Repo [3].
> >
> > Kind regards,
> > Raúl
> >
> > [1] https://github.com/apache/arrow/pull/33622
> > [2]
> >
> https://github.com/apache/arrow/pulls?q=is%3Apr+is%3Aopen+user-review-requested%3A%40me+
> > [3] https://github.com/apache/arrow/issues/33977
> >
>


Re: [FlightSQL] servers / client reference implementations supporting parameterized statements

2023-02-02 Thread Andrew Lamb
Thank you -- this is super helpful

Andrew

On Wed, Feb 1, 2023 at 12:47 PM Matt Topol  wrote:

> To this point, the Go flightsql sqlite server example is used to test the
> Parameter Support for the ADBC flightsql driver:
>  - CI:
>
> https://github.com/apache/arrow-adbc/blob/main/.github/workflows/native-unix.yml#L293
>  - Dockerfile to run SQLite flightsql server:
>
> https://github.com/apache/arrow-adbc/blob/main/ci/docker/golang-flightsql-sqlite.dockerfile
>
> On Wed, Feb 1, 2023 at 12:02 PM David Li  wrote:
>
> > The ADBC C++ Flight SQL driver was probably the most complete Flight SQL
> > client, but it didn't make it through review:
> > https://github.com/apache/arrow/pull/14082
> >
> > The ADBC Go Flight SQL driver supports parameters:
> > https://github.com/apache/arrow-adbc/tree/main/go/adbc/driver/flightsql
> >
> > So does the ADBC Java Flight SQL driver:
> >
> https://github.com/apache/arrow-adbc/tree/main/java/driver/flight-sql/src/main/java/org/apache/arrow/adbc/driver/flightsql
> >
> > The example servers in the C++, Go, and Java source trees all support
> > parameters to varying degrees:
> >
> > -
> >
> https://github.com/apache/arrow/tree/master/cpp/src/arrow/flight/sql/example
> > -
> >
> https://github.com/apache/arrow/tree/master/go/arrow/flight/flightsql/example
> > -
> >
> https://github.com/apache/arrow/blob/master/java/flight/flight-sql/src/test/java/org/apache/arrow/flight/sql/example/FlightSqlExample.java
> >
> > On Wed, Feb 1, 2023, at 11:16, Andrew Lamb wrote:
> > > Hi,
> > >
> > > Does anyone know of FlightSQL clients or servers that support
> > parameterized
> > > statements (e.g. include a placehold like `select * from cpu where
> time >
> > > ?`) other than [1]?
> > >
> > > Several projects are working on implementing FlightSQL in Rust (for
> > example
> > > Ballista and InfluxDB IOx). Since the key feature of FlightSQL is
> > > interoperability, we are very interested in testing against other
> > > implementations, rather than just implementing the spec.
> > >
> > > We have been using the JDBC driver as this reference implementation so
> > far
> > > but recently (re)discovered that parameterized statement support is
> > still a
> > > WIP[2]. Thus we can not yet use JDBC as the reference implementation
> for
> > > parameterized features and thus are looking for others.
> > >
> > > Thanks,
> > > Andrew
> > >
> > > [1]:
> > >
> >
> https://github.com/apache/arrow/tree/master/go/arrow/flight/flightsql/example
> > > [2]: https://github.com/apache/arrow/issues/33475
> >
>


Re: [C++] Parquet and Arrow overlap

2023-02-02 Thread Will Jones
Day to day, I think having Parquet-cpp under the Apache Arrow project could
make sense. Though I worry about two risks:

1. Would that lead to the governance of the format itself to be primarily
the responsibility of the developers of Parquet-MR?
2. Would C++ developers interested in working with Parquet outside of Arrow
recognize it as a relevant library?

On Thu, Feb 2, 2023 at 6:03 AM Neal Richardson 
wrote:

> Would it make sense to transfer all governance of the parquet-cpp
> implementation to Apache Arrow? It seems like that's where we de facto are
> already, so that would resolve these ambiguities and put it in line with
> the Rust implementation.
>
> Would the Parquet PMC be opposed to formalizing this change?
>
> Neal
>
> On Thu, Feb 2, 2023 at 6:48 AM Raphael Taylor-Davies
>  wrote:
>
> > Hi,
> >
> > > Does the parquet rust implementation have a similar issue?
> >
> > Similar to the C++ implementation, the Rust implementation lives under
> > the Apache Arrow umbrella and does not have any direct affiliation with
> > the Apache Parquet project that I am aware of, beyond using the same
> > format specification. However, as almost all of the users and
> > contributions are with respect to the arrow interfaces, and not the
> > parquet record APIs, there perhaps isn't the same ambiguity as
> > encountered with the C++ implementation. I would expect all issues to be
> > raised in the arrow-rs repository, and a PARQUET Jira only raised,
> > likely by myself or whoever is triaging the issue, if there is some
> > issue/ambiguity pertaining to the format itself.
> >
> > Kind Regards,
> >
> > Raphael
> >
> > On 02/02/2023 01:58, Gang Wu wrote:
> > > Hi Will,
> > >
> > > AFAIK, the Apache Parquet community no longer considers contribution to
> > > parquet-cpp when promoting new committers after the donation to Apache
> > > Arrow.
> > >
> > > It would be a dilemma for the parquet-cpp contributors if none of the
> > > Apache Arrow community or Apache Parquet community recognizes their
> work.
> > >
> > > Does the parquet rust implementation have a similar issue?
> > >
> > > Best,
> > > Gang
> > >
> > > On Thu, Feb 2, 2023 at 3:27 AM Will Jones 
> > wrote:
> > >
> > >> Hello,
> > >>
> > >> A while back, the Parquet C++ implementation was merged into the
> Apache
> > >> Arrow monorepo [1]. As I understand it, this helped the development
> > process
> > >> immensely. However, I am noticing some governance issues because of
> it.
> > >>
> > >> First, it's not obvious where issues are supposed to be open: In
> Parquet
> > >> Jira or Arrow GitHub issues. Looking back at some of the original
> > >> discussion, it looks like the intention was
> > >>
> > >> * use PARQUET-XXX for issues relating to Parquet core
> > >>> * use ARROW-XXX for issues relation to Arrow's consumption of Parquet
> > >>> core (e.g. changes that are in parquet/arrow right now)
> > >>>
> > >> The README for the old parquet-cpp repo [3] states instead in it's
> > >> migration note:
> > >>
> > >>   JIRA issues should continue to be opened in the PARQUET JIRA
> project.
> > >>
> > >>
> > >> Either way, it doesn't seem like this process is obvious to people.
> > Perhaps
> > >> we could clarify this and add notices to Arrow's GitHub issues
> template?
> > >>
> > >> Second, committer status is a little unclear. I am a committer on
> Arrow,
> > >> but not on Parquet right now. Does that mean I should only merge
> Parquet
> > >> C++ PRs for code changes in parquet/arrow? Or that I shouldn't merge
> > >> Parquet changes at all?
> > >>
> > >> Also, are the contributions to Arrow C++ Parquet being actively
> reviewed
> > >> for potential new committers?
> > >>
> > >> Best,
> > >>
> > >> Will Jones
> > >>
> > >> [1] https://lists.apache.org/thread/76wzx2lsbwjl363bg066g8kdsocd03rw
> > >> [2] https://lists.apache.org/thread/dkh6vjomcfyjlvoy83qdk9j5jgxk7n4j
> > >> [3] https://github.com/apache/parquet-cpp
> > >>
> >
>


Re: [DISCUSS] Fixed shape tensor Canonical Extension Type

2023-02-02 Thread Clark Zinzow
Hi Alenka,

Great work on the RFC, I'm super excited to see this! I was planning to
open a similar RFC at some point over the next few weeks, so this just
saved me a bunch of work. :D

At the Ray project [1], we've developed two tensor extension types
(originally adapted from the tensor extension type in
text_extension_for_pandas [2]) that we've continuously extended: a
fixed-shape tensor type [3] and a variable-shaped tensor type [4]. These
extension types include both an Arrow side [5] and a Pandas side [6]. We
would love to contribute anything upstream that's deemed appropriate for
inclusion, to share our learnings from our users using this extension type
in production data processing and AI workloads, and to hopefully stay in
the loop for this RFC as a stakeholder and dev resource.

One thing that I want to preemptively call out is the importance of
zero-copy exchange with tensor libraries in the bindings languages (e.g.
NumPy ndarrays for Python), where ideally we would hand off the underlying
ndarray data buffers directly to Arrow and vice versa, when possible
(boolean data requires a copy due to Arrow's bitpacking and NumPy's lack
thereof). This shouldn't impact the underlying extension type spec, just
the to/from layer at the bindings level, where I imagine most of the
complexity will lie.

Thanks again for pushing on this RFC, and I'll try to make time over the
next few days to review the spec, C++ implementation, and Python example!

Cheers,

Clark

[1] https://github.com/ray-project/ray/tree/master

[2[ https://github.com/CODAIT/text-extensions-for-pandas

[3]
https://github.com/ray-project/ray/blob/ada5db71db36f672301639a61b5849fd4fd5914e/python/ray/air/util/tensor_extensions/arrow.py#L55-L525

[4]
https://github.com/ray-project/ray/blob/ada5db71db36f672301639a61b5849fd4fd5914e/python/ray/air/util/tensor_extensions/arrow.py#L528-L809

[5]
https://github.com/ray-project/ray/blob/ada5db71db36f672301639a61b5849fd4fd5914e/python/ray/air/util/tensor_extensions/arrow.py

[6]
https://github.com/ray-project/ray/blob/ada5db71db36f672301639a61b5849fd4fd5914e/python/ray/air/util/tensor_extensions/pandas.py

On Thu, Feb 2, 2023 at 8:07 AM Alenka Frim 
wrote:

> Hi all!
>
> There have been quite a lot of discussions connected to the tensor support
> in Arrow Tables/RecorBatches. Issues to add support for a column in an
> Arrow table that has value cells each containing a tensor value, with all
> tensors having the same shape/dimensions [1] and a separate one for varying
> shape [2] are already created in the Arrow repository.
>
> Rok Mihevc, Joris Van den Bossche and I would like to start a discussion
> about the specification for canonicalizing the fixed shape tensor type in
> Arrow:
>
> Fixed shape tensor
>
> ==
>
> * Extension name: `arrow.fixed_shape_tensor`.
>
> * The storage type of the extension: ``FixedSizeList`` where:
>
>   * **value_type** is the data type of individual tensors and
>
> is an instance of ``pyarrow.DataType`` or ``pyarrow.Field``.
>
>   * **list_size** is the product of all the elements in tensor shape.
>
> * Extension type parameters:
>
>   * **value_type** = Arrow DataType of the tensor elements
>
>   * **shape** = shape of the contained tensors as a tuple
>
> * Description of the serialization:
>
>   The metadata must be a valid JSON object including shape of
>
>   the contained tensors as an array with key "shape".
>
>   For example: `{ "shape": [2, 5]}`
>
> .. note::
>
>   Elements in an fixed shape tensor extension array are stored
>
>   in row-major/C-contiguous order.
>
> RFC umbrella issue [3] includes:
>
>-
>
>Specification for Tensor canonical type extension [4]
>-
>
>C++ implementation of the proposed specification [5]
>-
>
>Python example implementation of the proposed specification and usage
>(only illustrative) [6]
>
> Open questions:
>
>-
>
>Should metadata include the "dim_names" key to pass dimension names when
>creating the Arrow FixedShapeTensorArray? Do we standardize how to
> specify
>those names and which names to use? Or the names shouldn't be
> standardized
>and it would be up to the application to understand them.
>
> An example for NCHW ordered data [7]: the application could pass
> "dim_names":
> ["C", "H", "W"] when creating the Arrow FixedShapeTensorArray.
>
>-
>
>Should the implementation of the tensor extension type be in Arrow C++
>or should it be implemented in the bindings separately?
>
> In the future we would like to canonicalize variable shape tensor type in
> Arrow also.
>
> Kind regards, Alenka
>
> [1]: https://github.com/apache/arrow/issues/15483
>
> [2]: https://github.com/apache/arrow/issues/24868
>
> [3]: https://github.com/apache/arrow/issues/33924
>
> [4]: https://github.com/apache/arrow/issues/33923
>
> [5]: https://github.com/apache/arrow/issues/15483
>
> [6]: https://github.com/apache/arrow/issues/33947
> [7]: https://machinelearning.wtf/terms/nchw/
>


[DISCUSS] Fixed shape tensor Canonical Extension Type

2023-02-02 Thread Alenka Frim
Hi all!

There have been quite a lot of discussions connected to the tensor support
in Arrow Tables/RecorBatches. Issues to add support for a column in an
Arrow table that has value cells each containing a tensor value, with all
tensors having the same shape/dimensions [1] and a separate one for varying
shape [2] are already created in the Arrow repository.

Rok Mihevc, Joris Van den Bossche and I would like to start a discussion
about the specification for canonicalizing the fixed shape tensor type in
Arrow:

Fixed shape tensor

==

* Extension name: `arrow.fixed_shape_tensor`.

* The storage type of the extension: ``FixedSizeList`` where:

  * **value_type** is the data type of individual tensors and

is an instance of ``pyarrow.DataType`` or ``pyarrow.Field``.

  * **list_size** is the product of all the elements in tensor shape.

* Extension type parameters:

  * **value_type** = Arrow DataType of the tensor elements

  * **shape** = shape of the contained tensors as a tuple

* Description of the serialization:

  The metadata must be a valid JSON object including shape of

  the contained tensors as an array with key "shape".

  For example: `{ "shape": [2, 5]}`

.. note::

  Elements in an fixed shape tensor extension array are stored

  in row-major/C-contiguous order.

RFC umbrella issue [3] includes:

   -

   Specification for Tensor canonical type extension [4]
   -

   C++ implementation of the proposed specification [5]
   -

   Python example implementation of the proposed specification and usage
   (only illustrative) [6]

Open questions:

   -

   Should metadata include the "dim_names" key to pass dimension names when
   creating the Arrow FixedShapeTensorArray? Do we standardize how to specify
   those names and which names to use? Or the names shouldn't be standardized
   and it would be up to the application to understand them.

An example for NCHW ordered data [7]: the application could pass "dim_names":
["C", "H", "W"] when creating the Arrow FixedShapeTensorArray.

   -

   Should the implementation of the tensor extension type be in Arrow C++
   or should it be implemented in the bindings separately?

In the future we would like to canonicalize variable shape tensor type in
Arrow also.

Kind regards, Alenka

[1]: https://github.com/apache/arrow/issues/15483

[2]: https://github.com/apache/arrow/issues/24868

[3]: https://github.com/apache/arrow/issues/33924

[4]: https://github.com/apache/arrow/issues/33923

[5]: https://github.com/apache/arrow/issues/15483

[6]: https://github.com/apache/arrow/issues/33947
[7]: https://machinelearning.wtf/terms/nchw/


Re: [C++] Parquet and Arrow overlap

2023-02-02 Thread Neal Richardson
Would it make sense to transfer all governance of the parquet-cpp
implementation to Apache Arrow? It seems like that's where we de facto are
already, so that would resolve these ambiguities and put it in line with
the Rust implementation.

Would the Parquet PMC be opposed to formalizing this change?

Neal

On Thu, Feb 2, 2023 at 6:48 AM Raphael Taylor-Davies
 wrote:

> Hi,
>
> > Does the parquet rust implementation have a similar issue?
>
> Similar to the C++ implementation, the Rust implementation lives under
> the Apache Arrow umbrella and does not have any direct affiliation with
> the Apache Parquet project that I am aware of, beyond using the same
> format specification. However, as almost all of the users and
> contributions are with respect to the arrow interfaces, and not the
> parquet record APIs, there perhaps isn't the same ambiguity as
> encountered with the C++ implementation. I would expect all issues to be
> raised in the arrow-rs repository, and a PARQUET Jira only raised,
> likely by myself or whoever is triaging the issue, if there is some
> issue/ambiguity pertaining to the format itself.
>
> Kind Regards,
>
> Raphael
>
> On 02/02/2023 01:58, Gang Wu wrote:
> > Hi Will,
> >
> > AFAIK, the Apache Parquet community no longer considers contribution to
> > parquet-cpp when promoting new committers after the donation to Apache
> > Arrow.
> >
> > It would be a dilemma for the parquet-cpp contributors if none of the
> > Apache Arrow community or Apache Parquet community recognizes their work.
> >
> > Does the parquet rust implementation have a similar issue?
> >
> > Best,
> > Gang
> >
> > On Thu, Feb 2, 2023 at 3:27 AM Will Jones 
> wrote:
> >
> >> Hello,
> >>
> >> A while back, the Parquet C++ implementation was merged into the Apache
> >> Arrow monorepo [1]. As I understand it, this helped the development
> process
> >> immensely. However, I am noticing some governance issues because of it.
> >>
> >> First, it's not obvious where issues are supposed to be open: In Parquet
> >> Jira or Arrow GitHub issues. Looking back at some of the original
> >> discussion, it looks like the intention was
> >>
> >> * use PARQUET-XXX for issues relating to Parquet core
> >>> * use ARROW-XXX for issues relation to Arrow's consumption of Parquet
> >>> core (e.g. changes that are in parquet/arrow right now)
> >>>
> >> The README for the old parquet-cpp repo [3] states instead in it's
> >> migration note:
> >>
> >>   JIRA issues should continue to be opened in the PARQUET JIRA project.
> >>
> >>
> >> Either way, it doesn't seem like this process is obvious to people.
> Perhaps
> >> we could clarify this and add notices to Arrow's GitHub issues template?
> >>
> >> Second, committer status is a little unclear. I am a committer on Arrow,
> >> but not on Parquet right now. Does that mean I should only merge Parquet
> >> C++ PRs for code changes in parquet/arrow? Or that I shouldn't merge
> >> Parquet changes at all?
> >>
> >> Also, are the contributions to Arrow C++ Parquet being actively reviewed
> >> for potential new committers?
> >>
> >> Best,
> >>
> >> Will Jones
> >>
> >> [1] https://lists.apache.org/thread/76wzx2lsbwjl363bg066g8kdsocd03rw
> >> [2] https://lists.apache.org/thread/dkh6vjomcfyjlvoy83qdk9j5jgxk7n4j
> >> [3] https://github.com/apache/parquet-cpp
> >>
>


Re: [RESULT][VOTE] Release Apache Arrow 11.0.0 - RC0

2023-02-02 Thread Raúl Cumplido
Hi,

The current status of the post-release tasks. I will follow up on them.
Thanks everyone!

- [done] Update the released milestone Date and set to "Closed" on GitHub
- [done] Merge changes on release branch to maintenance branch for patch
releases
- [done] Add the new release to the Apache Reporter System
- [done] Upload source
- [done] Upload binaries
- [done] Update website
- [done] Upload JavaScript packages
- [done] Upload C# packages
- [done] Upload wheels/sdist to pypi
- [done] Publish Maven artifacts
- [done] Update MSYS2 package
- [done] Bump versions
- [done] Update tags for Go modules
- [done] Update docs
- [done] Publish release blog posts
- [done] Announce the new release
- [in-progress] Update Homebrew packages
- [in-progress] Update vcpkg port
- [] Upload RubyGems --> waiting for homebrew
- [] Update Conan recipe
- [] Update version in Apache Arrow Cookbook --> was waiting conda
- [] Remove old artifacts

I will need help with the following (and might need help with some of the
above, I'll ask if needed):
- [] Make the CPP PARQUET related version as "RELEASED" on JIRA
- [] Start the new version on JIRA for the related CPP PARQUET version
- [done] Update conda recipes
- [] Update R packages

El vie, 27 ene 2023 a las 16:36, Neal Richardson (<
neal.p.richard...@gmail.com>) escribió:

> Conda often happens automatically; looks like there is already a PR:
> https://github.com/conda-forge/arrow-cpp-feedstock/pull/941
>
> On Fri, Jan 27, 2023 at 9:52 AM Raúl Cumplido 
> wrote:
>
> > Hi,
> >
> > The current status of the post-release tasks. I will keep working on the
> > rest of tasks during the next days:
> >
> > - [done] Update the released milestone Date and set to "Closed" on GitHub
> > - [done] Merge changes on release branch to maintenance branch for patch
> > releases
> > - [done] Add the new release to the Apache Reporter System
> > - [done] Upload source
> > - [done] Upload binaries
> > - [done] Update website
> > - [done] Upload JavaScript packages
> > - [done] Upload C# packages
> > - [done] Upload wheels/sdist to pypi
> > - [done] Publish Maven artifacts
> > - [in-progress] Update Homebrew packages
> > - [in-progress] Update MSYS2 package
> > - [] Upload RubyGems
> > - [] Update vcpkg port
> > - [] Update Conan recipe
> > - [] Bump versions
> > - [] Update tags for Go modules
> > - [] Update docs
> > - [] Update version in Apache Arrow Cookbook
> > - [] Announce the new release
> > - [] Publish release blog posts
> > - [] Remove old artifacts
> >
> > I will need help with the following (and might need help with some of the
> > above, I'll ask if needed):
> > - [] Make the CPP PARQUET related version as "RELEASED" on JIRA
> > - [] Start the new version on JIRA for the related CPP PARQUET version
> > - [] Update conda recipes
> > - [] Update R packages
> >
> > Thanks,
> > Raúl
> >
> > El mié, 25 ene 2023 a las 21:07, Sutou Kouhei ()
> > escribió:
> >
> > > Hi,
> > >
> > > I did the followings because they require PMC:
> > >
> > > - Add the new release to the Apache Reporter System
> > > - Upload source
> > >
> > > The current status:
> > >
> > > - [done] Update the released milestone Date and set to "Closed" on
> GitHub
> > > - [] Make the CPP PARQUET related version as "RELEASED" on JIRA
> > > - [] Start the new version on JIRA for the related CPP PARQUET version
> > > - [done] Merge changes on release branch to maintenance branch for
> patch
> > > releases
> > > - [done] Add the new release to the Apache Reporter System
> > > - [done] Upload source
> > > - [] Upload binaries
> > > - [] Update website
> > > - [] Update Homebrew packages
> > > - [] Update MSYS2 package
> > > - [] Upload RubyGems
> > > - [] Upload JavaScript packages
> > > - [] Upload C# packages
> > > - [] Update conda recipes
> > > - [] Upload wheels/sdist to pypi
> > > - [] Publish Maven artifacts
> > > - [] Update R packages
> > > - [] Update vcpkg port
> > > - [] Update Conan recipe
> > > - [] Bump versions
> > > - [] Update tags for Go modules
> > > - [] Update docs
> > > - [] Update version in Apache Arrow Cookbook
> > > - [] Announce the new release
> > > - [] Publish release blog posts
> > > - [] Remove old artifacts
> > >
> > > Thanks,
> > > --
> > > kou
> > >
> > > In  >
> > >   "[RESULT][VOTE] Release Apache Arrow 11.0.0 - RC0" on Wed, 25 Jan
> 2023
> > > 11:17:35 +0100,
> > >   Raúl Cumplido  wrote:
> > >
> > > > Hi,
> > > >
> > > > The result of the vote is successful with 3 +1 binding votes, 3 +1
> > > > non-binding votes and no -1 votes.
> > > >
> > > > I will start with the post release tasks [1] and Kou has volunteered
> to
> > > > help me with the tasks that require PMC.
> > > >
> > > > Can someone with PARQUET permissions help with the following tasks:
> > > > - Make the CPP PARQUET related version as "RELEASED" on JIRA
> > > > - Start the new version on JIRA for the related CPP PARQUET version
> > > >
> > > > Thanks,
> > > > Raúl
> > > >
> > > > [1]
> > > >
> > >
> >
> https://arrow.apache.org/docs/d

[ANNOUNCE] Apache Arrow 11.0.0 released

2023-02-02 Thread Raúl Cumplido
The Apache Arrow community is pleased to announce the 11.0.0 release. It
includes 423 resolved issues ([1]) since the 10.0.1 release.

The release is available now from our website and [2]:
http://arrow.apache.org/install/

Read about what's new in the release
https://arrow.apache.org/blog/2023/01/25/11.0.0-release/

Changelog
https://arrow.apache.org/release/11.0.0.html

What is Apache Arrow?
-

Apache Arrow is a columnar in-memory analytics layer designed to accelerate
big
data. It houses a set of canonical in-memory representations of flat and
hierarchical data along with multiple language-bindings for structure
manipulation. It also provides low-overhead streaming and batch messaging,
zero-copy interprocess communication (IPC), and vectorized in-memory
analytics
libraries.

Please report any feedback to the mailing lists ([3])

Regards,
The Apache Arrow community

[1]: https://github.com/apache/arrow/milestone/1?closed=1dev@arrow.apache.
org
[2]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-11.0.0/
[3]: https://lists.apache.org/list.html?dev@arrow.apache.org


Re: [C++] Parquet and Arrow overlap

2023-02-02 Thread Raphael Taylor-Davies

Hi,


Does the parquet rust implementation have a similar issue?


Similar to the C++ implementation, the Rust implementation lives under 
the Apache Arrow umbrella and does not have any direct affiliation with 
the Apache Parquet project that I am aware of, beyond using the same 
format specification. However, as almost all of the users and 
contributions are with respect to the arrow interfaces, and not the 
parquet record APIs, there perhaps isn't the same ambiguity as 
encountered with the C++ implementation. I would expect all issues to be 
raised in the arrow-rs repository, and a PARQUET Jira only raised, 
likely by myself or whoever is triaging the issue, if there is some 
issue/ambiguity pertaining to the format itself.


Kind Regards,

Raphael

On 02/02/2023 01:58, Gang Wu wrote:

Hi Will,

AFAIK, the Apache Parquet community no longer considers contribution to
parquet-cpp when promoting new committers after the donation to Apache
Arrow.

It would be a dilemma for the parquet-cpp contributors if none of the
Apache Arrow community or Apache Parquet community recognizes their work.

Does the parquet rust implementation have a similar issue?

Best,
Gang

On Thu, Feb 2, 2023 at 3:27 AM Will Jones  wrote:


Hello,

A while back, the Parquet C++ implementation was merged into the Apache
Arrow monorepo [1]. As I understand it, this helped the development process
immensely. However, I am noticing some governance issues because of it.

First, it's not obvious where issues are supposed to be open: In Parquet
Jira or Arrow GitHub issues. Looking back at some of the original
discussion, it looks like the intention was

* use PARQUET-XXX for issues relating to Parquet core

* use ARROW-XXX for issues relation to Arrow's consumption of Parquet
core (e.g. changes that are in parquet/arrow right now)


The README for the old parquet-cpp repo [3] states instead in it's
migration note:

  JIRA issues should continue to be opened in the PARQUET JIRA project.


Either way, it doesn't seem like this process is obvious to people. Perhaps
we could clarify this and add notices to Arrow's GitHub issues template?

Second, committer status is a little unclear. I am a committer on Arrow,
but not on Parquet right now. Does that mean I should only merge Parquet
C++ PRs for code changes in parquet/arrow? Or that I shouldn't merge
Parquet changes at all?

Also, are the contributions to Arrow C++ Parquet being actively reviewed
for potential new committers?

Best,

Will Jones

[1] https://lists.apache.org/thread/76wzx2lsbwjl363bg066g8kdsocd03rw
[2] https://lists.apache.org/thread/dkh6vjomcfyjlvoy83qdk9j5jgxk7n4j
[3] https://github.com/apache/parquet-cpp



Re: [C++] Parquet and Arrow overlap

2023-02-02 Thread Raúl Cumplido
Hi,

I just wanted to add that with the recent migration to GitHub issues for
Arrow we have updated our development tools (merge script, archery release
tasks, ...) to work with GitHub but we haven't been able to drop JIRA
support due to having to support Parquet issues. This makes us have to
support two issue trackers at the moment. For context on the 11.0.0 release
there were 6 issues tracked on the JIRA Parquet.

Thanks,
Raúl



El jue, 2 feb 2023 a las 10:14, Antoine Pitrou ()
escribió:

>
> Hi Will,
>
> Le 01/02/2023 à 20:27, Will Jones a écrit :
> >
> > First, it's not obvious where issues are supposed to be open: In Parquet
> > Jira or Arrow GitHub issues. Looking back at some of the original
> > discussion, it looks like the intention was
> >
> > * use PARQUET-XXX for issues relating to Parquet core
> >> * use ARROW-XXX for issues relation to Arrow's consumption of Parquet
> >> core (e.g. changes that are in parquet/arrow right now)
> >>
> > The README for the old parquet-cpp repo [3] states instead in it's
> > migration note:
> >
> >   JIRA issues should continue to be opened in the PARQUET JIRA project.
> >
> > Either way, it doesn't seem like this process is obvious to people.
> Perhaps
> > we could clarify this and add notices to Arrow's GitHub issues template?
>
> I agree we should clarify this. I have no personal preference, but I
> will note that Github issues decrease friction as having a GH account is
> already necessary for submitting PRs.
>
> > Second, committer status is a little unclear. I am a committer on Arrow,
> > but not on Parquet right now. Does that mean I should only merge Parquet
> > C++ PRs for code changes in parquet/arrow? Or that I shouldn't merge
> > Parquet changes at all?
>
> Since Parquet C++ is part of Arrow C++, you are allowed to merge Parquet
> C++ changes. As always you should ensure you have sufficient
> understanding of the contribution, and that it follows established
> practices:
> https://arrow.apache.org/docs/dev/developers/reviewing.html
>
> > Also, are the contributions to Arrow C++ Parquet being actively reviewed
> > for potential new committers?
>
> I would certainly do.
>
> Regards
>
> Antoine.
>


Re: [DISCUSS] PR automation workflow

2023-02-02 Thread Antoine Pitrou



Hi Raul,

Since I'm the one who proposed that we reuse CPython's existing workflow 
infrastructure, it follows logically that I'm in favour :-)


I'm a CPython core developer myself (though inactive lately), I will add 
that this workflow is really easing the work of reviewing PRs, as it 
makes obvious whether a PR is needing attention from a committer.


Once we start working with it, we may decide to make adjustments to 
better fit our expectations, but I think we can start with the 
unmodified workflow scheme.


Regards

Antoine.


Le 01/02/2023 à 15:34, Raúl Cumplido a écrit :

Hi,

I would like to start working on some automation for our PRs and issues
workflows.

I've heard, and have experienced, the frustration of spending a lot of time
on our issue tracker and our PRs to follow up on their status.
It is very hard to keep track of which PRs and issues are waiting for user
feedback, have gone stale or are pending maintainer/committer action.
This means users frequently get no timely response, all the while we
regularly spend time on GH to look for PRs / issues needing action from us.
As a first step we should probably tackle PRs, once PRs are tackled and we
are satisfied with the solution, we can try to devise a similar one for GH
issues.

An example of a great improvement is the CODEOWNERS addition [1]. This
allows us to use filters like `is:pr is:open user-review-requested:@me` [2]
which will show PRs that have requested a review from us. This does not
solve the problem of what are the PRs waiting for second review,
waiting for changes, etcetera.

I don't think we have to reinvent the wheel, CPython has something that
works well and can easily be adapted/tweaked.
They use a GitHub bot (bedevere) with the following state machine:
https://github.com/python/bedevere#pr-state-machine

PRs have one label of the following workflow labels, depending of the state:
- `Awaiting review`
- `Awaiting core review`
- `Awaiting changes`
- `Awaiting change review`
- `Awaiting merge`

I would like to propose adding a GitHub bot to our repo that triggers on PR
changes / comments implementing a similar workflow than the one on the
CPython repository.

I am going to start working on it and I would love to hear feedback about
that workflow. I have also created an issue on the Repo [3].

Kind regards,
Raúl

[1] https://github.com/apache/arrow/pull/33622
[2]
https://github.com/apache/arrow/pulls?q=is%3Apr+is%3Aopen+user-review-requested%3A%40me+
[3] https://github.com/apache/arrow/issues/33977



Re: [C++] Parquet and Arrow overlap

2023-02-02 Thread Antoine Pitrou



Hi Will,

Le 01/02/2023 à 20:27, Will Jones a écrit :


First, it's not obvious where issues are supposed to be open: In Parquet
Jira or Arrow GitHub issues. Looking back at some of the original
discussion, it looks like the intention was

* use PARQUET-XXX for issues relating to Parquet core

* use ARROW-XXX for issues relation to Arrow's consumption of Parquet
core (e.g. changes that are in parquet/arrow right now)


The README for the old parquet-cpp repo [3] states instead in it's
migration note:

  JIRA issues should continue to be opened in the PARQUET JIRA project.

Either way, it doesn't seem like this process is obvious to people. Perhaps
we could clarify this and add notices to Arrow's GitHub issues template?


I agree we should clarify this. I have no personal preference, but I 
will note that Github issues decrease friction as having a GH account is 
already necessary for submitting PRs.



Second, committer status is a little unclear. I am a committer on Arrow,
but not on Parquet right now. Does that mean I should only merge Parquet
C++ PRs for code changes in parquet/arrow? Or that I shouldn't merge
Parquet changes at all?


Since Parquet C++ is part of Arrow C++, you are allowed to merge Parquet 
C++ changes. As always you should ensure you have sufficient 
understanding of the contribution, and that it follows established 
practices:

https://arrow.apache.org/docs/dev/developers/reviewing.html


Also, are the contributions to Arrow C++ Parquet being actively reviewed
for potential new committers?


I would certainly do.

Regards

Antoine.