Re: [VOTE][RUST] Release Apache Arrow Rust 52.2.0 RC1

2024-07-26 Thread Will Jones
+1 (binding)

Verified on M2 Mac. Thanks Andrew!

On Wed, Jul 24, 2024 at 5:13 PM L. C. Hsieh  wrote:

> +1 (binding)
>
> Verified on M3 Mac.
>
> Thanks Andrew.
>
> On Wed, Jul 24, 2024 at 4:31 PM Andrew Lamb  wrote:
> >
> > Hi,
> >
> > I would like to propose a release of Apache Arrow Rust Implementation,
> > version 52.2.0.
> >
> > This release candidate is based on commit:
> > 49e714de6e951169d0d5e73381af247ad0230fcf [1]
> >
> > The proposed release tarball and signatures are hosted at [2].
> >
> > The changelog is located at [3].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. There is a script [4] that automates some of
> > the verification.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow Rust
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow Rust  because...
> >
> > [1]:
> >
> https://github.com/apache/arrow-rs/tree/49e714de6e951169d0d5e73381af247ad0230fcf
> > [2]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-52.2.0-rc1
> > [3]:
> >
> https://github.com/apache/arrow-rs/blob/49e714de6e951169d0d5e73381af247ad0230fcf/CHANGELOG.md
> > [4]:
> >
> https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
>


Re: [VOTE][RUST] Release Apache Arrow Rust 52.0.0 RC1

2024-06-04 Thread Will Jones
+1 (binding)

Verified on M1 Mac.

On Tue, Jun 4, 2024 at 5:46 AM Patrick Horan  wrote:

> +1
>
> Verified on M1 Mac.
>
> Thank you.
>
> Paddy
>
> On Tue, Jun 4, 2024, at 5:46 AM, Andrew Lamb wrote:
> > +1 (binding)
> >
> > Verified on M3 mac.
> >
> > Thank you - this one is a good one.
> >
> > Andrew
> >
> > On Mon, Jun 3, 2024 at 12:18 PM Andy Grove 
> wrote:
> >
> > > +1 (binding)
> > >
> > > Verified on Ubuntu 22.04.4 LTS.
> > >
> > > Thanks, Raphael.
> > >
> > > On Mon, Jun 3, 2024 at 10:12 AM L. C. Hsieh  wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > Verified on M3 Mac.
> > > >
> > > > Thanks Raphael.
> > > >
> > > > On Mon, Jun 3, 2024 at 9:04 AM Raphael Taylor-Davies
> > > >  wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > I would like to propose a release of Apache Arrow Rust
> Implementation,
> > > > > version 52.0.0.
> > > > >
> > > > > This release candidate is based on commit:
> > > > > f42218ae5d9c9f0b9ea3365f2b1e6025a43b8c71 [1]
> > > > >
> > > > > The proposed release tarball and signatures are hosted at [2].
> > > > >
> > > > > The changelog is located at [3].
> > > > >
> > > > > Please download, verify checksums and signatures, run the unit
> tests,
> > > > > and vote on the release. There is a script [4] that automates some
> of
> > > > > the verification.
> > > > >
> > > > > The vote will be open for at least 72 hours.
> > > > >
> > > > > [ ] +1 Release this as Apache Arrow Rust
> > > > > [ ] +0
> > > > > [ ] -1 Do not release this as Apache Arrow Rust  because...
> > > > >
> > > > > I vote +1 (binding) on this release
> > > > >
> > > > > [1]:
> > > > >
> > > >
> > >
> https://github.com/apache/arrow-rs/tree/f42218ae5d9c9f0b9ea3365f2b1e6025a43b8c71
> > > > > [2]:
> > > >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-52.0.0-rc1
> > > > > [3]:
> > > > >
> > > >
> > >
> https://github.com/apache/arrow-rs/blob/f42218ae5d9c9f0b9ea3365f2b1e6025a43b8c71/CHANGELOG.md
> > > > > [4]:
> > > > >
> > > >
> > >
> https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
> > > > >
> > > >
> > >
> >
>


Re: [VOTE] Move Arrow DataFusion Subproject to new Top Level Apache Project

2024-03-03 Thread Will Jones
+1 (binding)

On Sun, Mar 3, 2024 at 09:43 Wayne Xia  wrote:

> +1 (non-binding)
>
> Regards,
> Wayne
>
> Julian Hyde  于 2024年3月4日周一 上午1:38写道:
>
> > +1 (binding)
> >
> > > On Mar 2, 2024, at 2:28 PM, Dewey Dunnington
> >  wrote:
> > >
> > > +1 (binding)
> > >
> > >> On Sat, Mar 2, 2024 at 8:08 AM vin jake  wrote:
> > >>
> > >> +1 (binding)
> > >>
> > >>> On Fri, Mar 1, 2024 at 7:33 PM Andrew Lamb 
> > wrote:
> > >>>
> > >>> Hello,
> > >>>
> > >>> As we have discussed[1][2] I would like to vote on the proposal to
> > >>> create a new Apache Top Level Project for DataFusion. The text of the
> > >>> proposed resolution and background document is copy/pasted below
> > >>>
> > >>> If the community is in favor of this, we plan to submit the
> resolution
> > >>> to the ASF board for approval with the next Arrow report (for the
> > >>> April 2024 board meeting).
> > >>>
> > >>> The vote will be open for at least 7 days.
> > >>>
> > >>> [ ] +1 Accept this Proposal
> > >>> [ ] +0
> > >>> [ ] -1 Do not accept this proposal because...
> > >>>
> > >>> Andrew
> > >>>
> > >>> [1] https://lists.apache.org/thread/c150t1s1x0kcb3r03cjyx31kqs5oc341
> > >>> [2] https://github.com/apache/arrow-datafusion/discussions/6475
> > >>>
> > >>> -- Proposed Resolution -
> > >>>
> > >>> Resolution to Create the Apache DataFusion Project from the Apache
> > >>> Arrow DataFusion Sub Project
> > >>>
> > >>> =
> > >>>
> > >>> X. Establish the Apache DataFusion Project
> > >>>
> > >>> WHEREAS, the Board of Directors deems it to be in the best
> > >>> interests of the Foundation and consistent with the
> > >>> Foundation's purpose to establish a Project Management
> > >>> Committee charged with the creation and maintenance of
> > >>> open-source software related to an extensible query engine
> > >>> for distribution at no charge to the public.
> > >>>
> > >>> NOW, THEREFORE, BE IT RESOLVED, that a Project Management
> > >>> Committee (PMC), to be known as the "Apache DataFusion Project",
> > >>> be and hereby is established pursuant to Bylaws of the
> > >>> Foundation; and be it further
> > >>>
> > >>> RESOLVED, that the Apache DataFusion Project be and hereby is
> > >>> responsible for the creation and maintenance of software
> > >>> related to an extensible query engine; and be it further
> > >>>
> > >>> RESOLVED, that the office of "Vice President, Apache DataFusion" be
> > >>> and hereby is created, the person holding such office to
> > >>> serve at the direction of the Board of Directors as the chair
> > >>> of the Apache DataFusion Project, and to have primary responsibility
> > >>> for management of the projects within the scope of
> > >>> responsibility of the Apache DataFusion Project; and be it further
> > >>>
> > >>> RESOLVED, that the persons listed immediately below be and
> > >>> hereby are appointed to serve as the initial members of the
> > >>> Apache DataFusion Project:
> > >>>
> > >>> * Andy Grove (agr...@apache.org)
> > >>> * Andrew Lamb (al...@apache.org)
> > >>> * Daniël Heres (dhe...@apache.org)
> > >>> * Jie Wen (jake...@apache.org)
> > >>> * Kun Liu (liu...@apache.org)
> > >>> * Liang-Chi Hsieh (vii...@apache.org)
> > >>> * Qingping Hou: (ho...@apache.org)
> > >>> * Wes McKinney(w...@apache.org)
> > >>> * Will Jones (wjones...@apache.org)
> > >>>
> > >>> RESOLVED, that the Apache DataFusion Project be and hereby
> > >>> is tasked with the migration and rationalization of the Apache
> > >>> Arrow DataFusion sub-project; and be it further
> > >>>
> > >>> RESOLVED, that all responsibilities pertaining to the Apache
> > >>> Arrow DataFusion sub-project encumbered upon the
> > >>> 

Re: [VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 34.0.0 RC3

2023-12-15 Thread Will Jones
+1 (binding). Verified on x86_64 Ubuntu 22.04.

On Fri, Dec 15, 2023 at 1:13 AM Jean-Baptiste Onofré 
wrote:

> +1 (non binding)
>
> Regards
> JB
>
> On Thu, Dec 14, 2023 at 9:52 PM Andy Grove  wrote:
> >
> > Hi,
> >
> > I would like to propose a release of Apache Arrow DataFusion
> Implementation,
> > version 34.0.0.
> >
> > *Please note this is RC3 - I ran into some local testing issues with RC2*
> >
> > This release candidate is based on commit:
> > 26933842e48d69f510f9461a1f2c87af587d5986 [1]
> > The proposed release tarball and signatures are hosted at [2].
> > The changelog is located at [3].
> >
> > Please download, verify checksums and signatures, run the unit tests, and
> > vote
> > on the release. The vote will be open for at least 72 hours.
> >
> > Only votes from PMC members are binding, but all members of the community
> > are
> > encouraged to test the release and vote with "(non-binding)".
> >
> > The standard verification procedure is documented at
> >
> https://github.com/apache/arrow-datafusion/blob/main/dev/release/README.md#verifying-release-candidates
> > .
> >
> > [ ] +1 Release this as Apache Arrow DataFusion 34.0.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow DataFusion 34.0.0 because...
> >
> > Here is my vote:
> >
> > +1
> >
> > [1]:
> >
> https://github.com/apache/arrow-datafusion/tree/26933842e48d69f510f9461a1f2c87af587d5986
> > [2]:
> >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-34.0.0-rc3
> > [3]:
> >
> https://github.com/apache/arrow-datafusion/blob/26933842e48d69f510f9461a1f2c87af587d5986/CHANGELOG.md
>


Re: decimal64

2023-11-09 Thread Will Jones
> Sidenote: I haven't seen many proposals for canonical extension types
> until now, which is a bit surprising. The barrier for standardizing a
> canonical extension type is much lower than for a new Arrow data type.

Chiming in here, since I've done some exploration implementing bfloat16 as
an extension type. The reason I haven't proposed a canonical extension type
is that I've found the ecosystem I work in doesn't feel ready for this to
really be implemented as an extension type. To give a few examples:

 * In PyArrow, the ChunkedArray repr can't be extended, so users always see
the extension type. This does not provide a good user experience when the
storage type is fixed size binary, but means something other than binary
data. See example in [1].
* In arrow-rs / datafusion, there isn't any explicit support for extension
types. They only exist as metadata fields right now. There is some
discussion of adding ExtensionType to the DataType enum, but that has
been found to be an unacceptably large refactor for now. [2] There is
instead an effort to contemplate supporting them in Datafusion [3].

So, as I see it, the practice barrier to extension types isn't just
defining canonical ones but also improving the ecosystem to better support
them. I don't think these issues are insurmountable, but just will take a
little more work. In fact, if we put effort into getting just one extension
type like this well supported in the Arrow ecosystem, I think the path for
additional extension types would be rather easy. I was hoping to make some
progress on these myself, but I have had to focus elsewhere. For now, it
seems like getting a data type formally supported is still an easier path
than the extension type path, if you really care about the user experience
being good.

Best,

Will Jones


[1] https://github.com/apache/arrow/issues/36648
[2] https://github.com/apache/arrow-rs/issues/4472
[3] https://github.com/apache/arrow-datafusion/issues/7923

On Thu, Nov 9, 2023 at 9:39 AM Curt Hagenlocher 
wrote:

> It certainly could be. Would float16 be done as a canonical extension type
> if it were proposed today?
>
> On Thu, Nov 9, 2023 at 9:36 AM David Li  wrote:
>
> > cuDF has decimal32/decimal64 [1].
> >
> > Would a canonical extension type [2] be appropriate here? I think that's
> > come up as a solution before.
> >
> > [1]: https://docs.rapids.ai/api/cudf/stable/user_guide/data-types/
> > [2]: https://arrow.apache.org/docs/format/CanonicalExtensions.html
> >
> > On Thu, Nov 9, 2023, at 11:56, Antoine Pitrou wrote:
> > > Or they could trivially use a int64 column for that, since the scale is
> > > fixed anyway, and you're probably not going to multiply money values
> > > together.
> > >
> > >
> > > Le 09/11/2023 à 17:54, Curt Hagenlocher a écrit :
> > >> If Arrow had a decimal64 type, someone could choose to use that for a
> > >> PostgreSQL money column knowing that there are edge cases where they
> may
> > >> get an undesired result.
> > >>
> > >> On Thu, Nov 9, 2023 at 8:42 AM Antoine Pitrou 
> > wrote:
> > >>
> > >>>
> > >>> Le 09/11/2023 à 17:23, Curt Hagenlocher a écrit :
> > >>>> Or more succinctly,
> > >>>> "111,111,111,111,111." will fit into a decimal64; would you
> > prevent
> > >>> it
> > >>>> from being stored in one so that you can describe the column as
> > >>>> "decimal(18, 4)"?
> > >>>
> > >>> That's what we do for other decimal types, see PyArrow below:
> > >>> ```
> > >>>   >>> pa.array([111_111_111_111_111_]).cast(pa.decimal128(18, 0))
> > >>> Traceback (most recent call last):
> > >>> [...]
> > >>> ArrowInvalid: Precision is not great enough for the result. It should
> > be
> > >>> at least 19
> > >>> ```
> > >>>
> > >>>
> > >>
> >
>


Re: [DISCUSS][Format] C data interface for Utf8View

2023-11-07 Thread Will Jones
I agree with the approach originally proposed by Ben. It seems like the
most straightforward way to implement within the current protocol.

On Sun, Oct 29, 2023 at 4:59 PM Dewey Dunnington
 wrote:

> In the absence of a general solution to the C data interface omitting
> buffer sizes, I think the original proposal is the best way
> forward...this is the first type to be added whose buffer sizes cannot
> be calculated without looping over every element of the array; the
> buffer sizes are needed to efficiently serialize the imported array to
> IPC if imported by a consumer that cares about buffer sizes.
>
> Using a schema's flags to indicate something about a specific paired
> array (particularly one that, if misinterpreted, would lead to a
> crash) is a precedent that is probably not worth introducing for just
> one type. Currently a schema is completely independent of any
> particular ArrowArray, and I think that is a feature that is worth
> preserving. My gripes about not having buffer sizes on the CPU to more
> efficiently copy between devices is a concept almost certainly better
> suited to the ArrowDeviceArray struct.
>
> On Fri, Oct 27, 2023 at 12:45 PM Benjamin Kietzman 
> wrote:
> >
> > > This begs the question of what happens if a consumer receives an
> unknown
> > > flag value.
> >
> > It seems to me that ignoring unknown flags is the primary case to
> consider
> > at
> > this point, since consumers may ignore unknown flags. Since that is the
> > case,
> > it seems adding any flag which would break such a consumer would be
> > tantamount to an ABI breakage. I don't think this can be averted unless
> all
> > consumers are required to error out on unknown flag values.
> >
> > In the specific case of Utf8View it seems certain that consumers would
> add
> > support for the buffer sizes flag simultaneously with adding support for
> the
> > new type (since Utf8View is difficult to import otherwise), so any
> consumer
> > which would error out on the new flag would already be erroring out on an
> > unsupported data type.
> >
> > > I might be the only person who has implemented
> > > a deep copy of an ArrowSchema in C, but it does blindly pass along a
> > > schema's flag value
> >
> > I think passing a schema's flag value including unknown flags is an
> error.
> > The ABI defines moving structures but does not define deep copying. I
> think
> > in order to copy deeply in terms of operations which *are* specified: we
> > import then export the schema. Since this includes an export step, it
> > should not
> > include flags which are not supported by the exporter.
> >
> > On Thu, Oct 26, 2023 at 6:40 PM Antoine Pitrou 
> wrote:
> >
> > >
> > > Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit :
> > > >> Is this buffer lengths buffer only present if the array type is
> > > Utf8View?
> > > >
> > > > IIUC, the proposal would add the buffer lengths buffer for all types
> if
> > > the
> > > > schema's
> > > > flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to
> avoid
> > > > the special case and that `n_buffers` would continue to be consistent
> > > with
> > > > IPC.
> > >
> > > This begs the question of what happens if a consumer receives an
> unknown
> > > flag value. We haven't specified that unknown flag values should be
> > > ignored, so a consumer could judiciously choose to error out instead of
> > > potentially misinterpreting the data.
> > >
> > > All in all, personally I'd rather we make a special case for Utf8View
> > > instead of adding a flag that can lead to worse interoperability.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
>


Re: [VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 33.0.0 RC1

2023-11-06 Thread Will Jones
Thanks for the clarification, Raphael. That likely narrows the scope of who
is affected. If this bug is present in DataFusion 33, then delta-rs will
likely skip upgrading until 34. If we're the only downstream project this
parsing issue affects, then I think it's fine to release.

On Mon, Nov 6, 2023 at 8:22 PM Raphael Taylor-Davies
 wrote:

> Hi,
>
> To further clarify the bug concerns the serde compatibility feature that
> allows converting a serde compatible data structure to arrow [1]. It will
> not impact workloads reading JSON.
>
> I am not sure this is a sufficiently fundamental bug to warrant special
> concern, but happy to defer to others.
>
> Kind Regards,
>
> Raphael
>
> [1]: https://docs.rs/arrow/latest/arrow/#serde-compatibility
>
> On 7 November 2023 03:20:59 GMT, Will Jones 
> wrote:
> >Hello,
> >
> >There is an upstream bug in arrow-json that can cause the JSON reader to
> >return incorrect data for large integers [1]. It was recently fixed by
> >Raphael within the last 24 hours, but is not included in any release. The
> >bug was introduced in Arrow 48, which this DataFusion release will expose
> >users to.
> >
> >Not sure what the precedent here is, but I think either we should consider
> >either (a) seeing if we can release and upgrade Arrow to include the fix,
> >or else (b) calling out the regression as a known bug so downstream
> >projects can include the path in their applications.
> >
> >Best,
> >
> >Will Jones
> >
> >[1] https://github.com/apache/arrow-rs/issues/5038
> >[2] https://github.com/apache/arrow-rs/pull/5042
> >
> >On Mon, Nov 6, 2023 at 12:25 PM Andrew Lamb  wrote:
> >
> >> +1 (the tests passed for me). I have left a comment on
> >> https://github.com/apache/arrow-datafusion/issues/8069
> >>
> >> On Mon, Nov 6, 2023 at 2:02 PM Andy Grove 
> wrote:
> >>
> >> > I filed https://github.com/apache/arrow-datafusion/issues/8069
> >> >
> >> > On Mon, Nov 6, 2023 at 11:59 AM Andy Grove 
> >> wrote:
> >> >
> >> > > I see the same error when I run on my M1 Macbook Air with 16 GB RAM.
> >> > >
> >> > >  aggregates::tests::run_first_last_multi_partitions stdout 
> >> > > Error: ResourcesExhausted("Failed to allocate additional 632 bytes
> for
> >> > > GroupedHashAggregateStream[0] with 1829 bytes already allocated -
> >> maximum
> >> > > available is 605")
> >> > >
> >> > > It worked fine on my workstation with 128 GB RAM.
> >> > >
> >> > >
> >> > >
> >> > > On Mon, Nov 6, 2023 at 11:23 AM L. C. Hsieh 
> wrote:
> >> > >
> >> > >> Hmm, ran verification script and got one failure:
> >> > >>
> >> > >> failures:
> >> > >>
> >> > >>  aggregates::tests::run_first_last_multi_partitions stdout 
> >> > >> Error: ResourcesExhausted("Failed to allocate additional 632 bytes
> for
> >> > >> GroupedHashAggregateStream[0] with 1829 bytes already allocated -
> >> > >> maximum available is 605")
> >> > >>
> >> > >> failures:
> >> > >> aggregates::tests::run_first_last_multi_partitions
> >> > >>
> >> > >> test result: FAILED. 557 passed; 1 failed; 1 ignored; 0 measured; 0
> >> > >> filtered out; finished in 2.21s
> >> > >>
> >> > >>
> >> > >>
> >> > >> On Mon, Nov 6, 2023 at 6:57 AM Andy Grove 
> >> > wrote:
> >> > >> >
> >> > >> > Hi,
> >> > >> >
> >> > >> > I would like to propose a release of Apache Arrow DataFusion
> >> > >> Implementation,
> >> > >> > version 33.0.0.
> >> > >> >
> >> > >> > This release candidate is based on commit:
> >> > >> > 262f08778b8ec231d96792c01fc3e051640eb5d4 [1]
> >> > >> > The proposed release tarball and signatures are hosted at [2].
> >> > >> > The changelog is located at [3].
> >> > >> >
> >> > >> > Please download, verify checksums and signatures, run the unit
> >> tests,
> >> > >> and
> >> > >> > vote
> >> > >> > on the release. The vote will be open for at least 72 hours.
> >> > >> >
> >> > >> > Only votes from PMC members are binding, but all members of the
> >> > >> community
> >> > >> > are
> >> > >> > encouraged to test the release and vote with "(non-binding)".
> >> > >> >
> >> > >> > The standard verification procedure is documented at
> >> > >> >
> >> > >>
> >> >
> >>
> https://github.com/apache/arrow-datafusion/blob/main/dev/release/README.md#verifying-release-candidates
> >> > >> > .
> >> > >> >
> >> > >> > [ ] +1 Release this as Apache Arrow DataFusion 33.0.0
> >> > >> > [ ] +0
> >> > >> > [ ] -1 Do not release this as Apache Arrow DataFusion 33.0.0
> >> > because...
> >> > >> >
> >> > >> > Here is my vote:
> >> > >> >
> >> > >> > +1
> >> > >> >
> >> > >> > [1]:
> >> > >> >
> >> > >>
> >> >
> >>
> https://github.com/apache/arrow-datafusion/tree/262f08778b8ec231d96792c01fc3e051640eb5d4
> >> > >> > [2]:
> >> > >> >
> >> > >>
> >> >
> >>
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-33.0.0-rc1
> >> > >> > [3]:
> >> > >> >
> >> > >>
> >> >
> >>
> https://github.com/apache/arrow-datafusion/blob/262f08778b8ec231d96792c01fc3e051640eb5d4/CHANGELOG.md
> >> > >>
> >> > >
> >> >
> >>
>


Re: [VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 33.0.0 RC1

2023-11-06 Thread Will Jones
Hello,

There is an upstream bug in arrow-json that can cause the JSON reader to
return incorrect data for large integers [1]. It was recently fixed by
Raphael within the last 24 hours, but is not included in any release. The
bug was introduced in Arrow 48, which this DataFusion release will expose
users to.

Not sure what the precedent here is, but I think either we should consider
either (a) seeing if we can release and upgrade Arrow to include the fix,
or else (b) calling out the regression as a known bug so downstream
projects can include the path in their applications.

Best,

Will Jones

[1] https://github.com/apache/arrow-rs/issues/5038
[2] https://github.com/apache/arrow-rs/pull/5042

On Mon, Nov 6, 2023 at 12:25 PM Andrew Lamb  wrote:

> +1 (the tests passed for me). I have left a comment on
> https://github.com/apache/arrow-datafusion/issues/8069
>
> On Mon, Nov 6, 2023 at 2:02 PM Andy Grove  wrote:
>
> > I filed https://github.com/apache/arrow-datafusion/issues/8069
> >
> > On Mon, Nov 6, 2023 at 11:59 AM Andy Grove 
> wrote:
> >
> > > I see the same error when I run on my M1 Macbook Air with 16 GB RAM.
> > >
> > >  aggregates::tests::run_first_last_multi_partitions stdout 
> > > Error: ResourcesExhausted("Failed to allocate additional 632 bytes for
> > > GroupedHashAggregateStream[0] with 1829 bytes already allocated -
> maximum
> > > available is 605")
> > >
> > > It worked fine on my workstation with 128 GB RAM.
> > >
> > >
> > >
> > > On Mon, Nov 6, 2023 at 11:23 AM L. C. Hsieh  wrote:
> > >
> > >> Hmm, ran verification script and got one failure:
> > >>
> > >> failures:
> > >>
> > >>  aggregates::tests::run_first_last_multi_partitions stdout 
> > >> Error: ResourcesExhausted("Failed to allocate additional 632 bytes for
> > >> GroupedHashAggregateStream[0] with 1829 bytes already allocated -
> > >> maximum available is 605")
> > >>
> > >> failures:
> > >> aggregates::tests::run_first_last_multi_partitions
> > >>
> > >> test result: FAILED. 557 passed; 1 failed; 1 ignored; 0 measured; 0
> > >> filtered out; finished in 2.21s
> > >>
> > >>
> > >>
> > >> On Mon, Nov 6, 2023 at 6:57 AM Andy Grove 
> > wrote:
> > >> >
> > >> > Hi,
> > >> >
> > >> > I would like to propose a release of Apache Arrow DataFusion
> > >> Implementation,
> > >> > version 33.0.0.
> > >> >
> > >> > This release candidate is based on commit:
> > >> > 262f08778b8ec231d96792c01fc3e051640eb5d4 [1]
> > >> > The proposed release tarball and signatures are hosted at [2].
> > >> > The changelog is located at [3].
> > >> >
> > >> > Please download, verify checksums and signatures, run the unit
> tests,
> > >> and
> > >> > vote
> > >> > on the release. The vote will be open for at least 72 hours.
> > >> >
> > >> > Only votes from PMC members are binding, but all members of the
> > >> community
> > >> > are
> > >> > encouraged to test the release and vote with "(non-binding)".
> > >> >
> > >> > The standard verification procedure is documented at
> > >> >
> > >>
> >
> https://github.com/apache/arrow-datafusion/blob/main/dev/release/README.md#verifying-release-candidates
> > >> > .
> > >> >
> > >> > [ ] +1 Release this as Apache Arrow DataFusion 33.0.0
> > >> > [ ] +0
> > >> > [ ] -1 Do not release this as Apache Arrow DataFusion 33.0.0
> > because...
> > >> >
> > >> > Here is my vote:
> > >> >
> > >> > +1
> > >> >
> > >> > [1]:
> > >> >
> > >>
> >
> https://github.com/apache/arrow-datafusion/tree/262f08778b8ec231d96792c01fc3e051640eb5d4
> > >> > [2]:
> > >> >
> > >>
> >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-33.0.0-rc1
> > >> > [3]:
> > >> >
> > >>
> >
> https://github.com/apache/arrow-datafusion/blob/262f08778b8ec231d96792c01fc3e051640eb5d4/CHANGELOG.md
> > >>
> > >
> >
>


Re: [VOTE][RUST] Release Apache Arrow Rust 48.0.0 RC2

2023-10-19 Thread Will Jones
+1

Verified on M1 Mac.

Thanks Raphael!

On Wed, Oct 18, 2023 at 1:30 PM Andrew Lamb  wrote:

> +1 (binding) -- thank you Raphael
>
> Verified on x86 Mac
>
> Hint for anyone else verifying, this is RC*2* (RC1 hit an issue[1])
>
> Andrew
>
> [1]: https://github.com/apache/arrow-rs/pull/4950
>
> On Wed, Oct 18, 2023 at 12:39 PM L. C. Hsieh  wrote:
>
> > +1 (binding)
> >
> > Verified on M1 Mac.
> >
> > Thanks Raphael.
> >
> > On Wed, Oct 18, 2023 at 6:59 AM Raphael Taylor-Davies
> >  wrote:
> > >
> > > Hi,
> > >
> > > I would like to propose a release of Apache Arrow Rust Implementation,
> > > version 48.0.0 *RC2*.
> > >
> > > Please note that there were issues with the first release candidate
> that
> > > required cutting a second.
> > >
> > > This release candidate is based on commit:
> > > 51ac6fec8755147cd6b1dfe7d76bfdcfacad0463 [1]
> > >
> > > The proposed release tarball and signatures are hosted at [2].
> > >
> > > The changelog is located at [3].
> > >
> > > Please download, verify checksums and signatures, run the unit tests,
> > > and vote on the release. There is a script [4] that automates some of
> > > the verification.
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Release this as Apache Arrow Rust
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Arrow Rust  because...
> > >
> > > [1]:
> > >
> >
> https://github.com/apache/arrow-rs/tree/51ac6fec8755147cd6b1dfe7d76bfdcfacad0463
> > > [2]:
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-48.0.0-rc2
> > > [3]:
> > >
> >
> https://github.com/apache/arrow-rs/blob/51ac6fec8755147cd6b1dfe7d76bfdcfacad0463/CHANGELOG.md
> > > [4]:
> > >
> >
> https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
> >
>


Re: Language-specific discussion (with C# example)

2023-10-17 Thread Will Jones
Hi Curt,

I think the most visible place for now would be creating an issue for
discussion.

In the future, if you and some others want to have a place to discuss C#
development, you could create a channel in a chat app. For example, Arrow
Rust has both a Slack channel in the official ASF Slack as well as a
Discord channel [1]. There is also Zuliip, which is used by some of the
C++, Python, and R developers. (Although I can't find where this is
documented anymore. This is still around right?). These kinds of places
aren't great to have the discussion themselves most of the time, but are
useful for when you want to post the issue to get attention to a discussion
and are looking for a targeted audience.

Best,

Will Jones

[1] https://github.com/apache/arrow-rs#arrow-rust-community

On Tue, Oct 17, 2023 at 3:20 PM Curt Hagenlocher 
wrote:

> I'm curious what other (sub-) communities do about implementation-specific
> considerations that aren't directly tied to the Arrow standard. I don't see
> much of that kind of discussion on the dev list; does that mean these
> happen largely in the context of specific pull requests -- or perhaps not
> at all?
>
> My specific motivation for asking is that there are three similar feature
> requests for C#: 23892 <https://github.com/apache/arrow/issues/23892>,
> 37359
> <https://github.com/apache/arrow/issues/37359> and 35199
> <https://github.com/apache/arrow/issues/35199>. Looking at these, I was
> thinking that the best general solution would be to have the scalar arrays
> in C# implement IReadOnlyList and ICollection. The former is a
> strictly-better superset of IEnumerable which also allows indexing by
> position, while the latter is an unfortunate concession to working well
> with "LINQ" (pre-.NET 9). Implementing ICollection would allow LINQ's
> "ToList" to just work, and work efficiently.
>
> But it feels weird to just submit a PR for this kind of implementation
> decision without more feedback from users or potential users, and at the
> same time it doesn't feel significant enough to e.g. write it up in a
> document to submit for review. I could (and will) open a new issue for this
> on GitHub, but it doesn't look like anyone proactively looks at new issues
> to find things to comment on.
>
> So what do others do?
>
> Thanks,
> -Curt
>


Re: [DISCUSS] Arrow PyCapsule Protocol

2023-10-11 Thread Will Jones
FYI, just discussed on the Arrow Community Call, we plan to include this on
the 14.0.0 release as an experimental protocol.

In the future, we may hold a vote to remove the experimental flag, although
it's unclear if this protocol requires a vote because (under certain
interpretations) it is not "cross-language". [1]

[1] https://arrow.apache.org/docs/format/Changing.html

On Tue, Oct 10, 2023 at 10:29 AM Will Jones  wrote:

> Hello Arrow devs,
>
> We are winding down discussion and review. I have created a rendered
> version of the proposed protocol: [1]
>
> Feel free to add feedback in the PR [2] or on this thread.
>
> Best,
> Will Jones
>
> [1]
> http://crossbow.voltrondata.com/pr_docs/37797/format/CDataInterface/PyCapsuleInterface.html
> [2] https://github.com/apache/arrow/pull/37797
>
> On Fri, Sep 22, 2023 at 8:11 PM Will Jones 
> wrote:
>
>> Hello Arrow devs,
>>
>> Based on Joris' idea in [1], I've drafted up a protocol specification for
>> PyCapsules holding Arrow C Data / Stream Interface structs. PR: [2]
>>
>> This has two goals:
>>
>> 1. Provide a library-independent representation of these structs in Python
>> 2. Standardize methods to export objects to these PyCapsules.
>>
>> This will help projects like nanoarrow be able to import Arrow data
>> safely from more than just PyArrow [3]. It would also allow libraries to
>> easily interchange Arrow data without requiring going through PyArrow or
>> writing a bespoke export function.
>>
>> I would welcome feedback in the PR [2].
>>
>> Thanks for your attention,
>>
>> Will Jones
>>
>> [1] https://github.com/apache/arrow/issues/34031
>> [2] https://github.com/apache/arrow/pull/37797
>> [3]
>> https://github.com/apache/arrow-nanoarrow/blob/c4816261dc34f5f898b1658359c25b867b1079cd/python/src/nanoarrow/lib.py#L21-L35
>>
>>


Re: [DISCUSS] Arrow PyCapsule Protocol

2023-10-10 Thread Will Jones
Hello Arrow devs,

We are winding down discussion and review. I have created a rendered
version of the proposed protocol: [1]

Feel free to add feedback in the PR [2] or on this thread.

Best,
Will Jones

[1]
http://crossbow.voltrondata.com/pr_docs/37797/format/CDataInterface/PyCapsuleInterface.html
[2] https://github.com/apache/arrow/pull/37797

On Fri, Sep 22, 2023 at 8:11 PM Will Jones  wrote:

> Hello Arrow devs,
>
> Based on Joris' idea in [1], I've drafted up a protocol specification for
> PyCapsules holding Arrow C Data / Stream Interface structs. PR: [2]
>
> This has two goals:
>
> 1. Provide a library-independent representation of these structs in Python
> 2. Standardize methods to export objects to these PyCapsules.
>
> This will help projects like nanoarrow be able to import Arrow data safely
> from more than just PyArrow [3]. It would also allow libraries to easily
> interchange Arrow data without requiring going through PyArrow or writing a
> bespoke export function.
>
> I would welcome feedback in the PR [2].
>
> Thanks for your attention,
>
> Will Jones
>
> [1] https://github.com/apache/arrow/issues/34031
> [2] https://github.com/apache/arrow/pull/37797
> [3]
> https://github.com/apache/arrow-nanoarrow/blob/c4816261dc34f5f898b1658359c25b867b1079cd/python/src/nanoarrow/lib.py#L21-L35
>
>


Re: [VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 32.0.0 RC1

2023-10-09 Thread Will Jones
+1 (binding)
Verified on M1 Mac.

On Mon, Oct 9, 2023 at 7:12 AM Andrew Lamb  wrote:

> +1 (binding)
> Verified on x86 mac
>
> Thanks Andy
>
> On Sun, Oct 8, 2023 at 1:22 PM Andy Grove  wrote:
>
> > Hi,
> >
> > I would like to propose a release of Apache Arrow DataFusion
> > Implementation,
> > version 32.0.0.
> >
> > This release candidate is based on commit:
> > eca48dae2447a67fcf30313c956e6c39cf739d48 [1]
> > The proposed release tarball and signatures are hosted at [2].
> > The changelog is located at [3].
> >
> > Please download, verify checksums and signatures, run the unit tests, and
> > vote
> > on the release. The vote will be open for at least 72 hours.
> >
> > Only votes from PMC members are binding, but all members of the community
> > are
> > encouraged to test the release and vote with "(non-binding)".
> >
> > The standard verification procedure is documented at
> >
> >
> https://github.com/apache/arrow-datafusion/blob/main/dev/release/README.md#verifying-release-candidates
> > .
> >
> > [ ] +1 Release this as Apache Arrow DataFusion 32.0.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow DataFusion 32.0.0 because...
> >
> > Here is my vote:
> >
> > +1
> >
> > [1]:
> >
> >
> https://github.com/apache/arrow-datafusion/tree/eca48dae2447a67fcf30313c956e6c39cf739d48
> > [2]:
> >
> >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-32.0.0-rc1
> > [3]:
> >
> >
> https://github.com/apache/arrow-datafusion/blob/eca48dae2447a67fcf30313c956e6c39cf739d48/CHANGELOG.md
> >
>


Re: [VOTE][RUST] Release Apache Arrow Rust Object Store 0.7.1 RC1

2023-09-27 Thread Will Jones
+1 (binding)

Verified on M1 Mac.

On Tue, Sep 26, 2023 at 11:29 AM Andrew Lamb  wrote:

> +1 (binding)
>
> Verified on mac x86_64
>
> Looks like a good release to me -- thank you Raphael
>
> Andrew
>
> On Tue, Sep 26, 2023 at 12:05 PM Raphael Taylor-Davies
>  wrote:
>
> > Hi,
> >
> > I would like to propose a release of Apache Arrow Rust Object
> > Store Implementation, version 0.7.1.
> >
> > This release candidate is based on commit:
> > 4ef7917bd57b701e30def8511b5fd8a7961f2fcf [1]
> >
> > The proposed release tarball and signatures are hosted at [2].
> >
> > The changelog is located at [3].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. There is a script [4] that automates some of
> > the verification.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow Rust Object Store
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow Rust Object Store because...
> >
> > [1]:
> >
> >
> https://github.com/apache/arrow-rs/tree/4ef7917bd57b701e30def8511b5fd8a7961f2fcf
> > [2]:
> >
> >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.7.1-rc1
> > [3]:
> >
> >
> https://github.com/apache/arrow-rs/blob/4ef7917bd57b701e30def8511b5fd8a7961f2fcf/object_store/CHANGELOG.md
> > [4]:
> >
> >
> https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh
> >
> >
>


Re: [VOTE] Release Apace Arrow nanoarrow 0.3.0 - RC0

2023-09-27 Thread Will Jones
+1 (binding)

Verified with Conda on MacOS M1.

On Tue, Sep 26, 2023 at 6:49 PM Jacob Wujciak-Jens
 wrote:

> +1 (non-binding)
>
> full verification with conda arrow 13.0.0 R 4.3 on pop_os 23.04, cmake
> 3.27, gcc 11
>
> On Wed, Sep 27, 2023 at 1:26 AM Bryce Mecum  wrote:
>
> > +1 (non-binding)
> >
> > Verified with `./verify-release-candidate.sh 0.3.0 0` on:
> > - Windows 10, x86_64, libarrow-main, MSVC 17 2022, R 4.3.1, Rtools 43
> > - macOS 13.6, aarch64, libarrow 13.0.0, R 4.3.1
> > - Ubuntu 23.04, aarch64, libarrow 13.0.0, R 4.2.2
> >
>


[DISCUSS] Arrow PyCapsule Protocol

2023-09-22 Thread Will Jones
Hello Arrow devs,

Based on Joris' idea in [1], I've drafted up a protocol specification for
PyCapsules holding Arrow C Data / Stream Interface structs. PR: [2]

This has two goals:

1. Provide a library-independent representation of these structs in Python
2. Standardize methods to export objects to these PyCapsules.

This will help projects like nanoarrow be able to import Arrow data safely
from more than just PyArrow [3]. It would also allow libraries to easily
interchange Arrow data without requiring going through PyArrow or writing a
bespoke export function.

I would welcome feedback in the PR [2].

Thanks for your attention,

Will Jones

[1] https://github.com/apache/arrow/issues/34031
[2] https://github.com/apache/arrow/pull/37797
[3]
https://github.com/apache/arrow-nanoarrow/blob/c4816261dc34f5f898b1658359c25b867b1079cd/python/src/nanoarrow/lib.py#L21-L35


Re: [LAST CALL][DISCUSS] Unsigned integers in Utf8View

2023-09-19 Thread Will Jones
Hi Ben,

I'm open to the idea of using unsigned if it allows compatibility with an
existing implementation or two. Could we name which ones it would be
compatible with? Links to implementation code would be very welcome, if
available.

As I understood it originally, the current implementations use raw pointers
rather than offsets into buffers, so these values already weren't
compatible to begin with. For example, it seems like Velox uses raw
pointers into buffers [1]. So unless I'm missing something, it seems like
these implementations will have to map the pointers to offsets anyways, so
maybe it's not much more trouble to convert to signed integers on the way.

Best,

Will Jones

[1]
https://github.com/facebookincubator/velox/blob/db8edec395288527a7464d17ab86b36b970eb270/velox/type/StringView.h#L46-L78

On Tue, Sep 19, 2023 at 8:26 AM Benjamin Kietzman 
wrote:

> Hello again all,
>
> It seems there hasn't been much interest in this point so I'm leaning
> toward keeping unsigned integers. If anyone has a concern please respond
> here and/or on the PR [1].
>
> Sincerely,
> Ben Kietzman
>
> [1] https://github.com/apache/arrow/pull/37526#discussion_r1323029022
>
> On Thu, Sep 14, 2023 at 9:31 AM David Li  wrote:
>
> > I think Java was usually raised as the odd child out when this has come
> up
> > before. Since Java 8 there are standard library methods to manipulate
> > signed integers as if they were unsigned, so in principle Java shouldn't
> be
> > a blocker anymore.
> >
> > That said, ByteBuffer is still indexed by int so in practice Java
> wouldn't
> > be able to handle more than 2 GB in a single buffer, at least until we
> can
> > use the Java 21+ APIs (MemorySegment is finally indexed by (signed)
> long).
> >
> > On Tue, Sep 12, 2023, at 11:40, Benjamin Kietzman wrote:
> > > Hello all,
> > >
> > > Utf8View was recently accepted [1] and I've opened a PR to add the
> > > spec/schema changes [2]. In review [3], it was requested that signed 32
> > bit
> > > integers be used for the fields of view structs instead of 32 bit
> > unsigned.
> > >
> > > This divergence has been discussed on the ML previously [4], but in
> light
> > > of my reviewer's request for a change it should be raised again for
> > focused
> > > discussion. (At this stage, I don't *think* the change would require
> > > another vote.) I'll enumerate the motivations for signed and unsigned
> as
> > I
> > > understand them.
> > >
> > > Signed:
> > > - signed integers are conventional in the arrow format
> > > - unsigned integers may cause some difficulty of implementation in
> > > languages which don't natively support them
> > >
> > > Unsigned:
> > > - unsigned integers are used by engines which already implement
> Utf8View
> > >
> > > My own bias is toward compatibility with existing implementers, but
> using
> > > signed integers will only affect the case of arrays which include data
> > > buffers larger than 2GB. For reference, the default buffer size in
> velox
> > is
> > > 32KB so such a massive data buffer would only occur when a single slot
> > of a
> > > string array has 2.1GB of characters. This seems sufficiently unlikely
> > that
> > > I wouldn't consider it a blocker.
> > >
> > > Sincerely,
> > > Ben Kietzman
> > >
> > > [1] https://lists.apache.org/thread/wt9j3q7qd59cz44kyh1zkts8s6wo1dn6
> > > [2] https://github.com/apache/arrow/pull/37526
> > > [3] https://github.com/apache/arrow/pull/37526#discussion_r1323029022
> > > [4] https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt
> > > [5]
> > >
> >
> https://github.com/facebookincubator/velox/blob/947d98c99a7cf05bfa4e409b1542abc89a28cb29/velox/vector/FlatVector.h#L46-L50
> >
>


Re: [VOTE][RUST] Release Apache Arrow Rust 47.0.0 RC1

2023-09-19 Thread Will Jones
+1 (binding).

Verified on Mac M1.

Thanks for managing this release, Raphael!

On Tue, Sep 19, 2023 at 6:21 AM Raphael Taylor-Davies
 wrote:

> This time with the links..
>
> I would like to propose a release of Apache Arrow Rust Implementation,
> version 47.0.0.
>
> This release candidate is based on commit:
> 1d6feeacebb8d0d659d493b783ba381940973745 [1]
>
> The proposed release tarball and signatures are hosted at [2].
>
> The changelog is located at [3].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. There is a script [4] that automates some of
> the verification.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Release this as Apache Arrow Rust
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow Rust  because...
>
> [1]:
>
> https://github.com/apache/arrow-rs/tree/1d6feeacebb8d0d659d493b783ba381940973745
> [2]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-47.0.0-rc1
> [3]:
>
> https://github.com/apache/arrow-rs/blob/1d6feeacebb8d0d659d493b783ba381940973745/CHANGELOG.md
> [4]:
>
> https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
>
> On 19/09/2023 14:18, Raphael Taylor-Davies wrote:
> > Hi,
> >
> > I would like to propose a release of Apache Arrow Rust Implementation,
> > version 47.0.0.
> >
> > This release candidate is based on commit:
> > 1d6feeacebb8d0d659d493b783ba381940973745 [1]
> >
> > The proposed release tarball and signatures are hosted at [2].
> >
> > The changelog is located at [3].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. There is a script [4] that automates some of
> > the verification.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow Rust
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow Rust  because...
> >
>


Re: [VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 31.0.0 RC1

2023-09-09 Thread Will Jones
+1 (binding). Verified on x86_64 Ubuntu Linux.

Thanks, Andy.

On Fri, Sep 8, 2023 at 2:16 PM Chao Sun  wrote:

> +1 (non-binding). Thanks Andy!
>
> On Fri, Sep 8, 2023 at 12:18 PM Andrew Lamb  wrote:
> >
> > +1 (binding)
> >
> > Thanks again Andy for keeping the release train moving forward
> >
> > Andrew
> >
> > On Fri, Sep 8, 2023 at 12:49 PM L. C. Hsieh  wrote:
> >
> > > +1 (binding)
> > >
> > > Verified on M1 Mac.
> > >
> > > Thanks Andy.
> > >
> > >
> > > On Fri, Sep 8, 2023 at 9:01 AM Andy Grove 
> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I would like to propose a release of Apache Arrow DataFusion
> > > Implementation,
> > > > version 31.0.0.
> > > >
> > > > This release candidate is based on commit:
> > > > 44cf6f127ddfba7cda0c243b22f7e0fce70f16ec [1]
> > > > The proposed release tarball and signatures are hosted at [2].
> > > > The changelog is located at [3].
> > > >
> > > > Please download, verify checksums and signatures, run the unit
> tests, and
> > > > vote
> > > > on the release. The vote will be open for at least 72 hours.
> > > >
> > > > Only votes from PMC members are binding, but all members of the
> community
> > > > are
> > > > encouraged to test the release and vote with "(non-binding)".
> > > >
> > > > The standard verification procedure is documented at
> > > >
> > >
> https://github.com/apache/arrow-datafusion/blob/main/dev/release/README.md#verifying-release-candidates
> > > > .
> > > >
> > > > [ ] +1 Release this as Apache Arrow DataFusion 31.0.0
> > > > [ ] +0
> > > > [ ] -1 Do not release this as Apache Arrow DataFusion 31.0.0
> because...
> > > >
> > > > Here is my vote:
> > > >
> > > > +1
> > > >
> > > > [1]:
> > > >
> > >
> https://github.com/apache/arrow-datafusion/tree/44cf6f127ddfba7cda0c243b22f7e0fce70f16ec
> > > > [2]:
> > > >
> > >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-31.0.0-rc1
> > > > [3]:
> > > >
> > >
> https://github.com/apache/arrow-datafusion/blob/44cf6f127ddfba7cda0c243b22f7e0fce70f16ec/CHANGELOG.md
> > >
>


Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-09-01 Thread Will Jones
Thanks for pointing that out, Dane. I think that seems like an obvious
choice for Dask to be able to consume this protocol.

On Fri, Sep 1, 2023 at 10:13 AM Dane Pitkin 
wrote:

> The Python Substrait package[1] is on PyPi[2] and currently has python
> wrappers for the Substrait protobuf objects. I think this will be a great
> opportunity to identify helper features that users of this protocol would
> like to see. I'll be keeping an eye out as this develops, but also feel
> free to file feature requests in the project!
>
>
> [1]https://github.com/substrait-io/substrait-python
> [2]https://pypi.org/project/substrait/
>
>
> On Thu, Aug 31, 2023 at 10:05 PM Will Jones 
> wrote:
>
> > Hello Arrow devs,
> >
> > We discussed this further in the Arrow community call on 2023-08-30 [1],
> > and concluded we should create an entirely new protocol that uses
> Substrait
> > expressions. I have created an issue [2] to track this and will start a
> PR
> > soon.
> >
> > It does look like we might block this on creating a PyCapsule based
> > protocol for arrays, schemas, and streams. That is tracked here [3].
> > Hopefully that isn't too ambitious :)
> >
> > Best,
> >
> > Will Jones
> >
> >
> > [1]
> >
> >
> https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/edit
> > [2] https://github.com/apache/arrow/issues/37504
> > [3] https://github.com/apache/arrow/issues/35531
> >
> >
> > On Tue, Aug 29, 2023 at 2:59 PM Ian Cook  wrote:
> >
> > > An update about this:
> > >
> > > Weston's PR https://github.com/apache/arrow/pull/34834/ merged last
> > > week. This makes it possible to convert PyArrow expressions to/from
> > > Substrait expressions.
> > >
> > > As Fokko previously noted, the PR does not change the PyArrow Dataset
> > > interface at all. It simply enables a Substrait expression to be
> > > converted to a PyArrow expression, which can then be used to
> > > filter/project a Dataset.
> > >
> > > There is a basic example here demonstrating this:
> > > https://gist.github.com/ianmcook/f70fc185d29ae97bdf85ffe0378c68e0
> > >
> > > We might now consider whether to build upon this to create a Dataset
> > > protocol that is independent of the PyArrow Expression implementation
> > > and that could interoperate across languages.
> > >
> > > Ian
> > >
> > > On Mon, Jul 3, 2023 at 5:48 PM Will Jones 
> > wrote:
> > > >
> > > > Hello,
> > > >
> > > > After thinking about it, I think I understand the approach David Li
> and
> > > Ian
> > > > are suggesting with respect to expressions. There will be some
> > arguments
> > > > that only PyArrow's own datasets support, but that aren't in the
> > generic
> > > > protocol. Passing
> > > > PyArrow expressions to the filters argument should be considered one
> of
> > > > those. DuckDB and others are currently passing them down, so they
> > aren't
> > > > yet using the protocol properly. But once we add support in the
> > protocol
> > > > for passing filters via Substrait expressions, we'll move DuckDB and
> > > others
> > > > over to be fully compliant with the protocol.
> > > >
> > > > It's a bit of an awkward temporary state for now, but so would having
> > > > PyArrow expressions in the protocol just to be deprecated in a few
> > > months.
> > > > One caveat is that we'll need to provide DuckDB and other consumers
> > with
> > > a
> > > > way to tell whether the dataset supports passing filters as Substrait
> > > > expression or PyArrow ones, since I doubt they'll want to lose
> support
> > > for
> > > > integrating with older PyArrow versions.
> > > >
> > > > I've removed filters from the protocol for now, with the intention of
> > > > bringing them back as soon as we can get Substrait support. I think
> we
> > > can
> > > > do this in the 14.0.0 release.
> > > >
> > > > Best,
> > > >
> > > > Will Jones
> > > >
> > > >
> > > > On Mon, Jul 3, 2023 at 7:45 AM Fokko Driesprong 
> > > wrote:
> > > >
> > > > > Hey everyone,
> > > > >
> > > > > Chiming in here from the PyIceberg side. I would love to see the
> > > protocol
> > > > > as p

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-08-31 Thread Will Jones
Hello Arrow devs,

We discussed this further in the Arrow community call on 2023-08-30 [1],
and concluded we should create an entirely new protocol that uses Substrait
expressions. I have created an issue [2] to track this and will start a PR
soon.

It does look like we might block this on creating a PyCapsule based
protocol for arrays, schemas, and streams. That is tracked here [3].
Hopefully that isn't too ambitious :)

Best,

Will Jones


[1]
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/edit
[2] https://github.com/apache/arrow/issues/37504
[3] https://github.com/apache/arrow/issues/35531


On Tue, Aug 29, 2023 at 2:59 PM Ian Cook  wrote:

> An update about this:
>
> Weston's PR https://github.com/apache/arrow/pull/34834/ merged last
> week. This makes it possible to convert PyArrow expressions to/from
> Substrait expressions.
>
> As Fokko previously noted, the PR does not change the PyArrow Dataset
> interface at all. It simply enables a Substrait expression to be
> converted to a PyArrow expression, which can then be used to
> filter/project a Dataset.
>
> There is a basic example here demonstrating this:
> https://gist.github.com/ianmcook/f70fc185d29ae97bdf85ffe0378c68e0
>
> We might now consider whether to build upon this to create a Dataset
> protocol that is independent of the PyArrow Expression implementation
> and that could interoperate across languages.
>
> Ian
>
> On Mon, Jul 3, 2023 at 5:48 PM Will Jones  wrote:
> >
> > Hello,
> >
> > After thinking about it, I think I understand the approach David Li and
> Ian
> > are suggesting with respect to expressions. There will be some arguments
> > that only PyArrow's own datasets support, but that aren't in the generic
> > protocol. Passing
> > PyArrow expressions to the filters argument should be considered one of
> > those. DuckDB and others are currently passing them down, so they aren't
> > yet using the protocol properly. But once we add support in the protocol
> > for passing filters via Substrait expressions, we'll move DuckDB and
> others
> > over to be fully compliant with the protocol.
> >
> > It's a bit of an awkward temporary state for now, but so would having
> > PyArrow expressions in the protocol just to be deprecated in a few
> months.
> > One caveat is that we'll need to provide DuckDB and other consumers with
> a
> > way to tell whether the dataset supports passing filters as Substrait
> > expression or PyArrow ones, since I doubt they'll want to lose support
> for
> > integrating with older PyArrow versions.
> >
> > I've removed filters from the protocol for now, with the intention of
> > bringing them back as soon as we can get Substrait support. I think we
> can
> > do this in the 14.0.0 release.
> >
> > Best,
> >
> > Will Jones
> >
> >
> > On Mon, Jul 3, 2023 at 7:45 AM Fokko Driesprong 
> wrote:
> >
> > > Hey everyone,
> > >
> > > Chiming in here from the PyIceberg side. I would love to see the
> protocol
> > > as proposed in the PR. I did a small test
> > > <
> https://github.com/apache/arrow/pull/35568#pullrequestreview-1480259722>,
> > > and it seems to be quite straightforward to implement and it brings a
> lot
> > > of potential. Unsurprisingly, I leaning toward the first option:
> > >
> > > 1. We keep PyArrow expressions in the API initially, but once we have
> > > > Substrait-based alternatives we deprecate the PyArrow expression
> support.
> > > > This is what I intended with the current design, and I think it
> provides
> > > > the most obvious migration paths for existing producers and
> consumers.
> > >
> > >
> > > Let me give my vision on some of the concerns raised.
> > >
> > > Will, I see that you've already addressed this issue to some extent in
> > > > your proposal. For example, you mention that we should initially
> > > > define this protocol to include only a minimal subset of the Dataset
> > > > API. I agree, but I think there are some loose ends we should be
> > > > careful to tie up. I strongly agree with the comments made by David,
> > > > Weston, and Dewey arguing that we should avoid any use of PyArrow
> > > > expressions in this API. Expressions are an implementation detail of
> > > > PyArrow, not a part of the Arrow standard. It would be much safer for
> > > > the initial version of this protocol to not define *any*
> > > > methods/arguments that take expressions. This will allow us to take
> > > > 

Java Arrow Flight Compression

2023-08-24 Thread Nathaniel Jones
Hello,

I have a couple questions about using compression in Java Arrow Flight.
I'll break it down into 2 parts: whether it's possible and whether it's
useful.

*1) Is it possible to do in current Java APIs?*
* I see that an ArrowMessage has a bodyCompression field that it derives
from an ArrowRecordBatch or ArrowDictionaryBatch.
* And then a FlightStream acts on these ArrowMessages, with some internal
calls where converting ArrowMessages to ArrowRecordBatches is aware of
compression via the MessageSerializer.
* However, I haven't found a way where I can use these internal details to
add compression in Flight. For example, my general workflow is to use a
VectorSchemaRoot in tandem with a ServerStreamListener to handle DoGet RPCs
- AFAIK I'm not meant to directly deal with ArrowRecordBatches, and thus
haven't found the right spot to turn on compression.
* Am I missing some APIs or other ways to enable compression in Java Flight?
* I see that it's possible in C++ via David's answer here
, but I don't see the same IpcOption
in Java (whose IpcOption is more minimal
, and not sure
where that would go in Java API regardless).

*2) Is compression useful in Flight?*
>From the same SO linked above, David linked to this Issue
 which has a really
useful thread around examining whether Flight compression is helpful (I
think in C++). It seems like based on the testing in that thread, the
answer is broadly "unlikely". Perhaps in very
network-constrained environments the story could change.

If it's easy to enable in Java Flight with current APIs I'd like to do some
more testing to see how it goes in my environment, but if not, it seems
unclear if it's worth the effort to enable it.

Thank you for any help!


Re: ADBC support for the Rust ecosystem

2023-08-23 Thread Will Jones
Hi Roman,

Yes, it is still a work in progress and not ready for release. I haven't
had time recently to finish it, so I've left the in-progress PRs there.
IIRC they were mostly complete, but still need to be rebased on the current
Rust API design and get feedback from others.

The part that is merged is just the Rust API, and the two PRs linked add
the ability to interop with C-API-based drivers. The driver implementation
PR makes it easy to expose a driver written in Rust as a C-API driver. The
driver manager PR makes it easy to consume a C-API driver.

I hope to get back to finishing the Rust ADBC bindings, but I have a few
other projects I need to get done first. The project I was hoping to use it
for was delayed. If you are interested in taking over the PRs, even just
part of it, I'd be happy to give PR reviews.

Best,

Will Jones



On Wed, Aug 23, 2023 at 5:54 AM David Li  wrote:

> Hi Roman,
>
> I hope Will (the author of the Rust parts) can chime in too, but yes,
> while some parts are there, there's definitely more work needed for it to
> actually be usable. I don't think there are active plans for getting it
> published on crates.io; it needs someone to step up and plan out/finish
> the work. Contributions are of course welcome.
>
> There are a couple WIP PRs [1][2] that might serve as a starting point.
>
> [1]: https://github.com/apache/arrow-adbc/pull/446
> [2]: https://github.com/apache/arrow-adbc/pull/416
>
> -David
>
> On Wed, Aug 23, 2023, at 08:38, Roman Shaposhnik wrote:
> > Hi!
> >
> > I am really eager to see ADBC support for the
> > Rust ecosystem and so far I've stumbled on
> > what seems to be a work-in-progress here:
> >https://github.com/apache/arrow-adbc/tree/main/rust
> >
> > There are a few comments in the source that
> > make it sound like it should be at least possible
> > to use it via Rust/C linkage, but then I look through
> > the implementation and even that part of it seems to
> > be missing.
> >
> > Am I overlooking something obvious here?
> >
> > And also, what are the plans for getting it to a point
> > where it can be published on crates.io?
> >
> > Thanks,
> > Roman.
>


Re: [VOTE] Apache Arrow ADBC (API) 1.1.0

2023-08-14 Thread Will Jones
+1

These additions look excellent.

On Mon, Aug 14, 2023 at 10:40 AM David Li  wrote:

> Hello,
>
> We have been discussing revisions [1] to the ADBC APIs, which we formerly
> decided to treat as a specification [2]. These revisions clean up various
> missing features (e.g. cancellation, error metadata) and better position
> ADBC to help different data systems interoperate (e.g. by exposing more
> metadata, like table/column statistics).
>
> For details, see the PR at [3]. (The main file to read through is adbc.h.)
>
> I would like to propose that the Arrow project adopt this RFC, along with
> the linked PR, as version 1.1.0 of the ADBC API standard.
>
> Please vote to adopt the specification as described above. This is not a
> vote to release any packages; the first package release to support version
> 1.1.0 of the APIs will be 0.7.0 of the packages. (So I will not merge the
> linked PR until after we release ADBC 0.6.0.)
>
> This vote will be open for at least 72 hours.
>
> [ ] +1 Adopt the ADBC 1.1.0 specification
> [ ]  0
> [ ] -1 Do not adopt the specification because...
>
> Thanks to Sutou Kouhei, Matt Topol, Dewey Dunnington, Antoine Pitrou, Will
> Ayd, and Will Jones for feedback on the design and various work-in-progress
> PRs.
>
> [1]: https://github.com/apache/arrow-adbc/milestone/3
> [2]: https://lists.apache.org/thread/s8m4l9hccfh5kqvvd2x3gxn3ry0w1ryo
> [3]: https://github.com/apache/arrow-adbc/pull/971
>
> Thank you,
> David
>


Re: Do we need CODEOWNERS ?

2023-07-04 Thread Will Jones
I haven't had as much time to review the Parquet PRs, so I'll remove myself
from the CODEOWNERS for that.

I've found that I have a much easier time keeping up with PR reviews in
projects that are smaller, even if there are proportionally fewer
maintainers. I think that's the piece that appealed to me originally about
CODEOWNERS: that we could start to make there be some more clarity on how
reviewing responsibility can be divided up. But I agree it hasn't really
lived up to that hope.

On Tue, Jul 4, 2023 at 1:13 PM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> I think it can be useful in certain cases, where the selection is
> specific enough (for example if all Go related PRs is not too much for
> Matt, this features sounds useful for him. I can also imagine if you
> are working on flight, just getting notifications for changes to the
> flight-related files might be useful).
>
> Personally, for myself I didn't add my name to the CODEOWNERS, because
> as someone doing general pyarrow maintenance, I was thinking that
> adding my name as owner of "python" directory would lead to way too
> many notifications for me, and there is no obvious more specific
> selection.
>
> So if it's useful for some people, I wouldn't necessarily remove it,
> as long as: 1) everyone individually evaluates for themselves whether
> this is working or not (and it's fine to remove some entries again),
> and 2) we know this is not a system to properly ping reviewers for all
> PRs, and we still need to manually ping reviewers in other cases.
>
> On Tue, 4 Jul 2023 at 20:11, Matt Topol  wrote:
> >
> > I've found it useful for me so far since it auto adds me on any Go
> related
> > PRs so I don't need to sift through the notifications or active PRs, and
> > instead can easily find them in my reviews on GitHub notifications.
> >
> > But if everyone else finds it more detrimental than helpful I can set up
> a
> > custom filter or something.
> >
> > On Tue, Jul 4, 2023, 12:30 PM Weston Pace  wrote:
> >
> > > I agree the experiment isn't working very well.  I've been meaning to
> > > change my listing from `compute` to `acero` for a while.  I'd be +1 for
> > > just removing it though.
> > >
> > > On Tue, Jul 4, 2023, 6:44 AM Dewey Dunnington
> > > 
> > > wrote:
> > >
> > > > Just a note that for me, the main problem is that I get automatic
> > > > review requests for PRs that have nothing to do with R (I think this
> > > > happens when a rebase occurs that contained an R commit). Because
> that
> > > > happens a lot, it means I miss actual review requests and sometimes
> > > > mentions because they blend in. I think CODEOWNERS results in me
> > > > reviewing more PRs than if I had to set up some kind of custom
> > > > notification filter but I agree that it's not perfect.
> > > >
> > > > Cheers,
> > > >
> > > > -dewey
> > > >
> > > > On Tue, Jul 4, 2023 at 10:04 AM Antoine Pitrou 
> > > wrote:
> > > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > Some time ago we added a `.github/CODEOWNERS` file in the main
> Arrow
> > > > > repo. The idea is that, when specific files or directories are
> touched
> > > > > by a PR, specific people are asked for review.
> > > > >
> > > > > Unfortunately, it seems that, most of the time, this produces the
> > > > > following effects:
> > > > >
> > > > > 1) the people who are automatically queried for review don't show
> up
> > > > > (perhaps they simply ignore those automatic notifications)
> > > > > 2) when several people are assigned for review, each designated
> > > reviewer
> > > > > seems to hope that the other ones will be doing the work, instead
> of
> > > > > doing it themselves
> > > > > 3) contributors expect those people to show up and are therefore
> > > > > bewildered when nobody comes to review their PR
> > > > >
> > > > > Do we want to keep CODEOWNERS? If we still think it can be
> beneficial,
> > > > > we should institute a policy where people who are listed in that
> file
> > > > > promise to respond to review requests: 1) either by doing a review
> 2)
> > > or
> > > > > by de-assigning themselves, and if possible pinging another core
> > > > developer.
> > > > >
> > > > > What do you think?
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > >
> > >
>


Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-07-03 Thread Will Jones
Hello,

After thinking about it, I think I understand the approach David Li and Ian
are suggesting with respect to expressions. There will be some arguments
that only PyArrow's own datasets support, but that aren't in the generic
protocol. Passing
PyArrow expressions to the filters argument should be considered one of
those. DuckDB and others are currently passing them down, so they aren't
yet using the protocol properly. But once we add support in the protocol
for passing filters via Substrait expressions, we'll move DuckDB and others
over to be fully compliant with the protocol.

It's a bit of an awkward temporary state for now, but so would having
PyArrow expressions in the protocol just to be deprecated in a few months.
One caveat is that we'll need to provide DuckDB and other consumers with a
way to tell whether the dataset supports passing filters as Substrait
expression or PyArrow ones, since I doubt they'll want to lose support for
integrating with older PyArrow versions.

I've removed filters from the protocol for now, with the intention of
bringing them back as soon as we can get Substrait support. I think we can
do this in the 14.0.0 release.

Best,

Will Jones


On Mon, Jul 3, 2023 at 7:45 AM Fokko Driesprong  wrote:

> Hey everyone,
>
> Chiming in here from the PyIceberg side. I would love to see the protocol
> as proposed in the PR. I did a small test
> <https://github.com/apache/arrow/pull/35568#pullrequestreview-1480259722>,
> and it seems to be quite straightforward to implement and it brings a lot
> of potential. Unsurprisingly, I leaning toward the first option:
>
> 1. We keep PyArrow expressions in the API initially, but once we have
> > Substrait-based alternatives we deprecate the PyArrow expression support.
> > This is what I intended with the current design, and I think it provides
> > the most obvious migration paths for existing producers and consumers.
>
>
> Let me give my vision on some of the concerns raised.
>
> Will, I see that you've already addressed this issue to some extent in
> > your proposal. For example, you mention that we should initially
> > define this protocol to include only a minimal subset of the Dataset
> > API. I agree, but I think there are some loose ends we should be
> > careful to tie up. I strongly agree with the comments made by David,
> > Weston, and Dewey arguing that we should avoid any use of PyArrow
> > expressions in this API. Expressions are an implementation detail of
> > PyArrow, not a part of the Arrow standard. It would be much safer for
> > the initial version of this protocol to not define *any*
> > methods/arguments that take expressions. This will allow us to take
> > some more time to finish up the Substrait expression implementation
> > work that is underway [7][8], then introduce Substrait-based
> > expressions in a latter version of this protocol. This approach will
> > better position this protocol to be implemented in other languages
> > besides Python.
>
>
> I'm confused here. Looking at GH-33985
> <https://github.com/apache/arrow/pull/34834/files> I don't see any new
> primitives being introduced for composing an expression. As I understand
> it, in PyArrow the expression as it exists today will continue to exist. In
> the case of inter-process communication, it goes to Substrait, and then it
> gets de-serialized in the native expression construct (In PyIceberg, a
> BoundPredicate). I would say that the protocol and substrait are
> complementary.
>
> Another concern I have is that we have not fully explained why we want
> > to use Dataset instead of RecordBatchReader [9] as the basis of this
> > protocol. I would like to see an explanation of why RecordBatchReader
> > is not sufficient for this. RecordBatchReader seems like another
> > possible way to represent "unmaterialized dataframes" and there are
> > some parallels between RecordBatch/RecordBatchReader and
> > Fragment/Dataset. We should help developers and users understand why
> > Arrow needs both of these.
>
>
> Just to clarify, I think there are different use cases. For example, Lance
> provides its own readers, but PyIceberg does not have any intent to provide
> its own Parquet readers. Iceberg will generate the list of files that need
> to be read, and do the filtering/projection/deletes/etc. This would make
> the Dataset a better choice than the RecordBatchReader.
>
> That wouldn't remove the feature from DuckDB, would it? It would just mean
> > that we recognize that PyArrow expressions don't have well-defined
> > semantics that we are committing to at this time. As long as we have
> > `**kwargs` everywhere, we can in the future introduce a
> > `substrait_filter_expression` 

[C++][Parquet] Handling empty files while reading Parquet files using C++

2023-07-03 Thread Luca Jones
Hi,

I've been trying to read data from a Parquet file into a stream using the
Parquet::StreamReader class for a while. The first column of my data
consists of int64s - thus, I have been streaming data as follows:

shared_ptr infile;
PARQUET_ASSIGN_OR_THROW(infile, arrow::io::ReadableFile::Open(datapath));
parquet::StreamReader stream{ parquet::ParquetFileReader::Open(infile) };

int64_t c1;

while (!stream.eof()) {
stream >> c1;
stream.SkipColumns(100);
stream >> parquet::EndRow;

cout << c1 << endl;

My code throws a ParquetException in the CheckColumn() function when
comparing length and node->type_length() [stream_reader.cc, Line 543]:

  if (length != node->type_length()) {
throw ParquetException("Column length mismatch.  Column '" + node->name() +
   "' has length " +
std::to_string(node->type_length()) +
   "] not " + std::to_string(length));
  }

I figured out that this was because there are empty data fields in my
parquet, meaning length is 0 but node->type_length() is 64. I've looked all
over the internet trying to find a way to properly handle empty values in
parquet files using Arrow, but have had no luck. Is there a way to check if
a data field is empty for a Parquet::StreamReader object, or some other way
to manage empty fields?

Any help would be appreciated.


Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-28 Thread Will Jones
>
> That wouldn't remove the feature from DuckDB, would it? It would just mean
> that we recognize that PyArrow expressions don't have well-defined
> semantics that we are committing to at this time.
>

That's a fair point, David. I would be fine excluding it from the protocol
initially, and keep the existing integrations in DuckDB, Polars, and
Datafusion "secret" or "not officially supported" for the time being. At
the very least, documenting the pattern to get a Arrow C stream will be a
step forward.

Best,

Will Jones

On Wed, Jun 28, 2023 at 12:35 PM Jonathan Keane  wrote:

> > I would understand this objection more if DuckDB hasn't been relying on
> > being able to pass PyArrow expressions for 18 months now [1]. Unless, do
> we
> > just think this isn't widely used enough that we don't care?
>
> This isn't a pro or a con of specifically adopting the PyArrow expression
> semantics as is / with a warning about changing / not at all, but having
> some kind of standardization in this interface would be very nice. This
> even came up while collaborating with the DuckDB folks that using some of
> the expression bits here (and in the R equivalents) was a little bit odd
> and having something like a proper API for that would have made that
> more natural (and likely that would have been used had it existed 18 months
> ago :))
>
> -Jon
>
>
> On Wed, Jun 28, 2023 at 1:17 PM David Li  wrote:
>
> > That wouldn't remove the feature from DuckDB, would it? It would just
> mean
> > that we recognize that PyArrow expressions don't have well-defined
> > semantics that we are committing to at this time. As long as we have
> > `**kwargs` everywhere, we can in the future introduce a
> > `substrait_filter_expression` or similar argument, while allowing current
> > implementors to handle `filter` if possible. (As a compromise, we could
> > reserve `filter` and existing arguments and note that PyArrow Expression
> > semantics are subject to change without notice?)
> >
> > On Wed, Jun 28, 2023, at 13:38, Will Jones wrote:
> > > Hi Ian,
> > >
> > >
> > >> I favor option 2 out of concern that option 1 could create a
> > >> temptation for users of this protocol to depend on a feature that we
> > >> intend to deprecate.
> > >>
> > >
> > > I would understand this objection more if DuckDB hasn't been relying on
> > > being able to pass PyArrow expressions for 18 months now [1]. Unless,
> do
> > we
> > > just think this isn't widely used enough that we don't care?
> > >
> > > Best,
> > > Will
> > >
> > > [1] https://duckdb.org/2021/12/03/duck-arrow.html
> > >
> > > On Tue, Jun 27, 2023 at 11:19 AM Ian Cook  wrote:
> > >
> > >> > I think there's three routes we can go here:
> > >> >
> > >> > 1. We keep PyArrow expressions in the API initially, but once we
> have
> > >> > Substrait-based alternatives we deprecate the PyArrow expression
> > support.
> > >> > This is what I intended with the current design, and I think it
> > provides
> > >> > the most obvious migration paths for existing producers and
> consumers.
> > >> > 2. We keep the overall dataset API, but don't introduce the filter
> and
> > >> > projection arguments until we have Substrait support. I'm not sure
> > what
> > >> the
> > >> > migration path looks like for producers and consumers, but I think
> > this
> > >> > just implicitly becomes the same as (1), but with worse
> documentation.
> > >> > 3. We write a protocol completely from scratch, that doesn't try to
> > >> > describe the existing dataset API. Producers and consumers would
> then
> > >> > migrate to use the new protocol and deprecate their existing dataset
> > >> > integrations. We could introduce a dunder method in that API (sort
> of
> > >> like
> > >> > __arrow_array__) that would make the migration seamless from the
> > end-user
> > >> > perspective.
> > >> >
> > >> > *Which do you all think is the best path forward?*
> > >>
> > >> I favor option 2 out of concern that option 1 could create a
> > >> temptation for users of this protocol to depend on a feature that we
> > >> intend to deprecate. I think option 2 also creates a stronger
> > >> motivation to complete the Substrait expression integration work,
> > >> which is underway in htt

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-28 Thread Will Jones
Hi Ian,


> I favor option 2 out of concern that option 1 could create a
> temptation for users of this protocol to depend on a feature that we
> intend to deprecate.
>

I would understand this objection more if DuckDB hasn't been relying on
being able to pass PyArrow expressions for 18 months now [1]. Unless, do we
just think this isn't widely used enough that we don't care?

Best,
Will

[1] https://duckdb.org/2021/12/03/duck-arrow.html

On Tue, Jun 27, 2023 at 11:19 AM Ian Cook  wrote:

> > I think there's three routes we can go here:
> >
> > 1. We keep PyArrow expressions in the API initially, but once we have
> > Substrait-based alternatives we deprecate the PyArrow expression support.
> > This is what I intended with the current design, and I think it provides
> > the most obvious migration paths for existing producers and consumers.
> > 2. We keep the overall dataset API, but don't introduce the filter and
> > projection arguments until we have Substrait support. I'm not sure what
> the
> > migration path looks like for producers and consumers, but I think this
> > just implicitly becomes the same as (1), but with worse documentation.
> > 3. We write a protocol completely from scratch, that doesn't try to
> > describe the existing dataset API. Producers and consumers would then
> > migrate to use the new protocol and deprecate their existing dataset
> > integrations. We could introduce a dunder method in that API (sort of
> like
> > __arrow_array__) that would make the migration seamless from the end-user
> > perspective.
> >
> > *Which do you all think is the best path forward?*
>
> I favor option 2 out of concern that option 1 could create a
> temptation for users of this protocol to depend on a feature that we
> intend to deprecate. I think option 2 also creates a stronger
> motivation to complete the Substrait expression integration work,
> which is underway in https://github.com/apache/arrow/pull/34834.
>
> Ian
>
>
> On Fri, Jun 23, 2023 at 1:25 PM Weston Pace  wrote:
> >
> > > The trouble is that Dataset was not designed to serve as a
> > > general-purpose unmaterialized dataframe. For example, the PyArrow
> > > Dataset constructor [5] exposes options for specifying a list of
> > > source files and a partitioning scheme, which are irrelevant for many
> > > of the applications that Will anticipates. And some work is needed to
> > > reconcile the methods of the PyArrow Dataset object [6] with the
> > > methods of the Table object. Some methods like filter() are exposed by
> > > both and behave lazily on Datasets and eagerly on Tables, as a user
> > > might expect. But many other Table methods are not implemented for
> > > Dataset though they potentially could be, and it is unclear where we
> > > should draw the line between adding methods to Dataset vs. encouraging
> > > new scanner implementations to expose options controlling what lazy
> > > operations should be performed as they see fit.
> >
> > In my mind there is a distinction between the "compute domain" (e.g. a
> > pandas dataframe or something like ibis or SQL) and the "data domain"
> (e.g.
> > pyarrow datasets).  I think, in a perfect world, you could push any and
> all
> > compute up and down the chain as far as possible.  However, in practice,
> I
> > think there is a healthy set of tools and libraries that say "simple
> column
> > projection and filtering is good enough".  I would argue that there is
> room
> > for both APIs and while the temptation is always present to "shove as
> much
> > compute as you can" I think pyarrow datasets seem to have found a balance
> > between the two that users like.
> >
> > So I would argue that this protocol may never become a general-purpose
> > unmaterialized dataframe and that isn't necessarily a bad thing.
> >
> > > they are splittable and serializable, so that fragments can be
> distributed
> > > amongst processes / workers.
> >
> > Just to clarify, the proposal currently only requires the fragments to be
> > serializable correct?
> >
> > On Fri, Jun 23, 2023 at 11:48 AM Will Jones 
> wrote:
> >
> > > Thanks Ian for your extensive feedback.
> > >
> > > I strongly agree with the comments made by David,
> > > > Weston, and Dewey arguing that we should avoid any use of PyArrow
> > > > expressions in this API. Expressions are an implementation detail of
> > > > PyArrow, not a part of the Arrow standard. It would be much safer for
> > > > the init

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-23 Thread Will Jones
Thanks Ian for your extensive feedback.

I strongly agree with the comments made by David,
> Weston, and Dewey arguing that we should avoid any use of PyArrow
> expressions in this API. Expressions are an implementation detail of
> PyArrow, not a part of the Arrow standard. It would be much safer for
> the initial version of this protocol to not define *any*
> methods/arguments that take expressions.
>

I would agree with this point, if we were starting from scratch. But one of
my goals is for this protocol to be descriptive of the existing dataset
integrations in the ecosystem, which all currently rely on PyArrow
expressions. For example, you'll notice in the PR that there are unit tests
to verify the current PyArrow Dataset classes conform to this protocol,
without changes.

I think there's three routes we can go here:

1. We keep PyArrow expressions in the API initially, but once we have
Substrait-based alternatives we deprecate the PyArrow expression support.
This is what I intended with the current design, and I think it provides
the most obvious migration paths for existing producers and consumers.
2. We keep the overall dataset API, but don't introduce the filter and
projection arguments until we have Substrait support. I'm not sure what the
migration path looks like for producers and consumers, but I think this
just implicitly becomes the same as (1), but with worse documentation.
3. We write a protocol completely from scratch, that doesn't try to
describe the existing dataset API. Producers and consumers would then
migrate to use the new protocol and deprecate their existing dataset
integrations. We could introduce a dunder method in that API (sort of like
__arrow_array__) that would make the migration seamless from the end-user
perspective.

*Which do you all think is the best path forward?*

Another concern I have is that we have not fully explained why we want
> to use Dataset instead of RecordBatchReader [9] as the basis of this
> protocol. I would like to see an explanation of why RecordBatchReader
> is not sufficient for this. RecordBatchReader seems like another
> possible way to represent "unmaterialized dataframes" and there are
> some parallels between RecordBatch/RecordBatchReader and
> Fragment/Dataset.
>

This is a good point. I can add a section describing the differences. The
main ones I can think of are that: (1) Datasets are "pruneable": one can
select a subset of columns and apply a filter on rows to avoid IO and (2)
they are splittable and serializable, so that fragments can be distributed
amongst processes / workers.

Best,

Will Jones

On Fri, Jun 23, 2023 at 10:48 AM Ian Cook  wrote:

> Thanks Will for this proposal!
>
> For anyone familiar with PyArrow, this idea has a clear intuitive
> logic to it. It provides an expedient solution to the current lack of
> a practical means for interchanging "unmaterialized dataframes"
> between different Python libraries.
>
> To elaborate on that: If you look at how people use the Arrow Dataset
> API—which is implemented in the Arrow C++ library [1] and has bindings
> not just for Python [2] but also for Java [3] and R [4]—you'll see
> that Dataset is often used simply as a "virtual" variant of Table. It
> is used in cases when the data is larger than memory or when it is
> desirable to defer reading (materializing) the data into memory.
>
> So we can think of a Table as a materialized dataframe and a Dataset
> as an unmaterialized dataframe. That aspect of Dataset is I think what
> makes it most attractive as a protocol for enabling interoperability:
> it allows libraries to easily "speak Arrow" in cases where
> materializing the full data in memory upfront is impossible or
> undesirable.
>
> The trouble is that Dataset was not designed to serve as a
> general-purpose unmaterialized dataframe. For example, the PyArrow
> Dataset constructor [5] exposes options for specifying a list of
> source files and a partitioning scheme, which are irrelevant for many
> of the applications that Will anticipates. And some work is needed to
> reconcile the methods of the PyArrow Dataset object [6] with the
> methods of the Table object. Some methods like filter() are exposed by
> both and behave lazily on Datasets and eagerly on Tables, as a user
> might expect. But many other Table methods are not implemented for
> Dataset though they potentially could be, and it is unclear where we
> should draw the line between adding methods to Dataset vs. encouraging
> new scanner implementations to expose options controlling what lazy
> operations should be performed as they see fit.
>
> Will, I see that you've already addressed this issue to some extent in
> your proposal. For example, you mention that we should initially
> define this protocol to include only a 

[Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-21 Thread Will Jones
Hello Arrow devs,

I have drafted a PR defining an experimental protocol which would allow
third-party libraries to imitate the PyArrow Dataset API [5]. This protocol
is intended to endorse an integration pattern that is starting to be used
in the Python ecosystem, where some libraries are providing their own
scanners with this API, while query engines are accepting these as
duck-typed objects.

To give some background: back at the end of 2021, we collaborated with
DuckDB to be able to read datasets (an Arrow C++ concept), supporting
column selection and filter pushdown. This was accomplished by having
DuckDB manipulating Python (or R) objects to get a RecordBatchReader and
then exporting over the C Stream Interface.

Since then, DataFusion [2] and Polars have both made similar
implementations for their Python bindings, allowing them to consume PyArrow
datasets. This has created an implicit protocol, whereby arbitrary compute
engines can push down queries into the PyArrow dataset scanner.

Now, libraries supporting table formats including Delta Lake, Lance, and
Iceberg are looking to be able to support these engines, while bringing
their own scanners and metadata handling implementations. One possible
route is allowing them to imitate the PyArrow datasets API.

Bringing these use cases together, I'd like to propose an experimental
protocol, made out of the minimal subset of the PyArrow Dataset API
necessary to facilitate this kind of integration. This would allow any
library to produce a scanner implementation and that arbitrary query
engines could call into. I've drafted a PR [3] and there is some background
research available in a google doc [4].

I've already gotten some good feedback on both, and would welcome more.

One last point: I'd like for this to be a first step rather than a
comprehensive API. This PR focuses on making explicit a protocol that is
already in use in the ecosystem, but without much concrete definition. Once
this is established, we can use our experience from this protocol to design
something more permanent that takes advantage of newer innovations in the
Arrow ecosystem (such as the PyCapsule for C Data Interface or
Substrait for passing expressions / scan plans). I am tracking such future
improvements in [5].

Best,

Will Jones

[1] https://duckdb.org/2021/12/03/duck-arrow.html
[2] https://github.com/apache/arrow-datafusion-python/pull/9
[3] https://github.com/apache/arrow/pull/35568
[4]
https://docs.google.com/document/d/1r56nt5Un2E7yPrZO9YPknBN4EDtptpx-tqOZReHvq1U/edit?pli=1
[5]
https://docs.google.com/document/d/1-uVkSZeaBtOALVbqMOPeyV3s2UND7Wl-IGEZ-P-gMXQ/edit


Re: [VOTE] Release Apache Arrow nanoarrow 0.2.0 - RC1

2023-06-19 Thread Will Jones
Thanks for fixing that issue. I can now successfully verify the release on
M1 Mac with Conda.

My vote: +1 (binding)

On Mon, Jun 19, 2023 at 12:10 PM Dewey Dunnington
 wrote:

> My vote is +1 (non-binding). Verified on MacOS M1 (both Homebrew and
> Conda).
>
> On Mon, Jun 19, 2023 at 3:58 PM Dewey Dunnington 
> wrote:
> >
> > Hello,
> >
> > I would like to propose the following release candidate (RC1) of
> > Apache Arrow nanoarrow version 0.2.0. This release consists of 17
> > resolved GitHub issues [1].
> >
> > This release candidate is based on commit:
> > f71063605e288d9a8dd73cfdd9578773519b6743 [2]
> >
> > The source release rc1 is hosted at [3].
> > The changelog is located at [4].
> > The draft release post is located at [5].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [6] for how to validate a release
> > candidate.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow nanoarrow 0.2.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow nanoarrow 0.2.0 because...
> >
> > [0] https://github.com/apache/arrow-nanoarrow
> > [1] https://github.com/apache/arrow-nanoarrow/milestone/2?closed=1
> > [2]
> https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.2.0-rc1
> > [3]
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.2.0-rc1/
> > [4]
> https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.2.0-rc1/CHANGELOG.md
> > [5] https://github.com/apache/arrow-site/pull/364
> > [6]
> https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md
>


Re: [VOTE][RUST] Release Apache Arrow Rust 42.0.0 RC1

2023-06-18 Thread Will Jones
+1, verified on MacOS M1.

Thanks Andrew!

On Sun, Jun 18, 2023 at 7:02 AM Wayne Xia  wrote:

> +1, verified on amd64 linux, thanks!
>
>
> vin jake  :
>
> > +1 (binding)
> >
> > Verified on M1 macbook.
> >
> > Thanks Andrew.
> >
> > On Sat, Jun 17, 2023, 02:40 Andrew Lamb  wrote:
> >
> > > Hi,
> > >
> > > I would like to propose a release of Apache Arrow Rust Implementation,
> > > version 42.0.0.
> > >
> > > Please note that there is one known regression in this release related
> to
> > > parsing intervals like '.5 months' [5], but I do not believe it should
> > > block the release (see [6] for rationale). However, if others feel
> > > differently, there is a proposed fix [7] and once it is reviewed /
> > merged I
> > > can create a new RC as well
> > >
> > > This release candidate is based on commit:
> > > 2c7b4efc1701d9db5a0cc6decacf1df22123645f [1]
> > >
> > > The proposed release tarball and signatures are hosted at [2].
> > >
> > > The changelog is located at [3].
> > >
> > > Please download, verify checksums and signatures, run the unit tests,
> > > and vote on the release. There is a script [4] that automates some of
> > > the verification.
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Release this as Apache Arrow Rust
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Arrow Rust  because...
> > >
> > > [1]:
> > >
> > >
> >
> https://github.com/apache/arrow-rs/tree/2c7b4efc1701d9db5a0cc6decacf1df22123645f
> > > [2]:
> > >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-42.0.0-rc1
> > > [3]:
> > >
> > >
> >
> https://github.com/apache/arrow-rs/blob/2c7b4efc1701d9db5a0cc6decacf1df22123645f/CHANGELOG.md
> > > [4]:
> > >
> > >
> >
> https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
> > > [5]: https://github.com/apache/arrow-rs/issues/4424
> > > [6]:
> https://github.com/apache/arrow-rs/pull/4425#discussion_r1232573299
> > > [6]: https://github.com/apache/arrow-rs/pull/4425
> > >
> >
>


Re: [VOTE] Release Apache Arrow nanoarrow 0.2.0 - RC0

2023-06-18 Thread Will Jones
Hello,

I attempted to verify on M1 MacOS within a conda environment. But sadly
encountered some issues that I don't think are nanoarrow's fault:
* The gnupg from conda segfaults on MacOS. The homebrew one works fine.
* I got a segfault on this test: BitmapTest.BitmapTestCountSetSingleByte
(SEGFAULT). It seems to originate from googletest. Stack trace:

std::__1::basic_string,
std::__1::allocator>::__is_long[abi:v160006]() const
(/Users/willjones/mambaforge/envs/nanoarrow-verify-rc/include/c++/v1/string:1682)
std::__1::basic_string,
std::__1::allocator>::~basic_string()
(/Users/willjones/mambaforge/envs/nanoarrow-verify-rc/include/c++/v1/string:2361)
std::__1::basic_string,
std::__1::allocator>::~basic_string()
(/Users/willjones/mambaforge/envs/nanoarrow-verify-rc/include/c++/v1/string:2359)
std::__1::default_delete,
std::__1::allocator>>::operator()[abi:v160006](std::__1::basic_string, std::__1::allocator>*) const
(/Users/willjones/mambaforge/envs/nanoarrow-verify-rc/include/c++/v1/__memory/unique_ptr.h:65)
std::__1::unique_ptr, std::__1::allocator>,
std::__1::default_delete,
std::__1::allocator>>>::reset[abi:v160006](std::__1::basic_string, std::__1::allocator>*)
(/Users/willjones/mambaforge/envs/nanoarrow-verify-rc/include/c++/v1/__memory/unique_ptr.h:297)
std::__1::unique_ptr, std::__1::allocator>,
std::__1::default_delete,
std::__1::allocator>>>::~unique_ptr[abi:v160006]()
(/Users/willjones/mambaforge/envs/nanoarrow-verify-rc/include/c++/v1/__memory/unique_ptr.h:263)
std::__1::unique_ptr, std::__1::allocator>,
std::__1::default_delete,
std::__1::allocator>>>::~unique_ptr[abi:v160006]()
(/Users/willjones/mambaforge/envs/nanoarrow-verify-rc/include/c++/v1/__memory/unique_ptr.h:263)
testing::AssertionResult::~AssertionResult()
(/Users/willjones/Documents/arrow-nanoarrow/out/build/default-with-tests/_deps/googletest-src/googletest/include/gtest/gtest.h:283)
testing::AssertionResult::~AssertionResult()
(/Users/willjones/Documents/arrow-nanoarrow/out/build/default-with-tests/_deps/googletest-src/googletest/include/gtest/gtest.h:283)
BitmapTest_BitmapTestCountSetSingleByte_Test::TestBody()
(/Users/willjones/Documents/arrow-nanoarrow/src/nanoarrow/buffer_test.cc:334)
void
testing::internal::HandleSehExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*)
(/Users/willjones/Documents/arrow-nanoarrow/out/build/default-with-tests/_deps/googletest-src/googletest/src/gtest.cc:2607)
void testing::internal::HandleExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*)
(/Users/willjones/Documents/arrow-nanoarrow/out/build/default-with-tests/_deps/googletest-src/googletest/src/gtest.cc:2643)
testing::Test::Run()
(/Users/willjones/Documents/arrow-nanoarrow/out/build/default-with-tests/_deps/googletest-src/googletest/src/gtest.cc:2682)
testing::TestInfo::Run()
(/Users/willjones/Documents/arrow-nanoarrow/out/build/default-with-tests/_deps/googletest-src/googletest/src/gtest.cc:2861)
testing::TestSuite::Run()
(/Users/willjones/Documents/arrow-nanoarrow/out/build/default-with-tests/_deps/googletest-src/googletest/src/gtest.cc:3015)
testing::internal::UnitTestImpl::RunAllTests()
(/Users/willjones/Documents/arrow-nanoarrow/out/build/default-with-tests/_deps/googletest-src/googletest/src/gtest.cc:5855)
bool
testing::internal::HandleSehExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool
(testing::internal::UnitTestImpl::*)(), char const*)
(/Users/willjones/Documents/arrow-nanoarrow/out/build/default-with-tests/_deps/googletest-src/googletest/src/gtest.cc:2607)
bool
testing::internal::HandleExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool
(testing::internal::UnitTestImpl::*)(), char const*)
(/Users/willjones/Documents/arrow-nanoarrow/out/build/default-with-tests/_deps/googletest-src/googletest/src/gtest.cc:2643)
testing::UnitTest::Run()
(/Users/willjones/Documents/arrow-nanoarrow/out/build/default-with-tests/_deps/googletest-src/googletest/src/gtest.cc:5438)
RUN_ALL_TESTS()
(/Users/willjones/Documents/arrow-nanoarrow/out/build/default-with-tests/_deps/googletest-src/googletest/include/gtest/gtest.h:2490)


On Fri, Jun 16, 2023 at 6:59 PM Jacob Wujciak-Jens
 wrote:

> +1 (non-binding) verified fully on R 4.3 and GCC 12 on manjaro
>
> On Fri, Jun 16, 2023 at 11:13 PM David Li  wrote:
>
> > +1
> >
> > Tested on Ubuntu 20.04/x86_64
> >
> > On Fri, Jun 16, 2023, at 16:15, Dewey Dunnington wrote:
> > > Hello,
> > >
> > > I would like to propose the following release candidate (RC0) of
> > > Apache Arrow nanoarrow version 0.2.0. This release consists of 17
> > > resolved GitHub issues [1].
> > >
> > > This release candidate is based on commit:
> > > a7b824de6cb99ce458e1a5cd311d69588ceb0570 [2]
> > >
> > > The source release rc0 is hosted at [3].
> > > The changelog is located at [4].
> > >
> > > Please download, verify checksums and signatures, run the unit tests,
> > > and vote on the release. See [5] for how to validate a release
> > > 

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-15 Thread Will Jones
Cool. Thanks for doing that!

On Thu, Jun 15, 2023 at 12:40 Benjamin Kietzman  wrote:

> I've added https://github.com/apache/arrow/issues/36112 to track
> deduplication of buffers on write.
> I don't think it would require modification of the IPC format.
>
> Ben
>
> On Thu, Jun 15, 2023 at 1:30 PM Matt Topol  wrote:
>
> > Based on my understanding, in theory a buffer *could* be shared within a
> > batch since the flatbuffers message just uses an offset and length to
> > identify the buffers.
> >
> > That said, I don't believe any current implementation actually does this
> or
> > takes advantage of this in any meaningful way.
> >
> > --Matt
> >
> > On Thu, Jun 15, 2023 at 1:00 PM Will Jones 
> > wrote:
> >
> > > Hi Ben,
> > >
> > > It's exciting to see this move along.
> > >
> > > The buffers will be duplicated. If buffer duplication is becomes a
> > concern,
> > > > I'd prefer to handle
> > > > that in the ipc writer. Then buffers which are duplicated could be
> > > detected
> > > > by checking
> > > > pointer identity and written only once.
> > >
> > >
> > > Question: to be able to write buffer only once and reference in
> multiple
> > > arrays, does that require a change to the IPC format? Or is sharing
> > buffers
> > > within the same batch already allowed in the IPC format?
> > >
> > > Best,
> > >
> > > Will Jones
> > >
> > > On Thu, Jun 15, 2023 at 9:03 AM Benjamin Kietzman  >
> > > wrote:
> > >
> > > > Hello again all,
> > > >
> > > > The PR [1] to add string view to the format and the C++
> implementation
> > is
> > > > hovering around passing CI and has been undrafted. Furthermore, there
> > is
> > > > now also a PR [2] to add string view to the Go implementation. Code
> > > review
> > > > is underway for each PR and I'd like to move toward a vote for
> > > acceptance-
> > > > are there any other preliminaries which I've neglected?
> > > >
> > > > To reiterate the answers to some past questions:
> > > > - Benchmarks are added in the C++ PR [1] to demonstrate the
> performance
> > > of
> > > >   conversion between the various string formats. In addition, there
> are
> > > >   some benchmarks which demonstrate the performance gains available
> > with
> > > >   the new format [3].
> > > > - Adding string view to the C ABI is a natural follow up, but should
> be
> > > >   handled independently. An issue has been added to track that
> > > >   enhancement [4].
> > > >
> > > > Sincerely,
> > > > Ben Kietzman
> > > >
> > > > [1] https://github.com/apache/arrow/pull/35628
> > > > [2] https://github.com/apache/arrow/pull/35769
> > > > [3]
> https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
> > > > [4] https://github.com/apache/arrow/issues/36099
> > > >
> > > > On Wed, May 17, 2023 at 12:53 PM Benjamin Kietzman <
> > bengil...@gmail.com>
> > > > wrote:
> > > >
> > > > > @Jacob
> > > > > > You mention benchmarks multiple times, are these results
> published
> > > > > somewhere?
> > > > >
> > > > > I benchmarked the performance of raw pointer vs index offset views
> in
> > > my
> > > > > PR to velox,
> > > > > I do intend to port them to my arrow PR but I haven't gotten there
> > yet.
> > > > > Furthermore, it
> > > > > seemed less urgent to me since coexistence of the two types in the
> > c++
> > > > > implementation
> > > > > defers the question of how aggressively one should be preferred
> over
> > > the
> > > > > other.
> > > > >
> > > > > @Dewey
> > > > > > I don't see the C Data interface in the PR
> > > > >
> > > > > I have not addressed the C ABI in this PR. As you mention, it may
> be
> > > > > useful to transmit
> > > > > arrays with raw pointer views between implementations which allow
> > > them. I
> > > > > can address
> > > > > this in a follow up PR.
> > > > >
> > > > > @Will
> > > > > > If I understand correctly, multiple arrays 

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-15 Thread Will Jones
Hi Ben,

It's exciting to see this move along.

The buffers will be duplicated. If buffer duplication is becomes a concern,
> I'd prefer to handle
> that in the ipc writer. Then buffers which are duplicated could be detected
> by checking
> pointer identity and written only once.


Question: to be able to write buffer only once and reference in multiple
arrays, does that require a change to the IPC format? Or is sharing buffers
within the same batch already allowed in the IPC format?

Best,

Will Jones

On Thu, Jun 15, 2023 at 9:03 AM Benjamin Kietzman 
wrote:

> Hello again all,
>
> The PR [1] to add string view to the format and the C++ implementation is
> hovering around passing CI and has been undrafted. Furthermore, there is
> now also a PR [2] to add string view to the Go implementation. Code review
> is underway for each PR and I'd like to move toward a vote for acceptance-
> are there any other preliminaries which I've neglected?
>
> To reiterate the answers to some past questions:
> - Benchmarks are added in the C++ PR [1] to demonstrate the performance of
>   conversion between the various string formats. In addition, there are
>   some benchmarks which demonstrate the performance gains available with
>   the new format [3].
> - Adding string view to the C ABI is a natural follow up, but should be
>   handled independently. An issue has been added to track that
>   enhancement [4].
>
> Sincerely,
> Ben Kietzman
>
> [1] https://github.com/apache/arrow/pull/35628
> [2] https://github.com/apache/arrow/pull/35769
> [3] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
> [4] https://github.com/apache/arrow/issues/36099
>
> On Wed, May 17, 2023 at 12:53 PM Benjamin Kietzman 
> wrote:
>
> > @Jacob
> > > You mention benchmarks multiple times, are these results published
> > somewhere?
> >
> > I benchmarked the performance of raw pointer vs index offset views in my
> > PR to velox,
> > I do intend to port them to my arrow PR but I haven't gotten there yet.
> > Furthermore, it
> > seemed less urgent to me since coexistence of the two types in the c++
> > implementation
> > defers the question of how aggressively one should be preferred over the
> > other.
> >
> > @Dewey
> > > I don't see the C Data interface in the PR
> >
> > I have not addressed the C ABI in this PR. As you mention, it may be
> > useful to transmit
> > arrays with raw pointer views between implementations which allow them. I
> > can address
> > this in a follow up PR.
> >
> > @Will
> > > If I understand correctly, multiple arrays can reference the same
> buffers
> > > in memory, but once they are written to IPC their data buffers will be
> > > duplicated. Is that right?
> > The buffers will be duplicated. If buffer duplication is becomes a
> > concern, I'd prefer to handle
> > that in the ipc writer. Then buffers which are duplicated could be
> > detected by checking
> > pointer identity and written only once.
> >
> >
> > On Wed, May 17, 2023 at 12:07 AM Will Jones 
> > wrote:
> >
> >> Hello Ben,
> >>
> >> Thanks for your work on this. I think this will be an excellent addition
> >> to
> >> the format.
> >>
> >> If I understand correctly, multiple arrays can reference the same
> buffers
> >> in memory, but once they are written to IPC their data buffers will be
> >> duplicated. Is that right?
> >>
> >> Dictionary types have a special message so they can be reused across
> >> batches and even fields. Did we consider adding a similar message for
> >> string view buffers?
> >>
> >> One relevant use case I'm curious about is substring extraction. For
> >> example, if I have a column of URIs and I create columns where I've
> >> extracted substrings like the hostname, path, and a list column of query
> >> parameters, I'd like for those latter columns to be views into the URI
> >> buffers, rather than full copies.
> >>
> >> However, I've never touched the IPC read code paths, so it's quite
> >> possible
> >> I'm overlooking something obvious.
> >>
> >> Best,
> >>
> >> Will Jones
> >>
> >>
> >> On Tue, May 16, 2023 at 6:29 PM Dewey Dunnington
> >>  wrote:
> >>
> >> > Very cool!
> >> >
> >> > In addition to performance mentioned above, I could see this being
> >> > useful for the R bindings - we already have a global string pool and a
> >> > mechanism fo

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-13 Thread Will Jones
Hello Arrow devs,

Just a quick note. To answer one of my earlier questions:

1. Is this array type currently only used in Velox? (not DuckDB like some
> of the other new types?) What evidence do we have that it will become used
> outside of Velox?
>

This type is also used by DuckDB. Found discussion today in a talk from
Mark Raasveldt [1]. That does improve the case for adding this type in my
eyes.

Best,

Will Jones

[1] https://youtu.be/bZOvAKGkzpQ?t=1570



On Tue, Jun 6, 2023 at 7:40 PM Weston Pace  wrote:

> > This implies that each canonical alternative layout would codify a
> > primary layout as its "fallback."
>
> Yes, that was part of my proposal:
>
> >  * A new layout, if it is semantically equivalent to another, is
> considered an alternative layout
>
> Or, to phrase it another way.  If there is not a "fallback" then it is not
> an alternative layout.  It's a brand new primary layout.  I'd expect this
> to be quite rare.  I can't really even hypothesize any examples.  I think
> the only truly atomic layouts are fixed-width, list, and struct.
>
> > This seems reasonable but it opens
> > up some cans of worms, such as how two components communicating
> > through an Arrow interface would negotiate which layout is supported
>
> Most APIs that I'm aware of already do this.  For example,
> pyarrow.parquet.read_table has a "read_dictionary" property that can be
> used to control whether or not a column is returned with the dictionary
> encoding.  There is no way (that I'm aware of) to get a column in REE
> encoding today without explicitly requesting it.  In fact, this could be as
> simple as a boolean "use_advanced_features" flag although I would
> discourage something so simplistic.  The point is that arrow-compatible
> software should, by default, emit types that are supported by all arrow
> implementations.
>
> Of course, there is no way to enforce this, it's just a guideline / strong
> recommendation on how software should behave if it wants to state "arrow
> compatible" as a feature.
>
> On Tue, Jun 6, 2023 at 3:33 PM Ian Cook  wrote:
>
> > Thanks Weston. That all sounds reasonable to me.
> >
> > >  with the caveat that the primary layout must be emitted if the user
> > does not specifically request the alternative layout
> >
> > This implies that each canonical alternative layout would codify a
> > primary layout as its "fallback." This seems reasonable but it opens
> > up some cans of worms, such as how two components communicating
> > through an Arrow interface would negotiate which layout is supported.
> > I suppose such details should be discussed in a separate thread, but I
> > raise this here just to point out that it implies an expansion in the
> > scope of what Arrow interfaces can do.
> >
> > On Tue, Jun 6, 2023 at 6:17 PM Weston Pace 
> wrote:
> > >
> > > From Micah:
> > >
> > > > This sounds reasonable to me but my main concern is, I'm not sure
> > there is
> > > > a great mechanism to enforce canonical layouts don't somehow become
> > > default
> > > > (or the only implementation).
> > >
> > > I'm not sure I understand.  Is the concern that an alternative layout
> is
> > > eventually
> > > used more and more by implementations until it is used more often than
> > the
> > > primary
> > > layouts?  In that case I think that is ok and we can promote the
> > alternative
> > > to a primary layout.
> > >
> > > Or is the concern that some applications will only support the
> > alternative
> > > layouts and
> > > not the primary layout?  In that case I would argue the application is
> > not
> > > "arrow compatible".
> > > I don't know that we prevent or enforce this today either.  An author
> can
> > > always falsely
> > > claim they support Arrow even if they are using their own bespoke
> format.
> > >
> > > From Ian:
> > >
> > > > It seems to me that most projects that are implementing Arrow today
> > > > are not aiming to provide complete coverage of Arrow; rather they are
> > > > adopting Arrow because of its role as a standard and they are
> > > > implementing only as much of the Arrow standard as they require to
> > > > achieve some goal. I believe that such projects are important Arrow
> > > > stakeholders, and I believe that this proposed notion of canonical
> > > > alternative layouts will serve them well and will create efficiencies
> > >

Re: [DISCUSS] JSON Canonical Extension Type

2023-06-07 Thread Will Jones
Hello,

Sorry this hasn't gotten much attention recently. I just brought this up at
the Arrow community meeting, as I'd like to revive it.

It looks like there is a draft implementation up already [1].

I'm generally supportive of this, but I have a few questions:

1. Would we be able to make this extension type work on top of any of the
string types, including Utf8, LargeUtf8, and the (under consideration [2])
StringView types?
2. Does this imply a potential canonical extension type for every
text-based data format, such as HOCON, XML, and so on? If we agree JSON is
special, I think it's fine to have its own extension type. On the other
hand, it might be worth considering making a generic extension type for
serialized data, that is parameterized by the media type
("application/json" in this case).  This doesn't preclude the possibility
of building an extension type class / struct within Arrow implementations
that is specific to JSON; I don't think there's any hard rule that there
has to be a 1-1 correspondence between extension types in the format and
the concrete data structures in libraries.

Best,

Will Jones

[1] https://github.com/apache/arrow/pull/13901
[2] https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt


On Thu, Dec 1, 2022 at 12:23 AM Antoine Pitrou  wrote:

>
> HOCON is a superset of JSON, so I'm not sure making it an extension type
> based it on JSON would be a good idea.
>
>
> Le 01/12/2022 à 06:23, Micah Kornfield a écrit :
> >>
> >> Can a logical extension be based on another logical extension?
> >
> > Potentially but this is mostly an implementation details, each type
> should
> > have their own specification IMO.
> >
> > HOCON support might be nice..
> >
> > I'm not sure if this is common enough to warrant a canonical type within
> > Arrow but you are welcome to propose something if you would like.
> >
> > Cheers,
> > Micah
> >
> > On Mon, Nov 28, 2022 at 11:55 AM Lee, David  .invalid>
> > wrote:
> >
> >> Can a logical extension be based on another logical extension?
> >>
> >> HOCON support might be nice..
> >>
> >> -Original Message-
> >> From: Micah Kornfield 
> >> Sent: Monday, November 28, 2022 11:50 AM
> >> To: dev@arrow.apache.org
> >> Subject: Re: [DISCUSS] JSON Canonical Extension Type
> >>
> >> External Email: Use caution with links and attachments
> >>
> >>
> >> This seems like a reasonable definition to me.  Since there hasn't been
> >> much feedback, I think maybe following through an implementation + this
> >> description in a PR would be the next steps.  If there isn't further
> >> feedback on this, once the PR is up we can have try to vote (which might
> >> bring up some more feedback, but hopefully wouldn't cause too much
> >> implementation churn).
> >>
> >> Thanks,
> >> Micah
> >>
> >> On Thu, Nov 17, 2022 at 3:58 PM Pradeep Gollakota
> >>  wrote:
> >>
> >>> Hi folks!
> >>>
> >>> I put together this specification for canonicalizing the JSON type in
> >>> Arrow.
> >>>
> >>> ## Introduction
> >>> JSON is a widely used text based data interchange format. There are
> >>> many use cases where a user has a column whose contents are a JSON
> >>> encoded string. BigQuery's [JSON Type][1] and Parquet’s [JSON Logical
> >>> Type][2] are two such examples.
> >>>
> >>> The JSON specification is defined in [RFC-8259][3]. However, many of
> >>> the most popular parsers support non standard extensions. Examples of
> >>> non standard extensions to JSON include comments, unquoted keys,
> >>> trailing commas, etc.
> >>>
> >>> ## Extension Specification
> >>> * The name of the extension is `arrow.json`
> >>> * The storage type of the extension is `utf8`
> >>> * The extension type has no parameters
> >>> * The metadata MUST be either empty or a valid JSON object
> >>>  - There is no canonical metadata
> >>>  - Implementations MAY include implementation-specific metadata by
> >>> using a namespaced key. For example `{"google.bigquery": {"my":
> >>> "metadata"}}`
> >>> * Implementations...
> >>>  - MUST produce valid UTF-8 encoded text
> >>>  - SHOULD produce valid standard JSON
> >>>  - MAY produce valid non-standard JSON
> >>>  - MUST support parsing standard JSO

Re: [VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 26.0.0 RC1

2023-06-05 Thread Will Jones
+1 (binding). Verified on Ubuntu 22 x86_64. Thanks, Andy!

On Sat, Jun 3, 2023 at 9:41 AM vin jake  wrote:

> +1 (non-binding)
>
> verified on my M1 mac book.
>
> Thanks Andy.
>
> Andy Grove  于 2023年6月3日周六 23:20写道:
>
> > Hi,
> >
> > I would like to propose a release of Apache Arrow DataFusion
> > Implementation,
> > version 26.0.0.
> >
> > This release candidate is based on commit:
> > 06240ab87e7e7d8ac4b43feaa95377bf607d18eb [1]
> > The proposed release tarball and signatures are hosted at [2].
> > The changelog is located at [3].
> >
> > Please download, verify checksums and signatures, run the unit tests, and
> > vote
> > on the release. The vote will be open for at least 72 hours.
> >
> > Only votes from PMC members are binding, but all members of the community
> > are
> > encouraged to test the release and vote with "(non-binding)".
> >
> > The standard verification procedure is documented at
> >
> >
> https://github.com/apache/arrow-datafusion/blob/main/dev/release/README.md#verifying-release-candidates
> > .
> >
> > [ ] +1 Release this as Apache Arrow DataFusion 26.0.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow DataFusion 26.0.0 because...
> >
> > Here is my vote:
> >
> > +1
> >
> > [1]:
> >
> >
> https://github.com/apache/arrow-datafusion/tree/06240ab87e7e7d8ac4b43feaa95377bf607d18eb
> > [2]:
> >
> >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-26.0.0-rc1
> > [3]:
> >
> >
> https://github.com/apache/arrow-datafusion/blob/06240ab87e7e7d8ac4b43feaa95377bf607d18eb/CHANGELOG.md
> >
>


Re: [VOTE][RUST] Release Apache Arrow Rust 41.0.0 RC1

2023-06-05 Thread Will Jones
+1 (binding). Verified on Ubuntu 22 x86_64. Thanks, Raphael!

On Fri, Jun 2, 2023 at 12:47 PM Andrew Lamb  wrote:

> +1 (binding)
> Verified on x86_64 mac
>
> The content of this release looks very good 
>
> Thank you Raphael
>
> Andrew
>
> On Fri, Jun 2, 2023 at 2:59 PM L. C. Hsieh  wrote:
>
> > +1 (binding)
> >
> > Verified on M1 Mac.
> >
> > Thanks Raphael.
> >
> > On Fri, Jun 2, 2023 at 11:55 AM Raphael Taylor-Davies
> >  wrote:
> > >
> > > Hi,
> > >
> > > I would like to propose a release of Apache Arrow Rust Implementation,
> > > version 41.0.0.
> > >
> > > This release candidate is based on commit:
> > > e1badc0542ca82e2304cc3f51a9d25ea2dbb74eb [1]
> > >
> > > The proposed release tarball and signatures are hosted at [2].
> > >
> > > The changelog is located at [3].
> > >
> > > Please download, verify checksums and signatures, run the unit tests,
> > > and vote on the release. There is a script [4] that automates some of
> > > the verification.
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Release this as Apache Arrow Rust
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Arrow Rust  because...
> > >
> > > [1]:
> > >
> >
> https://github.com/apache/arrow-rs/tree/e1badc0542ca82e2304cc3f51a9d25ea2dbb74eb
> > > [2]:
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-41.0.0-rc1
> > > [3]:
> > >
> >
> https://github.com/apache/arrow-rs/blob/e1badc0542ca82e2304cc3f51a9d25ea2dbb74eb/CHANGELOG.md
> > > [4]:
> > >
> >
> https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
> > >
> >
>


Re: [VOTE][RUST] Release Apache Arrow Rust Object Store 0.6.1 RC1

2023-06-05 Thread Will Jones
+1 (binding), verified on M1 MacOS. Thanks Raphael!

On Fri, Jun 2, 2023 at 11:56 AM L. C. Hsieh  wrote:

> +1 (binding)
>
> Verified on M1 Mac.
>
> Thanks Raphael.
>
> On Fri, Jun 2, 2023 at 11:38 AM Andrew Lamb  wrote:
> >
> > +1 (binding)
> >
> > I verified the signature and ran the verification script on mac x86_64
> > Thank you Raphael
> >
> > On Fri, Jun 2, 2023 at 2:23 PM Raphael Taylor-Davies
> >  wrote:
> >
> > > Hi,
> > >
> > > I would like to propose a release of Apache Arrow Rust Object
> > > Store Implementation, version 0.6.1.
> > >
> > > This release candidate is based on commit:
> > > f323097584eaa8edb1193b4fb67bccadd39594f6 [1]
> > >
> > > The proposed release tarball and signatures are hosted at [2].
> > >
> > > The changelog is located at [3].
> > >
> > > Please download, verify checksums and signatures, run the unit tests,
> > > and vote on the release. There is a script [4] that automates some of
> > > the verification.
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Release this as Apache Arrow Rust Object Store
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Arrow Rust Object Store because...
> > >
> > > [1]:
> > >
> > >
> https://github.com/apache/arrow-rs/tree/f323097584eaa8edb1193b4fb67bccadd39594f6
> > > [2]:
> > >
> > >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.6.1-rc1
> > > [3]:
> > >
> > >
> https://github.com/apache/arrow-rs/blob/f323097584eaa8edb1193b4fb67bccadd39594f6/object_store/CHANGELOG.md
> > > [4]:
> > >
> > >
> https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh
> > >
> > >
>


Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

2023-06-02 Thread Will Jones
> The main downside with using the mask (or any solution based on a filter
> node / filtering) is that it requires that the delete indices go into the
> plan itself.  So you need to first read the delete files and then create
> the plan.  And, if there are many deleted rows, this can be costly.

Ah, I see. I was assuming you could load the indices within the fragment
scan, at the same time the page index was read. That's how I'm implementing
with Lance, and how I plan to implement with Delta Lake. But if you can't
do that, then filtering with an anti-join makes sense. You wouldn't want to
include those in a plan.

On Fri, Jun 2, 2023 at 7:38 AM Weston Pace  wrote:

> Also, for clarity, I do agree with Gang that these are both valuable
> features in their own right.  A mask makes a lot of sense for page indices.
>
> On Fri, Jun 2, 2023 at 7:36 AM Weston Pace  wrote:
>
> > > then I think the incremental cost of adding the
> > > positional deletes to the mask is probably lower than the anti-join.
> > Do you mean developer cost?  Then yes, I agree.  Although there may be
> > some subtlety in the pushdown to connect a dataset filter to a parquet
> > reader filter.
> >
> > The main downside with using the mask (or any solution based on a filter
> > node / filtering) is that it requires that the delete indices go into the
> > plan itself.  So you need to first read the delete files and then create
> > the plan.  And, if there are many deleted rows, this can be costly.
> >
> > On Thu, Jun 1, 2023 at 7:13 PM Will Jones 
> wrote:
> >
> >> That's a good point, Gang. To perform deletes, we definitely need the
> row
> >> index, so we'll want that regardless of whether it's used in scans.
> >>
> >> > I'm not sure a mask would be the ideal solution for Iceberg (though it
> >> is
> >> a reasonable feature in its own right) because I think position-based
> >> deletes, in Iceberg, are still done using an anti-join and not a filter.
> >>
> >> For just positional deletes in isolation, I agree the mask wouldn't be
> >> more
> >> optimal than the anti-join. But if they end up using the mask for
> >> filtering
> >> with the page index, then I think the incremental cost of adding the
> >> positional deletes to the mask is probably lower than the anti-join.
> >>
> >> On Thu, Jun 1, 2023 at 6:33 PM Gang Wu  wrote:
> >>
> >> > IMO, the adding a row_index column from the reader is orthogonal to
> >> > the mask implementation. Table formats (e.g. Apache Iceberg and
> >> > Delta) require the knowledge of row index to finalize row deletion. It
> >> > would be trivial to natively support row index from the file reader.
> >> >
> >> > Best,
> >> > Gang
> >> >
> >> > On Fri, Jun 2, 2023 at 3:40 AM Weston Pace 
> >> wrote:
> >> >
> >> > > I agree that having a row_index is a good approach.  I'm not sure a
> >> mask
> >> > > would be the ideal solution for Iceberg (though it is a reasonable
> >> > feature
> >> > > in its own right) because I think position-based deletes, in
> Iceberg,
> >> are
> >> > > still done using an anti-join and not a filter.
> >> > >
> >> > > That being said, we probably also want to implement a streaming
> >> > merge-based
> >> > > anti-join because I believe delete files are ordered by row_index
> and
> >> so
> >> > a
> >> > > streaming approach is likely to be much more performant.
> >> > >
> >> > > On Mon, May 29, 2023 at 4:01 PM Will Jones  >
> >> > > wrote:
> >> > >
> >> > > > Hi Rusty,
> >> > > >
> >> > > > At first glance, I think adding a row_index column would make
> >> sense. To
> >> > > be
> >> > > > clear, this would be an index within a file / fragment, not across
> >> > > multiple
> >> > > > files, which don't necessarily have a known ordering in Acero
> >> (IIUC).
> >> > > >
> >> > > > However, another approach would be to take a mask argument in the
> >> > Parquet
> >> > > > reader. We may wish to do this anyways for support for using
> >> predicate
> >> > > > pushdown with Parquet's page index. While Arrow C++ hasn't yet
> >> > > implemented
> >> > > > predicate pushdown on page index (right now ju

Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

2023-06-01 Thread Will Jones
That's a good point, Gang. To perform deletes, we definitely need the row
index, so we'll want that regardless of whether it's used in scans.

> I'm not sure a mask would be the ideal solution for Iceberg (though it is
a reasonable feature in its own right) because I think position-based
deletes, in Iceberg, are still done using an anti-join and not a filter.

For just positional deletes in isolation, I agree the mask wouldn't be more
optimal than the anti-join. But if they end up using the mask for filtering
with the page index, then I think the incremental cost of adding the
positional deletes to the mask is probably lower than the anti-join.

On Thu, Jun 1, 2023 at 6:33 PM Gang Wu  wrote:

> IMO, the adding a row_index column from the reader is orthogonal to
> the mask implementation. Table formats (e.g. Apache Iceberg and
> Delta) require the knowledge of row index to finalize row deletion. It
> would be trivial to natively support row index from the file reader.
>
> Best,
> Gang
>
> On Fri, Jun 2, 2023 at 3:40 AM Weston Pace  wrote:
>
> > I agree that having a row_index is a good approach.  I'm not sure a mask
> > would be the ideal solution for Iceberg (though it is a reasonable
> feature
> > in its own right) because I think position-based deletes, in Iceberg, are
> > still done using an anti-join and not a filter.
> >
> > That being said, we probably also want to implement a streaming
> merge-based
> > anti-join because I believe delete files are ordered by row_index and so
> a
> > streaming approach is likely to be much more performant.
> >
> > On Mon, May 29, 2023 at 4:01 PM Will Jones 
> > wrote:
> >
> > > Hi Rusty,
> > >
> > > At first glance, I think adding a row_index column would make sense. To
> > be
> > > clear, this would be an index within a file / fragment, not across
> > multiple
> > > files, which don't necessarily have a known ordering in Acero (IIUC).
> > >
> > > However, another approach would be to take a mask argument in the
> Parquet
> > > reader. We may wish to do this anyways for support for using predicate
> > > pushdown with Parquet's page index. While Arrow C++ hasn't yet
> > implemented
> > > predicate pushdown on page index (right now just supports row groups),
> > > Arrow Rust has and provides an API to pass in a mask to support it. The
> > > reason for this implementation is described in the blog post "Querying
> > > Parquet with Millisecond Latency" [1], under "Page Pruning". The
> > > RowSelection struct API is worth a look [2].
> > >
> > > I'm not yet sure which would be preferable, but I think adopting a
> > similar
> > > pattern to what the Rust community has done may be wise. It's possible
> > that
> > > row_index is easy to implement while the mask will take time, in which
> > case
> > > row_index makes sense as an interim solution.
> > >
> > > Best,
> > >
> > > Will Jones
> > >
> > > [1]
> > >
> > >
> >
> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
> > > [2]
> > >
> > >
> >
> https://docs.rs/parquet/40.0.0/parquet/arrow/arrow_reader/struct.RowSelection.html
> > >
> > > On Mon, May 29, 2023 at 2:12 PM Rusty Conover  >
> > > wrote:
> > >
> > > > Hi Arrow Team,
> > > >
> > > > I wanted to suggest an improvement regarding Acero's Scan node.
> > > > Currently, it provides useful information such as __fragment_index,
> > > > __batch_index, __filename, and __last_in_fragment. However, it would
> > > > be beneficial to have an additional column that returns an overall
> > > > "row index" from the source.
> > > >
> > > > The row index would start from zero and increment for each row
> > > > retrieved from the source, particularly in the case of Parquet files.
> > > > Is it currently possible to obtain this row index or would expanding
> > > > the Scan node's behavior be required?
> > > >
> > > > Having this row index column would be valuable in implementing
> support
> > > > for Iceberg's positional-based delete files, as outlined in the
> > > > following link:
> > > >
> > > > https://iceberg.apache.org/spec/#delete-formats
> > > >
> > > > While Iceberg's value-based deletes can already be performed using
> the
> > > > support for anti joins, using a projection node does not guarantee
> the
> > > > row ordering within an Acero graph. Hence, the inclusion of a
> > > > dedicated row index column would provide a more reliable solution in
> > > > this context.
> > > >
> > > > Thank you for considering this suggestion.
> > > >
> > > > Rusty
> > > >
> > >
> >
>


Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

2023-05-29 Thread Will Jones
Hi Rusty,

At first glance, I think adding a row_index column would make sense. To be
clear, this would be an index within a file / fragment, not across multiple
files, which don't necessarily have a known ordering in Acero (IIUC).

However, another approach would be to take a mask argument in the Parquet
reader. We may wish to do this anyways for support for using predicate
pushdown with Parquet's page index. While Arrow C++ hasn't yet implemented
predicate pushdown on page index (right now just supports row groups),
Arrow Rust has and provides an API to pass in a mask to support it. The
reason for this implementation is described in the blog post "Querying
Parquet with Millisecond Latency" [1], under "Page Pruning". The
RowSelection struct API is worth a look [2].

I'm not yet sure which would be preferable, but I think adopting a similar
pattern to what the Rust community has done may be wise. It's possible that
row_index is easy to implement while the mask will take time, in which case
row_index makes sense as an interim solution.

Best,

Will Jones

[1]
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
[2]
https://docs.rs/parquet/40.0.0/parquet/arrow/arrow_reader/struct.RowSelection.html

On Mon, May 29, 2023 at 2:12 PM Rusty Conover 
wrote:

> Hi Arrow Team,
>
> I wanted to suggest an improvement regarding Acero's Scan node.
> Currently, it provides useful information such as __fragment_index,
> __batch_index, __filename, and __last_in_fragment. However, it would
> be beneficial to have an additional column that returns an overall
> "row index" from the source.
>
> The row index would start from zero and increment for each row
> retrieved from the source, particularly in the case of Parquet files.
> Is it currently possible to obtain this row index or would expanding
> the Scan node's behavior be required?
>
> Having this row index column would be valuable in implementing support
> for Iceberg's positional-based delete files, as outlined in the
> following link:
>
> https://iceberg.apache.org/spec/#delete-formats
>
> While Iceberg's value-based deletes can already be performed using the
> support for anti joins, using a projection node does not guarantee the
> row ordering within an Acero graph. Hence, the inclusion of a
> dedicated row index column would provide a more reliable solution in
> this context.
>
> Thank you for considering this suggestion.
>
> Rusty
>


Re: New datatype: Huge integers & decimals

2023-05-23 Thread Will Jones
Hello Arrow devs,

I actually have a use case where we'd like to support a new number type in
Arrow, but instead of larger numbers, smaller ones. :) For machine learning
use cases, we at Lance would like to support bfloat16 [1]. These are 16-bit
floating point numbers that trade significant digits to exponent, so they
have the same range as float 32 but less precision than float 16. They are
natively supported on newer AI-focused silicon [1]

I'm just starting to look at this, so not yet sure what the pros and cons
are of implementing it as an extension type versus a native Arrow type. My
initial ideas:

Pros of an extension type:
* It can be moved through Arrow-native systems that don't implement it, as
long as they preserve extension type information.

Pros of a native type:
* We have established patterns for writing compute kernels for natively
supported types.

If we were to implement these as extension types, I think bfloat16 and the
number types Ian Joiner mentions would be best implemented as extension
types based on fixed-size binary. We have a native float16 type already,
but I think making bfloat16 an extension type based on that it could get
accidentally manipulated as a float16, which IIUC would be invalid.

If anyone has any advice from our work thus far on extension types, I'd
welcome your input.

Best,

Will Jones

[1]
https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus
[2] https://en.wikipedia.org/wiki/Bfloat16_floating-point_format

On Tue, May 23, 2023 at 10:49 AM Antoine Pitrou  wrote:

>
> Your question seems unspecific, but we now have the possibility of
> standardizing canonical extension types (which are, of course, optional
> to implement and support):
>
> https://arrow.apache.org/docs/format/CanonicalExtensions.html
>
>
> Le 23/05/2023 à 19:45, Ian Joiner a écrit :
> > That’s a possibility. Do we consider officially support them?
> >
> >
> > On Tuesday, May 23, 2023, Antoine Pitrou  wrote:
> >
> >>
> >> I'm not sure what you're actually proposing here. A new extension type
> >> perhaps?
> >>
> >>
> >> Le 23/05/2023 à 19:13, Ian Joiner a écrit :
> >>
> >>> Hi,
> >>>
> >>> We need to have really large integers (with 128, 256 and 512 bits) as
> well
> >>> as decimals (up to at least decimal1024) because they do actually
> exist in
> >>> crypto / web3 space.
> >>>
> >>> See https://docs.rs/primitive-types/latest/primitive_types/ for an
> >>> example
> >>> of what needs to be supported.
> >>>
> >>> If accepted we can implement the types for C++/Python and Rust.
> >>>
> >>> Thanks,
> >>> Ian
> >>>
> >>>
> >
>


Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-22 Thread Will Jones
Hello Arrow devs,

I don't understand why we would start deprecating features in the Arrow
> format. Even starting this talk might already be a bad idea PR-wise.
>

I agree we don't want to make breaking changes to the Arrow format. But
several maintainers have already stated they have no interest in
maintaining both list types with full compute functionality [1][2], so I
think it's very likely one list type or the other will be
implicitly preferred in the ecosystem if this data type was added.  If
that's the case, I'd prefer that we agreed as a community which one should
be preferred. Maybe that's not the best path; it's just one way for us to
balance stability, maintenance burden, and relevance.

Can someone help distill down the primary rationale and usecase for
> adding ArrayView to the Arrow Spec?
>

Looking back at that old thread, I think one of the main motivations is to
try to prevent query engine implementers from feeling there is a tradeoff
between having state-of-the-art performance and being Arrow-native. For
some of the new array types, we had both Velox and DuckDB use them, so it
was reasonable to expect they were innovations that might proliferate. I'm
not sure if the ArrayView is part of that. From Wes earlier [3]:

The idea is that in a world of data and query federation (for example,
> consider [1] where Arrow is being used as a data federation layer with many
> query engines), we want to increase the amount of data in-flight and
> in-memory that is in Arrow format. So if query engines are having to depart
> substantially from the Arrow format to get performance, then this creates a
> potential lose-lose situation: * Depart from Arrow: get better performance
> but pay serialization costs to read and write Arrow (the performance and
> resource utilization benefits outweigh the serialization costs). This puts
> additional pressure on query engines to build specialized components for
> solving problems rather than making use of off-the-shelf components that
> use Arrow. This has knock-on effects on ecosystem fragmentation. * Or use
> Arrow, and accept suboptimal query processing performance
>


Will mentions one usecase is Velox consuming python UDF output, which seems
> to be mostly about how fast Velox can consume this format, not how fast it
> can be written. Are there other usecases?
>

To be clear, I don't know if that's the use case they want. That's just me
speculating.

I still have some questions myself:

1. Is this array type currently only used in Velox? (not DuckDB like some
of the other new types?) What evidence do we have that it will become used
outside of Velox?
2. We already have three list types: list, large list (64-bit offsets), and
fixed size list. Do we think we will only want a view version of the 32-bit
offset variable length list? Or are we potentially talking about view
variants for all three?

Best,

Will Jones

[1] https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx
[2] https://lists.apache.org/thread/cc4w3vs3foj1fmpq9x888k51so60ftr3
[3] https://lists.apache.org/thread/mk2yn62y6l8qtngcs1vg2qtwlxzbrt8t

On Mon, May 22, 2023 at 3:48 AM Andrew Lamb  wrote:

> Can someone help distill down the primary rationale and usecase for
> adding ArrayView to the Arrow Spec?
>
> From the above discussions, the stated rationale seems to be fast
> (zero-copy) interchange with Velox.
>
> This thread has qualitatively enumerated the benefits of (offset+len)
> encoding over the existing Arrow ListArray (offets) approach, but I haven't
> seen any performance measurements that might help us to gauge the tradeoff
> in additional complexity vs runtime overhead.
>
> Will mentions one usecase is Velox consuming python UDF output, which seems
> to be mostly about how fast Velox can consume this format, not how fast it
> can be written. Are there other usecases?
>
> Do we have numbers showing how much overhead converting to /from Velox's
> internal representation and the existing ListArray adds? Has anyone in
> Velox land considered adding faster support for Arrow style ListArray
> encoding?
>
>
> Andrew
>
> On Mon, May 22, 2023 at 4:38 AM Antoine Pitrou  wrote:
>
> >
> > Hi,
> >
> > I don't understand why we would start deprecating features in the Arrow
> > format. Even starting this talk might already be a bad idea PR-wise.
> >
> > As for implementing conversions at the I/O boundary, it's a reasonably
> > policy, but it still requires work by implementors and it's not granted
> > that all consumers of the Arrow format will grow such conversions
> > if/when we add non-trivial types such as ListView or StringView.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 22/05/2023 à 00:39, Will Jones a écrit :
> > > One more thing: Looki

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-21 Thread Will Jones
One more thing: Looking back on the previous discussion[1] (which Weston
pointed out in his earlier message), Jorge suggested that the old list
types might be deprecated in favor of view variants [2]. Others were
worried that it might undermine the perception that the Arrow format is
stable. I think it might be worth thinking about "soft deprecating" the old
list type: suggesting new implementations prefer the list view, but
reassuring that implementations should support the old format, even if they
just convert to the new format. To be clear, this wouldn't mean we plan to
create breaking changes in the format; but if we ever did for other
reasons, the old list type might go.

Arrow compute libraries could choose either format for compute support, and
plan to do conversion at the boundaries. Libraries that use the new type
will have cheap conversion when reading the old type. Meanwhile those that
are still on the old type will have some incentive to move towards the new
one, since that conversion will not be as efficient.

[1] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
[2] https://lists.apache.org/thread/smn13j1rnt23mb3fwx75sw3f877k3nwx

On Sun, May 21, 2023 at 3:07 PM Will Jones  wrote:

> Hello,
>
> I think Sasha brings up a good point, that the advantages of this format
> seem to be primarily about query processing. Other encodings like REE and
> dictionary have space-saving advantages that justify them simply in terms
> of space efficiency (although they have query processing advantages as
> well). As discussed, most use cases are already well served by existing
> list types and dictionary encoding.
>
> I agree that there are cases where transferring this type without
> conversion would be ideal. One use case I can think of is if Velox wants to
> be able to take Arrow-based UDFs (possibly written with PyArrow, for
> example) that operate on this column type and therefore wants zero-copy
> exchange over the C Data Interface.
>
> One big question I have: we already have three list types: list, large
> list (64-bit offsets), and fixed size list. Do we think we will only want a
> view version of the 32-bit offset variable length list? Or are we
> potentially talking about view variants for all three?
>
> Best,
>
> Will Jones
>
>
> On Sun, May 21, 2023 at 2:19 PM Felipe Oliveira Carvalho <
> felipe...@gmail.com> wrote:
>
>> The benefit of having a memory format that’s friendly to non-deterministic
>> order writes is unlocked by the transport and processing of the data being
>> agnostic to the physical order as much as possible.
>>
>> Requiring a conversion could cancel out that benefit. But it can be a
>> provisory step for compatibility between systems that don’t understand the
>> format yet. This is similar to the situation with compression schemes like
>> run-end encoding — the goal is processing the compressed data directly
>> without an expansion step whenever possible.
>>
>> This is why having it as part of the open Arrow format is so important:
>> everyone can agree on a format that’s friendly to parallel and/or
>> vectorized compute kernels without introducing multiple incompatible
>> formats to the ecosystem and without imposing a conversion step between
>> the
>> different systems.
>>
>> —
>> Felipe
>>
>> On Sat, 20 May 2023 at 20:04 Aldrin  wrote:
>>
>> > I don't feel like this representation is necessarily a detail of the
>> query
>> > engine, but I am also not sure why this representation would have to be
>> > converted to a non-view format when serializing. Could you clarify
>> that? My
>> > impression is that this representation could be used for persistence or
>> > data transfer, though it can be more complex to guarantee the portion of
>> > the buffer that an index points to is also present in memory.
>> >
>> > Sent from Proton Mail for iOS
>> >
>> >
>> > On Sat, May 20, 2023 at 15:00, Sasha Krassovsky <
>> krassovskysa...@gmail.com
>> > > wrote:
>> >
>> > Hi everyone,
>> > I understand that there are numerous benefits to this representation
>> > during query processing, but would it be fair to say that this is an
>> > implementation detail of the query engine? Query engines don’t
>> necessarily
>> > need to conform to the Arrow format internally, only at ingest/egress
>> > points, and performing a conversion from the non-view to view format
>> seems
>> > like it would be very cheap (though I understand not necessarily the
>> other
>> > way around, but you’d need to do that anyway if you’re serializing).
>

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-21 Thread Will Jones
Hello,

I think Sasha brings up a good point, that the advantages of this format
seem to be primarily about query processing. Other encodings like REE and
dictionary have space-saving advantages that justify them simply in terms
of space efficiency (although they have query processing advantages as
well). As discussed, most use cases are already well served by existing
list types and dictionary encoding.

I agree that there are cases where transferring this type without
conversion would be ideal. One use case I can think of is if Velox wants to
be able to take Arrow-based UDFs (possibly written with PyArrow, for
example) that operate on this column type and therefore wants zero-copy
exchange over the C Data Interface.

One big question I have: we already have three list types: list, large list
(64-bit offsets), and fixed size list. Do we think we will only want a view
version of the 32-bit offset variable length list? Or are we potentially
talking about view variants for all three?

Best,

Will Jones


On Sun, May 21, 2023 at 2:19 PM Felipe Oliveira Carvalho <
felipe...@gmail.com> wrote:

> The benefit of having a memory format that’s friendly to non-deterministic
> order writes is unlocked by the transport and processing of the data being
> agnostic to the physical order as much as possible.
>
> Requiring a conversion could cancel out that benefit. But it can be a
> provisory step for compatibility between systems that don’t understand the
> format yet. This is similar to the situation with compression schemes like
> run-end encoding — the goal is processing the compressed data directly
> without an expansion step whenever possible.
>
> This is why having it as part of the open Arrow format is so important:
> everyone can agree on a format that’s friendly to parallel and/or
> vectorized compute kernels without introducing multiple incompatible
> formats to the ecosystem and without imposing a conversion step between the
> different systems.
>
> —
> Felipe
>
> On Sat, 20 May 2023 at 20:04 Aldrin  wrote:
>
> > I don't feel like this representation is necessarily a detail of the
> query
> > engine, but I am also not sure why this representation would have to be
> > converted to a non-view format when serializing. Could you clarify that?
> My
> > impression is that this representation could be used for persistence or
> > data transfer, though it can be more complex to guarantee the portion of
> > the buffer that an index points to is also present in memory.
> >
> > Sent from Proton Mail for iOS
> >
> >
> > On Sat, May 20, 2023 at 15:00, Sasha Krassovsky <
> krassovskysa...@gmail.com
> > > wrote:
> >
> > Hi everyone,
> > I understand that there are numerous benefits to this representation
> > during query processing, but would it be fair to say that this is an
> > implementation detail of the query engine? Query engines don’t
> necessarily
> > need to conform to the Arrow format internally, only at ingest/egress
> > points, and performing a conversion from the non-view to view format
> seems
> > like it would be very cheap (though I understand not necessarily the
> other
> > way around, but you’d need to do that anyway if you’re serializing).
> >
> > Sasha Krassovsky
> >
> > > 20 мая 2023 г., в 13:00, Will Jones 
> > написал(а):
> > >
> > > Thanks for sharing these details, Pedro. The conditional branches
> > argument
> > > makes a lot of sense to me.
> > >
> > > The tensors point brings up some interesting issues. For now, we've
> > defined
> > > our only tensor extension type to be built on a fixed size list. If a
> use
> > > case of this might be manipulating tensors with zero copy, perhaps that
> > > suggests that we want a fixed size list variant? In addition, would we
> > have
> > > to define another extension type to be a ListView variant? Or would we
> > want
> > > to think about making extension types somehow valid across various
> > > encodings of the same "logical type"?
> > >
> > >> On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha Pedreira
> > >>  wrote:
> > >>
> > >> Hi all,
> > >>
> > >> This is Pedro from the Velox team at Meta. This is my first time here,
> > so
> > >> nice to e-meet you!
> > >>
> > >> Adding to what Felipe said, the main reason we created “ListView”
> > (though
> > >> we just call them ArrayVector/MapVector in Velox) is that, along with
> > >> StringViews for strings, they allow us to write any columnar buffer
> > >> out-or-

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-20 Thread Will Jones
ing conversion at the
> edges. We already have issues with the amount of code generation
> resulting in binary bloat and long compile times, and I worry this would
> worsen this situation whilst not really providing compelling advantages
> for the vast majority of workloads that don't interact with Velox.
> Whilst I can definitely see that the ListView representation is probably
> a better way to represent variable length lists than what arrow settled
> upon, I'm not yet convinced it is sufficiently better to incentivise
> broad ecosystem adoption.
>
> Kind Regards,
>
> Raphael Taylor-Davies
>
> On 11/05/2023 21:20, Will Jones wrote:
> > Hi Felipe,
> >
> > Thanks for the additional details.
> >
> >
> >> Velox kernels benefit from being able to append data to the array from
> >> different threads without care for strict ordering. Only the offsets
> array
> >> has to be written according to logical order but that is potentially a
> much
> >> smaller buffer than the values buffer.
> >>
> > It still seems to me like applications are still pretty niche, as I
> suspect
> > in most cases the benefits are outweighed by the costs. The benefit here
> > seems pretty limited: if you are trying to split work between threads,
> > usually you will have other levels such as array chunks to parallelize.
> And
> > if you have an incoming stream of row data, you'll want to append in
> > predictable order to match the order of the other arrays. Am I missing
> > something?
> >
> > And, IIUC, the cost of using ListView with out-of-order values over
> > ListArray is you lose memory locality; the values of element 2 are no
> > longer adjacent to the values of element 1. What do you think about that
> > tradeoff?
> >
> > I don't mean to be difficult about this. I'm excited for both the REE and
> > StringView arrays, but this one I'm not so sure about yet. I suppose
> what I
> > am trying to ask is, if we added this, do we think many Arrow and query
> > engine implementations (for example, DataFusion) will be eager to add
> full
> > support for the type, including compute kernels? Or are they likely to
> just
> > convert this type to ListArray at import boundaries?
> >
> > Because if it turns out to be the latter, then we might as well ask Velox
> > to export this type as ListArray and save the rest of the ecosystem some
> > work.
> >
> > Best,
> >
> > Will Jones
> >
> > On Thu, May 11, 2023 at 12:32 PM Felipe Oliveira Carvalho <
> > felipe...@gmail.com<mailto:felipe...@gmail.com>> wrote:
> >
> >> Initial reason for ListView arrays in Arrow is zero-copy compatibility
> with
> >> Velox which uses this format.
> >>
> >> Velox kernels benefit from being able to append data to the array from
> >> different threads without care for strict ordering. Only the offsets
> array
> >> has to be written according to logical order but that is potentially a
> much
> >> smaller buffer than the values buffer.
> >>
> >> Acero kernels could take advantage of that in the future.
> >>
> >> In implementing ListViewArray/Type I was able to reuse some C++
> templates
> >> used for ListArray which can reduce some of the burden on kernel
> >> implementations that aim to work with all the types.
> >>
> >> I’m can fix Acero kernels for working with ListView. This is similar to
> the
> >> work I’ve doing in kernels dealing with run-end encoded arrays.
> >>
> >> —
> >> Felipe
> >>
> >>
> >> On Wed, 26 Apr 2023 at 01:03 Will Jones  <mailto:will.jones...@gmail.com>> wrote:
> >>
> >>> I suppose one common use case is materializing list columns after some
> >>> expanding operation like a join or unnest. That's a case where I could
> >>> imagine a lot of repetition of values. Haven't yet thought of common
> >> cases
> >>> where there is overlap but not full duplication, but am eager to hear
> >> any.
> >>> The dictionary encoding point Raphael makes is interesting, especially
> >>> given the existence of LargeList and FixedSizeList. For many
> operations,
> >> it
> >>> might make more sense to just compose those existing types.
> >>>
> >>> IIUC the operations that would be unique to the ArrayView are ones
> >> altering
> >>> the shape. One could truncate each array to a certain length cheaply
> >> simply
> >>> by replacing the sizes 

Re: [VOTE][RUST] Release Apache Arrow Rust 40.0.0 RC1

2023-05-20 Thread Will Jones
+1 (binding)

Verified on Ubuntu 22.04. Thanks Raphael!

On Fri, May 19, 2023 at 10:05 AM L. C. Hsieh  wrote:

> +1 (binding)
>
> Verified on M1 Mac.
>
> Thanks Raphael
>
> On Fri, May 19, 2023 at 6:37 AM Andrew Lamb  wrote:
> >
> > +1 (binding)
> >
> > Verified on mac osx x86_64
> >
> > Thank you Raphael
> >
> > On Fri, May 19, 2023 at 8:49 AM Raphael Taylor-Davies
> >  wrote:
> >
> > > Hi,
> > >
> > > I would like to propose a release of Apache Arrow Rust Implementation,
> > > version 40.0.0.
> > >
> > > This release candidate is based on commit:
> > > 25bfccca58ff219d9f59ba9f4d75550493238a4f [1]
> > >
> > > The proposed release tarball and signatures are hosted at [2].
> > >
> > > The changelog is located at [3].
> > >
> > > Please download, verify checksums and signatures, run the unit tests,
> > > and vote on the release. There is a script [4] that automates some of
> > > the verification.
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Release this as Apache Arrow Rust
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Arrow Rust  because...
> > >
> > > [1]:
> > >
> > >
> https://github.com/apache/arrow-rs/tree/25bfccca58ff219d9f59ba9f4d75550493238a4f
> > > [2]:
> > >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-40.0.0-rc1
> > > [3]:
> > >
> > >
> https://github.com/apache/arrow-rs/blob/25bfccca58ff219d9f59ba9f4d75550493238a4f/CHANGELOG.md
> > > [4]:
> > >
> > >
> https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
> > >
> > >
>


Re: [VOTE][RUST] Release Apache Arrow Rust Object Store 0.6.0 RC1

2023-05-20 Thread Will Jones
+1 (binding)

Verified on Ubuntu 22.04.

Thanks Raphael!

On Thu, May 18, 2023 at 8:57 AM L. C. Hsieh  wrote:

> +1 (binding)
>
> Verified on M1 Mac.
>
> Thanks Raphael.
>
> On Thu, May 18, 2023 at 3:31 AM Andrew Lamb  wrote:
> >
> > +1 (binding)
> >
> > I ran the release verification script (Mac x86_64) and reviewed the
> > changelog . Looks like a good release.
> >
> > Thank you,
> > Andrew
> >
> > On Thu, May 18, 2023 at 5:25 AM Raphael Taylor-Davies
> >  wrote:
> >
> > > Hi,
> > >
> > > I would like to propose a release of Apache Arrow Rust Object
> > > Store Implementation, version 0.6.0.
> > >
> > > This release candidate is based on commit:
> > > ec7706c1f2aeef5a289e46d1df7785e5c93e6bfb [1]
> > >
> > > The proposed release tarball and signatures are hosted at [2].
> > >
> > > The changelog is located at [3].
> > >
> > > Please download, verify checksums and signatures, run the unit tests,
> > > and vote on the release. There is a script [4] that automates some of
> > > the verification.
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Release this as Apache Arrow Rust Object Store
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Arrow Rust Object Store because...
> > >
> > > [1]:
> > >
> > >
> https://github.com/apache/arrow-rs/tree/ec7706c1f2aeef5a289e46d1df7785e5c93e6bfb
> > > [2]:
> > >
> > >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.6.0-rc1
> > > [3]:
> > >
> > >
> https://github.com/apache/arrow-rs/blob/ec7706c1f2aeef5a289e46d1df7785e5c93e6bfb/object_store/CHANGELOG.md
> > > [4]:
> > >
> > >
> https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh
> > >
> > >
>


Re: [DISCUSS] Interest in a 12.0.1 patch?

2023-05-18 Thread Will Jones
Thanks for bringing this up Weston.

Joris has already created a 12.0.1 milestone that contains several fixes
that are candidates for backport [1], including this one. I think this is
the most severe issue though.

As a maintainer of the Python deltalake package, which uses the PyArrow
Parquet writer and is often passed pandas data, I would appreciate a patch
release.

Best,

Will Jones

[1]
https://github.com/apache/arrow/issues?q=is%3Aopen+is%3Aissue+milestone%3A12.0.1

On Thu, May 18, 2023 at 10:18 AM Ian Cook  wrote:

> There is also a major issue with the 12.0.0 R package that has now
> been fixed in the repo [2] and needs to be resubmitted to CRAN soon.
> The R package developers are supportive of a 12.0.1 patch release
> happening soon so that the resubmission of the R package to CRAN can
> also include the fix for the performance regression you mention.
>
> Ian
>
> [2] https://github.com/apache/arrow/pull/35612
>
> On Thu, May 18, 2023 at 1:04 PM Weston Pace  wrote:
> >
> > Regrettabl, 12.0.0 had a significant performance regression (I'll take
> the
> > blame for not thinking through all the use cases), most easily exposed
> when
> > writing datasets from pandas / numpy data, which is being addressed in
> > [1].  I believe this to be a fairly common use case and it may warrant a
> > 12.0.1 patch.  Are there other issues that would need a patch?  Do we
> feel
> > this issue is significant enough to justify the work?
> >
> > [1] https://github.com/apache/arrow/pull/35565
>


Re: [DISCUSS][Format] Draft implementation of string view array format

2023-05-16 Thread Will Jones
Hello Ben,

Thanks for your work on this. I think this will be an excellent addition to
the format.

If I understand correctly, multiple arrays can reference the same buffers
in memory, but once they are written to IPC their data buffers will be
duplicated. Is that right?

Dictionary types have a special message so they can be reused across
batches and even fields. Did we consider adding a similar message for
string view buffers?

One relevant use case I'm curious about is substring extraction. For
example, if I have a column of URIs and I create columns where I've
extracted substrings like the hostname, path, and a list column of query
parameters, I'd like for those latter columns to be views into the URI
buffers, rather than full copies.

However, I've never touched the IPC read code paths, so it's quite possible
I'm overlooking something obvious.

Best,

Will Jones


On Tue, May 16, 2023 at 6:29 PM Dewey Dunnington
 wrote:

> Very cool!
>
> In addition to performance mentioned above, I could see this being
> useful for the R bindings - we already have a global string pool and a
> mechanism for keeping a vector of them alive.
>
> I don't see the C Data interface in the PR although I may have missed
> it - is that a part of the proposal? It seems like it would be
> possible to use raw pointers as long as they can be guaranteed to be
> valid until the release callback is called?
>
> On Tue, May 16, 2023 at 8:43 PM Jacob Wujciak
>  wrote:
> >
> > Hello Everyone,
> > I think keeping interoperability with the large ecosystem is a very
> > important goal for arrow so I am overall in favor of this proposal!
> >
> > You mention benchmarks multiple times, are these results published
> > somewhere?
> >
> > Thanks
> >
> > On Tue, May 16, 2023 at 11:39 PM Benjamin Kietzman 
> > wrote:
> >
> > > Hello all,
> > >
> > > As previously discussed on this list [1], an UmbraDB/DuckDB/Velox
> > > compatible
> > > "string view" type could bring several performance benefits to access
> and
> > > authoring of string data in the arrow format [2]. Additionally better
> > > interoperability with engines already using this format could be
> > > established.
> > >
> > > PR #0 [3] adds Utf8View and BinaryView types to the C++ implementation
> and
> > > to
> > > the IPC format. For the purposes of IPC raw pointers are not used.
> Instead,
> > > each view contains a pair of 32 bit unsigned integers which encode the
> > > index of
> > > a character buffer (string view arrays may consist of a variable
> number of
> > > such buffers) and the offset of a view's data within that buffer
> > > respectively.
> > > Benefits of this substitution include:
> > > - This makes explicit the guarantee that lifetime of all character
> data is
> > > equal
> > >   to that of the array which views it, which is critical for confident
> > >   consumption across an interface boundary.
> > > - As with other types in the arrow format, such arrays are
> serializable and
> > >   venue agnostic; directly usable in shared memory without
> modification.
> > > - Indices and offsets are easily validated.
> > >
> > > Accessing the data requires some trivial pointer arithmetic, but in
> > > benchmarking
> > > this had negligible impact on sequential access and only minor impact
> on
> > > random
> > > access.
> > >
> > > In the C++ implementation, raw pointer string views are supported as an
> > > extended
> > > case of the Utf8View type: `utf8_view(/*has_raw_pointers=*/true)`.
> > > Branching on
> > > this access pattern bit at the data type level has negligible impact on
> > > performance since the branch resides outside any hot loops. Utility
> > > functions
> > > are provided for efficient (potentially in-place) conversion between
> raw
> > > pointer
> > > and index offset views. For example, the C++ implementation could zero
> copy
> > > a raw pointer array from Velox, filter it, then convert to
> index/offset for
> > > serialization. Other implementations may choose to accommodate or
> eschew
> > > raw
> > > pointer views as their communities direct.
> > >
> > > Where desirous in a rigorously controlled context this still enables
> > > construction
> > > and safe consumption of string view arrays which reference memory not
> > > directly bound to the lifetime of the array. I'm not sure when or if we
> > > would
> > > find 

Re: Reusing RecordBatch objects and their memory space

2023-05-12 Thread Will Jones
Hello,

I'm not sure if there are easy ways to avoid calling the destructors.
However, I would point out memory space reuse is handled through memory
pools; if you have one enabled it shouldn't be handing memory back to the
OS between each iteration.

Best,

Will Jones

On Fri, May 12, 2023 at 9:59 AM SHI BEI  wrote:

> Hi community,
>
>
> I'm using theRecordBatchReader::ReadNext interface to read Parquet
> data in my project, and I've noticed that there are a lot of temporary
> object destructors being generated during usage. Has the community
> considered providing an interface to reuseRecordBatchobjects
> and their memory space for storing data?
>
>
>
>
> SHIBEI
> shibei...@foxmail.com


Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-11 Thread Will Jones
Hi Felipe,

Thanks for the additional details.


> Velox kernels benefit from being able to append data to the array from
> different threads without care for strict ordering. Only the offsets array
> has to be written according to logical order but that is potentially a much
> smaller buffer than the values buffer.
>

It still seems to me like applications are still pretty niche, as I suspect
in most cases the benefits are outweighed by the costs. The benefit here
seems pretty limited: if you are trying to split work between threads,
usually you will have other levels such as array chunks to parallelize. And
if you have an incoming stream of row data, you'll want to append in
predictable order to match the order of the other arrays. Am I missing
something?

And, IIUC, the cost of using ListView with out-of-order values over
ListArray is you lose memory locality; the values of element 2 are no
longer adjacent to the values of element 1. What do you think about that
tradeoff?

I don't mean to be difficult about this. I'm excited for both the REE and
StringView arrays, but this one I'm not so sure about yet. I suppose what I
am trying to ask is, if we added this, do we think many Arrow and query
engine implementations (for example, DataFusion) will be eager to add full
support for the type, including compute kernels? Or are they likely to just
convert this type to ListArray at import boundaries?

Because if it turns out to be the latter, then we might as well ask Velox
to export this type as ListArray and save the rest of the ecosystem some
work.

Best,

Will Jones

On Thu, May 11, 2023 at 12:32 PM Felipe Oliveira Carvalho <
felipe...@gmail.com> wrote:

> Initial reason for ListView arrays in Arrow is zero-copy compatibility with
> Velox which uses this format.
>
> Velox kernels benefit from being able to append data to the array from
> different threads without care for strict ordering. Only the offsets array
> has to be written according to logical order but that is potentially a much
> smaller buffer than the values buffer.
>
> Acero kernels could take advantage of that in the future.
>
> In implementing ListViewArray/Type I was able to reuse some C++ templates
> used for ListArray which can reduce some of the burden on kernel
> implementations that aim to work with all the types.
>
> I’m can fix Acero kernels for working with ListView. This is similar to the
> work I’ve doing in kernels dealing with run-end encoded arrays.
>
> —
> Felipe
>
>
> On Wed, 26 Apr 2023 at 01:03 Will Jones  wrote:
>
> > I suppose one common use case is materializing list columns after some
> > expanding operation like a join or unnest. That's a case where I could
> > imagine a lot of repetition of values. Haven't yet thought of common
> cases
> > where there is overlap but not full duplication, but am eager to hear
> any.
> >
> > The dictionary encoding point Raphael makes is interesting, especially
> > given the existence of LargeList and FixedSizeList. For many operations,
> it
> > might make more sense to just compose those existing types.
> >
> > IIUC the operations that would be unique to the ArrayView are ones
> altering
> > the shape. One could truncate each array to a certain length cheaply
> simply
> > by replacing the sizes buffer. Or perhaps there are interesting
> operations
> > on tensors that would benefit.
> >
> > On Tue, Apr 25, 2023 at 7:47 PM Raphael Taylor-Davies
> >  wrote:
> >
> > > Unless I am missing something, I think the selection use-case could be
> > > equally well served by a dictionary-encoded BinarArray/ListArray, and
> > would
> > > have the benefit of not requiring any modifications to the existing
> > format
> > > or kernels.
> > >
> > > The major additional flexibility of the proposed encoding would be
> > > permitting disjoint or overlapping ranges, are these common enough in
> > > practice to represent a meaningful bottleneck?
> > >
> > >
> > > On 26 April 2023 01:40:14 BST, David Li  wrote:
> > > >Is there a need for a 64-bit offsets version the same way we have List
> > > and LargeList?
> > > >
> > > >And just to be clear, the difference with List is that the lists don't
> > > have to be stored in their logical order (or in other words, offsets do
> > not
> > > have to be nondecreasing and so we also need sizes)?
> > > >
> > > >On Wed, Apr 26, 2023, at 09:37, Weston Pace wrote:
> > > >> For context, there was some discussion on this back in [1].  At that
> > > time
> > > >> this was called "sequence view" but I do not like tha

Re: [VOTE] Release Apache Arrow ADBC 0.4.0 - RC0

2023-05-10 Thread Will Jones
+1 (binding)

Verified on Ubuntu 22 with USE_CONDA=1
dev/release/verify-release-candidate.sh 0.4.0 0

On Wed, May 10, 2023 at 2:27 PM Matt Topol  wrote:

> Using a manjaro linux image (in honor of the issues we found for Arrow v12
> rc) I ran:
> USE_CONDA=1 ./dev/release/verify-release-candidate.sh 0.4.0 0
>
> My first attempt failed because the default base image doesn't have make
> and such installed. should we install that via conda too since we install
> the compilers and toolchains through conda when USE_CONDA=1?
>
> After installing `base-devel` package which gives make/autoconf/etc
> everything ran properly and worked just fine for verifying the candidate.
>
> So I'm +1 on the release (I'm fine with requiring that base-devel package
> installed) but I wanted to bring up/suggest the idea of installing make
> through conda also. That said, it still has the same libcrypt.so.1 issue
> that we saw with the Arrow v12 release, maybe we should add a note in the
> documentation that the `libxcrypt-compat` package is needed to build on any
> `pacman` / ArchLinux based systems?
>
> On Wed, May 10, 2023 at 7:03 AM Raúl Cumplido 
> wrote:
>
> > +1
> >
> > I ran the following on Ubuntu 22.04:
> > USE_CONDA=1 ./dev/release/verify-release-candidate.sh 0.4.0 0
> >
> > El mié, 10 may 2023 a las 9:59, Sutou Kouhei ()
> > escribió:
> > >
> > > +1
> > >
> > > I ran the following on Debian GNU/Linux sid:
> > >
> > >   JAVA_HOME=/usr/lib/jvm/default-java \
> > > TEST_PYTHON_VERSIONS=3.11 \
> > > dev/release/verify-release-candidate.sh 0.4.0 0
> > >
> > > with:
> > >
> > >   * Python 3.11.2
> > >   * g++ (Debian 12.2.0-14) 12.2.0
> > >   * go version go1.19.8 linux/amd64
> > >   * openjdk version "17.0.6" 2023-01-17
> > >   * ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]
> > >   * R version 4.3.0 (2023-04-21) -- "Already Tomorrow"
> > >
> > > Thanks,
> > > --
> > > kou
> > >
> > >
> > > In <831038c5-0ab3-4dae-80e3-07c882dce...@app.fastmail.com>
> > >   "[VOTE] Release Apache Arrow ADBC 0.4.0 - RC0" on Tue, 09 May 2023
> > 21:46:48 -0400,
> > >   "David Li"  wrote:
> > >
> > > > Hello,
> > > >
> > > > I would like to propose the following release candidate (RC0) of
> > Apache Arrow ADBC version 0.4.0. This is a release consisting of 47
> > resolved GitHub issues [1].
> > > >
> > > > This release candidate is based on commit:
> > cdb8fba8f6ca26647863224fb7fd9fc74097 [2]
> > > >
> > > > The source release rc0 is hosted at [3].
> > > > The binary artifacts are hosted at [4][5][6][7][8].
> > > > The changelog is located at [9].
> > > >
> > > > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [10] for how to validate a release
> candidate.
> > > >
> > > > See also a verification result on GitHub Actions [11].
> > > >
> > > > The vote will be open for at least 72 hours.
> > > >
> > > > [ ] +1 Release this as Apache Arrow ADBC 0.4.0
> > > > [ ] +0
> > > > [ ] -1 Do not release this as Apache Arrow ADBC 0.4.0 because...
> > > >
> > > > Note: to verify APT/YUM packages on macOS/AArch64, you must `export
> > DOCKER_DEFAULT_ARCHITECTURE=linux/amd64`. (Or skip this step by `export
> > TEST_APT=0 TEST_YUM=0`.)
> > > >
> > > > [1]:
> >
> https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+0.4.0%22+is%3Aclosed
> > > > [2]:
> >
> https://github.com/apache/arrow-adbc/commit/cdb8fba8f6ca26647863224fb7fd9fc74097
> > > > [3]:
> >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-0.4.0-rc0/
> > > > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> > > > [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> > > > [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> > > > [7]:
> >
> https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
> > > > [8]:
> >
> https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-0.4.0-rc0
> > > > [9]:
> >
> https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.4.0-rc0/CHANGELOG.md
> > > > [10]:
> >
> https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
> > > > [11]: https://github.com/apache/arrow-adbc/actions/runs/4931782221
> >
>


Re: [VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 24.0.0 RC1

2023-05-06 Thread Will Jones
+1 (binding)

Verified on x86_64 Ubuntu 22. Thanks Andy!

On Sat, May 6, 2023 at 1:28 PM L. C. Hsieh  wrote:

> +1 (binding)
>
> Verified on M1 Mac.
>
> Thanks Andy.
>
> On Sat, May 6, 2023 at 6:26 AM Andy Grove  wrote:
> >
> > Hi,
> >
> > I would like to propose a release of Apache Arrow DataFusion
> Implementation,
> > version 24.0.0.
> >
> > This release candidate is based on commit:
> > 37b2c53f281b9550034e7e69f5acf1ae666a0da7 [1]
> > The proposed release tarball and signatures are hosted at [2].
> > The changelog is located at [3].
> >
> > Please download, verify checksums and signatures, run the unit tests, and
> > vote
> > on the release. The vote will be open for at least 72 hours.
> >
> > Only votes from PMC members are binding, but all members of the community
> > are
> > encouraged to test the release and vote with "(non-binding)".
> >
> > The standard verification procedure is documented at
> >
> https://github.com/apache/arrow-datafusion/blob/main/dev/release/README.md#verifying-release-candidates
> > .
> >
> > [ ] +1 Release this as Apache Arrow DataFusion 24.0.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow DataFusion 24.0.0 because...
> >
> > Here is my vote:
> >
> > +1
> >
> > [1]:
> >
> https://github.com/apache/arrow-datafusion/tree/37b2c53f281b9550034e7e69f5acf1ae666a0da7
> > [2]:
> >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-24.0.0-rc1
> > [3]:
> >
> https://github.com/apache/arrow-datafusion/blob/37b2c53f281b9550034e7e69f5acf1ae666a0da7/CHANGELOG.md
>


Re: [VOTE][RUST] Release Apache Arrow Rust 39.0.0 RC1

2023-05-06 Thread Will Jones
+1 (binding)

Verified on x86_64 Ubuntu 22

Thanks Raphael!

On Fri, May 5, 2023 at 9:57 AM Andrew Lamb  wrote:

> +1 (binding)
>
> Verified on x86_64 mac
>
> Thanks Raphael
>
> On Fri, May 5, 2023 at 12:36 PM L. C. Hsieh  wrote:
>
> > +1 (binding)
> >
> > Verified on M1 Mac.
> >
> > Thanks Raphael.
> >
> > On Fri, May 5, 2023 at 7:46 AM Raphael Taylor-Davies
> >  wrote:
> > >
> > > Hi,
> > >
> > > I would like to propose a release of Apache Arrow Rust Implementation,
> > > version 39.0.0.
> > >
> > > This release candidate is based on commit:
> > > 575a199fa669d75833c13a2a69d71255b9a9f2e6 [1]
> > >
> > > The proposed release tarball and signatures are hosted at [2].
> > >
> > > The changelog is located at [3].
> > >
> > > Please download, verify checksums and signatures, run the unit tests,
> > > and vote on the release. There is a script [4] that automates some of
> > > the verification.
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Release this as Apache Arrow Rust
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Arrow Rust  because...
> > >
> > > [1]:
> > >
> >
> https://github.com/apache/arrow-rs/tree/575a199fa669d75833c13a2a69d71255b9a9f2e6
> > > [2]:
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-39.0.0-rc1
> > > [3]:
> > >
> >
> https://github.com/apache/arrow-rs/blob/575a199fa669d75833c13a2a69d71255b9a9f2e6/CHANGELOG.md
> > > [4]:
> > >
> >
> https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
> > >
> >
>


Re: [ANNOUNCE] New Arrow PMC member: Matt Topol

2023-05-03 Thread Will Jones
Congrats and welcome Matt. Thank you for all your contributions thus far.

On Wed, May 3, 2023 at 10:44 AM vin jake  wrote:

> Congratulations, Matt!
>
> Felipe Oliveira Carvalho  于 2023年5月4日周四 01:42写道:
>
> > Congratulations, Matt!
> >
> > On Wed, 3 May 2023 at 14:37 Andrew Lamb  wrote:
> >
> > > The Project Management Committee (PMC) for Apache Arrow has invited
> > > Matt Topol (zeroshade) to become a PMC member and we are pleased to
> > > announce
> > > that Matt has accepted.
> > >
> > > Congratulations and welcome!
> > >
> >
>


Re: [VOTE] Release Apache Arrow 12.0.0 - RC0

2023-04-29 Thread Will Jones
Sounds good. My vote is +1.

On Fri, Apr 28, 2023 at 9:59 PM Sutou Kouhei  wrote:

> Hi,
>
> Thanks for digging into this!
> I couldn't notice that we can fix the libcrypt.so.1 error
> from Perl by installing libxcrypt-compat.
>
> https://github.com/apache/arrow/pull/35362 is another
> approach to avoid the error.
>
>
> Anyway, I think that we solved all reported problems for
> RC0. I think that we can close this vote and release 12.0.0.
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: [VOTE] Release Apache Arrow 12.0.0 - RC0" on Fri, 28 Apr 2023
> 11:59:09 -0400,
>   Matthew Topol  wrote:
>
> > Okay, I confirmed that by running the following two commands, the test
> > failures that Jacob found on Manjarolinux were solved (at least in the
> > container I was using)
> >
> > $ pacman -S libxcrypt-compat
> > $ ln -s /usr/share/zoneinfo/America/New_York /etc/localtime
> >
> > For the second command it looks like the /etc/localtime symbolic link
> > wasn't being set in the container and is leveraged by the Orc adapter
> > tests. So setting the localtime (to any valid zone info) was sufficient
> to
> > let the tests run and pass.
> >
> > Hope this helps!
> >
> >
> > On Fri, Apr 28, 2023 at 11:49 AM Matthew Topol 
> wrote:
> >
> >> Looks like this might be related:
> >>
> https://unix.stackexchange.com/questions/691479/how-to-deal-with-missing-libcrypt-so-1-on-arch-linux
> >> as manjaro also uses pacman and Arch Linux's packages.
> >>
> >> I'm re-running the verification right now after installing the
> recommended
> >> package in that thread. I'll report back if it solves the issue.
> >>
> >> --Matt
> >>
> >> On Fri, Apr 28, 2023 at 11:29 AM Matthew Topol 
> >> wrote:
> >>
> >>> @Kou: I was able to reproduce the libcrypto failure that Jacob saw
> using
> >>> https://hub.docker.com/r/manjarolinux/base though i did need to
> manually
> >>> install git first since it doesn't come with it.
> >>>
> >>> $ pacman -Syu git
> >>> $ git clone https://github.com/apache/arrow.git
> >>> $ cd arrow
> >>> $ TEST_DEFAULT=0 TEST_SOURCE=0 TEST_CPP=1 USE_CONDA=1
> >>> dev/release/verify-release-candidate.sh 12.0.0 0
> >>>
> >>> That set of commands was sufficient to reproduce the error I believe (I
> >>> did this on monday when I was poking around the failures but I
> definitely
> >>> managed to see the same error pop up in a run). I'm running it again
> right
> >>> now to confirm.
> >>>
> >>> --Matt
> >>>
> >>> On Thu, Apr 27, 2023 at 8:28 PM Sutou Kouhei 
> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> Thanks for sharing the log.
> >>>>
> >>>> libcrypto.so isn't related on the segmentation fault. It's
> >>>> just for relating to showing backtrace.
> >>>>
> >>>> > perl: error while loading shared libraries: libcrypt.so.1:
> >>>> > cannot open shared object file: No such file or directory
> >>>>
> >>>> This is happen at
> >>>>
> >>>>
> https://github.com/apache/arrow/blob/main/cpp/build-support/run-test.sh#L42
> >>>> :
> >>>>
> >>>>   TEST_NAME=$(echo $TEST_FILENAME | perl -pe 's/\..+?$//') # Remove
> path
> >>>> and extension (if any).
> >>>>
> >>>> BTW, it seems that we should remove a Perl dependency from
> >>>>
> https://github.com/apache/arrow/blob/main/cpp/build-support/run-test.sh
> >>>> ...
> >>>>
> >>>>
> >>>> I want to reproduce this problem on my environment. Could
> >>>> you share your environment information? Did you use Manjaro
> >>>> Linux this too?
> >>>>
> >>>>
> >>>> Thanks,
> >>>> --
> >>>> kou
> >>>>
> >>>>
> >>>> In <
> canva0dgkodpfde7_b8xuvmtkh5kdmzvmtpbofo82hqj17gu...@mail.gmail.com>
> >>>>   "Re: [VOTE] Release Apache Arrow 12.0.0 - RC0" on Thu, 27 Apr 2023
> >>>> 23:54:58 +0200,
> >>>>   Jacob Wujciak  wrote:
> >>>>
> >>>> > I have uploaded the log [1] for the run using conda with gandiva
> >>>> active. It
> >>>> > looks like there is an is

Re: [WEBSITE] [DISCUSS] Arrow-Site blog post

2023-04-28 Thread Will Jones
Thanks for highlighting this, Matt.

I have added some comments in the document.

On Fri, Apr 28, 2023 at 2:34 PM Ian Cook  wrote:

> Hi Matt,
>
> I reviewed it and left a few very minor comments. Looks great to me.
>
> Do any PMC members wish to chime in? If not, it seems OK to give it 72
> hours from the time of your email here and then merge it.
>
> Thanks,
> Ian
>
>
> On Fri, Apr 28, 2023 at 11:41 AM Matt Topol 
> wrote:
> >
> > Hey All,
> >
> > Yevgeny Pats has contributed a blog post to the Arrow Site via PR[1].
> > detailing his company's usage of Arrow for their type system. I've
> reviewed
> > it and it looks good to me, but as I'm not a PMC member I didn't want to
> go
> > merging it and having it get published without input from others first
> > along with potentially coordinating *when* we should merge it to publish.
> >
> > So I'm hoping I can get a bit more eyes on this to give it a look over.
> >
> > Thanks all!
> >
> > --Matt
> >
> > [1]: https://github.com/apache/arrow-site/pull/348
>


Re: [VOTE] Release Apache Arrow 12.0.0 - RC0

2023-04-28 Thread Will Jones
I have managed to successfully verify this RC today, confirming the Pandas
issue was the only thing blocking my vote. If we think the currently in
discussion is non-blocking, I'm happy to give my plus one vote.

Here are the verifications commands I ran (and runtimes), for others'
reference:

# Verify binaries other than wheels (took 2h 24m)
TEST_DEFAULT=0 TEST_BINARIES=1 TEST_WHEELS=0
./dev/release/verify-release-candidate.sh 12.0.0 0

# Test main languages and integrations (took 49m)
TEST_DEFAULT=0 TEST_CPP=1 TEST_JAVA=1 TEST_GO=1 TEST_JS=1 TEST_CSHARP=1
TEST_INTEGRATION=1 \
./dev/release/verify-release-candidate.sh 12.0.0 0

# Test Ruby (took 24m)
TEST_DEFAULT=0 TEST_RUBY=1 \
./dev/release/verify-release-candidate.sh 12.0.0 0

And then if I added the line
  pip install 'pandas<2'
after verify-release-candidate.sh:726. I could successfully run:
# Test Python (took 23m)
TEST_DEFAULT=0 TEST_PYTHON=1 \
./dev/release/verify-release-candidate.sh 12.0.0 0


On Fri, Apr 28, 2023 at 8:59 AM Matthew Topol 
wrote:

> Okay, I confirmed that by running the following two commands, the test
> failures that Jacob found on Manjarolinux were solved (at least in the
> container I was using)
>
> $ pacman -S libxcrypt-compat
> $ ln -s /usr/share/zoneinfo/America/New_York /etc/localtime
>
> For the second command it looks like the /etc/localtime symbolic link
> wasn't being set in the container and is leveraged by the Orc adapter
> tests. So setting the localtime (to any valid zone info) was sufficient to
> let the tests run and pass.
>
> Hope this helps!
>
>
> On Fri, Apr 28, 2023 at 11:49 AM Matthew Topol 
> wrote:
>
> > Looks like this might be related:
> >
> https://unix.stackexchange.com/questions/691479/how-to-deal-with-missing-libcrypt-so-1-on-arch-linux
> > as manjaro also uses pacman and Arch Linux's packages.
> >
> > I'm re-running the verification right now after installing the
> recommended
> > package in that thread. I'll report back if it solves the issue.
> >
> > --Matt
> >
> > On Fri, Apr 28, 2023 at 11:29 AM Matthew Topol 
> > wrote:
> >
> >> @Kou: I was able to reproduce the libcrypto failure that Jacob saw using
> >> https://hub.docker.com/r/manjarolinux/base though i did need to
> manually
> >> install git first since it doesn't come with it.
> >>
> >> $ pacman -Syu git
> >> $ git clone https://github.com/apache/arrow.git
> >> $ cd arrow
> >> $ TEST_DEFAULT=0 TEST_SOURCE=0 TEST_CPP=1 USE_CONDA=1
> >> dev/release/verify-release-candidate.sh 12.0.0 0
> >>
> >> That set of commands was sufficient to reproduce the error I believe (I
> >> did this on monday when I was poking around the failures but I
> definitely
> >> managed to see the same error pop up in a run). I'm running it again
> right
> >> now to confirm.
> >>
> >> --Matt
> >>
> >> On Thu, Apr 27, 2023 at 8:28 PM Sutou Kouhei 
> wrote:
> >>
> >>> Hi,
> >>>
> >>> Thanks for sharing the log.
> >>>
> >>> libcrypto.so isn't related on the segmentation fault. It's
> >>> just for relating to showing backtrace.
> >>>
> >>> > perl: error while loading shared libraries: libcrypt.so.1:
> >>> > cannot open shared object file: No such file or directory
> >>>
> >>> This is happen at
> >>>
> >>>
> https://github.com/apache/arrow/blob/main/cpp/build-support/run-test.sh#L42
> >>> :
> >>>
> >>>   TEST_NAME=$(echo $TEST_FILENAME | perl -pe 's/\..+?$//') # Remove
> path
> >>> and extension (if any).
> >>>
> >>> BTW, it seems that we should remove a Perl dependency from
> >>>
> https://github.com/apache/arrow/blob/main/cpp/build-support/run-test.sh
> >>> ...
> >>>
> >>>
> >>> I want to reproduce this problem on my environment. Could
> >>> you share your environment information? Did you use Manjaro
> >>> Linux this too?
> >>>
> >>>
> >>> Thanks,
> >>> --
> >>> kou
> >>>
> >>>
> >>> In  >
> >>>   "Re: [VOTE] Release Apache Arrow 12.0.0 - RC0" on Thu, 27 Apr 2023
> >>> 23:54:58 +0200,
> >>>   Jacob Wujciak  wrote:
> >>>
> >>> > I have uploaded the log [1] for the run using conda with gandiva
> >>> active. It
> >>> > looks like there is an issue with libcrypt.so causing these tests to
> >>> > segfault.
> >&g

Re: [VOTE] Release Apache Arrow 12.0.0 - RC0

2023-04-27 Thread Will Jones
gt; > /usr/lib/cmake/llvm/LLVM-Config.cmake(159):  elseif(c STREQUAL
> > > nativecodegen )
> > > [snip]
> > > /usr/lib/cmake/llvm/LLVM-Config.cmake(212):  list(APPEND
> > > expanded_components debuginfodwarf )
> > > /usr/lib/cmake/llvm/LLVM-Config.cmake(130):  list(FIND
> > > LLVM_TARGETS_TO_BUILD X86 idx )
> > > /usr/lib/cmake/llvm/LLVM-Config.cmake(131):  if(NOT idx LESS 0 )
> > > /usr/lib/cmake/llvm/LLVM-Config.cmake(132):  if(TARGET LLVMX86CodeGen )
> > > /usr/lib/cmake/llvm/LLVM-Config.cmake(134):  else()
> > > /usr/lib/cmake/llvm/LLVM-Config.cmake(135):  if(TARGET LLVMX86 )
> > > /usr/lib/cmake/llvm/LLVM-Config.cmake(137):  else()
> > > */usr/lib/cmake/llvm/LLVM-Config.cmake(138):  message(FATAL_ERROR
> Target
> > > X86 is not in the set of libraries. )*
> > >
> > > On Tue, Apr 25, 2023 at 10:57 AM Raúl Cumplido  >
> > > wrote:
> > >
> > >> I have created the following issue for the new wheels test failure
> > >> around pandas 2.0.1 : https://github.com/apache/arrow/issues/35321
> > >>
> > >> I don't think we should create a new RC for that issue but I'm happy
> > >> to know other people's thoughts around that.
> > >>
> > >> El lun, 24 abr 2023 a las 21:12, Raúl Cumplido
> > >> () escribió:
> > >> >
> > >> > El lun, 24 abr 2023 a las 18:53, Will Jones
> > >> > () escribió:
> > >> > >
> > >> > > I'm seeing failing Pandas tests in PyArrow when verifying with
> > >> > >
> > >> > > USE_CONDA=1 dev/release/verify-release-candidate.sh 12.0.0 0
> > >> > >
> > >> > >
> > >>
> pyarrow/tests/test_extension_type.py::test_extension_to_pandas_storage_type[registered_period_type0]
> > >> > > - NotImplementedError: extension>
> > >> >
> > >> > This is also happening on our nightlies from today:
> > >> >
> > >>
> https://github.com/ursacomputing/crossbow/actions/runs/4786502455/jobs/8510514881
> > >> >
> > >> > There has been a new pandas release: 2.0.1 around 9 hours ago which
> > >> > seems to be the causing issue:
> > >> > https://pypi.org/project/pandas/#history
> > >> >
> > >> > >
> > >> > > No one else is getting that?
> > >> > >
> > >> > >
> > >> > > On Sun, Apr 23, 2023 at 9:21 AM Raúl Cumplido <
> raulcumpl...@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > > > +1 (non binding)
> > >> > > >
> > >> > > > I have tested both SOURCES and BINARIES successfully with:
> > >> > > > TEST_DEFAULT=0 TEST_SOURCE=1
> dev/release/verify-release-candidate.sh
> > >> > > > 12.0.0 0
> > >> > > > TEST_DEFAULT=0 TEST_WHEELS=1
> dev/release/verify-release-candidate.sh
> > >> > > > 12.0.0 0
> > >> > > > TEST_DEFAULT=0 TEST_BINARIES=1
> > >> dev/release/verify-release-candidate.sh
> > >> > > > 12.0.0 0
> > >> > > > with:
> > >> > > >   * Python 3.10.6
> > >> > > >   * gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
> > >> > > >   * NVIDIA CUDA cuda_11.5.r11.5/compiler.30672275_0
> > >> > > >   * openjdk version "17.0.6" 2023-01-17
> > >> > > >   * ruby 3.0.2p107 (2021-07-07 revision 0db68f0233)
> > >> [x86_64-linux-gnu]
> > >> > > >   * dotnet 7.0.203
> > >> > > >   * Ubuntu 22.04 LTS
> > >> > > >
> > >> > > > El dom, 23 abr 2023 a las 12:59, Yibo Cai ()
> > >> escribió:
> > >> > > > >
> > >> > > > > +1
> > >> > > > >
> > >> > > > > I ran the followings on Ubuntu-22.04, aarch64.
> > >> > > > >
> > >> > > > > TEST_DEFAULT=0 \
> > >> > > > >TEST_CPP=1 \
> > >> > > > >TEST_PYTHON=1 \
> > >> > > > >TEST_GO=1 \
> > >> > > > >dev/release/verify-release-candidate.sh 12.0.0 0
> > >> > > > >
> > >> > > > > TEST_DEFAULT=0 \
> > >> > > > >TEST_WHEELS=1 \
> > >> > > > >dev/release/verify-release-candidate.sh 12.0.0 0
> > >> > > > >
> > >> > > > >
> > >> > > > > On 4/23/23 14:40, Sutou Kouhei wrote:
> > >> > > > > > +1
> > >> > > > > >
> > >> > > > > > I ran the followings on Debian GNU/Linux sid:
> > >> > > > > >
> > >> > > > > >* TEST_DEFAULT=0 \
> > >> > > > > >TEST_SOURCE=1 \
> > >> > > > > >LANG=C \
> > >> > > > > >TZ=UTC \
> > >> > > > > >CUDAToolkit_ROOT=/usr \
> > >> > > > > >ARROW_CMAKE_OPTIONS="-DBoost_NO_BOOST_CMAKE=ON
> > >> > > > -Dxsimd_SOURCE=BUNDLED" \
> > >> > > > > >dev/release/verify-release-candidate.sh 12.0.0 0
> > >> > > > > >
> > >> > > > > >* TEST_DEFAULT=0 \
> > >> > > > > >TEST_APT=1 \
> > >> > > > > >LANG=C \
> > >> > > > > >dev/release/verify-release-candidate.sh 12.0.0 0
> > >> > > > > >
> > >> > > > > >* TEST_DEFAULT=0 \
> > >> > > > > >TEST_BINARY=1 \
> > >> > > > > >LANG=C \
> > >> > > > > >dev/release/verify-release-candidate.sh 12.0.0 0
> > >> > > > > >
> > >> > > > > >* TEST_DEFAULT=0 \
> > >> > > > > >TEST_JARS=1 \
> > >> > > > > >LANG=C \
> > >> > > > > >dev/release/verify-release-candidate.sh 12.0.0 0
> > >> > > > > >
> > >> > > > > >* TEST_DEFAULT=0 \
> > >> > > > > >TEST_PYTHON_VERSIONS=3.11 \
> > >> > > > > >TEST_WHEELS=1 \
> > >> > > > > >LANG=C \
> > >> > > > > >dev/release/verify-release-candidate.sh 12.0.0 0
> > >> > > > > >
> > >> > > > > >* TEST_DEFAULT=0 \
> > >> > > > > >TEST_YUM=1 \
> > >> > > > > >LANG=C \
> > >> > > > > >dev/release/verify-release-candidate.sh 12.0.0 0
> > >> > > > > >
> > >> > > > > > with:
> > >> > > > > >
> > >> > > > > >* .NET SDK (6.0.406)
> > >> > > > > >* Python 3.11.2
> > >> > > > > >* gcc (Debian 12.2.0-14) 12.2.0
> > >> > > > > >* nvidia-cuda-dev 11.7.99~11.7.1-4
> > >> > > > > >* openjdk version "17.0.6" 2023-01-17
> > >> > > > > >* ruby 3.1.2p20 (2022-04-12 revision 4491bb740a)
> > >> [x86_64-linux-gnu]
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Thanks,
> > >> > > >
> > >>
>


Re: [VOTE] Formalize how to change format

2023-04-26 Thread Will Jones
+1. Thanks Kou.

On Wed, Apr 26, 2023 at 10:27 AM Matt Topol  wrote:

> +1 (Non-binding)
>
> On Wed, Apr 26, 2023 at 5:16 AM Joris Van den Bossche <
> jorisvandenboss...@gmail.com> wrote:
>
> > +1
> >
> > On Wed, 26 Apr 2023 at 04:18, Sutou Kouhei  wrote:
> > >
> > > Hi,
> > >
> > > I've added one more note about documentation:
> > >
> > >   We must update the corresponding documentation (files in
> > >   ``_)
> > >   too.
> > >
> > > https://github.com/apache/arrow/pull/35174#issuecomment-1522572677
> > >
> > > See also the preview URL:
> > > http://crossbow.voltrondata.com/pr_docs/35174/format/Changing.html
> > >
> > >
> > > Thanks,
> > > --
> > > kou
> > >
> > > In <20230424.103259.664806138128874521@clear-code.com>
> > >   "[VOTE] Formalize how to change format" on Mon, 24 Apr 2023 10:32:59
> > +0900 (JST),
> > >   Sutou Kouhei  wrote:
> > >
> > > > Hi,
> > > >
> > > > I would like to formalize how to change format process.
> > > >
> > > > See the following pull request and discussion for details:
> > > >
> > > > * GH-35084: [Docs][Format] Add how to change format specification
> > > >   https://github.com/apache/arrow/pull/35174
> > > >
> > > >   Preview:
> > http://crossbow.voltrondata.com/pr_docs/35174/format/Changing.html
> > > >
> > > > * [DISCUSS] Formalize how to change format
> > > >   https://lists.apache.org/thread/cox8wz8y458n4dsko0rx6z5w9nqvcld3
> > > >
> > > > This is based on the following discussion:
> > > >
> > > >   [DISCUSS] Format changes: process and requirements
> > > >   https://lists.apache.org/thread/9t0pglrvxjhrt4r4xcsc1zmgmbtr8pxj
> > > >
> > > > Summary:
> > > >
> > > > * The format means files in
> > https://github.com/apache/arrow/tree/main/format
> > > > * We need to discuss and vote to change the format
> > > > * We need at least two reference implementations and
> > > >   integration tests to change the format
> > > > * We can choose at least two implementations from the
> > > >   followings:
> > > >   * The C++ implementation
> > > >   * The Java implementation
> > > >   * The Rust (arrow-rs) implementation
> > > >   * The Go implementation
> > > > * NOTE: The C++ and Java implementations were candidates in
> > > >   the initial discussion:
> > > > [DISCUSS] Format changes: process and requirements
> > > > https://lists.apache.org/thread/9t0pglrvxjhrt4r4xcsc1zmgmbtr8pxj
> > > > * We can add a new implementation to the above "at least two
> > > >   implementations" candidate list by "discuss and vote".
> > > >
> > > >
> > > > The vote will be open for at least 72 hours.
> > > >
> > > > [ ] +1 Accept this proposal
> > > > [ ] +0
> > > > [ ] -1 Do not accept this proposal because...
> > > >
> > > >
> > > > Thanks,
> > > > --
> > > > kou
> >
>


Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-25 Thread Will Jones
I suppose one common use case is materializing list columns after some
expanding operation like a join or unnest. That's a case where I could
imagine a lot of repetition of values. Haven't yet thought of common cases
where there is overlap but not full duplication, but am eager to hear any.

The dictionary encoding point Raphael makes is interesting, especially
given the existence of LargeList and FixedSizeList. For many operations, it
might make more sense to just compose those existing types.

IIUC the operations that would be unique to the ArrayView are ones altering
the shape. One could truncate each array to a certain length cheaply simply
by replacing the sizes buffer. Or perhaps there are interesting operations
on tensors that would benefit.

On Tue, Apr 25, 2023 at 7:47 PM Raphael Taylor-Davies
 wrote:

> Unless I am missing something, I think the selection use-case could be
> equally well served by a dictionary-encoded BinarArray/ListArray, and would
> have the benefit of not requiring any modifications to the existing format
> or kernels.
>
> The major additional flexibility of the proposed encoding would be
> permitting disjoint or overlapping ranges, are these common enough in
> practice to represent a meaningful bottleneck?
>
>
> On 26 April 2023 01:40:14 BST, David Li  wrote:
> >Is there a need for a 64-bit offsets version the same way we have List
> and LargeList?
> >
> >And just to be clear, the difference with List is that the lists don't
> have to be stored in their logical order (or in other words, offsets do not
> have to be nondecreasing and so we also need sizes)?
> >
> >On Wed, Apr 26, 2023, at 09:37, Weston Pace wrote:
> >> For context, there was some discussion on this back in [1].  At that
> time
> >> this was called "sequence view" but I do not like that name.  However,
> >> array-view array is a little confusing.  Given this is similar to list
> can
> >> we go with list-view array?
> >>
> >>> Thanks for the introduction. I'd be interested to hear about the
> >>> applications Velox has found for these vectors, and in what situations
> >> they
> >>> are useful. This could be contrasted with the current ListArray
> >>> implementations.
> >>
> >> I believe one significant benefit is that take (and by proxy, filter)
> and
> >> sort are O(# of items) with the proposed format and O(# of bytes) with
> the
> >> current format.  Jorge did some profiling to this effect in [1].
> >>
> >> [1] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> >>
> >> On Tue, Apr 25, 2023 at 3:13 PM Will Jones 
> wrote:
> >>
> >>> Hi Felipe,
> >>>
> >>> Thanks for the introduction. I'd be interested to hear about the
> >>> applications Velox has found for these vectors, and in what situations
> they
> >>> are useful. This could be contrasted with the current ListArray
> >>> implementations.
> >>>
> >>> IIUC it would be fairly cheap to transform a ListArray to an
> ArrayView, but
> >>> expensive to go the other way.
> >>>
> >>> Best,
> >>>
> >>> Will Jones
> >>>
> >>> On Tue, Apr 25, 2023 at 3:00 PM Felipe Oliveira Carvalho <
> >>> felipe...@gmail.com> wrote:
> >>>
> >>> > Hi folks,
> >>> >
> >>> > I would like to start a public discussion on the inclusion of a new
> array
> >>> > format to Arrow — array-view array. The name is also up for debate.
> >>> >
> >>> > This format is inspired by Velox's ArrayVector format [1]. Logically,
> >>> this
> >>> > array represents an array of arrays. Each element is an array-view
> >>> (offset
> >>> > and size pair) that points to a range within a nested "values" array
> >>> > (called "elements" in Velox docs). The nested array can be of any
> type,
> >>> > which makes this format very flexible and powerful.
> >>> >
> >>> > [image: ../_images/array-vector.png]
> >>> > <https://facebookincubator.github.io/velox/_images/array-vector.png>
> >>> >
> >>> > I'm currently working on a C++ implementation and plan to work on a
> Go
> >>> > implementation to fulfill the two-implementations requirement for
> format
> >>> > changes.
> >>> >
> >>> > The draft design:
> >>> >
> >>> > - 3 buffers: [validity_bitmap, int32 offsets buffer, int32 sizes
> buffer]
> >>> > - 1 child array: "values" as an array of the type parameter
> >>> >
> >>> > validity_bitmap is used to differentiate between empty array views
> >>> > (sizes[i] == 0) and NULL array views (validity_bitmap[i] == 0).
> >>> >
> >>> > When the validity_bitmap[i] is 0, both sizes and offsets are
> undefined
> >>> (as
> >>> > usual), and when sizes[i] == 0, offsets[i] is undefined. 0 is
> recommended
> >>> > if setting a value is not an issue to the system producing the
> arrays.
> >>> >
> >>> > offsets buffer is not required to be ordered and views don't have to
> be
> >>> > disjoint.
> >>> >
> >>> > [1]
> >>> >
> >>>
> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
> >>> >
> >>> > Thanks,
> >>> > Felipe O. Carvalho
> >>> >
> >>>
>


Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-25 Thread Will Jones
Hi Felipe,

Thanks for the introduction. I'd be interested to hear about the
applications Velox has found for these vectors, and in what situations they
are useful. This could be contrasted with the current ListArray
implementations.

IIUC it would be fairly cheap to transform a ListArray to an ArrayView, but
expensive to go the other way.

Best,

Will Jones

On Tue, Apr 25, 2023 at 3:00 PM Felipe Oliveira Carvalho <
felipe...@gmail.com> wrote:

> Hi folks,
>
> I would like to start a public discussion on the inclusion of a new array
> format to Arrow — array-view array. The name is also up for debate.
>
> This format is inspired by Velox's ArrayVector format [1]. Logically, this
> array represents an array of arrays. Each element is an array-view (offset
> and size pair) that points to a range within a nested "values" array
> (called "elements" in Velox docs). The nested array can be of any type,
> which makes this format very flexible and powerful.
>
> [image: ../_images/array-vector.png]
> <https://facebookincubator.github.io/velox/_images/array-vector.png>
>
> I'm currently working on a C++ implementation and plan to work on a Go
> implementation to fulfill the two-implementations requirement for format
> changes.
>
> The draft design:
>
> - 3 buffers: [validity_bitmap, int32 offsets buffer, int32 sizes buffer]
> - 1 child array: "values" as an array of the type parameter
>
> validity_bitmap is used to differentiate between empty array views
> (sizes[i] == 0) and NULL array views (validity_bitmap[i] == 0).
>
> When the validity_bitmap[i] is 0, both sizes and offsets are undefined (as
> usual), and when sizes[i] == 0, offsets[i] is undefined. 0 is recommended
> if setting a value is not an issue to the system producing the arrays.
>
> offsets buffer is not required to be ordered and views don't have to be
> disjoint.
>
> [1]
> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
>
> Thanks,
> Felipe O. Carvalho
>


Re: [VOTE] Release Apache Arrow 12.0.0 - RC0

2023-04-24 Thread Will Jones
I'm seeing failing Pandas tests in PyArrow when verifying with

USE_CONDA=1 dev/release/verify-release-candidate.sh 12.0.0 0

pyarrow/tests/test_extension_type.py::test_extension_to_pandas_storage_type[registered_period_type0]
- NotImplementedError: extension>

No one else is getting that?


On Sun, Apr 23, 2023 at 9:21 AM Raúl Cumplido 
wrote:

> +1 (non binding)
>
> I have tested both SOURCES and BINARIES successfully with:
> TEST_DEFAULT=0 TEST_SOURCE=1 dev/release/verify-release-candidate.sh
> 12.0.0 0
> TEST_DEFAULT=0 TEST_WHEELS=1 dev/release/verify-release-candidate.sh
> 12.0.0 0
> TEST_DEFAULT=0 TEST_BINARIES=1 dev/release/verify-release-candidate.sh
> 12.0.0 0
> with:
>   * Python 3.10.6
>   * gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
>   * NVIDIA CUDA cuda_11.5.r11.5/compiler.30672275_0
>   * openjdk version "17.0.6" 2023-01-17
>   * ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [x86_64-linux-gnu]
>   * dotnet 7.0.203
>   * Ubuntu 22.04 LTS
>
> El dom, 23 abr 2023 a las 12:59, Yibo Cai () escribió:
> >
> > +1
> >
> > I ran the followings on Ubuntu-22.04, aarch64.
> >
> > TEST_DEFAULT=0 \
> >TEST_CPP=1 \
> >TEST_PYTHON=1 \
> >TEST_GO=1 \
> >dev/release/verify-release-candidate.sh 12.0.0 0
> >
> > TEST_DEFAULT=0 \
> >TEST_WHEELS=1 \
> >dev/release/verify-release-candidate.sh 12.0.0 0
> >
> >
> > On 4/23/23 14:40, Sutou Kouhei wrote:
> > > +1
> > >
> > > I ran the followings on Debian GNU/Linux sid:
> > >
> > >* TEST_DEFAULT=0 \
> > >TEST_SOURCE=1 \
> > >LANG=C \
> > >TZ=UTC \
> > >CUDAToolkit_ROOT=/usr \
> > >ARROW_CMAKE_OPTIONS="-DBoost_NO_BOOST_CMAKE=ON
> -Dxsimd_SOURCE=BUNDLED" \
> > >dev/release/verify-release-candidate.sh 12.0.0 0
> > >
> > >* TEST_DEFAULT=0 \
> > >TEST_APT=1 \
> > >LANG=C \
> > >dev/release/verify-release-candidate.sh 12.0.0 0
> > >
> > >* TEST_DEFAULT=0 \
> > >TEST_BINARY=1 \
> > >LANG=C \
> > >dev/release/verify-release-candidate.sh 12.0.0 0
> > >
> > >* TEST_DEFAULT=0 \
> > >TEST_JARS=1 \
> > >LANG=C \
> > >dev/release/verify-release-candidate.sh 12.0.0 0
> > >
> > >* TEST_DEFAULT=0 \
> > >TEST_PYTHON_VERSIONS=3.11 \
> > >TEST_WHEELS=1 \
> > >LANG=C \
> > >dev/release/verify-release-candidate.sh 12.0.0 0
> > >
> > >* TEST_DEFAULT=0 \
> > >TEST_YUM=1 \
> > >LANG=C \
> > >dev/release/verify-release-candidate.sh 12.0.0 0
> > >
> > > with:
> > >
> > >* .NET SDK (6.0.406)
> > >* Python 3.11.2
> > >* gcc (Debian 12.2.0-14) 12.2.0
> > >* nvidia-cuda-dev 11.7.99~11.7.1-4
> > >* openjdk version "17.0.6" 2023-01-17
> > >* ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]
> > >
> > >
> > > Thanks,
>


Re: [VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 23.0.0 RC1

2023-04-21 Thread Will Jones
+1 (binding)

Verified on Ubuntu 22.04

On Fri, Apr 21, 2023 at 4:27 PM L. C. Hsieh  wrote:

> +1 (binding)
>
> Verified on Intel Mac.
>
> Thanks Andy.
>
> On Fri, Apr 21, 2023 at 1:40 PM Andy Grove  wrote:
> >
> > Hi,
> >
> > I would like to propose a release of Apache Arrow DataFusion
> Implementation,
> > version 23.0.0.
> >
> > This release candidate is based on commit:
> > caa60337c7a57572d93d8bd3cbc18006aabe55e6 [1]
> > The proposed release tarball and signatures are hosted at [2].
> > The changelog is located at [3].
> >
> > Please download, verify checksums and signatures, run the unit tests, and
> > vote
> > on the release. The vote will be open for at least 72 hours.
> >
> > Only votes from PMC members are binding, but all members of the community
> > are
> > encouraged to test the release and vote with "(non-binding)".
> >
> > The standard verification procedure is documented at
> >
> https://github.com/apache/arrow-datafusion/blob/main/dev/release/README.md#verifying-release-candidates
> > .
> >
> > [ ] +1 Release this as Apache Arrow DataFusion 23.0.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow DataFusion 23.0.0 because...
> >
> > Here is my vote:
> >
> > +1
> >
> > [1]:
> >
> https://github.com/apache/arrow-datafusion/tree/caa60337c7a57572d93d8bd3cbc18006aabe55e6
> > [2]:
> >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-23.0.0-rc1
> > [3]:
> >
> https://github.com/apache/arrow-datafusion/blob/caa60337c7a57572d93d8bd3cbc18006aabe55e6/CHANGELOG.md
>


Re: [VOTE][RUST] Release Apache Arrow Rust 38.0.0 RC1

2023-04-21 Thread Will Jones
+1 (binding)

Verified on Ubuntu 22.04.

On Fri, Apr 21, 2023 at 12:10 PM Andrew Lamb  wrote:

> +1 (binding)
>
> Verified on x86 mac
>
> Thank you Raphael
>
> On Fri, Apr 21, 2023 at 2:00 PM L. C. Hsieh  wrote:
>
> > +1 (binding)
> >
> > Verified on M1 Mac.
> >
> > Thanks Raphael.
> >
> > On Fri, Apr 21, 2023 at 7:47 AM Raphael Taylor-Davies
> >  wrote:
> > >
> > > Hi,
> > >
> > > I would like to propose a release of Apache Arrow Rust Implementation,
> > > version 38.0.0.
> > >
> > > This release candidate is based on commit:
> > > bbd57c615213bc6e80fb0192674942f688e5f6a8 [1]
> > >
> > > The proposed release tarball and signatures are hosted at [2].
> > >
> > > The changelog is located at [3].
> > >
> > > Please download, verify checksums and signatures, run the unit tests,
> > > and vote on the release. There is a script [4] that automates some of
> > > the verification.
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Release this as Apache Arrow Rust
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Arrow Rust  because...
> > >
> > > [1]:
> > >
> >
> https://github.com/apache/arrow-rs/tree/bbd57c615213bc6e80fb0192674942f688e5f6a8
> > > [2]:
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-38.0.0-rc1
> > > [3]:
> > >
> >
> https://github.com/apache/arrow-rs/blob/bbd57c615213bc6e80fb0192674942f688e5f6a8/CHANGELOG.md
> > > [4]:
> > >
> >
> https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
> > >
> >
>


Re: [VOTE][RUST] Release Apache Arrow Rust Object Store 0.5.6 RC2

2023-03-31 Thread Will Jones
+1
Verified on M1 MacOS.


On Fri, Mar 31, 2023 at 4:29 AM Raphael Taylor-Davies
 wrote:

> Hi,
>
> I would like to propose a release of Apache Arrow Rust Object
> Store Implementation, version 0.5.6.
>
> This release candidate is based on commit:
> 234b7847ecb737e96df3f4623df7b330b34b3d1b [1]
>
> The proposed release tarball and signatures are hosted at [2].
>
> The changelog is located at [3].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. There is a script [4] that automates some of
> the verification.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Release this as Apache Arrow Rust Object Store
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow Rust Object Store because...
>
> [1]:
>
> https://github.com/apache/arrow-rs/tree/234b7847ecb737e96df3f4623df7b330b34b3d1b
> [2]:
>
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.5.6-rc2
> [3]:
>
> https://github.com/apache/arrow-rs/blob/234b7847ecb737e96df3f4623df7b330b34b3d1b/object_store/CHANGELOG.md
> [4]:
>
> https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh
>
>


Re: Proposal: add a bot to close PRs that haven't been updated in 30 days

2023-03-31 Thread Will Jones
> Also good to know: contributors apparently can't re-open PRs if it was
> closed by someone else, so we have to be careful with messages like
> "feel free to reopen".

Thanks for bringing this up, Joris. That does make closing via bot much
less appealing to me.

I like your idea of (1) having the bot provide a friendly message asking
the contributor whether they plan to continue their work (and maybe provide
suggestions on how to get reviewer attention if needed) and (2) if there is
no response to that message after 30 days, we can then close the PR.



On Fri, Mar 31, 2023 at 3:57 AM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> I am personally not a huge fan of auto-closing PRs. Especially not
> after a short period like 30 days (I think that's too short for an
> open source project), and we have to be careful with messaging. Very
> often such a PR is "stale" because it is waiting for reviews. I know
> we have the labels now that could indicate this, but those are not
> (yet) bullet proof (for example, if I quickly answer to one comment it
> will already be marked as "awaiting changes", while in fact it might
> still be waiting on actual review). I think in general it is difficult
> to know the exact reason why something is stale, a good reason to be
> careful with automated actions that can be perceived as unfriendly.
>
> Personally, I think commenting on a PR instead of closing it might be
> a good alternative, if we craft a good and helpful message. That can
> act as a useful reminder, both towards the author as maintainer, and
> can also *ask* to close if they are not planning to further work on it
> (and for example, we could still auto-close PRs if nothing happened
> (no push, no comment, ..) on such a PR after an additional period of
> time).
>
> Also good to know: contributors apparently can't re-open PRs if it was
> closed by someone else, so we have to be careful with messages like
> "feel free to reopen".
>
> On Thu, 30 Mar 2023 at 23:11, Will Jones  wrote:
> >
> > I'm +0 on the reviewer bot pings. Closing PRs where the author hasn't
> > updated in 30 days is something a maintainer would have to do anyways, so
> > it seems like a useful automation. And there's only one author, so it's
> > guaranteed to ping the right person. Things are not so clean with
> reviewers.
> >
> > With the labels and codeowners file [1] I think we have supplied
> sufficient
> > tools so that each subproject in the monorepo can manage their review
> > process in their own way. For example, I have a bookmark that takes me
> to a
> > filtered view of PRs that only shows me the C++ Parquet ones that are
> ready
> > for review [2]. I'd encourage each reviewer to have a similar view of the
> > project that they regularly check.
> >
> > [1] https://github.com/apache/arrow/blob/main/.github/CODEOWNERS
> > [2]
> >
> https://github.com/apache/arrow/pulls?q=is%3Aopen+is%3Apr+label%3A%22Component%3A+C%2B%2B%22+label%3A%22Component%3A+Parquet%22+draft%3Afalse
> >
> > On Thu, Mar 30, 2023 at 1:37 PM Anja  wrote:
> >
> > > Using those labels is a clever idea!
> > >
> > > Would there be a benefit to pinging reviewers for PRs that have been
> > > "awaiting X review" for more than 30 days?
> > >
> > > On Thu, 30 Mar 2023 at 12:31, Will Jones 
> wrote:
> > >
> > > > Thanks Raul. Perhaps we could limit the stale bot to PRs that have
> been
> > > in
> > > > "awaiting changes" for 30 or more days?
> > > >
> > > > On Thu, Mar 30, 2023 at 11:36 AM Raúl Cumplido <
> raulcumpl...@gmail.com>
> > > > wrote:
> > > >
> > > > > I suppose we could use the new labels for "awaiting review",
> "awaiting
> > > > > committer review", "awaiting changes" and "awaiting change review"
> to
> > > > know
> > > > > whether is stale due to the contributor or the reviewer.
> > > > >
> > > > > El jue, 30 mar 2023, 20:08, Will Jones 
> > > > escribió:
> > > > >
> > > > > > First, to clarify: we are discussing for the monorepo only, not
> for
> > > > Rust
> > > > > /
> > > > > > Julia / etc.? This is a big project, so best to be specific which
> > > > > > subprojects you are addressing.
> > > > > >
> > > > > > I am +0.5 on this. 30 days seems like an appropriate window for
> this
> > > > > > project. If the PR

Re: Proposal: add a bot to close PRs that haven't been updated in 30 days

2023-03-30 Thread Will Jones
I'm +0 on the reviewer bot pings. Closing PRs where the author hasn't
updated in 30 days is something a maintainer would have to do anyways, so
it seems like a useful automation. And there's only one author, so it's
guaranteed to ping the right person. Things are not so clean with reviewers.

With the labels and codeowners file [1] I think we have supplied sufficient
tools so that each subproject in the monorepo can manage their review
process in their own way. For example, I have a bookmark that takes me to a
filtered view of PRs that only shows me the C++ Parquet ones that are ready
for review [2]. I'd encourage each reviewer to have a similar view of the
project that they regularly check.

[1] https://github.com/apache/arrow/blob/main/.github/CODEOWNERS
[2]
https://github.com/apache/arrow/pulls?q=is%3Aopen+is%3Apr+label%3A%22Component%3A+C%2B%2B%22+label%3A%22Component%3A+Parquet%22+draft%3Afalse

On Thu, Mar 30, 2023 at 1:37 PM Anja  wrote:

> Using those labels is a clever idea!
>
> Would there be a benefit to pinging reviewers for PRs that have been
> "awaiting X review" for more than 30 days?
>
> On Thu, 30 Mar 2023 at 12:31, Will Jones  wrote:
>
> > Thanks Raul. Perhaps we could limit the stale bot to PRs that have been
> in
> > "awaiting changes" for 30 or more days?
> >
> > On Thu, Mar 30, 2023 at 11:36 AM Raúl Cumplido 
> > wrote:
> >
> > > I suppose we could use the new labels for "awaiting review", "awaiting
> > > committer review", "awaiting changes" and "awaiting change review" to
> > know
> > > whether is stale due to the contributor or the reviewer.
> > >
> > > El jue, 30 mar 2023, 20:08, Will Jones 
> > escribió:
> > >
> > > > First, to clarify: we are discussing for the monorepo only, not for
> > Rust
> > > /
> > > > Julia / etc.? This is a big project, so best to be specific which
> > > > subprojects you are addressing.
> > > >
> > > > I am +0.5 on this. 30 days seems like an appropriate window for this
> > > > project. If the PR was stale because the contributor had not updated
> > it,
> > > it
> > > > seems appropriate. But sometimes it's because it hasn't had an update
> > > from
> > > > reviewers for a while, and in that situation it doesn't seem as
> ideal.
> > > >
> > > >
> > > >
> > > > On Thu, Mar 30, 2023 at 11:01 AM Anja  wrote:
> > > >
> > > > > Also, perhaps it can be two bots in an escalated process. A
> "reminder
> > > > ping"
> > > > > bot every X days, and then a stalebot every X+Y days.
> > > > >
> > > > > On Thu, 30 Mar 2023 at 10:54, Anja  wrote:
> > > > >
> > > > > > When checked this morning, there were 119 PRs that haven't been
> > > updated
> > > > > in
> > > > > > 30 days. The oldest was nearly 3 years old.
> > > > > >
> > > > > > I propose the addition of a bot that will automatically close any
> > PRs
> > > > > that
> > > > > > haven't been updated in 30 days. The closing will act as a
> > > notification
> > > > > to
> > > > > > the reviewers and submitter to evaluate if the work still has
> > value,
> > > > and
> > > > > > just outright close work that is too out-dated for a
> > straightforward
> > > > > merge.
> > > > > >
> > > > > > If the behaviour is done by a bot, it could reduce maintenance
> > > burden,
> > > > > and
> > > > > > simplify the emotional response. A bot can link to a policy, and
> it
> > > > feels
> > > > > > neutral in its consistent tone.
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Proposal: add a bot to close PRs that haven't been updated in 30 days

2023-03-30 Thread Will Jones
Thanks Raul. Perhaps we could limit the stale bot to PRs that have been in
"awaiting changes" for 30 or more days?

On Thu, Mar 30, 2023 at 11:36 AM Raúl Cumplido 
wrote:

> I suppose we could use the new labels for "awaiting review", "awaiting
> committer review", "awaiting changes" and "awaiting change review" to know
> whether is stale due to the contributor or the reviewer.
>
> El jue, 30 mar 2023, 20:08, Will Jones  escribió:
>
> > First, to clarify: we are discussing for the monorepo only, not for Rust
> /
> > Julia / etc.? This is a big project, so best to be specific which
> > subprojects you are addressing.
> >
> > I am +0.5 on this. 30 days seems like an appropriate window for this
> > project. If the PR was stale because the contributor had not updated it,
> it
> > seems appropriate. But sometimes it's because it hasn't had an update
> from
> > reviewers for a while, and in that situation it doesn't seem as ideal.
> >
> >
> >
> > On Thu, Mar 30, 2023 at 11:01 AM Anja  wrote:
> >
> > > Also, perhaps it can be two bots in an escalated process. A "reminder
> > ping"
> > > bot every X days, and then a stalebot every X+Y days.
> > >
> > > On Thu, 30 Mar 2023 at 10:54, Anja  wrote:
> > >
> > > > When checked this morning, there were 119 PRs that haven't been
> updated
> > > in
> > > > 30 days. The oldest was nearly 3 years old.
> > > >
> > > > I propose the addition of a bot that will automatically close any PRs
> > > that
> > > > haven't been updated in 30 days. The closing will act as a
> notification
> > > to
> > > > the reviewers and submitter to evaluate if the work still has value,
> > and
> > > > just outright close work that is too out-dated for a straightforward
> > > merge.
> > > >
> > > > If the behaviour is done by a bot, it could reduce maintenance
> burden,
> > > and
> > > > simplify the emotional response. A bot can link to a policy, and it
> > feels
> > > > neutral in its consistent tone.
> > > >
> > > >
> > > >
> > >
> >
>


Re: OpenTelemetry + Arrow

2023-03-30 Thread Will Jones
Hi Laurent,

I have read the first post and I really like it. I'd be +1 on publishing
these to the blog. I'm interested to read the second one when it's finished.

IMO the blog could use more examples of using Arrow that's not building a
data frame library / query engine, and I appreciate that this blog provides
advice for some of the trickier parts of working with complex Arrow
schemas. I think this will also provide a good concrete use case for us to
think about improving the ecosystem's support for nested data.

Best,

Will Jones

On Thu, Mar 30, 2023 at 10:56 AM Laurent Quérel 
wrote:

> Hello everyone,
>
> I was wondering if the Apache Arrow community would be interested in
> featuring a two-part article series on their blog, discussing the
> experiences and insights gained from an experimental version of the
> OpenTelemetry protocol (OTLP) utilizing Apache Arrow. As the main author of
> the OTLP Arrow specification
> <https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md
> >,
> the reference implementation otlp-arrow-adapter
> <https://github.com/f5/otel-arrow-adapter>, and the two articles (see
> links
> below), I believe that fostering collaboration between open-source projects
> like these is essential and mutually beneficial.
>
> These articles would serve as a fitting complement to the three
> introductory articles that Andrew Lamb and Raphael Taylor-Davies
> co-authored. They delve into the practical aspects of integrating Apache
> Arrow into an existing project, as well as the process of converting a
> hierarchical data model into its Arrow representation. The first article
> examines various mapping techniques for aligning an existing data model
> with the corresponding Arrow representation, while the second article
> explores an adaptive schema technique that I implemented in the library's
> final version in greater depth. Although the second article is still under
> development, the core framework description is already in place.
>
> What are your thoughts on this proposal?
>
> Article 1:
>
> https://docs.google.com/document/d/11lG7Go2IgKOyW-RReBRW6r7HIdV1X7lu5WrDGlW5LbQ/edit?usp=sharing
>
> Article 2 (WIP):
>
> https://docs.google.com/document/d/1K2CqAtF4pZjpiVts8BOcq34sOcNgozvZ9ZZw-_zTv6I/edit?usp=sharing
>
>
> Best regards,
>
> Laurent Quérel
>
> --
> Laurent Quérel
>


Re: Proposal: add a bot to close PRs that haven't been updated in 30 days

2023-03-30 Thread Will Jones
First, to clarify: we are discussing for the monorepo only, not for Rust /
Julia / etc.? This is a big project, so best to be specific which
subprojects you are addressing.

I am +0.5 on this. 30 days seems like an appropriate window for this
project. If the PR was stale because the contributor had not updated it, it
seems appropriate. But sometimes it's because it hasn't had an update from
reviewers for a while, and in that situation it doesn't seem as ideal.



On Thu, Mar 30, 2023 at 11:01 AM Anja  wrote:

> Also, perhaps it can be two bots in an escalated process. A "reminder ping"
> bot every X days, and then a stalebot every X+Y days.
>
> On Thu, 30 Mar 2023 at 10:54, Anja  wrote:
>
> > When checked this morning, there were 119 PRs that haven't been updated
> in
> > 30 days. The oldest was nearly 3 years old.
> >
> > I propose the addition of a bot that will automatically close any PRs
> that
> > haven't been updated in 30 days. The closing will act as a notification
> to
> > the reviewers and submitter to evaluate if the work still has value, and
> > just outright close work that is too out-dated for a straightforward
> merge.
> >
> > If the behaviour is done by a bot, it could reduce maintenance burden,
> and
> > simplify the emotional response. A bot can link to a policy, and it feels
> > neutral in its consistent tone.
> >
> >
> >
>


Re: [RUST] Somewhat regular sync today

2023-03-30 Thread Will Jones
Hi Andrew,

This is great information. When all three are recorded, perhaps these could
be shared in a blog post on the Arrow website?

On Thu, Mar 30, 2023 at 10:20 AM Andrew Lamb  wrote:

> Here are the recording and slides from today:
> Recording: [1]
> Slides: [2]
>
> I plan to present (and record)  Part 2 and Part 3 next week April 4 and
> April 5 at the same 15:00 UTC slot. Details are in the sync up document [3]
>
> [1]: https://youtu.be/NVKujPxwSBA
> [2]:
>
> https://docs.google.com/presentation/d/1D3GDVas-8y0sA4c8EOgdCvEjVND4s2E7I6zfs67Y4j8/edit#slide=id.p
> [3]:
>
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit
>
> On Thu, Mar 30, 2023 at 8:35 AM Andrew Lamb  wrote:
>
> > In case anyone is interested, I plan to present (and record) the first in
> > a 3 part series of "DataFusion architecture" presentations today Thursday
> > 2023-03-30 at 15:00 UTC. I will post a link to the recording and slides
> > afterwards
> >
> > Part 1 will basically be a "what is a query engine and why might you need
> > one", and then Part 2 will cover logical planning / exprs and Part 3 will
> > cover physical planing.
> >
> > More details are on [1]
> >
> > Andrew
> >
> > [1]
> >
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit
> >
>


Re: Plasma will be removed in Arrow 12.0.0

2023-03-29 Thread Will Jones
Thanks for the feedback on the benchmark. By switching from Unix domain
socket to TCP and reducing the batch size to under 5MB I was able to get
nearly 5Gbps throughput. I think Unix domain sockets are just slower on
Macs. Updated that repo [1]

[1] https://github.com/wjones127/arrow-ipc-bench/tree/main

On Fri, Mar 17, 2023 at 9:16 AM Antoine Pitrou  wrote:

>
> Le 17/03/2023 à 16:34, Alessandro Molina a écrit :
> > How does PyArrow cope with multiprocessing.Manager?
>
> I'm not sure anyone tried it. Also, I don't think
> multiprocessing.Manager was updated to use pickle v5 out-of-band buffers
> (which would help reduce copying), so I wouldn't expect very high
> performance.
>
> Generally, I don't think multiprocessing.Manager is very much used these
> days. It also doesn't receive a lot of maintenance.
>
>


Re: zero-copy Take?

2023-03-28 Thread Will Jones
Hi John,

Arrays have a `Slice()` method that allows getting a zero-copy slices of
the array given an offset and a length. If you had a set of ranges it
wouldn't be too hard to write a function that creates a new chunked array
made up of these slices.

Of course, there are likely cases where the overhead of creating lots of
slices costs more than materializing a whole new array. You wouldn't want
to get back a chunked array of length 1000 that is made up of 500 sliced
arrays. This is even more true if you are taking indices. I think that's
why no one has implemented such a function; it's complicated to detect when
making a zero-copy slice is better than creating a new array, so we always
just create a new array for take. But if you have a particular use case
where you know it makes sense, then I would go ahead and write a function
for that specific case.

Best,

Will

On Tue, Mar 28, 2023 at 10:14 AM John Muehlhausen  wrote:

> Is there a way to pass a RecordBatch (or a batch wrapped as a Table) to
> Take and get back a Table composed of in-place (zero copy) slices of the
> input?  I suppose this is not too hard to code, just wondered if there is
> already a utility.
>
> Result Take(const Datum& values, const Datum& indices,
>  const TakeOptions& options = TakeOptions::Defaults(),
>  ExecContext* ctx = NULLPTR);
>


Re: Arrow/C++14

2023-03-24 Thread Will Jones
Yes, we make sure the data is compatible over time, or at least detect that
data has new features. Our format versioning is explained here [1]. You can
see various additions here [2]. The only one that's newer than 9.0.0 is the
Run-end encoded arrays, but those aren't in widespread use yet.

[1] https://arrow.apache.org/docs/format/Versioning.html
[2] https://github.com/apache/arrow/blob/main/format/Schema.fbs

On Fri, Mar 24, 2023 at 12:57 PM Arkadiy Vertleyb (BLOOMBERG/ 120 PARK) <
avertl...@bloomberg.net> wrote:

> Thanks Will.
>
> Is that version still data-format compatible with today?
>
>
> From: dev@arrow.apache.org At: 03/24/23 15:39:12 UTC-4:00To:
> dev@arrow.apache.org
> Subject: Re: Arrow/C++14
>
> That should be Apache Arrow 9.0.0. It was in Arrow 10 that we mandated C++
> 17. [1][2]
>
> [1] https://arrow.apache.org/release/10.0.0.html
> [2] https://issues.apache.org/jira/browse/ARROW-17545
>
> On Fri, Mar 24, 2023 at 12:27 PM Arkadiy Vertleyb (BLOOMBERG/ 120 PARK) <
> avertl...@bloomberg.net> wrote:
>
> > Hi,
> >
> > What is the latest Arrow version that supports C++14 standard?
> >
> > Thank you,
> > Arkadiy
>
>
>


Re: Arrow/C++14

2023-03-24 Thread Will Jones
That should be Apache Arrow 9.0.0. It was in Arrow 10 that we mandated C++
17. [1][2]

[1] https://arrow.apache.org/release/10.0.0.html
[2] https://issues.apache.org/jira/browse/ARROW-17545

On Fri, Mar 24, 2023 at 12:27 PM Arkadiy Vertleyb (BLOOMBERG/ 120 PARK) <
avertl...@bloomberg.net> wrote:

> Hi,
>
> What is the latest Arrow version that supports C++14 standard?
>
> Thank you,
> Arkadiy


RE: Re: [Java] [Flight] Questions around DoGet implementations with flow control

2023-03-19 Thread Nathaniel Jones
Hi David,

Thanks so much for the fast reply, that’s really helpful context.

I like the idea you mentioned where the application provides an ArrowReader. 
Would that story go something like the following?
1. Today in FlightService doGetCustom, a GetListener is passed to 
producer.getStream - this leaves the application developer responsible for 
handling gRPC details discussed above (they can choose to respect 
OutboundStreamListener.isReady or not, etc.).
2. In the ArrowReader case you mentioned, that same call to producer.getStream 
would instead (or as another option) return an ArrowReader… then the 
FlightService code could say something like “while loadNextBatch is true, 
(somehow) unload that RecordBatch and pipe it through to the 
OutboundStreamListener’s putNext”
— One nice thing about interacting with the OutboundStreamListener directly is 
that application developers can putNext with metadata - I wonder where metadata 
fits in the ArrowReader case?
— That makes sense that it would be hard to get that async writing out to gRPC 
to work flawlessly - maybe one option in this scenario where Flight is deciding 
how to write out on the stream would be to default to a “noop 
BackpressureStrategy” or a simple blocking one and let application developers 
optionally override?

Thanks,
Nate  

On 2023/03/12 17:12:41 David Li wrote:
> Hi Nate,
> 
> That sounds about right to me (it's been a while since I've dug into 
> gRPC-Java behavior). A better server API is something I've long wanted to 
> consider and haven't had the time to; the current APIs try to let you write 
> blocking/procedural code as much as possible which then does not mesh well 
> with the actual gRPC APIs they attempt to wrap.
> 
> We could expose setOnReadyHandler, IMO. Though from what I recall, it's very 
> tricky to use correctly and it's easy to get yourself "stuck" by missing a 
> callback.
> 
> My hope for a better server API would be to eventually just have the 
> application provide an ArrowReader (asynchronously) and then have the Flight 
> implementation pull from that reader in the most efficient possible manner 
> (though gRPC-Java makes that hard, what with the hardcoded backpressure 
> threshold [1] - I think that may have been another reason why I didn't expose 
> setOnReadyListener before since it would artificially limit your throughput)
> 
> [1]: https://github.com/grpc/proposal/pull/135
> 
> -David
> 
> On Fri, Mar 10, 2023, at 18:46, Nathaniel Jones wrote:
> > Hello,
> >
> > I'm hoping to check my understanding around various ways to implement a
> > DoGet handler with respect to flow control, and then inquire about
> > potential future API changes.
> >
> > First, I'm aware of 3 ways to respect flow control when implementing a Java
> > server's DoGet that have different characteristics:
> >
> >1. Busy-Waiting / Thread.sleep()-ing:
> >   1. Implement a blocking body that loops (and maybe sleeps) while 
> > the
> >   ServerStreamListener's isReady is false (and respect isCancelled, 
> > too)
> >2. Using BackpressureStrategy
> >
> > <https://sourcegraph.com/github.com/apache/arrow/-/blob/java/flight/flight-core/src/main/java/org/apache/arrow/flight/BackpressureStrategy.java?L28:8>
> > (and
> >specifically CallbackBackpressureStrategy
> >
> > <https://sourcegraph.com/github.com/apache/arrow/-/blob/java/flight/flight-core/src/main/java/org/apache/arrow/flight/BackpressureStrategy.java?L75>
> > since
> >one could implement option #1 above as a simple strategy):
> >   1. My own experiments with CallbackBackpressureStrategy /
> >   understanding from initial PR discussion
> >   <https://github.com/apache/arrow/pull/8476#issuecomment-710777211>
> > demonstrate
> >   that the DoGet handler must be run on a separate thread; you 
> > can't invoke
> >   the "waitForListener" on the thread that gRPC uses to invoke this 
> > RPC
> >   because if you're blocking (in this case Thread await-ing) on 
> > this gRPC
> >   thread, gRPC can't process onReady callbacks for this RPC, and 
> > thus
> >   CallbackBackpressureStrategy would never be notify-ed to wake up
> >3. Write a fully async implementation relying directly on underlying
> >CallStreamObserver's setOnReadyHandler:
> >   1. This is similar in spirit to #2 above, but now operates 
> > completely
> >   on threads from gRPC's thread pool (the onReady handler *is* the
> >   DoGet logic). The code looks very roughly like:
> >  1. make VectorSchemaRoot with some schema and allocator
> >  2. On our ServerStr

Re: [VOTE][RUST][DataFusion] Release DataFusion Python Bindings 20.0.0 RC2

2023-03-17 Thread Will Jones
+1 (binding)

Verified on Ubuntu 22.04.

On Fri, Mar 17, 2023 at 9:44 AM L. C. Hsieh  wrote:

> +1 (binding)
>
> Verified on M1 Mac.
>
> Thanks Andy.
>
> On Fri, Mar 17, 2023 at 8:01 AM Andy Grove  wrote:
> >
> > Hi,
> >
> > I would like to propose a release of Apache Arrow DataFusion Python
> > Bindings,
> > version 20.0.0.
> >
> > This release candidate is based on commit:
> > f7446cb9fc28c0bb30bd5d40ed75263a65819ade [1]
> > The proposed release tarball and signatures are hosted at [2].
> > The changelog is located at [3].
> > The Python wheels are located at [4].
> >
> > Please download, verify checksums and signatures, run the unit tests, and
> > vote
> > on the release. The vote will be open for at least 72 hours.
> >
> > Only votes from PMC members are binding, but all members of the community
> > are
> > encouraged to test the release and vote with "(non-binding)".
> >
> > The standard verification procedure is documented at
> >
> https://github.com/apache/arrow-datafusion-python/blob/main/dev/release/README.md#verifying-release-candidates
> > .
> >
> > [ ] +1 Release this as Apache Arrow DataFusion Python 20.0.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow DataFusion Python 20.0.0
> > because...
> >
> > Here is my vote:
> >
> > +1
> >
> > [1]:
> >
> https://github.com/apache/arrow-datafusion-python/tree/f7446cb9fc28c0bb30bd5d40ed75263a65819ade
> > [2]:
> >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-python-20.0.0-rc2
> > [3]:
> >
> https://github.com/apache/arrow-datafusion-python/blob/f7446cb9fc28c0bb30bd5d40ed75263a65819ade/CHANGELOG.md
> > [4]: https://test.pypi.org/project/datafusion/20.0.0/
>


Re: [VOTE] Release Apache Arrow ADBC 0.3.0 - RC1

2023-03-17 Thread Will Jones
+1

I successfully ran on Mac OS 12.6 with
USE_CONDA=1 TEST_APT=0 TEST_YUM=0 ./dev/release/verify-release-candidate.sh
0.3.0 1

And also on Ubuntu 22.04 with:
USE_CONDA=1 ./dev/release/verify-release-candidate.sh 0.3.0 1


On Fri, Mar 17, 2023 at 12:09 PM Matt Topol  wrote:

> +1 (non-binding)
>
> I successfully ran the following on Pop!_OS 22.04 LTS
> USE_CONDA=1 ./dev/release-verify-release-candidate.sh 0.3.0 1
>
>
>
> On Fri, Mar 17, 2023 at 12:01 PM Raúl Cumplido 
> wrote:
>
> > +1 (non-binding)
> >
> > I have run successfully the following on Ubuntu 22.04:
> > USE_CONDA=1 ./dev/release/verify-release-candidate.sh 0.3.0 1
> >
> >
> > El vie, 17 mar 2023 a las 16:22, David Li ()
> > escribió:
> > >
> > > Hello,
> > >
> > > I would like to propose the following release candidate (RC1) of Apache
> > Arrow ADBC version 0.3.0. This is a release consisting of 24 resolved
> > GitHub issues [1].
> > >
> > > This release candidate is based on commit:
> > ebcb87d8df41798d82171d81b7650b6bdfbe295a [2]
> > >
> > > The source release rc1 is hosted at [3].
> > > The binary artifacts are hosted at [4][5][6][7][8].
> > > The changelog is located at [9].
> > >
> > > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [10] for how to validate a release
> candidate.
> > >
> > > See also a verification result on GitHub Actions [11].
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Release this as Apache Arrow ADBC 0.3.0
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Arrow ADBC 0.3.0 because...
> > >
> > > Note: to verify APT/YUM packages on macOS/AArch64, you must `export
> > DOCKER_DEFAULT_ARCHITECTURE=linux/amd64`. (Or skip this step by `export
> > TEST_APT=0 TEST_YUM=0`.)
> > >
> > > Thanks to Kou for helping prepare the release.
> > >
> > > [1]:
> >
> https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+0.3.0%22+is%3Aclosed
> > > [2]:
> >
> https://github.com/apache/arrow-adbc/commit/ebcb87d8df41798d82171d81b7650b6bdfbe295a
> > > [3]:
> >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-0.3.0-rc1/
> > > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> > > [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> > > [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> > > [7]:
> >
> https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
> > > [8]:
> >
> https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-0.3.0-rc1
> > > [9]:
> >
> https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.3.0-rc1/CHANGELOG.md
> > > [10]:
> >
> https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
> > > [11]: https://github.com/apache/arrow-adbc/actions/runs/4448256840
> >
>


Plasma will be removed in Arrow 12.0.0

2023-03-15 Thread Will Jones
Hello all,

First, a reminder that Plasma has been deprecated and will be removed in
the 12.0.0 release of the C++, Python, and Java Arrow libraries. [1]

I know some used Plasma as a convenient way to share Arrow data between
Python processes, so I pulled together a quick performance comparison
against two supported alternatives: Flight over unix domain socket and the
Python sharedmemory module. [2] The shared memory example performs
comparably to Plasma, but I don't think is accessible from other languages.
The Flight test is slower than shared memory, but still fairly fast, and of
course works across languages. I wrote a little more about the shared
memory case in a stackoverflow answer [3].

If you have migrated off of Plasma and want to share with other users what
you moved to, please do so in this thread.

Best,

Will Jones

[1] https://github.com/apache/arrow/issues/33243
[2] https://github.com/wjones127/arrow-ipc-bench
[3] https://stackoverflow.com/a/75402621/2048858


Re: [Rust][MaybeNotJustRust] PR titles and generating change logs

2023-03-14 Thread Will Jones
Thanks for sharing this script!

> I noticed that some contributors are already prefixing PR titles with
> "feat:", "feature:", "fix:", "docs:", etc. I plan on updating the
changelog
> generator to recognize these prefixes as well, to help automate my job.

I believe most people are doing this out of inspiration from the
Conventional Commits standard [1] (at least I am). This standard is used
and enforced in CI in the Substrait main repository, for example. [2]

I've found them not too bad to work with, but I definitely am rebasing and
squashing commits more often to make my messages conform. This can make it
harder for people to see the incremental changes in a PR when re-reviewing.
Other than that, I see no downside.

[1] https://www.conventionalcommits.org/en/v1.0.0/
[2]
https://github.com/substrait-io/substrait/actions/runs/4408812075/jobs/7724306882

On Tue, Mar 14, 2023 at 8:38 AM Andy Grove  wrote:

> I filed an issue in the datafusion repo as well, since not everyone is on
> the mailing list.
>
> https://github.com/apache/arrow-datafusion/issues/5601
>
> On Tue, Mar 14, 2023 at 9:36 AM Andy Grove  wrote:
>
> > We have been using github-changelog-generator [1] to generate changelogs
> > for the Rust projects for some time now. It has served us well but is no
> > longer workable, at least for DataFusion. This tool seems to pull down
> the
> > entire project history using the GitHub API and we had to artificially
> slow
> > it down to avoid hitting API rate limits, and it is now unusable due to
> the
> > number of issues and PRs in this repo.
> >
> > This weekend, I built a much simpler changelog generator in Python [2],
> > that I am now using for the projects that I am the release manager for
> > (datafusion, datafusion-python, ballista). It has almost the same
> > functionality that we were getting from the previous generator, but takes
> > less than a minute to run, compared to 30+ minutes for the old generator.
> > It only hits the GitHub API for information about commits and pull
> requests
> > in the release being documented, rather than accessing the entire project
> > history.
> >
> > I followed the same approach of using GitHub labels to categorize PRs
> > (enhancements, bug fixes, docs, etc) but this requires a small amount of
> > manual effort to add those labels and re-generate the changelog.
> >
> > I noticed that some contributors are already prefixing PR titles with
> > "feat:", "feature:", "fix:", "docs:", etc. I plan on updating the
> changelog
> > generator to recognize these prefixes as well, to help automate my job.
> >
> > I wonder if it is worth formalizing these "semantic titles" more, and
> > maybe having CI enforce them. It would improve the quality of our
> > changelogs and reduce the burden on release managers.
> >
> > I would appreciate any feedback on this idea.
> >
> > Thanks,
> >
> > Andy.
> >
> >
> > [1]
> > https://github.com/github-changelog-generator/github-changelog-generator
> > [2] https://github.com/andygrove/changelog-genie
> >
>


[Java] [Flight] Questions around DoGet implementations with flow control

2023-03-10 Thread Nathaniel Jones
Hello,

I'm hoping to check my understanding around various ways to implement a
DoGet handler with respect to flow control, and then inquire about
potential future API changes.

First, I'm aware of 3 ways to respect flow control when implementing a Java
server's DoGet that have different characteristics:

   1. Busy-Waiting / Thread.sleep()-ing:
  1. Implement a blocking body that loops (and maybe sleeps) while the
  ServerStreamListener's isReady is false (and respect isCancelled, too)
   2. Using BackpressureStrategy
   

(and
   specifically CallbackBackpressureStrategy
   

since
   one could implement option #1 above as a simple strategy):
  1. My own experiments with CallbackBackpressureStrategy /
  understanding from initial PR discussion
  
demonstrate
  that the DoGet handler must be run on a separate thread; you can't invoke
  the "waitForListener" on the thread that gRPC uses to invoke this RPC
  because if you're blocking (in this case Thread await-ing) on this gRPC
  thread, gRPC can't process onReady callbacks for this RPC, and thus
  CallbackBackpressureStrategy would never be notify-ed to wake up
   3. Write a fully async implementation relying directly on underlying
   CallStreamObserver's setOnReadyHandler:
  1. This is similar in spirit to #2 above, but now operates completely
  on threads from gRPC's thread pool (the onReady handler *is* the
  DoGet logic). The code looks very roughly like:
 1. make VectorSchemaRoot with some schema and allocator
 2. On our ServerStreamListener, invoke listener.start(root)
 immediately
 3. set listener.setOnReadyHandler()
 4. So, no blocking

My understanding of the trade-offs between options #2 and #3 above:

   - In CallbackBackpressureStrategy with background threads, there'll be
   (# threads in gRPC pool) + (sum of background threads across currently
   running DoGet streams) threads. If the actual streaming logic is intensive
   / CPU bound, it might be good for that to live on a background thread,
   because if the gRPC threads were tied up in intensive callbacks (and even
   exhausted), new RPC requests / ready and cancel callbacks could be slow /
   stuck.
   - On the other hand, for quick I/O bound work in DoGet logic, option #3
   might work well if it's not worth the extra threads / context switching,
   and gRPC threads can handle everything quickly.
   - So overall, it seems like different workloads could demand a different
   model, which should inform how DoGet logic is written.


*Q1: *Is my understanding of the above points approximately correct? I'd
appreciate any pointers on items I'm misunderstanding.

*Q2: *Assuming option #3 has utility in some cases, I'm curious if there is
a way in Flight to expose an easier API to implement DoGet fully
asynchronously. I find it tricky to manage cleanup of the VectorSchemaRoot
as well as being careful about things like respecting isCancelled becoming
true. Developers also "need to know" that the async setOnReadyHandler
exists at the gRPC layer and understand its benefits. I noticed that on
ARROW-4484 , which
focuses on DoPut's busy-wait, the first comment mentions that this applies
to DoGet as well - though it seems GetListener's waitUntilStreamReady

doesn't
busy-wait, I was curious if the spirit of the comment was similar to my
question around an easier way to write the logic asynchronously? If so,
would the idea be for Flight to wrap gRPC's callbacks and expose those to
DoGet authors, while helping to abstract some cleanup items?

Thank you for any feedback,
Nate


Re: [ADBC][Rust] Proposal for Rust ADBC API

2023-03-09 Thread Will Jones
I've been thinking about the process here, and I'd like to propose an
alternate path. As I understand it, currently the process is:

1. Approve the Rust API as a stable API
2. "Release" Rust API as part of a new version of the ADBC format
3. Release Rust libraries in tandem with other ADBC libraries, matching
their version number.

On step (1), I don't think we have enough developers yet working on Rust
ADBC to get adequate feedback on the API design to make something we want
to be stable. I've had some time to accumulate work in WIP PRs, but still
don't feel fully ready to declare any stable API.

Also on step (3), it doesn't seem obvious to me there's a benefit for Rust
to matching versions of other ADBC libraries. Rust's release process is
lightweight enough where it's not much additional effort to release it
separately. I would be willing to do that, at least initially.

So I'd propose:

1. Plan to merge Rust API as "experimental", and not stabilize it until a
little later. TBH I don't even think having a stable API in Rust is a big
deal right now, as none of the other Rust Arrow libraries do.
2. Plan to release Rust ADBC library independently from other libraries,
with its own (semantic) versioning.

I think these two can be considered independently; if we find a strong
reason to keep the Rust versions the same as other libraries, I'd still
hope it would be fine to have the Rust API be experimental for some period.

On Wed, Mar 1, 2023 at 8:12 PM Will Jones  wrote:

> Hello Arrow devs,
>
> I have created a PR to define a Rust API for ADBC [1], meant to parallel
> the ones in C, Go, and Java [2]. I'd like to get feedback on that. This API
> will be considered part of the ADBC format. Once I have addressed all
> feedback, we will put forward a vote for adoption into the ADBC standard.
>
> In the meantime, I will continue prototyping the Rust driver manager and
> implementation module (for building C API drivers with Rust). [3] These
> prototypes informed the proposed API design and will rely on this API.
> Hopefully those will be ready for review shortly after we adopt the API
> standard.
>
> Best,
>
> Will Jones
>
> [1] https://github.com/apache/arrow-adbc/pull/478
> [2] https://arrow.apache.org/adbc/0.2.0/format/specification.html
> [3] https://github.com/apache/arrow-adbc/pull/446
>


Re: [EXTERNAL] Re: Field class in Java vs C#

2023-03-09 Thread Will Jones
Hi David Coe,

As David Li pointed out, ADBC implementations can either be based purely
within a language (C#-specific drivers that can only be used by C#
programs) or use C API drivers written in other languages (C, C++, Go). For
the latter, we won't be able to implement this until we finish implementing
the C Data Interface [1] and C Stream Interface [2]. And for both
approaches, I think we need to implement Union and Map types for GetInfo,
which currently aren't implemented in Arrow C#.

Best,
Will Jones

[1] https://github.com/apache/arrow/issues/33856
[2] https://github.com/apache/arrow/issues/33857


On Thu, Mar 9, 2023 at 1:38 PM David Coe 
wrote:

> Yes, ok, I see the pattern now. Thanks you.
>
> -Original Message-
> From: David Li 
> Sent: Thursday, March 9, 2023 4:30 PM
> To: dev@arrow.apache.org
> Subject: Re: [EXTERNAL] Re: Field class in Java vs C#
>
> [You don't often get email from lidav...@apache.org. Learn why this is
> important at https://aka.ms/LearnAboutSenderIdentification ]
>
> I believe it would be something like (pseudocode since the last time I
> touched C♯ was, 2009?)
>
> List TABLE_SCHEMA = new[]{
>   ...,
>   new Field("table_columns", new ListType(new StructType(COLUMN_SCHEMA)),
>   ...,
> };
>
> i.e. COLUMN_SCHEMA gets passed as the fields of a StructType itself
> instead of the field containing the StructType. (Which saves you some
> typing too since you don't have to explicitly name the list child field.)
>
> On Thu, Mar 9, 2023, at 16:20, David Coe wrote:
> > I am investigating whether ADBC can be a replacement for ODBC in
> > certain scenarios and help with more efficient copying.
> >
> > For example, in
> > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> > ub.com%2Fapache%2Farrow-adbc%2Fblob%2F923e0408fe5a32cc6501b997fafa8316
> > ace25fe0%2Fjava%2Fcore%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Farrow%2Fad
> > bc%2Fcore%2FStandardSchemas.java%23L116=05%7C01%7CDavid.Coe%40mic
> > rosoft.com%7C79f83852d98644220a2808db20e58966%7C72f988bf86f141af91ab2d
> > 7cd011db47%7C1%7C0%7C638139942453273749%7CUnknown%7CTWFpbGZsb3d8eyJWIj
> > oiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C
> > %7C%7C=AHmBdvmzfH8vIabO8Z91HzqD%2BexwKNUn4McJQavldRM%3D
> > =0 it wants COLUMN_SCHEMA and CONSTRAINT_SCHEMA as children but
> > there's not an obvious way to add those children to the respected
> > fields.
> >
> > -Original Message-
> > From: David Li 
> > Sent: Thursday, March 9, 2023 3:37 PM
> > To: dev@arrow.apache.org
> > Subject: [EXTERNAL] Re: Field class in Java vs C#
> >
> > [You don't often get email from lidav...@apache.org. Learn why this is
> > important at https://aka.ms/LearnAboutSenderIdentification ]
> >
> > I'd be very interested if I can help in any way with porting ADBC to
> > more languages, and learning more about use cases/what functionality
> > is useful (e.g. are you looking to have a full driver/client ecosystem
> > in C♯, or are you interested in being able to leverage drivers written
> > in
> > C/C++/Go?)
> >
> > From a quick look, C♯ follows C++, Python, etc. in putting child
> > fields as part of the nested type, rather than as part of the field
> > itself. I can't say why precisely one implementation chose one design
> > or another, but Java basically follows the IPC format exactly in this
> > regard (and others, e.g. it has a parameterized Int type rather than
> > Int32, Int64, UInt32, etc.), while the other languages model it at a
> > higher level (because only some types can have children).
> >
> > What specifically is difficult with the how the APIs are structured?
> >
> > On Thu, Mar 9, 2023, at 15:19, David Coe wrote:
> >> I am interested in the difference between how a Field is structured
> >> in Java (with children) and in C# (no children) and why that's the case.
> >>
> >> I am looking to port apache/arrow-adbc: Apache arrow
> >> (github.com)<https://nam06.safelinks.protection.outlook.com/?url=http
> >> %3A%2F%2Fhttps%2F=05%7C01%7CDavid.Coe%40microsoft.com%7C79f83852
> >> d98644220a2808db20e58966%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7
> >> C638139942453273749%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQI
> >> joiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=EDG87
> >> XRgoxsU9x0bQ4zi0HcyxRytJMb8p3RsXR6xuO8%3D=0
> >> %3A%2F%2Fgithub.com%2Fapache%2Farrow-adbc=05%7C01%7CDavid.Coe%
> 40microsoft.com%7Cedcb6c5cec5f41e653ac08db20de28e7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638139910769892299%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=6bc%2BVEFR2syJFe%2FwzSTHwSMeLka8U48r9mypFbspkIQ%3D=0>
> to C# but the concept of children is making it a little hairy.
> >>
> >>
> >>   *   David
>


[ADBC][Rust] Proposal for Rust ADBC API

2023-03-01 Thread Will Jones
Hello Arrow devs,

I have created a PR to define a Rust API for ADBC [1], meant to parallel
the ones in C, Go, and Java [2]. I'd like to get feedback on that. This API
will be considered part of the ADBC format. Once I have addressed all
feedback, we will put forward a vote for adoption into the ADBC standard.

In the meantime, I will continue prototyping the Rust driver manager and
implementation module (for building C API drivers with Rust). [3] These
prototypes informed the proposed API design and will rely on this API.
Hopefully those will be ready for review shortly after we adopt the API
standard.

Best,

Will Jones

[1] https://github.com/apache/arrow-adbc/pull/478
[2] https://arrow.apache.org/adbc/0.2.0/format/specification.html
[3] https://github.com/apache/arrow-adbc/pull/446


Re: [VOTE][RUST] Release Apache Arrow Rust Object Store 0.5.5 RC1

2023-03-01 Thread Will Jones
+1 (non-binding).

Verified on MacOS aarch64.

On Mon, Feb 27, 2023 at 12:53 PM Andrew Lamb  wrote:

> +1 (binding)
>
> Verified on mac x86 -- the release train is quite impressive this month
>
> Thank you Raphael
>
>
> p.s I get one local failure, tracked with [1] , but I don't think it is
> serious and I have seen it in other releases as well
>
> [1] https://github.com/apache/arrow-rs/issues/3772
>
>
> On Mon, Feb 27, 2023 at 3:33 PM Raphael Taylor-Davies
>  wrote:
>
> > Hi,
> >
> > I would like to propose a release of Apache Arrow Rust Object
> > Store Implementation, version 0.5.5.
> >
> > This release candidate is based on commit:
> > 5cc0f9b634393008ea6136a228470b6612b2dee1 [1]
> >
> > The proposed release tarball and signatures are hosted at [2].
> >
> > The changelog is located at [3].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. There is a script [4] that automates some of
> > the verification.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow Rust Object Store
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow Rust Object Store because...
> >
> > [1]:
> >
> >
> https://github.com/apache/arrow-rs/tree/5cc0f9b634393008ea6136a228470b6612b2dee1
> > [2]:
> >
> >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.5.5-rc1
> > [3]:
> >
> >
> https://github.com/apache/arrow-rs/blob/5cc0f9b634393008ea6136a228470b6612b2dee1/object_store/CHANGELOG.md
> > [4]:
> >
> >
> https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh
> >
> >
>


Re: [VOTE] Release Apache Arrow nanoarrow 0.1.0 - RC1

2023-03-01 Thread Will Jones
+1 (non-binding). Verified with Conda on MacOS aarch64.

Note: I needed to add gtest to the conda environment. Otherwise it went
smoothly. [1]

[1] https://github.com/apache/arrow-nanoarrow/pull/138

On Wed, Mar 1, 2023 at 9:04 AM Dewey Dunnington
 wrote:

> Hello,
>
> I would like to propose the following release candidate (RC1) of Apache
> Arrow nanoarrow [0] version 0.1.0. This is an initial release consisting of
> 31 resolved GitHub issues [1].
>
> Special thanks to David Li for his reviews and support during the
> preparation of this initial release candidate!
>
> This release candidate is based on commit:
> 341279af1b2fdede36871d212f339083ffbd75eb [2]
>
> The source release rc1 is hosted at [3].
> The changelog is located at [4].
>
> Please download, verify checksums and signatures, run the unit tests, and
> vote on the release. See [5] for how to validate a release candidate.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Release this as Apache Arrow nanoarrow 0.1.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow nanoarrow 0.1.0 because...
>
> [0] https://github.com/apache/arrow-nanoarrow
> [1] https://github.com/apache/arrow-nanoarrow/milestone/1?closed=1
> [2]
>
> https://github.com/apache/arrow-nanoarrow/tree/apache-arrow-nanoarrow-0.1.0-rc1
> [3]
>
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-nanoarrow-0.1.0-rc1/
> [4]
>
> https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.1.0-rc1/CHANGELOG.md
> [5]
> https://github.com/apache/arrow-nanoarrow/blob/main/dev/release/README.md
>


Re: [DISCUSS] Flight RPC/Flight SQL/ADBC enhancements

2023-02-14 Thread Will Jones
Hi David,

The proposals in the Flight/Flight SQL document look excellent. As I've
been looking at ADBC I've been wondering about polling / async execution,
cancellation, and progress indicators. Glad to see those in the Flight
document, but where are they in the ADBC issues? Do they still need to be
created?

Best,

Will Jones

On Tue, Feb 14, 2023 at 12:58 PM David Li  wrote:

> Hello,
>
> I would like to submit some Flight RPC and Flight SQL enhancements for
> discussion. They cover the following:
>
> - Executing 'queries' in a retryable, nonblocking way
> - Handling ordered result sets
> - Handling expiration of/re-reading result sets
>
> In addition, there are corresponding proposals for ADBC in anticipation of
> these features, James's catalogs proposal for Flight SQL, and other
> feedback.
>
> The Flight proposals are described in this document [1]. It should be open
> for comments.
> The ADBC proposals are filed as individual issues in this milestone [2].
>
> Any feedback is much appreciated. There are not yet prototype
> implementations, but if there is a rough consensus then I can begin on that.
>
> [1]:
> https://docs.google.com/document/d/1jhPyPZSOo2iy0LqIJVUs9KWPyFULVFJXTILDfkadx2g/edit?usp=sharing
> [2]: https://github.com/apache/arrow-adbc/milestone/3
>
> Thanks,
> David
>


Re: [VOTE] Release Apache Arrow ADBC 0.2.0 - RC1

2023-02-09 Thread Will Jones
+1 (non-binding)

Verified on MacOS with (TEST_APT=0 TEST_YUM=0 USE_CONDA=1) and Ubuntu 22.04
with (USE_CONDA=1).

On Thu, Feb 9, 2023 at 11:53 AM David Li  wrote:

> Hi Jianfeng,
>
> Glad it's already helping you! It is definitely still in my plans, but I
> haven't gotten to it yet. Of course, contributions are also welcome.
>
> It was also brought to my attention that other ODBC wrappers exist (also:
> ConnectorX, IIRC) which could be evaluated in lieu of Turbodbc for this
> purpose [1]. If you have experience with any of them, that would also be
> interesting! (Will Jones has been working on a project to allow ADBC
> drivers to be built with Rust, which would let us take advantage of
> arrow-odbc or other Rust libraries.)
>
> [1]: https://github.com/apache/arrow-adbc/issues/72
>
> -David
>
> On Thu, Feb 9, 2023, at 14:29, Jianfeng Mao wrote:
> > Hi David, it is great to see that the ADBC project is moving so fast. We
> at
> > Deephaven implemented a new feature that relies on ADBC/ODBC to ingest
> > relational data.  When one of our dev-rel developers tried to set up some
> > demos for this feature, the experience with ADBC was much smoother than
> > that with Turbodbc/ODBC. I remember that you mentioned an intention to
> add
> > ODBC support in ADBC through Turbodbc, is that still the case?
> >
> > Best,
> > Jianfeng
> >
> > On Thu, Feb 9, 2023 at 8:06 AM David Li  wrote:
> >
> >> Hello,
> >>
> >> I would like to propose the following release candidate (RC1) of Apache
> >> Arrow ADBC version 0.2.0. This is a release consisting of 34 resolved
> >> GitHub issues [1].
> >>
> >> This release candidate is based on commit:
> >> de79252f70dfc145b853530f328b0c6dfed3085f [2]
> >>
> >> The source release rc1 is hosted at [3].
> >> The binary artifacts are hosted at [4][5][6][7][8].
> >> The changelog is located at [9].
> >>
> >> Please download, verify checksums and signatures, run the unit tests,
> and
> >> vote on the release. See [10] for how to validate a release candidate.
> >>
> >> See also a verification result on GitHub Actions [11].
> >>
> >> The vote will be open for at least 72 hours.
> >>
> >> [ ] +1 Release this as Apache Arrow ADBC 0.2.0
> >> [ ] +0
> >> [ ] -1 Do not release this as Apache Arrow ADBC 0.2.0 because...
> >>
> >> Note: to verify APT/YUM packages on macOS/AArch64, you must `export
> >> DOCKER_DEFAULT_ARCHITECTURE=linux/amd64`. (Or skip this step by `export
> >> TEST_APT=0 TEST_YUM=0`.)
> >>
> >> Thanks to Kou for his help with the release.
> >>
> >> [1]:
> >>
> https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+is%3Aclosed+milestone%3A%22ADBC+Libraries+0.2.0%22
> >> [2]:
> https://github.com/apache/arrow-adbc/tree/apache-arrow-adbc-0.2.0-rc1
> >> [3]:
> >>
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-0.2.0-rc1/
> >> [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> >> [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> >> [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> >> [7]:
> >>
> https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
> >> [8]:
> >>
> https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-0.2.0-rc1
> >> [9]:
> >>
> https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.2.0-rc1/CHANGELOG.md
> >> [10]:
> >>
> https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
> >> [11]: https://github.com/apache/arrow-adbc/actions/runs/4135206064
> >>
>


[FLIGHT] Question about Flight Protocol Usage

2023-02-03 Thread Nate Jones
Hello,

We've been using the Flight protocol similar to the way that the read path is 
described in 
documentation.
 That is, services have a separate metadata server (at least logically 
separated such that a network round trip occurs for GetFlightInfo), which 
returns FlightInfo to be used to access data server(s). We follow a similar 
pattern for writes and exchanges, as well.

While the separate metadata concept is crucial for certain applications, we 
think other use cases could be made much simpler by skipping the metadata step 
altogether - in this case, clients would craft their own Tickets and talk 
directly to data servers for reads, writes, and exchanges. This would be nice 
when we're just looking for a "normal gRPC streaming call but with the benefits 
of Flight." For example, some services have a metadata server that returns 
FlightInfo that simply points clients back to itself, resulting in an 
unnecessary round trip since the GetFlightInfo is essentially a “noop” here.

I notice in the docs the statement "Of course, applications may ignore 
compatibility and simply treat the Flight RPC methods as low-level building 
blocks for their own purposes." Despite this, I wanted to reach out to see if 
there are any reference use cases that use Flight in this way. Are there any 
concerns that come to mind when adapting the Flight pattern like this?

Thanks,
Nate


Re: [C++] Parquet and Arrow overlap

2023-02-02 Thread Will Jones
Day to day, I think having Parquet-cpp under the Apache Arrow project could
make sense. Though I worry about two risks:

1. Would that lead to the governance of the format itself to be primarily
the responsibility of the developers of Parquet-MR?
2. Would C++ developers interested in working with Parquet outside of Arrow
recognize it as a relevant library?

On Thu, Feb 2, 2023 at 6:03 AM Neal Richardson 
wrote:

> Would it make sense to transfer all governance of the parquet-cpp
> implementation to Apache Arrow? It seems like that's where we de facto are
> already, so that would resolve these ambiguities and put it in line with
> the Rust implementation.
>
> Would the Parquet PMC be opposed to formalizing this change?
>
> Neal
>
> On Thu, Feb 2, 2023 at 6:48 AM Raphael Taylor-Davies
>  wrote:
>
> > Hi,
> >
> > > Does the parquet rust implementation have a similar issue?
> >
> > Similar to the C++ implementation, the Rust implementation lives under
> > the Apache Arrow umbrella and does not have any direct affiliation with
> > the Apache Parquet project that I am aware of, beyond using the same
> > format specification. However, as almost all of the users and
> > contributions are with respect to the arrow interfaces, and not the
> > parquet record APIs, there perhaps isn't the same ambiguity as
> > encountered with the C++ implementation. I would expect all issues to be
> > raised in the arrow-rs repository, and a PARQUET Jira only raised,
> > likely by myself or whoever is triaging the issue, if there is some
> > issue/ambiguity pertaining to the format itself.
> >
> > Kind Regards,
> >
> > Raphael
> >
> > On 02/02/2023 01:58, Gang Wu wrote:
> > > Hi Will,
> > >
> > > AFAIK, the Apache Parquet community no longer considers contribution to
> > > parquet-cpp when promoting new committers after the donation to Apache
> > > Arrow.
> > >
> > > It would be a dilemma for the parquet-cpp contributors if none of the
> > > Apache Arrow community or Apache Parquet community recognizes their
> work.
> > >
> > > Does the parquet rust implementation have a similar issue?
> > >
> > > Best,
> > > Gang
> > >
> > > On Thu, Feb 2, 2023 at 3:27 AM Will Jones 
> > wrote:
> > >
> > >> Hello,
> > >>
> > >> A while back, the Parquet C++ implementation was merged into the
> Apache
> > >> Arrow monorepo [1]. As I understand it, this helped the development
> > process
> > >> immensely. However, I am noticing some governance issues because of
> it.
> > >>
> > >> First, it's not obvious where issues are supposed to be open: In
> Parquet
> > >> Jira or Arrow GitHub issues. Looking back at some of the original
> > >> discussion, it looks like the intention was
> > >>
> > >> * use PARQUET-XXX for issues relating to Parquet core
> > >>> * use ARROW-XXX for issues relation to Arrow's consumption of Parquet
> > >>> core (e.g. changes that are in parquet/arrow right now)
> > >>>
> > >> The README for the old parquet-cpp repo [3] states instead in it's
> > >> migration note:
> > >>
> > >>   JIRA issues should continue to be opened in the PARQUET JIRA
> project.
> > >>
> > >>
> > >> Either way, it doesn't seem like this process is obvious to people.
> > Perhaps
> > >> we could clarify this and add notices to Arrow's GitHub issues
> template?
> > >>
> > >> Second, committer status is a little unclear. I am a committer on
> Arrow,
> > >> but not on Parquet right now. Does that mean I should only merge
> Parquet
> > >> C++ PRs for code changes in parquet/arrow? Or that I shouldn't merge
> > >> Parquet changes at all?
> > >>
> > >> Also, are the contributions to Arrow C++ Parquet being actively
> reviewed
> > >> for potential new committers?
> > >>
> > >> Best,
> > >>
> > >> Will Jones
> > >>
> > >> [1] https://lists.apache.org/thread/76wzx2lsbwjl363bg066g8kdsocd03rw
> > >> [2] https://lists.apache.org/thread/dkh6vjomcfyjlvoy83qdk9j5jgxk7n4j
> > >> [3] https://github.com/apache/parquet-cpp
> > >>
> >
>


[C++] Parquet and Arrow overlap

2023-02-01 Thread Will Jones
Hello,

A while back, the Parquet C++ implementation was merged into the Apache
Arrow monorepo [1]. As I understand it, this helped the development process
immensely. However, I am noticing some governance issues because of it.

First, it's not obvious where issues are supposed to be open: In Parquet
Jira or Arrow GitHub issues. Looking back at some of the original
discussion, it looks like the intention was

* use PARQUET-XXX for issues relating to Parquet core
> * use ARROW-XXX for issues relation to Arrow's consumption of Parquet
> core (e.g. changes that are in parquet/arrow right now)
>

The README for the old parquet-cpp repo [3] states instead in it's
migration note:

 JIRA issues should continue to be opened in the PARQUET JIRA project.


Either way, it doesn't seem like this process is obvious to people. Perhaps
we could clarify this and add notices to Arrow's GitHub issues template?

Second, committer status is a little unclear. I am a committer on Arrow,
but not on Parquet right now. Does that mean I should only merge Parquet
C++ PRs for code changes in parquet/arrow? Or that I shouldn't merge
Parquet changes at all?

Also, are the contributions to Arrow C++ Parquet being actively reviewed
for potential new committers?

Best,

Will Jones

[1] https://lists.apache.org/thread/76wzx2lsbwjl363bg066g8kdsocd03rw
[2] https://lists.apache.org/thread/dkh6vjomcfyjlvoy83qdk9j5jgxk7n4j
[3] https://github.com/apache/parquet-cpp


Re: [Monorepo] Add labels breaking-change and critical-fix

2023-01-25 Thread Will Jones
"backport candidate" sounds like an interesting idea! If someone wants to
propose and help implement that, I would be supportive.

For now, I've merged the definitions of "Breaking Change" and "Critical
Fix". I've also labelled the relevant issues I've found for 11.0.0 and
10.0.0:

Breaking changes in 11.0.0:
https://github.com/apache/arrow/issues?q=milestone%3A%2211.0.0%22+label%3A%22Breaking+Change%22
Critical fixes in 11.0.0:
https://github.com/apache/arrow/issues?q=milestone%3A%2211.0.0%22+label%3A%22Critical+Fix%22+

Breaking changes in 10.0.0:
https://github.com/apache/arrow/issues?q=milestone%3A%2210.0.0%22+label%3A%22Breaking+Change%22+
Critical fixes in 10.0.0:
https://github.com/apache/arrow/issues?q=milestone%3A%2210.0.0%22+label%3A%22Critical+Fix%22+

On Tue, Jan 17, 2023 at 8:17 AM Antoine Pitrou  wrote:

>
> Hi,
>
> I would also suggest a "bugfix" or "backport candidate" label if we want
> to make it easier to cherrypick changes for bugfix releases.
>
> Regards
>
> Antoine.
>
>
> Le 06/01/2023 à 17:57, Will Jones a écrit :
> > Hello Arrow devs,
> >
> > For the monorepo, I would like to propose adding two new labels to
> issues:
> >
> > 1. breaking-change: for changes that break API compatibility.
> > 2. critical-fix: for bug fixes that address bugs that are important
> for
> > users will want to know about, but may not realize affect them. The
> primary
> > type I have observed in the Arrow repo are bugs that produce
> incorrect or
> > invalid data. Security vulnerabilities are another type. Bugs that
> caused
> > errors or crashes generally wouldn't count, since users are aware of
> the
> > errors they get. (Though I am definitely open to arguments for a
> > different definition or name.)
> >
> > I would additionally propose that these labels are validated during the
> > release process and included in the change notes. By validated, I mean
> > someone should review all the changes in a particular release to make
> sure
> > all relevant issues are tagged with these labels. These are the two kinds
> > of issues I think users will most want to know about when considering
> > upgrading an Arrow library: what APIs changed? And what's the risk of not
> > upgrading?
> >
> > I am willing to be responsible for maintaining these labels for the next
> > few releases for Python, R, and C++. I have been compiling the list of
> > these issues for past versions, as part of my work for my employer, so
> I'm
> > on the hook for this regardless. Having these labels available and used
> by
> > developers and reviewers would make that work much easier. And, of
> course,
> > our users would benefit by having this information easily available in
> our
> > release notes.
> >
> > It's also worth mentioning these two labels are useful if we decide to
> > change how we do releases. The breaking-change label can help decide
> > whether we actually need to increment the major version. And the
> > critical-fix label can help guide which bug fixes are worth applying to
> > older supported releases. I don't think we are ready for either of those
> > yet, but I thought it's worth connecting those dots.
> >
> > Best,
> >
> > Will Jones
> >
>


Re: [Monorepo] Add labels breaking-change and critical-fix

2023-01-17 Thread Will Jones
Antoine and Weston,

You make a very good point about crashes, particularly the security risk.
I'll add that to the scope of the definition.

On Sat, Jan 14, 2023 at 9:54 AM Antoine Pitrou  wrote:

>
> A crash on invalid *user* input can easily turn into a security
> vulnerability (if only because it's a possible vector for DoS attacks),
> and so should definitely be considered critical.
>
> What's not critical is a crash when the caller of a C++ API doesn't
> respect the API contract (e.g. passes a null pointer where non-null is
> expected).
>
> Regards
>
> Antoine.
>
>
> Le 14/01/2023 à 17:47, Weston Pace a écrit :
> > On further thought it seems a little odd to me that crashes are not
> > critical.  However, many of our crashes are from a failure to properly
> > validate user input, which I agree isn't as critical.  Would it be too
> > nuanced to say that:
> >
> >   * A crash, given valid input, is critical
> >   * A crash, given invalid input, is not critical
> >
> >
> >
> > On Sat, Jan 14, 2023, 8:12 AM Antoine Pitrou  wrote:
> >
> >>
> >> Hi Will,
> >>
> >> Le 14/01/2023 à 17:06, Will Jones a écrit :
> >>>>
> >>>> I'm quite skeptical about this. My experience is that many people
> have a
> >>>> very subjective idea of what is critical or not, and the
> categorization
> >>>> ends up not very informative.
> >>>
> >>> Antoine, skeptical about the definition of "Critical Fix"? Or something
> >>> else? On "Critical Fix", I tried to make the definition provided not
> very
> >>> ambiguous, but the PR is open for feedback.
> >>>
> >>> Keep in mind, I am planning on grooming these labels once every
> release,
> >>> and including them in the generation of the changes notes. So any drift
> >> in
> >>> the definition will be corrected before the final list of breaking
> >> changes
> >>> and critical fixes are published.
> >>
> >> That clears my concerns then :-)
> >>
> >> However, I think that an additional "Priority: critical" isn't very
> >> useful and will end up confusing people.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >
>


Re: [ANNOUNCE] Apache Arrow ADBC 0.1.0 Released

2023-01-16 Thread Will Jones
>
> Thanks for the reference. I feel like this must've been shared earlier but
> I missed it.


If it seems familiar, it was mentioned in this earlier user ML thread [1].

The other thing I'd be curious about is if we can generalize this subset of
> SQL/Substrait to drivers for other 'storage layers' like Apache Iceberg and
> Apache Hudi.
>

Yeah it's basically Ian Cook's idea of plan delegation [2] to storage
systems. Not sure if it's storage layers in general or table formats
specifically that we might want to define, but it's an interesting idea.

On Substrait, I'm holding off implementing anything for now, in the hopes
that later on we might get the Spark implementation of Delta Lake to align
on Substrait definitions for operations on Delta Lake tables. But if we can
get shared messages with Iceberg and Hudi, that would be even better. Seems
feasible at first glance.

[1] https://lists.apache.org/thread/fywntyryy7pr1ttzv25s3ghf6sqy7zjl
[2] https://youtu.be/5JjaB7p3Sjk?t=550

On Mon, Jan 16, 2023 at 4:01 PM David Li  wrote:

> Thanks for the reference. I feel like this must've been shared earlier but
> I missed it.
>
> Another direction I mean to explore: implementing an Arrow Dataset backend
> using ADBC, so that we can feed SQL databases (and now Delta Lake) into
> (Py)Arrow Dataset, and then further into Acero (and the R package's dplyr
> bindings, ...).
>
> The other thing I'd be curious about is if we can generalize this subset
> of SQL/Substrait to drivers for other 'storage layers' like Apache Iceberg
> and Apache Hudi.
>
> On Mon, Jan 16, 2023, at 17:53, Will Jones wrote:
> >>
> >> You could do something like what Matt Topol's done for Go
> >>
> >
> > Thank you for the link! That's very similar to what I am thinking for
> Rust.
> > I will look at that as a reference. :)
> >
> > What do you plan for a "query" to mean to the ADBC Delta Lake driver?
> Would
> >> that be a subset of Substrait that gets mapped to a table scan (with
> >> optional filter/selection)?
> >>
> >
> > Reads are basically a Substrait read relation. Other queries like CREATE
> > TABLE, DELETE, UPDATE are passed as simple SQL or Substrait queries. And
> > then engines can use the driver as a sink (binding output data as a
> record
> > batch stream) for INSERT, OVERWRITE, and MERGE operations. Further
> details
> > are in the design doc [1].
> >
> > The audience is query engines that want to add Delta Lake support (read,
> > write, modify) without getting deep into the details of the format and
> > writer protocol. The latter is especially complex. Whereas a database
> like
> > Postgres will validate new data and handle transaction logic, in Delta
> Lake
> > that responsibility falls on each write.
> >
> > [1]
> >
> https://docs.google.com/document/d/1ud-iBPg8VVz2N3HxySz9qbrffw6a9I7TiGZJ2MBs7ZE/edit?usp=sharing
> >
> >
> > On Mon, Jan 16, 2023 at 2:26 PM David Li  wrote:
> >
> >> Exciting!
> >>
> >> You could do something like what Matt Topol's done for Go: define a
> native
> >> Go API for ADBC, then a generic adapter to wrap any Go ADBC driver as a
> C
> >> one. See [1]. As a bonus,  you can then have a more natural (and safe)
> API
> >> for implementing the actual driver, and relegate the fiddly FFI bits to
> the
> >> adapter.
> >>
> >> What do you plan for a "query" to mean to the ADBC Delta Lake driver?
> >> Would that be a subset of Substrait that gets mapped to a table scan
> (with
> >> optional filter/selection)?
> >>
> >> [1]: https://github.com/apache/arrow-adbc/pull/347
> >>
> >> On Mon, Jan 16, 2023, at 16:09, Will Jones wrote:
> >> > Andrew and David,
> >> >
> >> > I'm starting to work on the ADBC connector for Delta Lake (in the
> >> delta-rs
> >> > repo) [1], written in Rust.
> >> >
> >> > I'm thinking there's some general code I can factor out to make it
> easier
> >> > for Rust developers to create ADBC drivers. I've created an issue to
> >> track
> >> > that in the arrow-rs repo [2]. If there's anyone else planning on
> working
> >> > with ADBC in Rust, I would be happy to collaborate.
> >> >
> >> > Best,
> >> >
> >> > Will Jones
> >> >
> >> > [1] https://github.com/delta-io/delta-rs/pull/945
> >> > [2] https://github.com/apache/arrow-rs/issues/3540
> >> >
> >> > On Sun, Jan 15, 2023 at 5:33 AM Andrew Lamb 
> >> wrote

Re: [ANNOUNCE] Apache Arrow ADBC 0.1.0 Released

2023-01-16 Thread Will Jones
>
> You could do something like what Matt Topol's done for Go
>

Thank you for the link! That's very similar to what I am thinking for Rust.
I will look at that as a reference. :)

What do you plan for a "query" to mean to the ADBC Delta Lake driver? Would
> that be a subset of Substrait that gets mapped to a table scan (with
> optional filter/selection)?
>

Reads are basically a Substrait read relation. Other queries like CREATE
TABLE, DELETE, UPDATE are passed as simple SQL or Substrait queries. And
then engines can use the driver as a sink (binding output data as a record
batch stream) for INSERT, OVERWRITE, and MERGE operations. Further details
are in the design doc [1].

The audience is query engines that want to add Delta Lake support (read,
write, modify) without getting deep into the details of the format and
writer protocol. The latter is especially complex. Whereas a database like
Postgres will validate new data and handle transaction logic, in Delta Lake
that responsibility falls on each write.

[1]
https://docs.google.com/document/d/1ud-iBPg8VVz2N3HxySz9qbrffw6a9I7TiGZJ2MBs7ZE/edit?usp=sharing


On Mon, Jan 16, 2023 at 2:26 PM David Li  wrote:

> Exciting!
>
> You could do something like what Matt Topol's done for Go: define a native
> Go API for ADBC, then a generic adapter to wrap any Go ADBC driver as a C
> one. See [1]. As a bonus,  you can then have a more natural (and safe) API
> for implementing the actual driver, and relegate the fiddly FFI bits to the
> adapter.
>
> What do you plan for a "query" to mean to the ADBC Delta Lake driver?
> Would that be a subset of Substrait that gets mapped to a table scan (with
> optional filter/selection)?
>
> [1]: https://github.com/apache/arrow-adbc/pull/347
>
> On Mon, Jan 16, 2023, at 16:09, Will Jones wrote:
> > Andrew and David,
> >
> > I'm starting to work on the ADBC connector for Delta Lake (in the
> delta-rs
> > repo) [1], written in Rust.
> >
> > I'm thinking there's some general code I can factor out to make it easier
> > for Rust developers to create ADBC drivers. I've created an issue to
> track
> > that in the arrow-rs repo [2]. If there's anyone else planning on working
> > with ADBC in Rust, I would be happy to collaborate.
> >
> > Best,
> >
> > Will Jones
> >
> > [1] https://github.com/delta-io/delta-rs/pull/945
> > [2] https://github.com/apache/arrow-rs/issues/3540
> >
> > On Sun, Jan 15, 2023 at 5:33 AM Andrew Lamb 
> wrote:
> >
> >> Thanks David -- I think currently the Rust implementation of
> arrow-flight
> >> and arrow-sql are being hammered out
> >>
> >> There are several projects that are working to implement FlightSQL in
> >> various stages of completeness (I know of Ballista and IOx) and so I
> expect
> >> FlightSQL support to be better in arrow-rs over the next few months. As
> >> part of that I expect we'll be using the integration tests and
> contribute
> >> back to other implementations as needed.
> >>
> >>
> >>
> >> On Sat, Jan 14, 2023 at 9:11 AM David Li  wrote:
> >>
> >> > Thanks Andrew! Several people helped, particularly Kou, Matt, and
> Jacob,
> >> > and this release also builds heavily on the nanoarrow project that
> Dewey
> >> is
> >> > spearheading.
> >> >
> >> > I know Rust was neglected for this initial push, but I would like to
> get
> >> > around to that someday. (If you're interested, feel free to propose
> >> > something or start a discussion. My Rust is too, well, rusty to put
> >> forward
> >> > a coherent proposal at the moment.)
> >> >
> >> > -David
> >> >
> >> > On Fri, Jan 13, 2023, at 16:00, Andrew Lamb wrote:
> >> > > Thank you David and everyone else who helped make this happen --
> really
> >> > > nice work filling in the Arrow / Database integration story.
> >> > >
> >> > > Andrew
> >> > >
> >> > > On Tue, Jan 10, 2023 at 8:00 PM David Li 
> wrote:
> >> > >
> >> > >> The Apache Arrow community is pleased to announce the 0.1.0
> release of
> >> > the
> >> > >> Apache Arrow ADBC libraries. It includes 63 resolved GitHub issues
> >> > ([1]).
> >> > >>
> >> > >> The release is available now from [2] and [3].
> >> > >>
> >> > >> Release notes are available at:
> >> > >>
> >> > >>
> >> >
> >>

  1   2   >