Re: [VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 33.0.0 RC1

2023-11-07 Thread Andrew Lamb
Perhaps we can create an arrow 48.1.0 patch release to include the fix?

On Tue, Nov 7, 2023 at 12:48 AM Will Jones  wrote:

> Thanks for the clarification, Raphael. That likely narrows the scope of who
> is affected. If this bug is present in DataFusion 33, then delta-rs will
> likely skip upgrading until 34. If we're the only downstream project this
> parsing issue affects, then I think it's fine to release.
>
> On Mon, Nov 6, 2023 at 8:22 PM Raphael Taylor-Davies
>  wrote:
>
> > Hi,
> >
> > To further clarify the bug concerns the serde compatibility feature that
> > allows converting a serde compatible data structure to arrow [1]. It will
> > not impact workloads reading JSON.
> >
> > I am not sure this is a sufficiently fundamental bug to warrant special
> > concern, but happy to defer to others.
> >
> > Kind Regards,
> >
> > Raphael
> >
> > [1]: https://docs.rs/arrow/latest/arrow/#serde-compatibility
> >
> > On 7 November 2023 03:20:59 GMT, Will Jones 
> > wrote:
> > >Hello,
> > >
> > >There is an upstream bug in arrow-json that can cause the JSON reader to
> > >return incorrect data for large integers [1]. It was recently fixed by
> > >Raphael within the last 24 hours, but is not included in any release.
> The
> > >bug was introduced in Arrow 48, which this DataFusion release will
> expose
> > >users to.
> > >
> > >Not sure what the precedent here is, but I think either we should
> consider
> > >either (a) seeing if we can release and upgrade Arrow to include the
> fix,
> > >or else (b) calling out the regression as a known bug so downstream
> > >projects can include the path in their applications.
> > >
> > >Best,
> > >
> > >Will Jones
> > >
> > >[1] https://github.com/apache/arrow-rs/issues/5038
> > >[2] https://github.com/apache/arrow-rs/pull/5042
> > >
> > >On Mon, Nov 6, 2023 at 12:25 PM Andrew Lamb 
> wrote:
> > >
> > >> +1 (the tests passed for me). I have left a comment on
> > >> https://github.com/apache/arrow-datafusion/issues/8069
> > >>
> > >> On Mon, Nov 6, 2023 at 2:02 PM Andy Grove 
> > wrote:
> > >>
> > >> > I filed https://github.com/apache/arrow-datafusion/issues/8069
> > >> >
> > >> > On Mon, Nov 6, 2023 at 11:59 AM Andy Grove 
> > >> wrote:
> > >> >
> > >> > > I see the same error when I run on my M1 Macbook Air with 16 GB
> RAM.
> > >> > >
> > >> > >  aggregates::tests::run_first_last_multi_partitions stdout
> 
> > >> > > Error: ResourcesExhausted("Failed to allocate additional 632 bytes
> > for
> > >> > > GroupedHashAggregateStream[0] with 1829 bytes already allocated -
> > >> maximum
> > >> > > available is 605")
> > >> > >
> > >> > > It worked fine on my workstation with 128 GB RAM.
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Mon, Nov 6, 2023 at 11:23 AM L. C. Hsieh 
> > wrote:
> > >> > >
> > >> > >> Hmm, ran verification script and got one failure:
> > >> > >>
> > >> > >> failures:
> > >> > >>
> > >> > >>  aggregates::tests::run_first_last_multi_partitions stdout
> 
> > >> > >> Error: ResourcesExhausted("Failed to allocate additional 632
> bytes
> > for
> > >> > >> GroupedHashAggregateStream[0] with 1829 bytes already allocated -
> > >> > >> maximum available is 605")
> > >> > >>
> > >> > >> failures:
> > >> > >> aggregates::tests::run_first_last_multi_partitions
> > >> > >>
> > >> > >> test result: FAILED. 557 passed; 1 failed; 1 ignored; 0
> measured; 0
> > >> > >> filtered out; finished in 2.21s
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> On Mon, Nov 6, 2023 at 6:57 AM Andy Grove  >
> > >> > wrote:
> > >> > >> >
> > >> > >> > Hi,
> > >> > >> >
> > >> > >> > I would like to propose a release of Apache Arrow DataFusion
> > >> > >> Implementation,
> > >> > >> > version 33.0.0.
> > >> > >> >
> > >> > >> > This release candidate is based on commit:
> > >> > >> > 262f08778b8ec231d96792c01fc3e051640eb5d4 [1]
> > >> > >> > The proposed release tarball and signatures are hosted at [2].
> > >> > >> > The changelog is located at [3].
> > >> > >> >
> > >> > >> > Please download, verify checksums and signatures, run the unit
> > >> tests,
> > >> > >> and
> > >> > >> > vote
> > >> > >> > on the release. The vote will be open for at least 72 hours.
> > >> > >> >
> > >> > >> > Only votes from PMC members are binding, but all members of the
> > >> > >> community
> > >> > >> > are
> > >> > >> > encouraged to test the release and vote with "(non-binding)".
> > >> > >> >
> > >> > >> > The standard verification procedure is documented at
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> https://github.com/apache/arrow-datafusion/blob/main/dev/release/README.md#verifying-release-candidates
> > >> > >> > .
> > >> > >> >
> > >> > >> > [ ] +1 Release this as Apache Arrow DataFusion 33.0.0
> > >> > >> > [ ] +0
> > >> > >> > [ ] -1 Do not release this as Apache Arrow DataFusion 33.0.0
> > >> > because...
> > >> > >> >
> > >> > >> > Here is my vote:
> > >> > >> >
> > >> > >> > +1
> > >> > >> >
> > >> > >> > [1]:
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> https://github.c

Re: [VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 33.0.0 RC1

2023-11-07 Thread Raphael Taylor-Davies
I intend to cut a new arrow release later this week, I would prefer we 
wait for this.


On 07/11/2023 11:39, Andrew Lamb wrote:

Perhaps we can create an arrow 48.1.0 patch release to include the fix?

On Tue, Nov 7, 2023 at 12:48 AM Will Jones  wrote:


Thanks for the clarification, Raphael. That likely narrows the scope of who
is affected. If this bug is present in DataFusion 33, then delta-rs will
likely skip upgrading until 34. If we're the only downstream project this
parsing issue affects, then I think it's fine to release.

On Mon, Nov 6, 2023 at 8:22 PM Raphael Taylor-Davies
 wrote:


Hi,

To further clarify the bug concerns the serde compatibility feature that
allows converting a serde compatible data structure to arrow [1]. It will
not impact workloads reading JSON.

I am not sure this is a sufficiently fundamental bug to warrant special
concern, but happy to defer to others.

Kind Regards,

Raphael

[1]: https://docs.rs/arrow/latest/arrow/#serde-compatibility

On 7 November 2023 03:20:59 GMT, Will Jones 
wrote:

Hello,

There is an upstream bug in arrow-json that can cause the JSON reader to
return incorrect data for large integers [1]. It was recently fixed by
Raphael within the last 24 hours, but is not included in any release.

The

bug was introduced in Arrow 48, which this DataFusion release will

expose

users to.

Not sure what the precedent here is, but I think either we should

consider

either (a) seeing if we can release and upgrade Arrow to include the

fix,

or else (b) calling out the regression as a known bug so downstream
projects can include the path in their applications.

Best,

Will Jones

[1] https://github.com/apache/arrow-rs/issues/5038
[2] https://github.com/apache/arrow-rs/pull/5042

On Mon, Nov 6, 2023 at 12:25 PM Andrew Lamb 

wrote:

+1 (the tests passed for me). I have left a comment on
https://github.com/apache/arrow-datafusion/issues/8069

On Mon, Nov 6, 2023 at 2:02 PM Andy Grove 

wrote:

I filed https://github.com/apache/arrow-datafusion/issues/8069

On Mon, Nov 6, 2023 at 11:59 AM Andy Grove 

wrote:

I see the same error when I run on my M1 Macbook Air with 16 GB

RAM.

 aggregates::tests::run_first_last_multi_partitions stdout



Error: ResourcesExhausted("Failed to allocate additional 632 bytes

for

GroupedHashAggregateStream[0] with 1829 bytes already allocated -

maximum

available is 605")

It worked fine on my workstation with 128 GB RAM.



On Mon, Nov 6, 2023 at 11:23 AM L. C. Hsieh 

wrote:

Hmm, ran verification script and got one failure:

failures:

 aggregates::tests::run_first_last_multi_partitions stdout



Error: ResourcesExhausted("Failed to allocate additional 632

bytes

for

GroupedHashAggregateStream[0] with 1829 bytes already allocated -
maximum available is 605")

failures:
 aggregates::tests::run_first_last_multi_partitions

test result: FAILED. 557 passed; 1 failed; 1 ignored; 0

measured; 0

filtered out; finished in 2.21s



On Mon, Nov 6, 2023 at 6:57 AM Andy Grove 
wrote:

Hi,

I would like to propose a release of Apache Arrow DataFusion

Implementation,

version 33.0.0.

This release candidate is based on commit:
262f08778b8ec231d96792c01fc3e051640eb5d4 [1]
The proposed release tarball and signatures are hosted at [2].
The changelog is located at [3].

Please download, verify checksums and signatures, run the unit

tests,

and

vote
on the release. The vote will be open for at least 72 hours.

Only votes from PMC members are binding, but all members of the

community

are
encouraged to test the release and vote with "(non-binding)".

The standard verification procedure is documented at


https://github.com/apache/arrow-datafusion/blob/main/dev/release/README.md#verifying-release-candidates

.

[ ] +1 Release this as Apache Arrow DataFusion 33.0.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow DataFusion 33.0.0

because...

Here is my vote:

+1

[1]:


https://github.com/apache/arrow-datafusion/tree/262f08778b8ec231d96792c01fc3e051640eb5d4

[2]:


https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-33.0.0-rc1

[3]:


https://github.com/apache/arrow-datafusion/blob/262f08778b8ec231d96792c01fc3e051640eb5d4/CHANGELOG.md


Re: [VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 33.0.0 RC1

2023-11-07 Thread Andrew Lamb
If the release later in the week doesn't have any breaking API changes,
perhaps it can be 48.1.0 (and thus also get the bugfix to datafusion)

On Tue, Nov 7, 2023 at 6:41 AM Raphael Taylor-Davies
 wrote:

> I intend to cut a new arrow release later this week, I would prefer we
> wait for this.
>
> On 07/11/2023 11:39, Andrew Lamb wrote:
> > Perhaps we can create an arrow 48.1.0 patch release to include the fix?
> >
> > On Tue, Nov 7, 2023 at 12:48 AM Will Jones 
> wrote:
> >
> >> Thanks for the clarification, Raphael. That likely narrows the scope of
> who
> >> is affected. If this bug is present in DataFusion 33, then delta-rs will
> >> likely skip upgrading until 34. If we're the only downstream project
> this
> >> parsing issue affects, then I think it's fine to release.
> >>
> >> On Mon, Nov 6, 2023 at 8:22 PM Raphael Taylor-Davies
> >>  wrote:
> >>
> >>> Hi,
> >>>
> >>> To further clarify the bug concerns the serde compatibility feature
> that
> >>> allows converting a serde compatible data structure to arrow [1]. It
> will
> >>> not impact workloads reading JSON.
> >>>
> >>> I am not sure this is a sufficiently fundamental bug to warrant special
> >>> concern, but happy to defer to others.
> >>>
> >>> Kind Regards,
> >>>
> >>> Raphael
> >>>
> >>> [1]: https://docs.rs/arrow/latest/arrow/#serde-compatibility
> >>>
> >>> On 7 November 2023 03:20:59 GMT, Will Jones 
> >>> wrote:
>  Hello,
> 
>  There is an upstream bug in arrow-json that can cause the JSON reader
> to
>  return incorrect data for large integers [1]. It was recently fixed by
>  Raphael within the last 24 hours, but is not included in any release.
> >> The
>  bug was introduced in Arrow 48, which this DataFusion release will
> >> expose
>  users to.
> 
>  Not sure what the precedent here is, but I think either we should
> >> consider
>  either (a) seeing if we can release and upgrade Arrow to include the
> >> fix,
>  or else (b) calling out the regression as a known bug so downstream
>  projects can include the path in their applications.
> 
>  Best,
> 
>  Will Jones
> 
>  [1] https://github.com/apache/arrow-rs/issues/5038
>  [2] https://github.com/apache/arrow-rs/pull/5042
> 
>  On Mon, Nov 6, 2023 at 12:25 PM Andrew Lamb 
> >> wrote:
> > +1 (the tests passed for me). I have left a comment on
> > https://github.com/apache/arrow-datafusion/issues/8069
> >
> > On Mon, Nov 6, 2023 at 2:02 PM Andy Grove 
> >>> wrote:
> >> I filed https://github.com/apache/arrow-datafusion/issues/8069
> >>
> >> On Mon, Nov 6, 2023 at 11:59 AM Andy Grove 
> > wrote:
> >>> I see the same error when I run on my M1 Macbook Air with 16 GB
> >> RAM.
> >>>  aggregates::tests::run_first_last_multi_partitions stdout
> >> 
> >>> Error: ResourcesExhausted("Failed to allocate additional 632 bytes
> >>> for
> >>> GroupedHashAggregateStream[0] with 1829 bytes already allocated -
> > maximum
> >>> available is 605")
> >>>
> >>> It worked fine on my workstation with 128 GB RAM.
> >>>
> >>>
> >>>
> >>> On Mon, Nov 6, 2023 at 11:23 AM L. C. Hsieh 
> >>> wrote:
>  Hmm, ran verification script and got one failure:
> 
>  failures:
> 
>   aggregates::tests::run_first_last_multi_partitions stdout
> >> 
>  Error: ResourcesExhausted("Failed to allocate additional 632
> >> bytes
> >>> for
>  GroupedHashAggregateStream[0] with 1829 bytes already allocated -
>  maximum available is 605")
> 
>  failures:
>   aggregates::tests::run_first_last_multi_partitions
> 
>  test result: FAILED. 557 passed; 1 failed; 1 ignored; 0
> >> measured; 0
>  filtered out; finished in 2.21s
> 
> 
> 
>  On Mon, Nov 6, 2023 at 6:57 AM Andy Grove  >> wrote:
> > Hi,
> >
> > I would like to propose a release of Apache Arrow DataFusion
>  Implementation,
> > version 33.0.0.
> >
> > This release candidate is based on commit:
> > 262f08778b8ec231d96792c01fc3e051640eb5d4 [1]
> > The proposed release tarball and signatures are hosted at [2].
> > The changelog is located at [3].
> >
> > Please download, verify checksums and signatures, run the unit
> > tests,
>  and
> > vote
> > on the release. The vote will be open for at least 72 hours.
> >
> > Only votes from PMC members are binding, but all members of the
>  community
> > are
> > encouraged to test the release and vote with "(non-binding)".
> >
> > The standard verification procedure is documented at
> >
> >>
> https://github.com/apache/arrow-datafusion/blob/main/dev/release/README.md#verifying-release-candidates
> > .
> >
> > [ ] +

Re: [VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 33.0.0 RC1

2023-11-07 Thread Raphael Taylor-Davies

It will contain breaking dependency updates, including object_store.

I hope to cut it today.

On 07/11/2023 11:43, Andrew Lamb wrote:

If the release later in the week doesn't have any breaking API changes,
perhaps it can be 48.1.0 (and thus also get the bugfix to datafusion)

On Tue, Nov 7, 2023 at 6:41 AM Raphael Taylor-Davies
 wrote:


I intend to cut a new arrow release later this week, I would prefer we
wait for this.

On 07/11/2023 11:39, Andrew Lamb wrote:

Perhaps we can create an arrow 48.1.0 patch release to include the fix?

On Tue, Nov 7, 2023 at 12:48 AM Will Jones 

wrote:

Thanks for the clarification, Raphael. That likely narrows the scope of

who

is affected. If this bug is present in DataFusion 33, then delta-rs will
likely skip upgrading until 34. If we're the only downstream project

this

parsing issue affects, then I think it's fine to release.

On Mon, Nov 6, 2023 at 8:22 PM Raphael Taylor-Davies
 wrote:


Hi,

To further clarify the bug concerns the serde compatibility feature

that

allows converting a serde compatible data structure to arrow [1]. It

will

not impact workloads reading JSON.

I am not sure this is a sufficiently fundamental bug to warrant special
concern, but happy to defer to others.

Kind Regards,

Raphael

[1]: https://docs.rs/arrow/latest/arrow/#serde-compatibility

On 7 November 2023 03:20:59 GMT, Will Jones 
wrote:

Hello,

There is an upstream bug in arrow-json that can cause the JSON reader

to

return incorrect data for large integers [1]. It was recently fixed by
Raphael within the last 24 hours, but is not included in any release.

The

bug was introduced in Arrow 48, which this DataFusion release will

expose

users to.

Not sure what the precedent here is, but I think either we should

consider

either (a) seeing if we can release and upgrade Arrow to include the

fix,

or else (b) calling out the regression as a known bug so downstream
projects can include the path in their applications.

Best,

Will Jones

[1] https://github.com/apache/arrow-rs/issues/5038
[2] https://github.com/apache/arrow-rs/pull/5042

On Mon, Nov 6, 2023 at 12:25 PM Andrew Lamb 

wrote:

+1 (the tests passed for me). I have left a comment on
https://github.com/apache/arrow-datafusion/issues/8069

On Mon, Nov 6, 2023 at 2:02 PM Andy Grove 

wrote:

I filed https://github.com/apache/arrow-datafusion/issues/8069

On Mon, Nov 6, 2023 at 11:59 AM Andy Grove 

wrote:

I see the same error when I run on my M1 Macbook Air with 16 GB

RAM.

 aggregates::tests::run_first_last_multi_partitions stdout



Error: ResourcesExhausted("Failed to allocate additional 632 bytes

for

GroupedHashAggregateStream[0] with 1829 bytes already allocated -

maximum

available is 605")

It worked fine on my workstation with 128 GB RAM.



On Mon, Nov 6, 2023 at 11:23 AM L. C. Hsieh 

wrote:

Hmm, ran verification script and got one failure:

failures:

 aggregates::tests::run_first_last_multi_partitions stdout



Error: ResourcesExhausted("Failed to allocate additional 632

bytes

for

GroupedHashAggregateStream[0] with 1829 bytes already allocated -
maximum available is 605")

failures:
  aggregates::tests::run_first_last_multi_partitions

test result: FAILED. 557 passed; 1 failed; 1 ignored; 0

measured; 0

filtered out; finished in 2.21s



On Mon, Nov 6, 2023 at 6:57 AM Andy Grove 
wrote:

Hi,

I would like to propose a release of Apache Arrow DataFusion

Implementation,

version 33.0.0.

This release candidate is based on commit:
262f08778b8ec231d96792c01fc3e051640eb5d4 [1]
The proposed release tarball and signatures are hosted at [2].
The changelog is located at [3].

Please download, verify checksums and signatures, run the unit

tests,

and

vote
on the release. The vote will be open for at least 72 hours.

Only votes from PMC members are binding, but all members of the

community

are
encouraged to test the release and vote with "(non-binding)".

The standard verification procedure is documented at


https://github.com/apache/arrow-datafusion/blob/main/dev/release/README.md#verifying-release-candidates

.

[ ] +1 Release this as Apache Arrow DataFusion 33.0.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow DataFusion 33.0.0

because...

Here is my vote:

+1

[1]:


https://github.com/apache/arrow-datafusion/tree/262f08778b8ec231d96792c01fc3e051640eb5d4

[2]:


https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-33.0.0-rc1

[3]:


https://github.com/apache/arrow-datafusion/blob/262f08778b8ec231d96792c01fc3e051640eb5d4/CHANGELOG.md



Re: [VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 33.0.0 RC1

2023-11-07 Thread Andrew Lamb
I think we should create a ticket to discuss releasing 48.1.0 (in addition
to 49.0.0) -- I can do so later today if no one beats me to it

On Tue, Nov 7, 2023 at 6:47 AM Raphael Taylor-Davies
 wrote:

> It will contain breaking dependency updates, including object_store.
>
> I hope to cut it today.
>
> On 07/11/2023 11:43, Andrew Lamb wrote:
> > If the release later in the week doesn't have any breaking API changes,
> > perhaps it can be 48.1.0 (and thus also get the bugfix to datafusion)
> >
> > On Tue, Nov 7, 2023 at 6:41 AM Raphael Taylor-Davies
> >  wrote:
> >
> >> I intend to cut a new arrow release later this week, I would prefer we
> >> wait for this.
> >>
> >> On 07/11/2023 11:39, Andrew Lamb wrote:
> >>> Perhaps we can create an arrow 48.1.0 patch release to include the fix?
> >>>
> >>> On Tue, Nov 7, 2023 at 12:48 AM Will Jones 
> >> wrote:
>  Thanks for the clarification, Raphael. That likely narrows the scope
> of
> >> who
>  is affected. If this bug is present in DataFusion 33, then delta-rs
> will
>  likely skip upgrading until 34. If we're the only downstream project
> >> this
>  parsing issue affects, then I think it's fine to release.
> 
>  On Mon, Nov 6, 2023 at 8:22 PM Raphael Taylor-Davies
>   wrote:
> 
> > Hi,
> >
> > To further clarify the bug concerns the serde compatibility feature
> >> that
> > allows converting a serde compatible data structure to arrow [1]. It
> >> will
> > not impact workloads reading JSON.
> >
> > I am not sure this is a sufficiently fundamental bug to warrant
> special
> > concern, but happy to defer to others.
> >
> > Kind Regards,
> >
> > Raphael
> >
> > [1]: https://docs.rs/arrow/latest/arrow/#serde-compatibility
> >
> > On 7 November 2023 03:20:59 GMT, Will Jones  >
> > wrote:
> >> Hello,
> >>
> >> There is an upstream bug in arrow-json that can cause the JSON
> reader
> >> to
> >> return incorrect data for large integers [1]. It was recently fixed
> by
> >> Raphael within the last 24 hours, but is not included in any
> release.
>  The
> >> bug was introduced in Arrow 48, which this DataFusion release will
>  expose
> >> users to.
> >>
> >> Not sure what the precedent here is, but I think either we should
>  consider
> >> either (a) seeing if we can release and upgrade Arrow to include the
>  fix,
> >> or else (b) calling out the regression as a known bug so downstream
> >> projects can include the path in their applications.
> >>
> >> Best,
> >>
> >> Will Jones
> >>
> >> [1] https://github.com/apache/arrow-rs/issues/5038
> >> [2] https://github.com/apache/arrow-rs/pull/5042
> >>
> >> On Mon, Nov 6, 2023 at 12:25 PM Andrew Lamb 
>  wrote:
> >>> +1 (the tests passed for me). I have left a comment on
> >>> https://github.com/apache/arrow-datafusion/issues/8069
> >>>
> >>> On Mon, Nov 6, 2023 at 2:02 PM Andy Grove 
> > wrote:
>  I filed https://github.com/apache/arrow-datafusion/issues/8069
> 
>  On Mon, Nov 6, 2023 at 11:59 AM Andy Grove  >
> >>> wrote:
> > I see the same error when I run on my M1 Macbook Air with 16 GB
>  RAM.
> >  aggregates::tests::run_first_last_multi_partitions stdout
>  
> > Error: ResourcesExhausted("Failed to allocate additional 632
> bytes
> > for
> > GroupedHashAggregateStream[0] with 1829 bytes already allocated -
> >>> maximum
> > available is 605")
> >
> > It worked fine on my workstation with 128 GB RAM.
> >
> >
> >
> > On Mon, Nov 6, 2023 at 11:23 AM L. C. Hsieh 
> > wrote:
> >> Hmm, ran verification script and got one failure:
> >>
> >> failures:
> >>
> >>  aggregates::tests::run_first_last_multi_partitions stdout
>  
> >> Error: ResourcesExhausted("Failed to allocate additional 632
>  bytes
> > for
> >> GroupedHashAggregateStream[0] with 1829 bytes already allocated
> -
> >> maximum available is 605")
> >>
> >> failures:
> >>   aggregates::tests::run_first_last_multi_partitions
> >>
> >> test result: FAILED. 557 passed; 1 failed; 1 ignored; 0
>  measured; 0
> >> filtered out; finished in 2.21s
> >>
> >>
> >>
> >> On Mon, Nov 6, 2023 at 6:57 AM Andy Grove <
> andygrov...@gmail.com
>  wrote:
> >>> Hi,
> >>>
> >>> I would like to propose a release of Apache Arrow DataFusion
> >> Implementation,
> >>> version 33.0.0.
> >>>
> >>> This release candidate is based on commit:
> >>> 262f08778b8ec231d96792c01fc3e051640eb5d4 [1]
> >>> The proposed release tarball and signatures are hosted at [2].
> >>> The changelog is l

Re: [VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 33.0.0 RC1

2023-11-07 Thread Andrew Lamb
I filed  https://github.com/apache/arrow-rs/issues/5050 to discuss some
different possibilities (for example, do an arrow patch release or hold the
datafusion release)

Please share your thoughts on the ticket

On Tue, Nov 7, 2023 at 6:51 AM Andrew Lamb  wrote:

> I think we should create a ticket to discuss releasing 48.1.0 (in addition
> to 49.0.0) -- I can do so later today if no one beats me to it
>
> On Tue, Nov 7, 2023 at 6:47 AM Raphael Taylor-Davies
>  wrote:
>
>> It will contain breaking dependency updates, including object_store.
>>
>> I hope to cut it today.
>>
>> On 07/11/2023 11:43, Andrew Lamb wrote:
>> > If the release later in the week doesn't have any breaking API changes,
>> > perhaps it can be 48.1.0 (and thus also get the bugfix to datafusion)
>> >
>> > On Tue, Nov 7, 2023 at 6:41 AM Raphael Taylor-Davies
>> >  wrote:
>> >
>> >> I intend to cut a new arrow release later this week, I would prefer we
>> >> wait for this.
>> >>
>> >> On 07/11/2023 11:39, Andrew Lamb wrote:
>> >>> Perhaps we can create an arrow 48.1.0 patch release to include the
>> fix?
>> >>>
>> >>> On Tue, Nov 7, 2023 at 12:48 AM Will Jones 
>> >> wrote:
>>  Thanks for the clarification, Raphael. That likely narrows the scope
>> of
>> >> who
>>  is affected. If this bug is present in DataFusion 33, then delta-rs
>> will
>>  likely skip upgrading until 34. If we're the only downstream project
>> >> this
>>  parsing issue affects, then I think it's fine to release.
>> 
>>  On Mon, Nov 6, 2023 at 8:22 PM Raphael Taylor-Davies
>>   wrote:
>> 
>> > Hi,
>> >
>> > To further clarify the bug concerns the serde compatibility feature
>> >> that
>> > allows converting a serde compatible data structure to arrow [1]. It
>> >> will
>> > not impact workloads reading JSON.
>> >
>> > I am not sure this is a sufficiently fundamental bug to warrant
>> special
>> > concern, but happy to defer to others.
>> >
>> > Kind Regards,
>> >
>> > Raphael
>> >
>> > [1]: https://docs.rs/arrow/latest/arrow/#serde-compatibility
>> >
>> > On 7 November 2023 03:20:59 GMT, Will Jones <
>> will.jones...@gmail.com>
>> > wrote:
>> >> Hello,
>> >>
>> >> There is an upstream bug in arrow-json that can cause the JSON
>> reader
>> >> to
>> >> return incorrect data for large integers [1]. It was recently
>> fixed by
>> >> Raphael within the last 24 hours, but is not included in any
>> release.
>>  The
>> >> bug was introduced in Arrow 48, which this DataFusion release will
>>  expose
>> >> users to.
>> >>
>> >> Not sure what the precedent here is, but I think either we should
>>  consider
>> >> either (a) seeing if we can release and upgrade Arrow to include
>> the
>>  fix,
>> >> or else (b) calling out the regression as a known bug so downstream
>> >> projects can include the path in their applications.
>> >>
>> >> Best,
>> >>
>> >> Will Jones
>> >>
>> >> [1] https://github.com/apache/arrow-rs/issues/5038
>> >> [2] https://github.com/apache/arrow-rs/pull/5042
>> >>
>> >> On Mon, Nov 6, 2023 at 12:25 PM Andrew Lamb 
>>  wrote:
>> >>> +1 (the tests passed for me). I have left a comment on
>> >>> https://github.com/apache/arrow-datafusion/issues/8069
>> >>>
>> >>> On Mon, Nov 6, 2023 at 2:02 PM Andy Grove 
>> > wrote:
>>  I filed https://github.com/apache/arrow-datafusion/issues/8069
>> 
>>  On Mon, Nov 6, 2023 at 11:59 AM Andy Grove <
>> andygrov...@gmail.com>
>> >>> wrote:
>> > I see the same error when I run on my M1 Macbook Air with 16 GB
>>  RAM.
>> >  aggregates::tests::run_first_last_multi_partitions stdout
>>  
>> > Error: ResourcesExhausted("Failed to allocate additional 632
>> bytes
>> > for
>> > GroupedHashAggregateStream[0] with 1829 bytes already allocated
>> -
>> >>> maximum
>> > available is 605")
>> >
>> > It worked fine on my workstation with 128 GB RAM.
>> >
>> >
>> >
>> > On Mon, Nov 6, 2023 at 11:23 AM L. C. Hsieh 
>> > wrote:
>> >> Hmm, ran verification script and got one failure:
>> >>
>> >> failures:
>> >>
>> >>  aggregates::tests::run_first_last_multi_partitions stdout
>>  
>> >> Error: ResourcesExhausted("Failed to allocate additional 632
>>  bytes
>> > for
>> >> GroupedHashAggregateStream[0] with 1829 bytes already
>> allocated -
>> >> maximum available is 605")
>> >>
>> >> failures:
>> >>   aggregates::tests::run_first_last_multi_partitions
>> >>
>> >> test result: FAILED. 557 passed; 1 failed; 1 ignored; 0
>>  measured; 0
>> >> filtered out; finished in 2.21s
>> >>
>> >>
>> >>
>> >> On Mon, Nov 6, 2023 at 6:57

Re: [VOTE] Release Apache Arrow ADBC 0.8.0 - RC0

2023-11-07 Thread David Li
Thanks for fixing those, Kou!

On Mon, Nov 6, 2023, at 22:02, Sutou Kouhei wrote:
> +1
>
> I ran the following on Debian GNU/Linux sid:
>
>   JAVA_HOME=/usr/lib/jvm/default-java \
> dev/release/verify-release-candidate.sh 0.8.0 0
>
> with:
>
>   * g++ (Debian 13.2.0-4) 13.2.0
>   * go version go1.21.0 linux/amd64
>   * openjdk version "17.0.9-ea" 2023-10-17
>   * Python 3.11.5
>   * ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]
>   * R version 4.3.1 (2023-06-16) -- "Beagle Scouts"
>
>
> Note that I need the following fixes:
>
>   * https://github.com/apache/arrow-adbc/pull/1264
>   * https://github.com/apache/arrow-adbc/pull/1266
>
> but they are verification script problems. It's not a
> problem of ADBC 0.8.0.
>
>
> Thanks,
> -- 
> kou
>
>
> In 
>   "[VOTE] Release Apache Arrow ADBC 0.8.0 - RC0" on Fri, 03 Nov 2023 
> 11:18:13 -0400,
>   "David Li"  wrote:
>
>> Hello,
>> 
>> I would like to propose the following release candidate (RC0) of Apache 
>> Arrow ADBC version 0.8.0. This is a release consisting of 42 resolved GitHub 
>> issues [1].
>> 
>> This release candidate is based on commit: 
>> 95f13231f49494bcf78df45de1f65aa25620981b [2]
>> 
>> The source release rc0 is hosted at [3].
>> The binary artifacts are hosted at [4][5][6][7][8].
>> The changelog is located at [9].
>> 
>> Please download, verify checksums and signatures, run the unit tests, and 
>> vote on the release. See [10] for how to validate a release candidate.
>> 
>> See also a verification result on GitHub Actions [11].
>> 
>> The vote will be open for at least 72 hours.
>> 
>> [ ] +1 Release this as Apache Arrow ADBC 0.8.0
>> [ ] +0
>> [ ] -1 Do not release this as Apache Arrow ADBC 0.8.0 because...
>> 
>> Note: to verify APT/YUM packages on macOS/AArch64, you must `export 
>> DOCKER_DEFAULT_PLATFORM=linux/amd64`. (Or skip this step by `export 
>> TEST_APT=0 TEST_YUM=0`.)
>> 
>> Note: it is not currently possible to verify with Conda and Python 3.12 
>> (some test dependencies do not yet have a Python 3.12 build available). The 
>> verification script defaults to Python 3.11. Binary artifacts are available 
>> for 3.12.
>> 
>> [1]: 
>> https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+0.8.0%22+is%3Aclosed
>> [2]: 
>> https://github.com/apache/arrow-adbc/commit/95f13231f49494bcf78df45de1f65aa25620981b
>> [3]: 
>> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-0.8.0-rc0/
>> [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
>> [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
>> [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
>> [7]: 
>> https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
>> [8]: 
>> https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-0.8.0-rc0
>> [9]: 
>> https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.8.0-rc0/CHANGELOG.md
>> [10]: 
>> https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
>> [11]: https://github.com/apache/arrow-adbc/actions/runs/6746653191


Re: [VOTE] Release Apache Arrow ADBC 0.8.0 - RC0

2023-11-07 Thread Dewey Dunnington
+1!

I ran: TEST_APT=0 TEST_YUM=0 USE_CONDA=1
dev/release/verify-release-candidate.sh 0.8.0 0

On Fri, Nov 3, 2023 at 12:18 PM David Li  wrote:
>
> Hello,
>
> I would like to propose the following release candidate (RC0) of Apache Arrow 
> ADBC version 0.8.0. This is a release consisting of 42 resolved GitHub issues 
> [1].
>
> This release candidate is based on commit: 
> 95f13231f49494bcf78df45de1f65aa25620981b [2]
>
> The source release rc0 is hosted at [3].
> The binary artifacts are hosted at [4][5][6][7][8].
> The changelog is located at [9].
>
> Please download, verify checksums and signatures, run the unit tests, and 
> vote on the release. See [10] for how to validate a release candidate.
>
> See also a verification result on GitHub Actions [11].
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Release this as Apache Arrow ADBC 0.8.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow ADBC 0.8.0 because...
>
> Note: to verify APT/YUM packages on macOS/AArch64, you must `export 
> DOCKER_DEFAULT_PLATFORM=linux/amd64`. (Or skip this step by `export 
> TEST_APT=0 TEST_YUM=0`.)
>
> Note: it is not currently possible to verify with Conda and Python 3.12 (some 
> test dependencies do not yet have a Python 3.12 build available). The 
> verification script defaults to Python 3.11. Binary artifacts are available 
> for 3.12.
>
> [1]: 
> https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+0.8.0%22+is%3Aclosed
> [2]: 
> https://github.com/apache/arrow-adbc/commit/95f13231f49494bcf78df45de1f65aa25620981b
> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-0.8.0-rc0/
> [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> [7]: 
> https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
> [8]: 
> https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-0.8.0-rc0
> [9]: 
> https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.8.0-rc0/CHANGELOG.md
> [10]: 
> https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
> [11]: https://github.com/apache/arrow-adbc/actions/runs/6746653191


Re: [DISCUSS][Format] C data interface for Utf8View

2023-11-07 Thread Will Jones
I agree with the approach originally proposed by Ben. It seems like the
most straightforward way to implement within the current protocol.

On Sun, Oct 29, 2023 at 4:59 PM Dewey Dunnington
 wrote:

> In the absence of a general solution to the C data interface omitting
> buffer sizes, I think the original proposal is the best way
> forward...this is the first type to be added whose buffer sizes cannot
> be calculated without looping over every element of the array; the
> buffer sizes are needed to efficiently serialize the imported array to
> IPC if imported by a consumer that cares about buffer sizes.
>
> Using a schema's flags to indicate something about a specific paired
> array (particularly one that, if misinterpreted, would lead to a
> crash) is a precedent that is probably not worth introducing for just
> one type. Currently a schema is completely independent of any
> particular ArrowArray, and I think that is a feature that is worth
> preserving. My gripes about not having buffer sizes on the CPU to more
> efficiently copy between devices is a concept almost certainly better
> suited to the ArrowDeviceArray struct.
>
> On Fri, Oct 27, 2023 at 12:45 PM Benjamin Kietzman 
> wrote:
> >
> > > This begs the question of what happens if a consumer receives an
> unknown
> > > flag value.
> >
> > It seems to me that ignoring unknown flags is the primary case to
> consider
> > at
> > this point, since consumers may ignore unknown flags. Since that is the
> > case,
> > it seems adding any flag which would break such a consumer would be
> > tantamount to an ABI breakage. I don't think this can be averted unless
> all
> > consumers are required to error out on unknown flag values.
> >
> > In the specific case of Utf8View it seems certain that consumers would
> add
> > support for the buffer sizes flag simultaneously with adding support for
> the
> > new type (since Utf8View is difficult to import otherwise), so any
> consumer
> > which would error out on the new flag would already be erroring out on an
> > unsupported data type.
> >
> > > I might be the only person who has implemented
> > > a deep copy of an ArrowSchema in C, but it does blindly pass along a
> > > schema's flag value
> >
> > I think passing a schema's flag value including unknown flags is an
> error.
> > The ABI defines moving structures but does not define deep copying. I
> think
> > in order to copy deeply in terms of operations which *are* specified: we
> > import then export the schema. Since this includes an export step, it
> > should not
> > include flags which are not supported by the exporter.
> >
> > On Thu, Oct 26, 2023 at 6:40 PM Antoine Pitrou 
> wrote:
> >
> > >
> > > Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit :
> > > >> Is this buffer lengths buffer only present if the array type is
> > > Utf8View?
> > > >
> > > > IIUC, the proposal would add the buffer lengths buffer for all types
> if
> > > the
> > > > schema's
> > > > flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to
> avoid
> > > > the special case and that `n_buffers` would continue to be consistent
> > > with
> > > > IPC.
> > >
> > > This begs the question of what happens if a consumer receives an
> unknown
> > > flag value. We haven't specified that unknown flag values should be
> > > ignored, so a consumer could judiciously choose to error out instead of
> > > potentially misinterpreting the data.
> > >
> > > All in all, personally I'd rather we make a special case for Utf8View
> > > instead of adding a flag that can lead to worse interoperability.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
>


Re: [VOTE] Release Apache Arrow ADBC 0.8.0 - RC0

2023-11-07 Thread Dane Pitkin
+1 (non-binding)

Verified on M1 MacOS 13 with:

USE_CONDA=1 TEST_YUM=0 TEST_APT=0 ./dev/release/verify-release-candidate.sh
0.8.0 0

On Tue, Nov 7, 2023 at 9:10 AM Dewey Dunnington
 wrote:

> +1!
>
> I ran: TEST_APT=0 TEST_YUM=0 USE_CONDA=1
> dev/release/verify-release-candidate.sh 0.8.0 0
>
> On Fri, Nov 3, 2023 at 12:18 PM David Li  wrote:
> >
> > Hello,
> >
> > I would like to propose the following release candidate (RC0) of Apache
> Arrow ADBC version 0.8.0. This is a release consisting of 42 resolved
> GitHub issues [1].
> >
> > This release candidate is based on commit:
> 95f13231f49494bcf78df45de1f65aa25620981b [2]
> >
> > The source release rc0 is hosted at [3].
> > The binary artifacts are hosted at [4][5][6][7][8].
> > The changelog is located at [9].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [10] for how to validate a release candidate.
> >
> > See also a verification result on GitHub Actions [11].
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow ADBC 0.8.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow ADBC 0.8.0 because...
> >
> > Note: to verify APT/YUM packages on macOS/AArch64, you must `export
> DOCKER_DEFAULT_PLATFORM=linux/amd64`. (Or skip this step by `export
> TEST_APT=0 TEST_YUM=0`.)
> >
> > Note: it is not currently possible to verify with Conda and Python 3.12
> (some test dependencies do not yet have a Python 3.12 build available). The
> verification script defaults to Python 3.11. Binary artifacts are available
> for 3.12.
> >
> > [1]:
> https://github.com/apache/arrow-adbc/issues?q=is%3Aissue+milestone%3A%22ADBC+Libraries+0.8.0%22+is%3Aclosed
> > [2]:
> https://github.com/apache/arrow-adbc/commit/95f13231f49494bcf78df45de1f65aa25620981b
> > [3]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-adbc-0.8.0-rc0/
> > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> > [5]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> > [6]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> > [7]:
> https://repository.apache.org/content/repositories/staging/org/apache/arrow/adbc/
> > [8]:
> https://github.com/apache/arrow-adbc/releases/tag/apache-arrow-adbc-0.8.0-rc0
> > [9]:
> https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.8.0-rc0/CHANGELOG.md
> > [10]:
> https://arrow.apache.org/adbc/main/development/releasing.html#how-to-verify-release-candidates
> > [11]: https://github.com/apache/arrow-adbc/actions/runs/6746653191
>


[DISCUSS][MATLAB] Proposal for incremental point releases of the MATLAB interface

2023-11-07 Thread Kevin Gurney
Hi All,

A considerable amount of new functionality has been added to the MATLAB 
interface over the last few months. We appreciate all the community's support 
in making this possible and are happy to see all the progress that is being 
made.

At this point, we would like to create an initial "0.1" release of the MATLAB 
interface. Incremental point releases will enable MATLAB users to provide early 
feedback. In addition, learning how to navigate the release process is an 
important step towards eventually releasing a stable 1.0 version of the MATLAB 
interface.

Our proposed approach to creating an initial release would be to:

1. Manually build the MATLAB interface on Windows, macOS, and Linux
2. Combine all of the cross platform build artifacts into a single MLTBX file 
[1] for distribution
3. Host the MLTBX somewhere that is easliy accessible for download

For reference - MLTBX is a standard packaging format for MATLAB which enables 
simple "one-click" installation - analogous to a Python pip package or a Ruby 
gem.

Creating an MLTBX file manually should be relatively low effort. However, in 
the long term, we would love to enable semi-automated "push button" releases 
via GitHub Actions (and possibly even "nightly builds").

Since this is our first time creating a release of the MATLAB interface, we 
wanted to draw on the community's expertise to answer a few questions:

1. Is there a recommended location where we can host the MLTBX file? e.g. 
GitHub Releases [2], JFrog [3], etc.?
2. Is there a recommended location for hosting release notes?
3. Is there a recommended cadence for incremental point releases?
4. Are there any notable ASF procedures [4] [5] (e.g. voting on a new release 
proposal) that we should be aware of as we consider creating an initial release?
5. How should the Arrow project release (i.e. 14.0.0) relate to the MATLAB 
interface version (i.e. 0.1)? As a point of reference, we noticed that PyArrow 
is on version 14.0.0, but it feels "misleading" to say that the MATLAB 
interface is at version 14.0.0 when we haven't yet implemented or stabilized 
all core Arrow APIs. Is there any precedent for using independent release 
versions for language bindings which are not fully stabilized and are also part 
of the main apache/arrow repository?

We've noticed that Arrow-related projects which are not part of the main 
apache/arrow GitHub repository (e.g. DataFusion) follow a mailing list-based 
voting and release process. However, it's not clear whether it makes sense to 
follow this process for the MATLAB interface since it is part of the main 
apache/arrow repository.

We sincerely appreciate the community's help and guidance on this topic!

Please let us know if you have any questions.

[1] https://www.mathworks.com/help/matlab/creating-help.html?s_tid=CRUX_lftnav
[2] https://github.com/apache/arrow/releases
[3] https://apache.jfrog.io/ui/native/arrow/
[4] https://www.apache.org/foundation/voting.html
[5] https://www.apache.org/legal/release-policy.html#release-approval

Best Regards,

Kevin Gurney


[VOTE][RUST] Release Apache Arrow Rust 49.0.0 RC1

2023-11-07 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 49.0.0.


This release candidate is based on commit: 
747dcbf0670aeab2ede474edb3c4f22028d6a7e6 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/747dcbf0670aeab2ede474edb3c4f22028d6a7e6

[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-49.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/747dcbf0670aeab2ede474edb3c4f22028d6a7e6/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




Re: [VOTE][RUST] Release Apache Arrow Rust 49.0.0 RC1

2023-11-07 Thread Andy Grove
+1 (binding)

Verified on Ubuntu 22.04.3 LTS.

Thanks, Raphael.

On Tue, Nov 7, 2023 at 2:22 PM Raphael Taylor-Davies
 wrote:

> Hi,
>
> I would like to propose a release of Apache Arrow Rust Implementation,
> version 49.0.0.
>
> This release candidate is based on commit:
> 747dcbf0670aeab2ede474edb3c4f22028d6a7e6 [1]
>
> The proposed release tarball and signatures are hosted at [2].
>
> The changelog is located at [3].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. There is a script [4] that automates some of
> the verification.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Release this as Apache Arrow Rust
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow Rust  because...
>
> [1]:
>
> https://github.com/apache/arrow-rs/tree/747dcbf0670aeab2ede474edb3c4f22028d6a7e6
> [2]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-49.0.0-rc1
> [3]:
>
> https://github.com/apache/arrow-rs/blob/747dcbf0670aeab2ede474edb3c4f22028d6a7e6/CHANGELOG.md
> [4]:
>
> https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
>
>


Re: [VOTE][RUST] Release Apache Arrow Rust 49.0.0 RC1

2023-11-07 Thread L. C. Hsieh
+1 (binding)

Verified on Intel Mac.

Thanks Raphael.

On Tue, Nov 7, 2023 at 1:38 PM Andy Grove  wrote:
>
> +1 (binding)
>
> Verified on Ubuntu 22.04.3 LTS.
>
> Thanks, Raphael.
>
> On Tue, Nov 7, 2023 at 2:22 PM Raphael Taylor-Davies
>  wrote:
>
> > Hi,
> >
> > I would like to propose a release of Apache Arrow Rust Implementation,
> > version 49.0.0.
> >
> > This release candidate is based on commit:
> > 747dcbf0670aeab2ede474edb3c4f22028d6a7e6 [1]
> >
> > The proposed release tarball and signatures are hosted at [2].
> >
> > The changelog is located at [3].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. There is a script [4] that automates some of
> > the verification.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow Rust
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow Rust  because...
> >
> > [1]:
> >
> > https://github.com/apache/arrow-rs/tree/747dcbf0670aeab2ede474edb3c4f22028d6a7e6
> > [2]:
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-49.0.0-rc1
> > [3]:
> >
> > https://github.com/apache/arrow-rs/blob/747dcbf0670aeab2ede474edb3c4f22028d6a7e6/CHANGELOG.md
> > [4]:
> >
> > https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
> >
> >


Re: [DISCUSS][Format] C data interface for Utf8View

2023-11-07 Thread Weston Pace
+1 for the original proposal as well.

---

The (minor) problem I see with flags is that there isn't much point to this
feature if you are gating on a flag.  I'm assuming the goal is what Dewey
originally mentioned which is making buffer calculations easier.  However,
if you're gating the feature with a flag then you are either:

 * Rejecting input from producers that don't support this feature
(undesirable, better to align on one use model if we can)
 * Doing all the work anyways to handle producers that don't support the
feature

Maybe it makes sense for a long term migration (e.g. we all agree this is
something we want to move towards but we need to handle old producers in
the meantime) but we can always discuss that separately and I don't think
the benefit here is worth the confusion.

On Tue, Nov 7, 2023 at 7:46 AM Will Jones  wrote:

> I agree with the approach originally proposed by Ben. It seems like the
> most straightforward way to implement within the current protocol.
>
> On Sun, Oct 29, 2023 at 4:59 PM Dewey Dunnington
>  wrote:
>
> > In the absence of a general solution to the C data interface omitting
> > buffer sizes, I think the original proposal is the best way
> > forward...this is the first type to be added whose buffer sizes cannot
> > be calculated without looping over every element of the array; the
> > buffer sizes are needed to efficiently serialize the imported array to
> > IPC if imported by a consumer that cares about buffer sizes.
> >
> > Using a schema's flags to indicate something about a specific paired
> > array (particularly one that, if misinterpreted, would lead to a
> > crash) is a precedent that is probably not worth introducing for just
> > one type. Currently a schema is completely independent of any
> > particular ArrowArray, and I think that is a feature that is worth
> > preserving. My gripes about not having buffer sizes on the CPU to more
> > efficiently copy between devices is a concept almost certainly better
> > suited to the ArrowDeviceArray struct.
> >
> > On Fri, Oct 27, 2023 at 12:45 PM Benjamin Kietzman 
> > wrote:
> > >
> > > > This begs the question of what happens if a consumer receives an
> > unknown
> > > > flag value.
> > >
> > > It seems to me that ignoring unknown flags is the primary case to
> > consider
> > > at
> > > this point, since consumers may ignore unknown flags. Since that is the
> > > case,
> > > it seems adding any flag which would break such a consumer would be
> > > tantamount to an ABI breakage. I don't think this can be averted unless
> > all
> > > consumers are required to error out on unknown flag values.
> > >
> > > In the specific case of Utf8View it seems certain that consumers would
> > add
> > > support for the buffer sizes flag simultaneously with adding support
> for
> > the
> > > new type (since Utf8View is difficult to import otherwise), so any
> > consumer
> > > which would error out on the new flag would already be erroring out on
> an
> > > unsupported data type.
> > >
> > > > I might be the only person who has implemented
> > > > a deep copy of an ArrowSchema in C, but it does blindly pass along a
> > > > schema's flag value
> > >
> > > I think passing a schema's flag value including unknown flags is an
> > error.
> > > The ABI defines moving structures but does not define deep copying. I
> > think
> > > in order to copy deeply in terms of operations which *are* specified:
> we
> > > import then export the schema. Since this includes an export step, it
> > > should not
> > > include flags which are not supported by the exporter.
> > >
> > > On Thu, Oct 26, 2023 at 6:40 PM Antoine Pitrou 
> > wrote:
> > >
> > > >
> > > > Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit :
> > > > >> Is this buffer lengths buffer only present if the array type is
> > > > Utf8View?
> > > > >
> > > > > IIUC, the proposal would add the buffer lengths buffer for all
> types
> > if
> > > > the
> > > > > schema's
> > > > > flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to
> > avoid
> > > > > the special case and that `n_buffers` would continue to be
> consistent
> > > > with
> > > > > IPC.
> > > >
> > > > This begs the question of what happens if a consumer receives an
> > unknown
> > > > flag value. We haven't specified that unknown flag values should be
> > > > ignored, so a consumer could judiciously choose to error out instead
> of
> > > > potentially misinterpreting the data.
> > > >
> > > > All in all, personally I'd rather we make a special case for Utf8View
> > > > instead of adding a flag that can lead to worse interoperability.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> >
>


Arrow community meeting November 8 at 17:00 UTC

2023-11-07 Thread Ian Cook
Our next biweekly Arrow community meeting is tomorrow at 17:00 UTC / 12:00 EDT.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

Meeting notes will be captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend this meeting, you are welcome to edit the
document to add the topics that you would like to discuss.

Thanks,
Ian


Re: [DISCUSS][MATLAB] Proposal for incremental point releases of the MATLAB interface

2023-11-07 Thread Sutou Kouhei
Hi,

> As a point of reference, we noticed that PyArrow is on
> version 14.0.0, but it feels "misleading" to say that the
> MATLAB interface is at version 14.0.0 when we haven't yet
> implemented or stabilized all core Arrow APIs.

I can understand this but I suggest that we use the same
version as other packages in apache/arrow. Because:

* Using isolated version increases release complexity.
* Using isolated version may introduce another
  "misleading"/"confusion": For example, "the MATLAB
  interface 1.0.0 uses Apache Arrow C++ 20.0.0" may be
  misleading/confused:
  * The MATLAB interface 1.0.0 doesn't use Apache Arrow C++
1.0.0.
  * It may be difficult to find the corresponding
Apache Arrow C++ version from the MATLAB interface
version.

Can we just mention "This is not stable yet!!!" in the
documentation instead of using isolated version?

We may want to use the status page for it:
https://arrow.apache.org/docs/status.html

> 1. Manually build the MATLAB interface on Windows, macOS, and Linux

It's better that we use CI for this like other binary
packages such as .deb/.rpm/.wheel/.jar/...

If we release the MATLAB interface separately, which Apache
Arrow C++ version is used? If we release the MATALB
interface right now, is Apache Arrow C++ 14.0.0 (the latest
release) used or is Apache Arrow C++ main (not released yet)
used? The MATLAB interface on main will depend on Apache
Arrow C++ main, we may not be able to use the latest release
for the MATLAB interface on main.

> 2. Combine all of the cross platform build artifacts into
>a single MLTBX file [1] for distribution

Does the MLTBX file include Apache Arrow C++ binaries too
like .wheel/.jar?

> 3. Host the MLTBX somewhere that is easliy accessible for download

MATLAB doesn't provide the official package repository such
as PyPI for Python and https://rubygems.org/ for Ruby, right?

> 1. Is there a recommended location where we can host the MLTBX file? e.g. 
> GitHub Releases [2], JFrog [3], etc.?

If the official package repository for MATLAB doesn't exist,
JFrog is better because the MLTBX file will be large (Apache
Arrow C++ binaries are large).

> 2. Is there a recommended location for hosting release notes?

How about creating https://arrow.apache.org/docs/matlab/ ?
We can use Sphinx like the Python docs
https://arrow.apache.org/docs/python/ or another
documentation tools like the R docs
https://arrow.apache.org/docs/r/ .
If we use Sphinx, we can create
https://github.com/apache/arrow/tree/main/docs/source/matlab/
.

> 3. Is there a recommended cadence for incremental point releases?

I suggest avoiding separated release as above.

> 4. Are there any notable ASF procedures [4] [5] (e.g. voting on a new release 
> proposal) that we should be aware of as we consider creating an initial 
> release?

We don't need additional task for an initial release.

> 5. How should the Arrow project release (i.e. 14.0.0)
>relate to the MATLAB interface version (i.e. 0.1)? As a
>point of reference, we noticed that PyArrow is on
>version 14.0.0, but it feels "misleading" to say that
>the MATLAB interface is at version 14.0.0 when we
>haven't yet implemented or stabilized all core Arrow
>APIs. Is there any precedent for using independent
>release versions for language bindings which are not
>fully stabilized and are also part of the main
>apache/arrow repository?

We don't have any precedent for using independent release
versions for language bindings. All language bindings used
the same version.

Apache Arrow JavaScript isn't a language bindings but it
used separated release and isolated versions before
0.4.1. It joined apache/arrow release after 0.4.1. (The next
version of Apache Arrow JavaScript 0.4.1 is 13.0.0.)

> We've noticed that Arrow-related projects which are not
> part of the main apache/arrow GitHub repository
> (e.g. DataFusion) follow a mailing list-based voting and
> release process. However, it's not clear whether it makes
> sense to follow this process for the MATLAB interface
> since it is part of the main apache/arrow repository.

If we want to use separated release for the MATLAB
interface, we should follow the same release process as
apache/arrow and other apache/arrow-* because it's the
standard ASF release process.


Thanks,
-- 
kou

In 

  "[DISCUSS][MATLAB] Proposal for incremental point releases of the MATLAB 
interface" on Tue, 7 Nov 2023 20:31:31 +,
  Kevin Gurney  wrote:

> Hi All,
> 
> A considerable amount of new functionality has been added to the MATLAB 
> interface over the last few months. We appreciate all the community's support 
> in making this possible and are happy to see all the progress that is being 
> made.
> 
> At this point, we would like to create an initial "0.1" release of the MATLAB 
> interface. Incremental point releases will enable MATLAB users to provide 
> early feedback. In addition, learning how to navigate the release process is 
> an important step

Re: [DISCUSS][MATLAB] Proposal for incremental point releases of the MATLAB interface

2023-11-07 Thread Dewey Dunnington
For argument's sake, I might suggest that the process you described in
your initial note would probably work best in another repo: you would
be able to iterate faster and release/version at your own pace. The
flexibility you get from moving to a separate repo comes at the cost
of extra responsibility: you have to set up your own CI, manage your
own issues, and set up your own release verification scripts + release
votes on the mailing list. Because you bind Arrow C++, you would have
to take sufficient steps to ensure that the Arrow C++ developers are
made aware of changes that break the Matlab bindings and vice versa
(i.e., test against dev Arrow C++ in a CI job).

Setting up that infrastructure for apache/arrow-nanoarrow took ~a week
of development time, and it now takes ~half a day to release a new
version (it took more for the first few versions, and the matlab
version has considerably higher complexity). Probably the biggest
barrier to releasing from another repo is that you have to ensure a
critical mass of PMC members can/will run your release verification
script and vote.

I happen to feel that it's the PMC's/wider community's responsibility
to help language binding contributors adopt a workflow that suits
their needs. If active Matlab contributors agree that they want to
release version 0.1 from another repo, (I feel that) we're here to
help you do that. If the active contributors want to stay in
apache/arrow, there is less flexibility about what you release and
when; however, the release process is well-defined.

On Tue, Nov 7, 2023 at 8:43 PM Sutou Kouhei  wrote:
>
> Hi,
>
> > As a point of reference, we noticed that PyArrow is on
> > version 14.0.0, but it feels "misleading" to say that the
> > MATLAB interface is at version 14.0.0 when we haven't yet
> > implemented or stabilized all core Arrow APIs.
>
> I can understand this but I suggest that we use the same
> version as other packages in apache/arrow. Because:
>
> * Using isolated version increases release complexity.
> * Using isolated version may introduce another
>   "misleading"/"confusion": For example, "the MATLAB
>   interface 1.0.0 uses Apache Arrow C++ 20.0.0" may be
>   misleading/confused:
>   * The MATLAB interface 1.0.0 doesn't use Apache Arrow C++
> 1.0.0.
>   * It may be difficult to find the corresponding
> Apache Arrow C++ version from the MATLAB interface
> version.
>
> Can we just mention "This is not stable yet!!!" in the
> documentation instead of using isolated version?
>
> We may want to use the status page for it:
> https://arrow.apache.org/docs/status.html
>
> > 1. Manually build the MATLAB interface on Windows, macOS, and Linux
>
> It's better that we use CI for this like other binary
> packages such as .deb/.rpm/.wheel/.jar/...
>
> If we release the MATLAB interface separately, which Apache
> Arrow C++ version is used? If we release the MATALB
> interface right now, is Apache Arrow C++ 14.0.0 (the latest
> release) used or is Apache Arrow C++ main (not released yet)
> used? The MATLAB interface on main will depend on Apache
> Arrow C++ main, we may not be able to use the latest release
> for the MATLAB interface on main.
>
> > 2. Combine all of the cross platform build artifacts into
> >a single MLTBX file [1] for distribution
>
> Does the MLTBX file include Apache Arrow C++ binaries too
> like .wheel/.jar?
>
> > 3. Host the MLTBX somewhere that is easliy accessible for download
>
> MATLAB doesn't provide the official package repository such
> as PyPI for Python and https://rubygems.org/ for Ruby, right?
>
> > 1. Is there a recommended location where we can host the MLTBX file? e.g. 
> > GitHub Releases [2], JFrog [3], etc.?
>
> If the official package repository for MATLAB doesn't exist,
> JFrog is better because the MLTBX file will be large (Apache
> Arrow C++ binaries are large).
>
> > 2. Is there a recommended location for hosting release notes?
>
> How about creating https://arrow.apache.org/docs/matlab/ ?
> We can use Sphinx like the Python docs
> https://arrow.apache.org/docs/python/ or another
> documentation tools like the R docs
> https://arrow.apache.org/docs/r/ .
> If we use Sphinx, we can create
> https://github.com/apache/arrow/tree/main/docs/source/matlab/
> .
>
> > 3. Is there a recommended cadence for incremental point releases?
>
> I suggest avoiding separated release as above.
>
> > 4. Are there any notable ASF procedures [4] [5] (e.g. voting on a new 
> > release proposal) that we should be aware of as we consider creating an 
> > initial release?
>
> We don't need additional task for an initial release.
>
> > 5. How should the Arrow project release (i.e. 14.0.0)
> >relate to the MATLAB interface version (i.e. 0.1)? As a
> >point of reference, we noticed that PyArrow is on
> >version 14.0.0, but it feels "misleading" to say that
> >the MATLAB interface is at version 14.0.0 when we
> >haven't yet implemented or stabilized all core Arrow
> >APIs. Is there an