Re: [VOTE][RUST] Release Apache Arrow Rust Object Store 0.10.2 RC1

2024-07-17 Thread Raphael Taylor-Davies

+1 (binding)

Verified on x86_64 GNU/Linux

Kind Regards,

Raphael

On 17/07/2024 18:36, Andrew Lamb wrote:

Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.10.2.

This release candidate is based on commit:
b44497e1cdd84933b49b56dd00506411c040b46c [1]

The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store  because...

[1]:
https://github.com/apache/arrow-rs/tree/b44497e1cdd84933b49b56dd00506411c040b46c
[2]:
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.10.2-rc1
[3]:
https://github.com/apache/arrow-rs/blob/b44497e1cdd84933b49b56dd00506411c040b46c/object_store/CHANGELOG.md
[4]:
https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh



Re: [DISCUSS][C++] Empty directory marker creation in S3FileSystem

2024-07-12 Thread Raphael Taylor-Davies

Many people
are familiar with object stores these days.  You could create a new
abstraction `ObjectStore` which is very similar to `FileSystem` except the
semantics are object store semantics and not filesystem semantics.
FWIW in the Arrow Rust ecosystem we only provide an object store 
abstraction, and this has served us very well. My 2 cents is that object 
store semantics are sufficient, if not superior [1], than filesystem 
based interfaces for the vast majority of use cases, with the few 
workloads that aren't sufficiently served requiring such close 
integration with often OS-specific filesystem APIs and behaviours as to 
make building a coherent abstraction extremely difficult.


Iceberg also took a similar approach with its File IO abstraction [2].

[1]: 
https://docs.rs/object_store/latest/object_store/#why-not-a-filesystem-interface

[2]: https://tabular.io/blog/iceberg-fileio-cloud-native-tables/

On 12/07/2024 22:05, Weston Pace wrote:

The markers are necessary to offer file system semantics on top of object
stores. You will get a ton of subtle bugs otherwise.

Yes, object stores and filesystems are different.  If you expect your
filesystem to act like a filesystem then these things need to be done in
order to avoid these bugs.

If an option modifies a filesystem to behave more like an object store then
I don't think it's necessarily a bad thing as long as it isn't the
default.  By turning on the option the user is intentionally altering the
behavior and should not be making the same expectations.

On the other hand, there is another approach you could take.  Many people
are familiar with object stores these days.  You could create a new
abstraction `ObjectStore` which is very similar to `FileSystem` except the
semantics are object store semantics and not filesystem semantics.  I
believe most of our filesystem classes could implement both `ObjectStore`
and `FileSystem` abstractions without significant code duplication.

This way, if a user wants filesystem semantics, they use a `FileSystem` and
they pay the abstraction cost.  If a user is comfortable with `ObjectStore`
semantics they use `ObjectStore` and they don't have to pay the costs.

This would be more work than just allowing options to violate FileSystem
guarantees but it would provide a more clear distinction between the two.


On Fri, Jul 12, 2024 at 9:25 AM Aldrin  wrote:


Hello!

This may be naive, but why does the empty directory marker need to exist
on the S3 side at all? If a local directory is created (because filesystem
semantics), then I am not sure why a fake object needs to exist on the
object-store side.





# --

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene

https://keybase.io/octalene


On Friday, July 12th, 2024 at 08:35, Felipe Oliveira Carvalho <
felipe...@gmail.com> wrote:


Hi,

The markers are necessary to offer file system semantics on top of object
stores. You will get a ton of subtle bugs otherwise.

If instead of arrow::FileSystem, Arrow offered an arrow::ObjectStore
interface that wraps local filesystems and object stores with

object-store

semantics (i.e. no concept of empty directory or atomic directory
deletion), then application developers would have more control of the
actions performed on the object store they are using. Cons would be

slower

operations when working with a local filesystem and no concept of

directory.

1. Add an Option: Introduce an option in S3Options to control

whether empty directory markers are created, giving users the choice.

Then it wouldn't be an honest implementation of arrow::FileSystem for the
reasons listed above.


Change Default Behavior: Modify the default behavior to avoid

creating empty directory markers when a file is deleted.

That would bring in the bugs because an arrow::FileSystem instance would
behave differently depending on what is backing it.


3. Smarter Directory Creation: Improve the implementation to check

for other objects in the same path before creating an empty directory
marker.

This might be a problem when more than one client or thread is mutating

the

object store through the arrow::FileSystem. You can check now and once
you're done deleting all the other files you thought existed are deleted

as

well. Very likely if clients decide to implement parallel deletion.

The existing solution of always creating a marker when done is not

perfect

either, but less likely to break.

## Suggested Workaround

Avoiding file by file operations so that internal functions can batch as
much as possible.

--
Felipe

On Fri, Jul 12, 2024 at 7:22 AM Hyunseok Seo hsseo0...@gmail.com wrote:


Hello. community!

I am currently working on addressing the issue described in [C++]

Addoption to not create parent directory with S3 delete_file. In this
process, I have

found it necessary to gather feedback on how to best resolve this

issue.

Below is a summary and some questions I have for the community.

### 

RE: Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-06-28 Thread Raphael Taylor-Davies
I wonder if the DataFusion project might be a more natural home for this 
functionality? UDFs are more of a query engine concept, whereas arrow-rs is 
more focused on purely physical execution?

On 28 June 2024 19:41:39 BST, Runji Wang  wrote:
>Hi Felipe,
>
>Vectorization will be applied whenever possible. When all input and output 
>types of a function are primitive (int16, int32, int64, float32, float64) and 
>do not involve any Option or Result, the macro will automatically generate 
>code based on unary  
>or binary  kernels, 
>which potentially allows for vectorization.
>
>Both examples you showed are not vectorized. The `div` function is due to the 
>Result output, while `gcd` is due to the loop in its implementation. However, 
>if the function is simple enough, like an `add` function:
>
>#[function("add(int, int) -> int")]
>fn add(a: i32, b: i32) -> i32 {
>a + b
>}
>
>It can be auto-vectorized by llvm.
>
>Runji
>
>
>On 2024/06/28 17:13:16 Felipe Oliveira Carvalho wrote:
>> On Fri, Jun 28, 2024 at 11:07 AM Andrew Lamb  wrote:
>> >
>> > Hi Xuanwo,
>> >
>> > Sorry for the delay in responding. I think  the ability to easily write
>> > functions that "feel" like native functions in whatever language and be
>> > able to generate arrow / vectorized versions of them is quite valuable.
>> > This is my understanding of what this proposal is about.
>> 
>> My understanding is that it's not vectorized. From the examples in
>> risingwavelabs/arrow-udf,  it
>> looks like the macros generate code that gathers values from columns into
>> local scalars that are passed as scalar parameters to user functions. Is
>> the hope here that rustc/llvm will auto-vectorize the code?
>> 
>> #[function("gcd(int, int) -> int")]
>> fn gcd(mut a: i32, mut b: i32) -> i32 {
>> while b != 0 {
>> (a, b) = (b, a % b);
>> }
>> a
>> }
>> 
>> #[function("div(int, int) -> int")]
>> fn div(x: i32, y: i32) -> Result {
>> if y == 0 {
>> return Err("division by zero");
>> }
>> Ok(x / y)
>> }
>> 
>> > I left some additional comments on the markdown.
>> >
>> > One thing that might be worth doing is articulate some other potential
>> > locations for where the code might go. One option, as I think you propose,
>> > is to make its own repository.  Another option could be to donate the code
>> > and put the various language bindings in the same repo as the arrow
>> > language implementations (e.g arrow-rs, arrow for python, etc) which would
>> > likely make it easier to maintain and discover.
>> >
>> > I am curious about what other devs / users feel about this?
>> >
>> > Andrew
>> >
>> >
>> >
>> > On Thu, Jun 20, 2024 at 3:04 AM Xuanwo  wrote:
>> >
>> > > Hello, everyone.
>> > >
>> > > I start this thread to disscuss the donation of a User-Defined Function
>> > > Framework for Apache Arrow.
>> > >
>> > > Feel free to review and leave your comments here. For live review,
>> please
>> > > visit:
>> > >
>> > > https://hackmd.io/@xuanwo/apache-arrow-udf
>> > >
>> > > The original content also pasted here for a quick reading:
>> > >
>> > > --
>> > >
>> > > ## Abstract
>> > >
>> > > Arrow UDF is a User-Defined Function Framework for Apache Arrow.
>> > >
>> > > ## Proposal
>> > >
>> > > Arrow UDF allows user to easily create and run user-defined functions
>> > > (UDF) in Rust, Python, Java or JavaScript based on Apache Arrow. The
>> > > functions can be executed natively, or in WebAssembly, or in a remote
>> > > server via Arrow Flight.
>> > >
>> > > Arrow UDF was originally designed to be used by the RisingWave project
>> but
>> > > is now being used by Databend and several database startups.
>> > >
>> > > We believe that the Arrow UDF project will provide diversity value to
>> the
>> > > entire Arrow community.
>> > >
>> > > ## Background
>> > >
>> > > Arrow UDF is being developed by an open-source community from day one
>> and
>> > > is owned by RisingWaveLabs. The project has been launched in December
>> 2023.
>> > >
>> > > ## Initial Goals
>> > >
>> > > By transferring ownership of the project to the Apache Arrow, Arrow UDF
>> > > expects to ensure its neutrality and further encourage and facilitate
>> the
>> > > adoption of Arrow UDF by the community.
>> > >
>> > > ## Current Status
>> > >
>> > > Contributors: 5
>> > >
>> > > Users:
>> > >
>> > > -   [RisingWave]: A Distributed SQL Database for Stream Processing.
>> > > -   [Databend]: An open-source cloud data warehouse that serves as a
>> > > cost-effective alternative to Snowflake.
>> > >
>> > > ## Documentation
>> > >
>> > > The document of Arrow UDF is hosted at
>> > > https://docs.rs/arrow-udf/latest/arrow_udf/.
>> > >
>> > > ## Initial Source
>> > >
>> > > The project currently holds a GitHub repository and multiple packages:
>> > >
>> > > - https://github.com/risingwavelabs/arrow-udf
>> 

[RESULT][VOTE][RUST] Release Apache Arrow Rust 52.0.0 RC1

2024-06-06 Thread Raphael Taylor-Davies

With 8 +1 votes (7 binding), including myself, the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-rs-52.0.0/


It has also been released to crates.io

Thank you to everyone who helped verify this release

Raphael

On 03/06/2024 17:03, Raphael Taylor-Davies wrote:

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 52.0.0.


This release candidate is based on commit: 
f42218ae5d9c9f0b9ea3365f2b1e6025a43b8c71 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

I vote +1 (binding) on this release

[1]: 
https://github.com/apache/arrow-rs/tree/f42218ae5d9c9f0b9ea3365f2b1e6025a43b8c71
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-52.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/f42218ae5d9c9f0b9ea3365f2b1e6025a43b8c71/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




[VOTE][RUST] Release Apache Arrow Rust 52.0.0 RC1

2024-06-03 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 52.0.0.


This release candidate is based on commit: 
f42218ae5d9c9f0b9ea3365f2b1e6025a43b8c71 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

I vote +1 (binding) on this release

[1]: 
https://github.com/apache/arrow-rs/tree/f42218ae5d9c9f0b9ea3365f2b1e6025a43b8c71

[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-52.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/f42218ae5d9c9f0b9ea3365f2b1e6025a43b8c71/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




Re: [DISCUSS] Statistics through the C data interface

2024-05-31 Thread Raphael Taylor-Davies
I'm likely missing something here, but why can't statistics be returned as 
arrow arrays encoded using the C data interface? My understanding of the C data 
interface is as a specification for exchanging arrow payloads, with it left to 
higher level protocols, such as ADBC, to assign semantic meaning to said 
payloads. It therefore seems odd to me to be shoehorning statistics information 
into it in this way? 

On 31 May 2024 18:05:19 BST, "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)" 
 wrote:
>Agreed, it doesn't seem like a good idea to require users of the
>C data interface to also depend on the IPC format. JSON sounds
>more reasonable in that case.
>
>Shoumyo
>
>From: dev@arrow.apache.org At: 05/29/24 02:02:23 UTC-4:00To:  
>dev@arrow.apache.org
>Subject: Re: [DISCUSS] Statistics through the C data interface
>
>>Hi,
>
>>
>
 - If you need statistics in the schema then simply encode the 1-row batch
>
   into an IPC buffer (using the streaming format) or maybe just an IPC
>
   RecordBatch message since the schema is fixed and store those bytes in 
 the
>
   schema
>
>>> 
>
>>> This would avoid having to define a separate "schema" for
>
>>> the JSON metadata
>
>>
>
>>Right. What I'm worried about with this approach is that
>
>>this may not match with the C data interface.
>
>>
>
>>In the C data interface, we don't use the IPC format. If we
>
>>want to transmit statistics with schema through the C data
>
>>interface, we need to mix the IPC format and the C data
>
>>interface. (This is why I used the address in my first
>
>>proposal.)
>
>>
>
>>Note that we can use separated API to transmit statistics
>
>>instead of embedding statistics into schema for this case.
>
>>
>
>>I thought using JSON is easier to use for both of the IPC
>
>>format and the C data interface. Statistics data will not be
>
>>large. So this will not affect performance.
>
>>
>
>>
>
>>> If we do go down the JSON route, how about something like
>
>>> this to avoid defining the keys for all possible statistics up
>
>>> front:
>
>>> 
>
>>>   Schema {
>
>>> custom_metadata: {
>
>>>   "ARROW:statistics" => "[ { \"key\": \"row_count\", \"value\": 29, 
>
>>\"value_type\": \"uint64\", \"is_approximate\": false } ]"
>
>>> }
>
>>>   }
>
>>> 
>
>>> It's more verbose, but more closely mirrors the Arrow array
>
>>> schema defined for statistics getter APIs. This could make it
>
>>> easier to translate between the two.
>
>>
>
>>Thanks. I didn't think of it.
>
>>It makes sense.
>
>>
>
>>
>
>>Thanks,
>
>>-- 
>
>>kou
>
>>
>
>>In <665673b500015f5808ce0...@message.bloomberg.net>
>
>>  "Re: [DISCUSS] Statistics through the C data interface" on Wed, 29 May 2024 
>
>>00:15:49 -,
>
>>  "Shoumyo Chakravorti (BLOOMBERG/ 120 PARK)"  
>
>>wrote:
>
>>
>
>>> Thanks for addressing the feedback! I didn't know that an
>
>>> Arrow IPC `Message` (not just Schema) could also contain
>
>>> `custom_metadata` -- thanks for pointing it out.
>
>>> 
>
 Based on the list, how about standardizing both of the
>
 followings for statistics?
>

>
 1. Apache Arrow schema for statistics that is used by
>
separated statistics getter API
>
 2. "ARROW:statistics" metadata format that can be used in
>
Apache Arrow schema metadata
>

>
 Users can use 1. and/or 2. based on their use cases.
>
>>> 
>
>>> This sounds good to me. Using JSON to represent the metadata
>
>>> for #2 also sounds reasonable. I think elsewhere on this
>
>>> thread, Weston mentioned that we could alternatively use
>
>>> the schema defined for #1 and directly use that to encode
>
>>> the schema metadata as an Arrow IPC RecordBatch:
>
>>> 
>
 This has been something that has always been desired for the Arrow IPC
>
 format too.
>

>
 My preference would be (apologies if this has been mentioned before):
>

>
 - Agree on how statistics should be encoded into an array (this is not
>
   hard, we just have to agree on the field order and the data type for
>
   null_count)
>
 - If you need statistics in the schema then simply encode the 1-row batch
>
   into an IPC buffer (using the streaming format) or maybe just an IPC
>
   RecordBatch message since the schema is fixed and store those bytes in 
 the
>
   schema
>
>>> 
>
>>> This would avoid having to define a separate "schema" for
>
>>> the JSON metadata, but might be more effort to work with in
>
>>> certain contexts (e.g. a library that currently only needs the
>
>>> C data interface would now also have to learn how to parse
>
>>> Arrow IPC).
>
>>> 
>
>>> If we do go down the JSON route, how about something like
>
>>> this to avoid defining the keys for all possible statistics up
>
>>> front:
>
>>> 
>
>>>   Schema {
>
>>> custom_metadata: {
>
>>>   "ARROW:statistics" => "[ { \"key\": \"row_count\", \"value\": 29, 
>
>>\"value_type\": \"uint64\", \"is_approximate\": false } ]"
>
>>> }
>
>>>   }
>
>>> 
>
>>> It's more verbose, but more 

Re: [DISCUSS] Statistics through the C data interface

2024-05-22 Thread Raphael Taylor-Davies

Hi,

One potential challenge with encoding statistics in the schema metadata 
is that some systems may consider this metadata as part of assessing 
schema equivalence.


However, I think the bigger question is what the intended use-case for 
these statistics is? Often query engines want to collect statistics from 
multiple containers in one go, as this allows for efficient vectorised 
pruning across multiple files, row groups, etc... I therefore wonder if 
the solution is simply to return separate arrays of min, max, etc... 
potentially even grouped together into a single StructArray?


This would have the benefit of not needing specification changes, whilst 
being significantly more efficient than an approach centered on scalar 
statistics. FWIW this is the approach taken by DataFusion for pruning 
statistics [1], and in arrow-rs we represent scalars as arrays to avoid 
needing to define a parallel serialization standard [2].


Kind Regards,

Raphael

[1]: 
https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html

[2]: https://github.com/apache/arrow-rs/pull/4393

On 22/05/2024 03:37, Sutou Kouhei wrote:

Hi,

We're discussing how to provide statistics through the C
data interface at:
https://github.com/apache/arrow/issues/38837

If you're interested in this feature, could you share your
comments?


Motivation:

We can interchange Apache Arrow data by the C data interface
in the same process. For example, we can pass Apache Arrow
data read by Apache Arrow C++ (provider) to DuckDB
(consumer) through the C data interface.

A provider may know Apache Arrow data statistics. For
example, a provider can know statistics when it reads Apache
Parquet data because Apache Parquet may provide statistics.

But a consumer can't know statistics that are known by a
producer. Because there isn't a standard way to provide
statistics through the C data interface. If a consumer can
know statistics, it can process Apache Arrow data faster
based on statistics.


Proposal:

https://github.com/apache/arrow/issues/38837#issuecomment-2123728784

How about providing statistics as a metadata in ArrowSchema?

We reserve "ARROW" namespace for internal Apache Arrow use:

https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata


The ARROW pattern is a reserved namespace for internal
Arrow use in the custom_metadata fields. For example,
ARROW:extension:name.

So we can use "ARROW:statistics" for the metadata key.

We can represent statistics as a ArrowArray like ADBC does.

Here is an example ArrowSchema that is for a record batch
that has "int32 column1" and "string column2":

ArrowSchema {
   .format = "+siu",
   .metadata = {
 "ARROW:statistics" => ArrowArray*, /* table-level statistics such as row 
count */
   },
   .children = {
 ArrowSchema {
   .name = "column1",
   .format = "i",
   .metadata = {
 "ARROW:statistics" => ArrowArray*, /* column-level statistics such as 
count distinct */
   },
 },
 ArrowSchema {
   .name = "column2",
   .format = "u",
   .metadata = {
 "ARROW:statistics" => ArrowArray*, /* column-level statistics such as 
count distinct */
   },
 },
   },
}

The metadata value (ArrowArray* part) of '"ARROW:statistics"
=> ArrowArray*' is a base 10 string of the address of the
ArrowArray. Because we can use only string for metadata
value. You can't release the statistics ArrowArray*. (Its
release is a no-op function.) It follows
https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
semantics. (The base ArrowSchema owns statistics
ArrowArray*.)


ArrowArray* for statistics use the following schema:

| Field Name | Field Type   | Comments |
||--|  |
| key| string not null  | (1)  |
| value  | `VALUE_SCHEMA` not null  |  |
| is_approximate | bool not null| (2)  |

1. We'll provide pre-defined keys such as "max", "min",
"byte_width" and "distinct_count" but users can also use
application specific keys.

2. If true, then the value is approximate or best-effort.

VALUE_SCHEMA is a dense union with members:

| Field Name | Field Type   | Comments |
||--|  |
| int64  | int64|  |
| uint64 | uint64   |  |
| float64| float64  |  |
| value  | The same type of the ArrowSchema | (3)  |
|| that is belonged to. |  |

3. If the ArrowSchema's type is string, this type is also string.

TODO: Is "value" good name? If we refer it from the
top-level statistics schema, we need to use
"value.value". It's a bit strange...


What do you think about this proposal? Could you share your

[RESULT][VOTE][RUST] Release Apache Arrow Rust Object Store 0.10.1 RC1

2024-05-14 Thread Raphael Taylor-Davies

With 3 +1 votes (3 binding), including myself, the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-object-store-rs-0.10.1/


It has also been released to crates.io, and I have yanked 0.10.0

Thank you to everyone who helped verify this release

Raphael

On 10/05/2024 18:29, Raphael Taylor-Davies wrote:

Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.10.1.

This is primarily motivated by a major bug introduced by 0.10.0 [1]

This release candidate is based on commit: 
3d3ddb2108502854da98654ada85364d5627ef21 [2]


The proposed release tarball and signatures are hosted at [3].

The changelog is located at [4]

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [5] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store because...

[1]: https://github.com/apache/arrow-rs/issues/5743
[2]: 
https://github.com/apache/arrow-rs/tree/3d3ddb2108502854da98654ada85364d5627ef21
[3]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.10.1-rc1
[4]: 
https://github.com/apache/arrow-rs/blob/3d3ddb2108502854da98654ada85364d5627ef21/object_store/CHANGELOG.md
[5]: 
https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh




[VOTE][RUST] Release Apache Arrow Rust Object Store 0.10.1 RC1

2024-05-10 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.10.1.

This is primarily motivated by a major bug introduced by 0.10.0 [1]

This release candidate is based on commit: 
3d3ddb2108502854da98654ada85364d5627ef21 [2]


The proposed release tarball and signatures are hosted at [3].

The changelog is located at [4]

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [5] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store because...

[1]: https://github.com/apache/arrow-rs/issues/5743
[2]: 
https://github.com/apache/arrow-rs/tree/3d3ddb2108502854da98654ada85364d5627ef21
[3]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.10.1-rc1
[4]: 
https://github.com/apache/arrow-rs/blob/3d3ddb2108502854da98654ada85364d5627ef21/object_store/CHANGELOG.md
[5]: 
https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh




[RESULT][VOTE][RUST] Release Apache Arrow Rust Object Store 0.10.0 RC1

2024-04-22 Thread Raphael Taylor-Davies

With 4 +1 votes (4 binding) the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-object-store-rs-0.10.0/


It has also been released to crates.io

Thank you to everyone who helped verify this release

On 18/04/2024 17:38, vin jake wrote:

+1(binding)

Verified on m1 macbook

Thanks Raphael

Raphael Taylor-Davies  于
2024年4月18日周四 下午6:55写道:


Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.10.0.

This release candidate is based on commit:
cd3331989d65f6d56830f9ffa758b4c96d10f4be [1]

The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store because...

I vote +1 (binding) on this release

[1]:

https://github.com/apache/arrow-rs/tree/cd3331989d65f6d56830f9ffa758b4c96d10f4be
[2]:

https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.10.0-rc1
[3]:

https://github.com/apache/arrow-rs/blob/cd3331989d65f6d56830f9ffa758b4c96d10f4be/object_store/CHANGELOG.md
[4]:

https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh




[VOTE][RUST] Release Apache Arrow Rust Object Store 0.10.0 RC1

2024-04-18 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.10.0.

This release candidate is based on commit: 
cd3331989d65f6d56830f9ffa758b4c96d10f4be [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store because...

I vote +1 (binding) on this release

[1]: 
https://github.com/apache/arrow-rs/tree/cd3331989d65f6d56830f9ffa758b4c96d10f4be
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.10.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/cd3331989d65f6d56830f9ffa758b4c96d10f4be/object_store/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh




[RESULT][VOTE][RUST] Release Apache Arrow Rust 51.0.0 RC1

2024-03-18 Thread Raphael Taylor-Davies

With 5 +1 votes (4 binding) the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-rs-51.0.0


It has also been released to crates.io

Thank you to everyone who helped verify this release

On 3/15/24 20:40, Raphael Taylor-Davies wrote:

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 51.0.0.


This release candidate is based on commit: 
ada986c7ec8f8fe4f94235c8aaeba4995392ee72 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/ada986c7ec8f8fe4f94235c8aaeba4995392ee72
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-51.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/ada986c7ec8f8fe4f94235c8aaeba4995392ee72/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




[VOTE][RUST] Release Apache Arrow Rust 51.0.0 RC1

2024-03-15 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 51.0.0.


This release candidate is based on commit: 
ada986c7ec8f8fe4f94235c8aaeba4995392ee72 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/ada986c7ec8f8fe4f94235c8aaeba4995392ee72

[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-51.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/ada986c7ec8f8fe4f94235c8aaeba4995392ee72/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




[RESULT][VOTE][RUST] Release Apache Arrow Rust Object Store 0.9.1 RC1

2024-03-04 Thread Raphael Taylor-Davies

With 3 +1 votes (3 binding) the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-object-store-rs-0.9.1 



It has also been released to crates.io

Thank you to everyone who helped verify this release

Raphael

On 3/2/24 00:40, Andrew Lamb wrote:

+1 (binding)

Verified on M3 mac

Thank you Raphael

Andrew

On Fri, Mar 1, 2024 at 2:45 AM L. C. Hsieh  wrote:

+1 (binding)

Verified on M1 Mac.

Thanks Raphael.


On Thu, Feb 29, 2024 at 11:10 PM Raphael Taylor-Davies
 wrote:

Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.9.1.

This release candidate is based on commit:
30151220c29fa5e01365c2a4e153de01d5d2c041 [1]

The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store because...

[1]:
https://github.com/apache/arrow-rs/tree/30151220c29fa5e01365c2a4e153de01d5d2c041
[2]:
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.9.1-rc1
[3]:
https://github.com/apache/arrow-rs/blob/30151220c29fa5e01365c2a4e153de01d5d2c041/object_store/CHANGELOG.md
[4]:
https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh



[VOTE][RUST] Release Apache Arrow Rust Object Store 0.9.1 RC1

2024-02-29 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.9.1.

This release candidate is based on commit: 
30151220c29fa5e01365c2a4e153de01d5d2c041 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store because...

[1]: 
https://github.com/apache/arrow-rs/tree/30151220c29fa5e01365c2a4e153de01d5d2c041
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.9.1-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/30151220c29fa5e01365c2a4e153de01d5d2c041/object_store/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh




[ANNOUNCE] New Arrow committer: Jeffrey Vo

2024-02-06 Thread Raphael Taylor-Davies
On behalf of the Arrow PMC, I am happy to announce that Jeffrey Vo has 
accepted an invitation to become a committer on Apache Arrow. Welcome, 
and thank you for your contributions!


Raphael Taylor-Davies



Re: [Python][Rust] Is Arrow Rust supposed to support S3-compatible storage with non-public certificates?

2024-01-25 Thread Raphael Taylor-Davies
On a related note, version 0.9.0 switched to using the system CAs by default 
[1], and so if you've added your private CA chain into there it should work.

[1]: https://github.com/apache/arrow-rs/pull/5056

On 25 January 2024 09:17:55 GMT, Raphael Taylor-Davies 
 wrote:
>The ticket for supporting self-signed certificates can be found here [1].
>
>If you switch the TLS backend to OpenSSL it may respect the SSL_CERT_FILE 
>environment variable, but I'm not very familiar with the particulars of that 
>library. This would require customising the Rust build, however, which may not 
>be possible if calling from python.
>
>Kind Regards,
>
>Raphael
>
>
>[1]: https://github.com/apache/arrow-rs/issues/5034
>
>On 25 January 2024 08:44:45 GMT, elveshoern32 
> wrote:
>>Since my question remained unanswered on the user list, I dare to ask again 
>>on the dev list:
>>
>>
>>While experimenting with polars [1] (which is based on arrow-rs) I found that 
>>it's not possible to read a single file from our on-prem S3-compatible 
>>storage.
>>
>>Any attempts result in SSL error messages:
>>
>>
>>
>>error trying to connect: invalid peer certificate: UnknownIssuer
>>
>>
>>
>>Such SSL errors are well-known to us and usually get fixed by setting the 
>>environment variable SSL_CERT_FILE (or something similar) pointing to our 
>>company's certstore.
>>
>>polars seems to ignore that env var.
>>
>>Now it's unclear to me whether this is an issue of polars or arrow-rs (or 
>>anything else).
>>
>>
>>
>>For more details see [2].
>>
>>
>>
>>[1] https://pola.rs/
>>
>>[2] https://github.com/pola-rs/polars/issues/13741 

Re: [Python][Rust] Is Arrow Rust supposed to support S3-compatible storage with non-public certificates?

2024-01-25 Thread Raphael Taylor-Davies
The ticket for supporting self-signed certificates can be found here [1].

If you switch the TLS backend to OpenSSL it may respect the SSL_CERT_FILE 
environment variable, but I'm not very familiar with the particulars of that 
library. This would require customising the Rust build, however, which may not 
be possible if calling from python.

Kind Regards,

Raphael


[1]: https://github.com/apache/arrow-rs/issues/5034

On 25 January 2024 08:44:45 GMT, elveshoern32 
 wrote:
>Since my question remained unanswered on the user list, I dare to ask again on 
>the dev list:
>
>
>While experimenting with polars [1] (which is based on arrow-rs) I found that 
>it's not possible to read a single file from our on-prem S3-compatible storage.
>
>Any attempts result in SSL error messages:
>
>
>
>error trying to connect: invalid peer certificate: UnknownIssuer
>
>
>
>Such SSL errors are well-known to us and usually get fixed by setting the 
>environment variable SSL_CERT_FILE (or something similar) pointing to our 
>company's certstore.
>
>polars seems to ignore that env var.
>
>Now it's unclear to me whether this is an issue of polars or arrow-rs (or 
>anything else).
>
>
>
>For more details see [2].
>
>
>
>[1] https://pola.rs/
>
>[2] https://github.com/pola-rs/polars/issues/13741 

[RESULT][VOTE][RUST] Release Apache Arrow Rust 50.0.0 RC1

2024-01-12 Thread Raphael Taylor-Davies

With 5 +1 votes (4 binding) the release is approved

The release is available here:

It has also been released to crates.io.

Thank you to everyone who helped verify this release

On 09/01/2024 10:24, Raphael Taylor-Davies wrote:

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 50.0.0.


This release candidate is based on commit: 
db811083669df66992008c9409b743a2e365adb0 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/db811083669df66992008c9409b743a2e365adb0
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-50.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/db811083669df66992008c9409b743a2e365adb0/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




[VOTE][RUST] Release Apache Arrow Rust 50.0.0 RC1

2024-01-09 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 50.0.0.


This release candidate is based on commit: 
db811083669df66992008c9409b743a2e365adb0 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/db811083669df66992008c9409b743a2e365adb0

[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-50.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/db811083669df66992008c9409b743a2e365adb0/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




[RESULT][VOTE][RUST] Release Apache Arrow Rust Object Store 0.9.0 RC1

2024-01-08 Thread Raphael Taylor-Davies

With 3 +1 votes (3 binding) the release is approved

The release is available here:

It has also been released to crates.io.

Thank you to everyone who helped verify this release

On 05/01/2024 13:29, Raphael Taylor-Davies wrote:

Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.9.0.

This release candidate is based on commit: 
cb16050ec732872d5995c7420cc6858749bbf743 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store because...

[1]: 
https://github.com/apache/arrow-rs/tree/cb16050ec732872d5995c7420cc6858749bbf743
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.9.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/cb16050ec732872d5995c7420cc6858749bbf743/object_store/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh




[VOTE][RUST] Release Apache Arrow Rust Object Store 0.9.0 RC1

2024-01-05 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.9.0.

This release candidate is based on commit: 
cb16050ec732872d5995c7420cc6858749bbf743 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store because...

[1]: 
https://github.com/apache/arrow-rs/tree/cb16050ec732872d5995c7420cc6858749bbf743
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.9.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/cb16050ec732872d5995c7420cc6858749bbf743/object_store/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh




Re: [DISCUSS] [DATAFUSION] PMC for new DataFusion top level project

2023-12-20 Thread Raphael Taylor-Davies
> thus not join the DataFusion PMC

This is correct, I don't currently have sufficient bandwidth to be able to 
perform such a role to the level that I would expect of others, in addition to 
my existing commitments.

Kind Regards,

Raphael

On 20 December 2023 20:19:56 GMT, Andy Grove  wrote:
>This list LGTM.
>
>On Wed, Dec 20, 2023 at 1:11 PM Andrew Lamb  wrote:
>
>> Hello,
>>
>> As we have discussed previously [1], we are planning to propose [2]
>> "graduating" the DataFusion to its own top level Apache project.
>>
>> I would like to discuss the initial PMC members for the new top level
>> project.  The suggestion in [1] is
>>
>> > All existing Arrow Committers and PMC members who so desired, would start
>> as committers or PMC members on the new DataFusion project (assuming this
>> is allowed by the process)
>>
>> From what I can tell, this means the PMC would be the following from the
>> current Arrow PMC [3]:
>>
>> Andy Grove (NVidia)
>> Andrew Lamb (InfuxData)
>> Daniël Heres (Coralogix)
>> Jie Wen (SelectDB)
>> Kun Liu (Ebay)
>> Liang-Chi Hsieh (Apple)
>> Qingping Hou (Scribd)
>> Will Jones (VoltronData)
>>
>> I think Raphael Taylor-Davies has told me offline he would prefer to focus
>> on Arrow and thus now join the DataFusion PMC, though it would be nice if
>> he could confirm.
>>
>> We also need to propose a chair of the new PMC -- I am happy to help anyone
>> who would like to do this role, or do it myself. Spending a year as the
>> Arrow PMC chair gave me sufficient experience to make sure the process is
>> smooth, in my opinion.
>>
>> If  the new project is approved and created then the initial PMC will
>> invite the relevant existing Arrow commiters as DataFusion committers.
>>
>> Please let me know your thoughts and if there are other existing Arrow PMC
>> members who should be included in the proposal for initial DataFusion PMC.
>>
>> Andrew
>>
>> p.s. As part of this process, I discovered Arrow's origin as a top level
>> project came from splitting off from the Apache Drill project, which I had
>> not previously known
>>
>>
>> [1]: https://github.com/apache/arrow-datafusion/discussions/6475
>> [2]: https://github.com/apache/arrow-datafusion/issues/8491
>> [3]: https://arrow.apache.org/committers/
>>


Re: [DISCUSS] Protocol for exchanging Arrow data over REST APIs

2023-11-20 Thread Raphael Taylor-Davies
I really like the idea of leveraging the mature ecosystem support for 
IPC streams [1] to provide a set of conventions for sending and 
receiving arrow data over plain HTTP.


For context, myself and my colleagues have run into a number of pain 
points whilst working on FlightSQL:


- The additional indirection via opaque Arrow Flight payloads somewhat 
undermines the value of using an IDL. In arrow-rs we've had to introduce 
custom abstractions [1] to workaround this
- The gRPC imposed message size limits are tricky to accommodate, and 
require non-trivial workarounds [2]
- The FlightData abstraction leaks a lot of IPC details into clients, 
which are fiddly to get correct. Again arrow-rs has added abstractions 
[3] to workaround this
- HTTP/2 keep-alives don't work over reverse proxies, as PING frames are 
not associated with a particular stream [5][6]


I therefore think providing a set of conventions for designing protocols 
operating over plain HTTP would be very compelling, as such protocols 
wouldn't encounter these pain points. This would also open the door to 
protocols making use of HTTP/3 where supported, eliminating a number of 
the issues inherent to TCP.


Kind Regards,

Raphael Taylor-Davies

[1]: https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format
[2]: 
https://docs.rs/arrow-flight/latest/arrow_flight/sql/server/trait.FlightSqlService.html
[3]: 
https://docs.rs/arrow-flight/latest/arrow_flight/encode/struct.FlightDataEncoderBuilder.html#method.with_max_flight_data_size
[4]: 
https://docs.rs/arrow-flight/latest/arrow_flight/encode/struct.FlightDataEncoder.html
[5]: 
https://github.com/microsoft/reverse-proxy/issues/118#issuecomment-940191553
[6]: 
https://kubernetes.github.io/ingress-nginx/examples/grpc/#notes-on-using-responserequest-streams


On 20/11/2023 14:23, David Li wrote:

I'm with Kou: what exactly are we trying to specify?

- The HTTP mapping of Flight RPC?
- A full, locked down RPC framework like Flight RPC, but otherwise unrelated?
- Something else?

I'd also ask: do we need to specify anything in the first place? What is stopping people 
from using Arrow in their REST APIs, and what kind of interoperability are we trying to 
achieve? I would say that Flight RPC effectively has no interoperability at all - each 
project using it has its own bespoke layers on top, and the "standardized" RPC 
methods just hinder the applications that would like more control and flexibility that 
Flight RPC does not provide. The recent additions to the Flight RPC spec speak to that: 
they were meant for Flight SQL, but needed to be implemented at the Flight RPC layer; 
there is not a real abstraction layer that Flight RPC really serves.


It could consist only of a specification for how to implement
support for exchanging Arrow-formatted data in an existing REST API.

I would say that this is the only part that might make sense: once a client has 
acquired an Arrow-aware endpoint, what should be the format of the Arrow data 
it gets (whether this is just the Arrow stream format, or something fancier 
like FlightData in Flight RPC).

Separately, it might make sense to define how GraphQL works with Arrow, or 
other specific, full protocols/APIs. But I'm not sure there's much room for a 
Flight RPC equivalent for HTTP/1, if Flight RPC on its own really ever made 
sense as a full framework/protocol in the first place.

On Sat, Nov 18, 2023, at 14:17, Gavin Ray wrote:

I know that myself and a number of folks I work with would be interested in
this.

gRPC is a bit of a barrier for a lot of services.
Having a spec for doing Arrow over HTTP API's would be solid.

In my opinion, it doesn't necessarily need to be REST-ful.
Something like JSON-RPC might fit well with the existing model for Arrow
over the wire that's been implemented in things like Flight/FlightSQL.

Something else I've been interested in (I think Matt Topol has done work in
this area) is Arrow over GraphQL, too:
GraphQL and Apache Arrow: A Match Made in Data (youtube.com)
<https://www.youtube.com/watch?v=5N97TzY_tis>

On Sat, Nov 18, 2023 at 1:52 PM Ian Cook  wrote:


Hi Kou,

I think it is too early to make a specific proposal. I hope to use this
discussion to collect more information about existing approaches. If
several viable approaches emerge from this discussion, then I think we
should make a document listing them, like you suggest.

Thank you for the information about Groonga. This type of straightforward
HTTP-based approach would work in the context of a REST API, as I
understand it.

But how is the performance? Have you measured the throughput of this
approach to see if it is comparable to using Flight SQL? Is this approach
able to saturate a fast network connection?

And what about the case in which the server wants to begin sending batches
to the client before the total number of result batches / records is known?
Would this approach work in that case? I think so but I am not sure.

If this

[RESULT][VOTE][RUST] Release Apache Arrow Rust 49.0.0 RC1

2023-11-13 Thread Raphael Taylor-Davies

With 6 +1 votes (4 binding) the release is approved

The release is available here:

It has also been released to crates.io. I opted to omit arrow-avro as it 
is still a work in progress.


Thank you to everyone who helped verify this release

On 09/11/2023 20:37, Andrew Lamb wrote:

+1 (binding) on Mac x86

Thank you Raphael

Andrew

On Thu, Nov 9, 2023 at 11:49 AM Chao Sun  wrote:


+1 (non-binding)

Verified on M1 Mac. Thanks Raphael.

On Thu, Nov 9, 2023 at 12:47 AM Wayne Xia  wrote:

+1 (non-binding)

Verified on Intel Linux

Thanks Raphael

On Wed, Nov 8, 2023 at 6:12 AM L. C. Hsieh  wrote:


+1 (binding)

Verified on Intel Mac.

Thanks Raphael.

On Tue, Nov 7, 2023 at 1:38 PM Andy Grove 

wrote:

+1 (binding)

Verified on Ubuntu 22.04.3 LTS.

Thanks, Raphael.

On Tue, Nov 7, 2023 at 2:22 PM Raphael Taylor-Davies
 wrote:


Hi,

I would like to propose a release of Apache Arrow Rust

Implementation,

version 49.0.0.

This release candidate is based on commit:
747dcbf0670aeab2ede474edb3c4f22028d6a7e6 [1]

The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit

tests,

and vote on the release. There is a script [4] that automates some

of

the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]:



https://github.com/apache/arrow-rs/tree/747dcbf0670aeab2ede474edb3c4f22028d6a7e6

[2]:


https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-49.0.0-rc1

[3]:



https://github.com/apache/arrow-rs/blob/747dcbf0670aeab2ede474edb3c4f22028d6a7e6/CHANGELOG.md

[4]:



https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




Re: [VOTE][RUST] Release Apache Arrow Rust 48.0.1 RC1

2023-11-09 Thread Raphael Taylor-Davies

+1 (binding)

Verified on x86_64 GNU/Linux

On 09/11/2023 20:31, Andrew Lamb wrote:

As discussed on [5], I would like to propose a patch release of Apache
Arrow Rust Implementation, version 48.0.1 to include two bug fixes.

This release candidate is based on commit:
b60fc7bb09ada1385d3542b784fff2915fbc9cff [1]

The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]:
https://github.com/apache/arrow-rs/tree/b60fc7bb09ada1385d3542b784fff2915fbc9cff
[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-48.0.1-rc1
[3]:
https://github.com/apache/arrow-rs/blob/b60fc7bb09ada1385d3542b784fff2915fbc9cff/CHANGELOG.md
[4]:
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
[5]: https://github.com/apache/arrow-rs/issues/5050



Re: decimal64

2023-11-09 Thread Raphael Taylor-Davies
Perhaps my maths is incorrect, but a decimal64 would have a maximum 
precision of 18, not 19? log(9223372036854775807) = 18.9?


On 09/11/2023 16:01, Curt Hagenlocher wrote:

Recently, someone opened an issue on GitHub ([C++] Decimal64/32 support? ·
Issue #38622 · apache/arrow (github.com)
) asking for support for
narrower decimal types. They were advised to start a thread on the mailing
list, and as they haven't done so yet I will start.

It's fairly common to store currency in databases in a type that's
compatible with decimal64. Both PostgreSQL and Microsoft SQL Server have
"money" data types; for Postgres, this is a decimal(19, 2) and for SQL
Server it's decimal(19, 4). Microsoft Analysis Services also uses
decimal(19, 4) as one of its core data types. If you search the internet
for suggestions on the database type to use for money, the vast majority
recommend a decimal type with a precision <= 19. Currency is something
stored very frequently as data, and it makes sense to have a type that's
optimized for this purpose. I submit that it's a far more common type than
float16, and that even if it's not as hip as the AI scenarios which
popularized float16, the ultimate goal of those scenarios is, after all, to
make more "money".

decimal64 is considerably easier to work with on modern CPUs and in common
programming languages than decimal128, and requires half the amount of
storage space. And while adding new types to Arrow obviously needs to be
done very sparingly, it's harder to imagine a new type for which support
would be easier to implement than this one.

I think decimal32 is much harder to justify. MS SQL Server has a
"smallmoney" (decimal(10, 4)), but I suspect it's not that heavily used.
Maybe others have more feedback on this one.


-Curt



[VOTE][RUST] Release Apache Arrow Rust 49.0.0 RC1

2023-11-07 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 49.0.0.


This release candidate is based on commit: 
747dcbf0670aeab2ede474edb3c4f22028d6a7e6 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/747dcbf0670aeab2ede474edb3c4f22028d6a7e6

[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-49.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/747dcbf0670aeab2ede474edb3c4f22028d6a7e6/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




Re: [VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 33.0.0 RC1

2023-11-07 Thread Raphael Taylor-Davies

It will contain breaking dependency updates, including object_store.

I hope to cut it today.

On 07/11/2023 11:43, Andrew Lamb wrote:

If the release later in the week doesn't have any breaking API changes,
perhaps it can be 48.1.0 (and thus also get the bugfix to datafusion)

On Tue, Nov 7, 2023 at 6:41 AM Raphael Taylor-Davies
 wrote:


I intend to cut a new arrow release later this week, I would prefer we
wait for this.

On 07/11/2023 11:39, Andrew Lamb wrote:

Perhaps we can create an arrow 48.1.0 patch release to include the fix?

On Tue, Nov 7, 2023 at 12:48 AM Will Jones 

wrote:

Thanks for the clarification, Raphael. That likely narrows the scope of

who

is affected. If this bug is present in DataFusion 33, then delta-rs will
likely skip upgrading until 34. If we're the only downstream project

this

parsing issue affects, then I think it's fine to release.

On Mon, Nov 6, 2023 at 8:22 PM Raphael Taylor-Davies
 wrote:


Hi,

To further clarify the bug concerns the serde compatibility feature

that

allows converting a serde compatible data structure to arrow [1]. It

will

not impact workloads reading JSON.

I am not sure this is a sufficiently fundamental bug to warrant special
concern, but happy to defer to others.

Kind Regards,

Raphael

[1]: https://docs.rs/arrow/latest/arrow/#serde-compatibility

On 7 November 2023 03:20:59 GMT, Will Jones 
wrote:

Hello,

There is an upstream bug in arrow-json that can cause the JSON reader

to

return incorrect data for large integers [1]. It was recently fixed by
Raphael within the last 24 hours, but is not included in any release.

The

bug was introduced in Arrow 48, which this DataFusion release will

expose

users to.

Not sure what the precedent here is, but I think either we should

consider

either (a) seeing if we can release and upgrade Arrow to include the

fix,

or else (b) calling out the regression as a known bug so downstream
projects can include the path in their applications.

Best,

Will Jones

[1] https://github.com/apache/arrow-rs/issues/5038
[2] https://github.com/apache/arrow-rs/pull/5042

On Mon, Nov 6, 2023 at 12:25 PM Andrew Lamb 

wrote:

+1 (the tests passed for me). I have left a comment on
https://github.com/apache/arrow-datafusion/issues/8069

On Mon, Nov 6, 2023 at 2:02 PM Andy Grove 

wrote:

I filed https://github.com/apache/arrow-datafusion/issues/8069

On Mon, Nov 6, 2023 at 11:59 AM Andy Grove 

wrote:

I see the same error when I run on my M1 Macbook Air with 16 GB

RAM.

 aggregates::tests::run_first_last_multi_partitions stdout



Error: ResourcesExhausted("Failed to allocate additional 632 bytes

for

GroupedHashAggregateStream[0] with 1829 bytes already allocated -

maximum

available is 605")

It worked fine on my workstation with 128 GB RAM.



On Mon, Nov 6, 2023 at 11:23 AM L. C. Hsieh 

wrote:

Hmm, ran verification script and got one failure:

failures:

 aggregates::tests::run_first_last_multi_partitions stdout



Error: ResourcesExhausted("Failed to allocate additional 632

bytes

for

GroupedHashAggregateStream[0] with 1829 bytes already allocated -
maximum available is 605")

failures:
  aggregates::tests::run_first_last_multi_partitions

test result: FAILED. 557 passed; 1 failed; 1 ignored; 0

measured; 0

filtered out; finished in 2.21s



On Mon, Nov 6, 2023 at 6:57 AM Andy Grove 
wrote:

Hi,

I would like to propose a release of Apache Arrow DataFusion

Implementation,

version 33.0.0.

This release candidate is based on commit:
262f08778b8ec231d96792c01fc3e051640eb5d4 [1]
The proposed release tarball and signatures are hosted at [2].
The changelog is located at [3].

Please download, verify checksums and signatures, run the unit

tests,

and

vote
on the release. The vote will be open for at least 72 hours.

Only votes from PMC members are binding, but all members of the

community

are
encouraged to test the release and vote with "(non-binding)".

The standard verification procedure is documented at


https://github.com/apache/arrow-datafusion/blob/main/dev/release/README.md#verifying-release-candidates

.

[ ] +1 Release this as Apache Arrow DataFusion 33.0.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow DataFusion 33.0.0

because...

Here is my vote:

+1

[1]:


https://github.com/apache/arrow-datafusion/tree/262f08778b8ec231d96792c01fc3e051640eb5d4

[2]:


https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-33.0.0-rc1

[3]:


https://github.com/apache/arrow-datafusion/blob/262f08778b8ec231d96792c01fc3e051640eb5d4/CHANGELOG.md



Re: [VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 33.0.0 RC1

2023-11-07 Thread Raphael Taylor-Davies
I intend to cut a new arrow release later this week, I would prefer we 
wait for this.


On 07/11/2023 11:39, Andrew Lamb wrote:

Perhaps we can create an arrow 48.1.0 patch release to include the fix?

On Tue, Nov 7, 2023 at 12:48 AM Will Jones  wrote:


Thanks for the clarification, Raphael. That likely narrows the scope of who
is affected. If this bug is present in DataFusion 33, then delta-rs will
likely skip upgrading until 34. If we're the only downstream project this
parsing issue affects, then I think it's fine to release.

On Mon, Nov 6, 2023 at 8:22 PM Raphael Taylor-Davies
 wrote:


Hi,

To further clarify the bug concerns the serde compatibility feature that
allows converting a serde compatible data structure to arrow [1]. It will
not impact workloads reading JSON.

I am not sure this is a sufficiently fundamental bug to warrant special
concern, but happy to defer to others.

Kind Regards,

Raphael

[1]: https://docs.rs/arrow/latest/arrow/#serde-compatibility

On 7 November 2023 03:20:59 GMT, Will Jones 
wrote:

Hello,

There is an upstream bug in arrow-json that can cause the JSON reader to
return incorrect data for large integers [1]. It was recently fixed by
Raphael within the last 24 hours, but is not included in any release.

The

bug was introduced in Arrow 48, which this DataFusion release will

expose

users to.

Not sure what the precedent here is, but I think either we should

consider

either (a) seeing if we can release and upgrade Arrow to include the

fix,

or else (b) calling out the regression as a known bug so downstream
projects can include the path in their applications.

Best,

Will Jones

[1] https://github.com/apache/arrow-rs/issues/5038
[2] https://github.com/apache/arrow-rs/pull/5042

On Mon, Nov 6, 2023 at 12:25 PM Andrew Lamb 

wrote:

+1 (the tests passed for me). I have left a comment on
https://github.com/apache/arrow-datafusion/issues/8069

On Mon, Nov 6, 2023 at 2:02 PM Andy Grove 

wrote:

I filed https://github.com/apache/arrow-datafusion/issues/8069

On Mon, Nov 6, 2023 at 11:59 AM Andy Grove 

wrote:

I see the same error when I run on my M1 Macbook Air with 16 GB

RAM.

 aggregates::tests::run_first_last_multi_partitions stdout



Error: ResourcesExhausted("Failed to allocate additional 632 bytes

for

GroupedHashAggregateStream[0] with 1829 bytes already allocated -

maximum

available is 605")

It worked fine on my workstation with 128 GB RAM.



On Mon, Nov 6, 2023 at 11:23 AM L. C. Hsieh 

wrote:

Hmm, ran verification script and got one failure:

failures:

 aggregates::tests::run_first_last_multi_partitions stdout



Error: ResourcesExhausted("Failed to allocate additional 632

bytes

for

GroupedHashAggregateStream[0] with 1829 bytes already allocated -
maximum available is 605")

failures:
 aggregates::tests::run_first_last_multi_partitions

test result: FAILED. 557 passed; 1 failed; 1 ignored; 0

measured; 0

filtered out; finished in 2.21s



On Mon, Nov 6, 2023 at 6:57 AM Andy Grove 
wrote:

Hi,

I would like to propose a release of Apache Arrow DataFusion

Implementation,

version 33.0.0.

This release candidate is based on commit:
262f08778b8ec231d96792c01fc3e051640eb5d4 [1]
The proposed release tarball and signatures are hosted at [2].
The changelog is located at [3].

Please download, verify checksums and signatures, run the unit

tests,

and

vote
on the release. The vote will be open for at least 72 hours.

Only votes from PMC members are binding, but all members of the

community

are
encouraged to test the release and vote with "(non-binding)".

The standard verification procedure is documented at


https://github.com/apache/arrow-datafusion/blob/main/dev/release/README.md#verifying-release-candidates

.

[ ] +1 Release this as Apache Arrow DataFusion 33.0.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow DataFusion 33.0.0

because...

Here is my vote:

+1

[1]:


https://github.com/apache/arrow-datafusion/tree/262f08778b8ec231d96792c01fc3e051640eb5d4

[2]:


https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-33.0.0-rc1

[3]:


https://github.com/apache/arrow-datafusion/blob/262f08778b8ec231d96792c01fc3e051640eb5d4/CHANGELOG.md


Re: [VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 33.0.0 RC1

2023-11-06 Thread Raphael Taylor-Davies
Hi,

To further clarify the bug concerns the serde compatibility feature that allows 
converting a serde compatible data structure to arrow [1]. It will not impact 
workloads reading JSON. 

I am not sure this is a sufficiently fundamental bug to warrant special 
concern, but happy to defer to others.

Kind Regards,

Raphael

[1]: https://docs.rs/arrow/latest/arrow/#serde-compatibility

On 7 November 2023 03:20:59 GMT, Will Jones  wrote:
>Hello,
>
>There is an upstream bug in arrow-json that can cause the JSON reader to
>return incorrect data for large integers [1]. It was recently fixed by
>Raphael within the last 24 hours, but is not included in any release. The
>bug was introduced in Arrow 48, which this DataFusion release will expose
>users to.
>
>Not sure what the precedent here is, but I think either we should consider
>either (a) seeing if we can release and upgrade Arrow to include the fix,
>or else (b) calling out the regression as a known bug so downstream
>projects can include the path in their applications.
>
>Best,
>
>Will Jones
>
>[1] https://github.com/apache/arrow-rs/issues/5038
>[2] https://github.com/apache/arrow-rs/pull/5042
>
>On Mon, Nov 6, 2023 at 12:25 PM Andrew Lamb  wrote:
>
>> +1 (the tests passed for me). I have left a comment on
>> https://github.com/apache/arrow-datafusion/issues/8069
>>
>> On Mon, Nov 6, 2023 at 2:02 PM Andy Grove  wrote:
>>
>> > I filed https://github.com/apache/arrow-datafusion/issues/8069
>> >
>> > On Mon, Nov 6, 2023 at 11:59 AM Andy Grove 
>> wrote:
>> >
>> > > I see the same error when I run on my M1 Macbook Air with 16 GB RAM.
>> > >
>> > >  aggregates::tests::run_first_last_multi_partitions stdout 
>> > > Error: ResourcesExhausted("Failed to allocate additional 632 bytes for
>> > > GroupedHashAggregateStream[0] with 1829 bytes already allocated -
>> maximum
>> > > available is 605")
>> > >
>> > > It worked fine on my workstation with 128 GB RAM.
>> > >
>> > >
>> > >
>> > > On Mon, Nov 6, 2023 at 11:23 AM L. C. Hsieh  wrote:
>> > >
>> > >> Hmm, ran verification script and got one failure:
>> > >>
>> > >> failures:
>> > >>
>> > >>  aggregates::tests::run_first_last_multi_partitions stdout 
>> > >> Error: ResourcesExhausted("Failed to allocate additional 632 bytes for
>> > >> GroupedHashAggregateStream[0] with 1829 bytes already allocated -
>> > >> maximum available is 605")
>> > >>
>> > >> failures:
>> > >> aggregates::tests::run_first_last_multi_partitions
>> > >>
>> > >> test result: FAILED. 557 passed; 1 failed; 1 ignored; 0 measured; 0
>> > >> filtered out; finished in 2.21s
>> > >>
>> > >>
>> > >>
>> > >> On Mon, Nov 6, 2023 at 6:57 AM Andy Grove 
>> > wrote:
>> > >> >
>> > >> > Hi,
>> > >> >
>> > >> > I would like to propose a release of Apache Arrow DataFusion
>> > >> Implementation,
>> > >> > version 33.0.0.
>> > >> >
>> > >> > This release candidate is based on commit:
>> > >> > 262f08778b8ec231d96792c01fc3e051640eb5d4 [1]
>> > >> > The proposed release tarball and signatures are hosted at [2].
>> > >> > The changelog is located at [3].
>> > >> >
>> > >> > Please download, verify checksums and signatures, run the unit
>> tests,
>> > >> and
>> > >> > vote
>> > >> > on the release. The vote will be open for at least 72 hours.
>> > >> >
>> > >> > Only votes from PMC members are binding, but all members of the
>> > >> community
>> > >> > are
>> > >> > encouraged to test the release and vote with "(non-binding)".
>> > >> >
>> > >> > The standard verification procedure is documented at
>> > >> >
>> > >>
>> >
>> https://github.com/apache/arrow-datafusion/blob/main/dev/release/README.md#verifying-release-candidates
>> > >> > .
>> > >> >
>> > >> > [ ] +1 Release this as Apache Arrow DataFusion 33.0.0
>> > >> > [ ] +0
>> > >> > [ ] -1 Do not release this as Apache Arrow DataFusion 33.0.0
>> > because...
>> > >> >
>> > >> > Here is my vote:
>> > >> >
>> > >> > +1
>> > >> >
>> > >> > [1]:
>> > >> >
>> > >>
>> >
>> https://github.com/apache/arrow-datafusion/tree/262f08778b8ec231d96792c01fc3e051640eb5d4
>> > >> > [2]:
>> > >> >
>> > >>
>> >
>> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-33.0.0-rc1
>> > >> > [3]:
>> > >> >
>> > >>
>> >
>> https://github.com/apache/arrow-datafusion/blob/262f08778b8ec231d96792c01fc3e051640eb5d4/CHANGELOG.md
>> > >>
>> > >
>> >
>>


[RESULT][VOTE][RUST] Release Apache Arrow Rust Object Store 0.8.0 RC1

2023-11-06 Thread Raphael Taylor-Davies

With 3 +1 votes (3 binding) the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-object-store-rs-0.8.0 



It has also been released to crates.io.

Thank you to everyone who helped verify this release

On 03/11/2023 15:44, Andrew Lamb wrote:

+1

I likewise commented out the -e in the verification script, and have
reviewed the test fix as well as reviewed the output (on mac x86).
Everthing looks good to me.

Thank you Raphael,
Andrew


On Thu, Nov 2, 2023 at 5:29 PM L. C. Hsieh  wrote:


Tried the verification script after set -e removed.

Got:

failures:
 local::tests::invalid_path

test result: FAILED. 55 passed; 1 failed; 1 ignored; 0 measured; 0
filtered out; finished in 4.26s

After that, cargo publish --dry-run continued to be run without
manually and got:

+ TEST_SUCCESS=yes
+ echo 'Release candidate looks good!'
Release candidate looks good!

So it looks good then.

+1 if test issues can be ignored.

On Thu, Nov 2, 2023 at 1:31 PM Raphael Taylor-Davies
 wrote:

Aah, because the release script bails out on the first error? You could
either:

- Remove set -e from the top of the script to have it continue on error
- Run the remaining check of `cargo publish --dry-run` manually

I can cut another RC tomorrow if neither of these is a satisfactory
solution, I just felt there was prior precedent for not holding releases
back on test issues.

On 02/11/2023 20:07, L. C. Hsieh wrote:

Hmm, I think we cannot run the verification script unless the issue is

fixed?

On Thu, Nov 2, 2023 at 6:40 AM Raphael Taylor-Davies
 wrote:

Aah, that was a mistake introduced in a test in [1], where it was
incorrectly relying on a particular ordering when listing a directory.
I've created a PR that should fix the issue [2].

As this is purely a testing oversight, I am not sure this warrants
cutting another RC unless you feel otherwise?

[1]:


https://github.com/apache/arrow-rs/pull/5020/files#diff-e0de0bcc9edd75e6cb6bb15dda180797a0c809b1e01b23cc2189f76409a1c2f5R1426

[2]: https://github.com/apache/arrow-rs/pull/5026

On 02/11/2023 13:31, Andrew Lamb wrote:

When I ran the verification script (on mac x86-46) it failed one of

the

tests:

```
 local::tests::invalid_path stdout 
thread 'local::tests::invalid_path' panicked at src/local.rs:1389:9:
assertion `left == right` failed
 left: [Path { raw: "directory/child.txt" }, Path { raw: "" }]
right: [Path { raw: "" }, Path { raw: "directory/child.txt" }]
note: run with `RUST_BACKTRACE=1` environment variable to display a
backtrace


failures:
   local::tests::invalid_path
```



On Thu, Nov 2, 2023 at 7:48 AM Raphael Taylor-Davies
 wrote:


Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.8.0.

This release candidate is based on commit:
ad211fe324d259bf9fea1c43a3a82b3c833f6d7a [1]

The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit

tests,

and vote on the release. There is a script [4] that automates some

of

the verification.

!! Please note you will need to use the latest version of the
verification scripts on master !!

This is the result of a doctest issue when compiling with only the
default features [5]

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store

because...

[1]:



https://github.com/apache/arrow-rs/tree/ad211fe324d259bf9fea1c43a3a82b3c833f6d7a

[2]:



https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.8.0-rc1

[3]:



https://github.com/apache/arrow-rs/blob/ad211fe324d259bf9fea1c43a3a82b3c833f6d7a/object_store/CHANGELOG.md

[4]:



https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh

[5]: https://github.com/apache/arrow-rs/issues/5025




Re: [VOTE][RUST] Release Apache Arrow Rust Object Store 0.8.0 RC1

2023-11-02 Thread Raphael Taylor-Davies
Aah, because the release script bails out on the first error? You could 
either:


- Remove set -e from the top of the script to have it continue on error
- Run the remaining check of `cargo publish --dry-run` manually

I can cut another RC tomorrow if neither of these is a satisfactory 
solution, I just felt there was prior precedent for not holding releases 
back on test issues.


On 02/11/2023 20:07, L. C. Hsieh wrote:

Hmm, I think we cannot run the verification script unless the issue is fixed?

On Thu, Nov 2, 2023 at 6:40 AM Raphael Taylor-Davies
 wrote:

Aah, that was a mistake introduced in a test in [1], where it was
incorrectly relying on a particular ordering when listing a directory.
I've created a PR that should fix the issue [2].

As this is purely a testing oversight, I am not sure this warrants
cutting another RC unless you feel otherwise?

[1]:
https://github.com/apache/arrow-rs/pull/5020/files#diff-e0de0bcc9edd75e6cb6bb15dda180797a0c809b1e01b23cc2189f76409a1c2f5R1426
[2]: https://github.com/apache/arrow-rs/pull/5026

On 02/11/2023 13:31, Andrew Lamb wrote:

When I ran the verification script (on mac x86-46) it failed one of the
tests:

```
 local::tests::invalid_path stdout 
thread 'local::tests::invalid_path' panicked at src/local.rs:1389:9:
assertion `left == right` failed
left: [Path { raw: "directory/child.txt" }, Path { raw: "" }]
   right: [Path { raw: "" }, Path { raw: "directory/child.txt" }]
note: run with `RUST_BACKTRACE=1` environment variable to display a
backtrace


failures:
  local::tests::invalid_path
```



On Thu, Nov 2, 2023 at 7:48 AM Raphael Taylor-Davies
 wrote:


Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.8.0.

This release candidate is based on commit:
ad211fe324d259bf9fea1c43a3a82b3c833f6d7a [1]

The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

!! Please note you will need to use the latest version of the
verification scripts on master !!

This is the result of a doctest issue when compiling with only the
default features [5]

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store because...

[1]:

https://github.com/apache/arrow-rs/tree/ad211fe324d259bf9fea1c43a3a82b3c833f6d7a
[2]:

https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.8.0-rc1
[3]:

https://github.com/apache/arrow-rs/blob/ad211fe324d259bf9fea1c43a3a82b3c833f6d7a/object_store/CHANGELOG.md
[4]:

https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh
[5]: https://github.com/apache/arrow-rs/issues/5025




Re: [VOTE][RUST] Release Apache Arrow Rust Object Store 0.8.0 RC1

2023-11-02 Thread Raphael Taylor-Davies
Aah, that was a mistake introduced in a test in [1], where it was 
incorrectly relying on a particular ordering when listing a directory. 
I've created a PR that should fix the issue [2].


As this is purely a testing oversight, I am not sure this warrants 
cutting another RC unless you feel otherwise?


[1]: 
https://github.com/apache/arrow-rs/pull/5020/files#diff-e0de0bcc9edd75e6cb6bb15dda180797a0c809b1e01b23cc2189f76409a1c2f5R1426

[2]: https://github.com/apache/arrow-rs/pull/5026

On 02/11/2023 13:31, Andrew Lamb wrote:

When I ran the verification script (on mac x86-46) it failed one of the
tests:

```
 local::tests::invalid_path stdout 
thread 'local::tests::invalid_path' panicked at src/local.rs:1389:9:
assertion `left == right` failed
   left: [Path { raw: "directory/child.txt" }, Path { raw: "" }]
  right: [Path { raw: "" }, Path { raw: "directory/child.txt" }]
note: run with `RUST_BACKTRACE=1` environment variable to display a
backtrace


failures:
 local::tests::invalid_path
```



On Thu, Nov 2, 2023 at 7:48 AM Raphael Taylor-Davies
 wrote:


Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.8.0.

This release candidate is based on commit:
ad211fe324d259bf9fea1c43a3a82b3c833f6d7a [1]

The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

!! Please note you will need to use the latest version of the
verification scripts on master !!

This is the result of a doctest issue when compiling with only the
default features [5]

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store because...

[1]:

https://github.com/apache/arrow-rs/tree/ad211fe324d259bf9fea1c43a3a82b3c833f6d7a
[2]:

https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.8.0-rc1
[3]:

https://github.com/apache/arrow-rs/blob/ad211fe324d259bf9fea1c43a3a82b3c833f6d7a/object_store/CHANGELOG.md
[4]:

https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh
[5]: https://github.com/apache/arrow-rs/issues/5025




[VOTE][RUST] Release Apache Arrow Rust Object Store 0.8.0 RC1

2023-11-02 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.8.0.

This release candidate is based on commit: 
ad211fe324d259bf9fea1c43a3a82b3c833f6d7a [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

!! Please note you will need to use the latest version of the 
verification scripts on master !!


This is the result of a doctest issue when compiling with only the 
default features [5]


The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store because...

[1]: 
https://github.com/apache/arrow-rs/tree/ad211fe324d259bf9fea1c43a3a82b3c833f6d7a
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.8.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/ad211fe324d259bf9fea1c43a3a82b3c833f6d7a/object_store/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh

[5]: https://github.com/apache/arrow-rs/issues/5025



[RESULT][VOTE][RUST] Release Apache Arrow Rust 48.0.0 RC2

2023-10-23 Thread Raphael Taylor-Davies

With 4 +1 votes (4 binding) the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-rs-48.0.0/


It has also been released to crates.io. I opted to omit arrow-avro as it 
is still a work in progress.


Thank you to everyone who helped verify this release

On 18/10/2023 14:59, Raphael Taylor-Davies wrote:


Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 48.0.0 *RC2*.


Please note that there were issues with the first release candidate 
that required cutting a second.


This release candidate is based on commit: 
51ac6fec8755147cd6b1dfe7d76bfdcfacad0463 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/51ac6fec8755147cd6b1dfe7d76bfdcfacad0463
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-48.0.0-rc2
[3]: 
https://github.com/apache/arrow-rs/blob/51ac6fec8755147cd6b1dfe7d76bfdcfacad0463/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh


Re: Apache Arrow file format

2023-10-18 Thread Raphael Taylor-Davies
To further what others have already mentioned, the IPC file format is 
primarily optimised for IPC use-cases, that is exchanging the entire 
contents between processes. It is relatively inexpensive to encode and 
decode, and supports all arrow datatypes, making it ideal for things 
like spill-to-disk processing, distributed shuffles, etc...


Parquet by comparison is a storage format, optimised for space 
efficiency and selective querying, with [1] containing an overview of 
the various techniques the format affords. It is comparatively expensive 
to encode and decode, and instead relies on index structures and 
statistics to accelerate access.


Both are therefore perfectly viable options depending on your particular 
use-case.


[1]: 
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/


On 18/10/2023 13:59, Dewey Dunnington wrote:

Plenty of opinions here already, but I happen to think that IPC
streams and/or Arrow File/Feather are wildly underutilized. For the
use-case where you're mostly just going to read an entire file into R
or Python it's a bit faster (and far superior to a CSV or pickling or
.rds files in R).


you're going to read all the columns for a record batch in the file, no matter 
what

The metadata for each every column in every record batch has to be
read, but there's nothing inherent about the format that prevents
selectively loading into memory only the required buffers. (I don't
know off the top of my head if any reader implementation actually does
this).

On Wed, Oct 18, 2023 at 12:02 AM wish maple  wrote:

Arrow IPC file is great, it focuses on in-memory representation and direct
computation.
Basically, it can support compression and dictionary encoding, and can
zero-copy
deserialize the file to memory Arrow format.

Parquet provides some strong functionality, like Statistics, which could
help pruning
unnecessary data during scanning and avoid cpu and io cust. And it has high
efficient
encoding, which could make the Parquet file smaller than the Arrow IPC file
under the same
data. However, currently some arrow data type cannot be convert to
correspond Parquet type
in the current arrow-cpp implementation. You can goto the arrow document to
take a look.

Adam Lippai  于2023年10月18日周三 10:50写道:


Also there is
https://github.com/lancedb/lance between the two formats. Depending on the
use case it can be a great choice.

Best regards
Adam Lippai

On Tue, Oct 17, 2023 at 22:44 Matt Topol  wrote:


One benefit of the feather format (i.e. Arrow IPC file format) is the
ability to mmap the file to easily handle reading sections of a larger

than

memory file of data. Since, as Felipe mentioned, the format is focused on
in-memory representation, you can easily and simply mmap the file and use
the raw bytes directly. For a large file that you only want to read
sections of, this can be beneficial for IO and memory usage.

Unfortunately, you are correct that it doesn't allow for easy column
projecting (you're going to read all the columns for a record batch in

the

file, no matter what). So it's going to be a trade off based on your

needs

as to whether it makes sense, or if you should use a file format like
Parquet instead.

-Matt


On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho <
felipe...@gmail.com>
wrote:


It’s not the best since the format is really focused on in- memory
representation and direct computation, but you can do it:

https://arrow.apache.org/docs/python/feather.html

—
Felipe

On Tue, 17 Oct 2023 at 23:26 Nara 

wrote:

Hi,

Is it a good idea to use Apache Arrow as a file format? Looks like
projecting columns isn't available by default.

One of the benefits of Parquet file format is column projection,

where

the

IO is limited to just the columns projected.

Regards ,
Nara



[VOTE][RUST] Release Apache Arrow Rust 48.0.0 RC2

2023-10-18 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 48.0.0 *RC2*.


Please note that there were issues with the first release candidate that 
required cutting a second.


This release candidate is based on commit: 
51ac6fec8755147cd6b1dfe7d76bfdcfacad0463 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/51ac6fec8755147cd6b1dfe7d76bfdcfacad0463

[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-48.0.0-rc2
[3]: 
https://github.com/apache/arrow-rs/blob/51ac6fec8755147cd6b1dfe7d76bfdcfacad0463/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh


Re: [DISCUSS][Rust][DataFusion][HiveMetaStore] Possible Metastore integration with Data Fusion

2023-10-04 Thread Raphael Taylor-Davies

Hi,

I think [1] might be a good place to start and handle coordination for 
this undertaking. I suspect it would probably want to live under the 
datafusion-contrib organisation, similar to HDFS [2]


Kind Regards,

Raphael

[1]: https://github.com/apache/arrow-datafusion/issues/2209
[2]: https://github.com/datafusion-contrib/datafusion-objectstore-hdfs

On 04/10/2023 00:18, Kothapalli, Vamsi wrote:

Hi devs,

I would like to start discussion with possible integration of Hive Metastore 
with Data fusion,


Simmilarly along the lines of this Glue catalog itegration
https://github.com/apache/arrow-datafusion/issues/2206, Can anyone suggest me 
how the code should look like if
I or some would like to work on building HiveMetastore as catalog provider 
feature in data fusion

Thanks,
Vamsi


[https://opengraph.githubassets.com/e562cb13412873d4ec10975083b49f02937926cfb878b1c07f4daf147c1f7494/apache/arrow-datafusion/issues/2206]
[datafusion-contrib] AWS Glue Integration · Issue #2206 · 
apache/arrow-datafusion
Is your feature request related to a problem or challenge? Please describe what 
you are trying to do. This has been discussed in various places, #907 and 
datafusion-contrib/datafusion-objectstore-s...
github.com




Re: [DISCUSS][C++] Raw pointer string views

2023-10-02 Thread Raphael Taylor-Davies
ast

```
--
Benchmark
  Time CPU   Iterations UserCounters...
--
ConvertViews<(from type), (to type), (length category)>
--
ConvertViews
42877057 ns 42875885 ns   16 items_per_second=32.6081M/s
   kUsuallyInlineable>
   34079672 ns 34075604 ns   21 items_per_second=30.772M/s
   kShortButNeverInlineable>
   16044043 ns 16043702 ns   43 items_per_second=34.8573M/s
   kLongAndSeldomInlineable>
1717984 ns  1717955 ns  376 items_per_second=38.1477M/s
   kLongAndNeverInlineable>
 1707074 ns  1706973 ns  413 items_per_second=38.3931M/s

ConvertViews
85538939 ns 85532072 ns8 items_per_second=16.3459M/s
   kUsuallyInlineable>
   66432452 ns 66417147 ns   10 items_per_second=15.7877M/s
   kShortButNeverInlineable>
   36025089 ns 36021631 ns   19 items_per_second=15.5251M/s
   kLongAndSeldomInlineable>
8791312 ns  8789937 ns   80 items_per_second=7.4558M/s
   kLongAndNeverInlineable>
 6272905 ns  6272238 ns  112 items_per_second=10.4486M/s

ConvertViews
15400749 ns 15400729 ns   45 items_per_second=90.7815M/s
   kUsuallyInlineable>
   21527529 ns 21527622 ns   33 items_per_second=48.7084M/s
   kShortButNeverInlineable>
   25101062 ns 25099755 ns   28 items_per_second=22.2807M/s
   kLongAndSeldomInlineable>
2665299 ns  2665111 ns  262 items_per_second=24.5903M/s
   kLongAndNeverInlineable>
 2694563 ns  2694485 ns  260 items_per_second=24.3223M/s

ConvertViews
15359965 ns 15358626 ns   46 items_per_second=91.0303M/s
   kUsuallyInlineable>
   13967232 ns 13967093 ns   50 items_per_second=75.0748M/s
   kShortButNeverInlineable>
7861021 ns  7860546 ns   89 items_per_second=71.1452M/s
   kLongAndSeldomInlineable>
 729323 ns   729272 ns  969 items_per_second=89.865M/s
   kLongAndNeverInlineable>
  709887 ns   709827 ns  965 items_per_second=92.3267M/s
```

Sincerely,
Ben Kietzman

[1]
https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings

On Mon, Oct 2, 2023 at 9:22 AM Andrew Lamb  wrote:


I don't think "we have to adjust the Arrow format so that existing
internal representations become Arrow-compliant without any
(re-)implementation effort" is a reasonable design principle.

I agree with this statement from Antoine -- given the Arrow community has
standardized an addition to the format with StringView, I think it would
help to get some input from those at DuckDB and Velox on their perspective

Andrew




On Mon, Oct 2, 2023 at 9:17 AM Raphael Taylor-Davies
 wrote:


Oh I'm with you on it being a precedent we want to be very careful about
setting, but if there isn't a meaningful performance difference, we may
be able to sidestep that discussion entirely.

On 02/10/2023 14:11, Antoine Pitrou wrote:

Even if performance were significant better, I don't think it's a good
enough reason to add these representations to Arrow. By construction,
a standard cannot continuously chase the performance state of art, it
has to weigh the benefits of performance improvements against the
increased cost for the ecosystem (for example the cost of adapting to
frequent standard changes and a growing standard size).

We have extension types which could reasonably be used for
non-standard data types, especially the kind that are motivated by
leading-edge performance research and innovation and come with unusual
constraints (such as requiring trusting and dereferencing raw pointers
embedded in data buffers). There could even be an argument for making
some of them canonical extension types if there's enough anteriority
in favor.

Regards

Antoine.


Le 02/10/2023 à 15:00, Raphael Taylor-Davies a écrit :

Re: [DISCUSS][C++] Raw pointer string views

2023-10-02 Thread Raphael Taylor-Davies
Oh I'm with you on it being a precedent we want to be very careful about 
setting, but if there isn't a meaningful performance difference, we may 
be able to sidestep that discussion entirely.


On 02/10/2023 14:11, Antoine Pitrou wrote:


Even if performance were significant better, I don't think it's a good 
enough reason to add these representations to Arrow. By construction, 
a standard cannot continuously chase the performance state of art, it 
has to weigh the benefits of performance improvements against the 
increased cost for the ecosystem (for example the cost of adapting to 
frequent standard changes and a growing standard size).


We have extension types which could reasonably be used for 
non-standard data types, especially the kind that are motivated by 
leading-edge performance research and innovation and come with unusual 
constraints (such as requiring trusting and dereferencing raw pointers 
embedded in data buffers). There could even be an argument for making 
some of them canonical extension types if there's enough anteriority 
in favor.


Regards

Antoine.


Le 02/10/2023 à 15:00, Raphael Taylor-Davies a écrit :

I think what would really help would be some concrete numbers, do we
have any numbers comparing the performance of the offset and pointer
based representations? If there isn't a significant performance
difference between them, would the systems that currently use a
pointer-based approach be willing to meet us in the middle and switch to
an offset based encoding? This to me feels like it would be the best
outcome for the ecosystem as a whole.

Kind Regards,

Raphael

On 02/10/2023 13:50, Antoine Pitrou wrote:


Le 01/10/2023 à 16:21, Micah Kornfield a écrit :


I would also assert that another way to reduce this risk is to add
some prose to the relevant sections of the columnar format
specification doc to clearly explain that a raw pointers variant of
the layout, while not part of the official spec, may be 
implemented in

some Arrow libraries.


I've lost a little context but on all the concerns of adding raw
pointers
as an official option to the spec.  But I see making raw-pointer
variants
the best path forward.

Things captured from this thread or seem obvious at least to me:
1.  Divergence of IPC spec from in-memory/C-ABI spec?
2.  More parts of the spec to cover.
3.  In-compatibility with some languages
4.  Validation (in my mind different use-cases require different
levels of
validation, so this is a little bit less of a concern in my mind).

I think the broader issue is how we think about compatibility with 
other

systems.  For instance, what happens if Velox and DuckDb start adding
new
divergent memory layouts?  Are we expecting to add them to the spec?


This is a slippery slope. The more Arrow has a policy of integrating
existing practices simply because they exist, the more the Arrow
format will become _à la carte_, with different implementations
choosing to implement whatever they want to spend their engineering
effort on (you can see this occur, in part, on the Parquet format with
its many different encodings, compression algorithms and a 96-bit
timestamp type).

We _have_ to think carefully about the middle- and long-term future of
the format when adopting new features.

In this instance, we are doing a large part of the effort by adopting
a string view format with variadic buffers, inlined prefixes and
offset-based views into those buffers. But some implementations with
historically different internal representations will have to share
part of the effort to align with the newly standardized format.

I don't think "we have to adjust the Arrow format so that existing
internal representations become Arrow-compliant without any
(re-)implementation effort" is a reasonable design principle.

Regards

Antoine.


Re: [DISCUSS][C++] Raw pointer string views

2023-10-02 Thread Raphael Taylor-Davies
I think what would really help would be some concrete numbers, do we 
have any numbers comparing the performance of the offset and pointer 
based representations? If there isn't a significant performance 
difference between them, would the systems that currently use a 
pointer-based approach be willing to meet us in the middle and switch to 
an offset based encoding? This to me feels like it would be the best 
outcome for the ecosystem as a whole.


Kind Regards,

Raphael

On 02/10/2023 13:50, Antoine Pitrou wrote:


Le 01/10/2023 à 16:21, Micah Kornfield a écrit :


I would also assert that another way to reduce this risk is to add
some prose to the relevant sections of the columnar format
specification doc to clearly explain that a raw pointers variant of
the layout, while not part of the official spec, may be implemented in
some Arrow libraries.


I've lost a little context but on all the concerns of adding raw 
pointers
as an official option to the spec.  But I see making raw-pointer 
variants

the best path forward.

Things captured from this thread or seem obvious at least to me:
1.  Divergence of IPC spec from in-memory/C-ABI spec?
2.  More parts of the spec to cover.
3.  In-compatibility with some languages
4.  Validation (in my mind different use-cases require different 
levels of

validation, so this is a little bit less of a concern in my mind).

I think the broader issue is how we think about compatibility with other
systems.  For instance, what happens if Velox and DuckDb start adding 
new

divergent memory layouts?  Are we expecting to add them to the spec?


This is a slippery slope. The more Arrow has a policy of integrating 
existing practices simply because they exist, the more the Arrow 
format will become _à la carte_, with different implementations 
choosing to implement whatever they want to spend their engineering 
effort on (you can see this occur, in part, on the Parquet format with 
its many different encodings, compression algorithms and a 96-bit 
timestamp type).


We _have_ to think carefully about the middle- and long-term future of 
the format when adopting new features.


In this instance, we are doing a large part of the effort by adopting 
a string view format with variadic buffers, inlined prefixes and 
offset-based views into those buffers. But some implementations with 
historically different internal representations will have to share 
part of the effort to align with the newly standardized format.


I don't think "we have to adjust the Arrow format so that existing 
internal representations become Arrow-compliant without any 
(re-)implementation effort" is a reasonable design principle.


Regards

Antoine.


Re: [VOTE][Format] Add ListView and LargeListView Arrays to Arrow Format

2023-10-02 Thread Raphael Taylor-Davies

+1

On 02/10/2023 13:53, Antoine Pitrou wrote:


Hello,

+1 and thanks for working on this!

There'll probably be some minor comments to the format PR, but those 
don't deter from accepting these new layouts into the standard.


Regards

Antoine.


Le 29/09/2023 à 14:09, Felipe Oliveira Carvalho a écrit :

Hello,

I'd like to propose adding ListView and LargeListView arrays to the 
Arrow

format.
Previous discussion in [1][2], columnar format description and 
flatbuffers

changes in [3].

There are implementations available in both C++ [4] and Go [5]. I'm 
working
on the integration tests which I will push to one of the PR branches 
before
they are merged. I've made a graph illustrating how this addition 
affects,
in a backwards compatible way, the type predicates and inheritance 
chain on

the C++ implementation. [6]

The vote will be open for at least 72 hours not counting the weekend.

[ ] +1 add the proposed ListView and LargeListView types to the Apache
Arrow format
[ ] -1 do not add the proposed ListView and LargeListView types to the
Apache Arrow format
because...

Sincerely,
Felipe

[1] https://lists.apache.org/thread/r28rw5n39jwtvn08oljl09d4q2c1ysvb
[2] https://lists.apache.org/thread/dcwdzhz15fftoyj6xp89ool9vdk3rh19
[3] https://github.com/apache/arrow/pull/37877
[4] https://github.com/apache/arrow/pull/35345
[5] https://github.com/apache/arrow/pull/37468
[6] https://gist.github.com/felipecrv/3c02f3784221d946dec1b031c6d400db



Re: [VOTE][RUST] Release Apache Arrow Rust Object Store 0.7.1 RC1

2023-09-29 Thread Raphael Taylor-Davies

With 5 +1 votes (4 binding) the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-object-store-rs-0.7.1/


It has also been released to crates.io

Thank you to everyone who helped verify this release

On 26/09/2023 17:01, Raphael Taylor-Davies wrote:

Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.7.1.

This release candidate is based on commit: 
4ef7917bd57b701e30def8511b5fd8a7961f2fcf [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store because...

[1]: 
https://github.com/apache/arrow-rs/tree/4ef7917bd57b701e30def8511b5fd8a7961f2fcf
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.7.1-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/4ef7917bd57b701e30def8511b5fd8a7961f2fcf/object_store/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh




Re: [VOTE][Format] Add ListView and LargeListView Arrays to Arrow Format

2023-09-29 Thread Raphael Taylor-Davies

Hi Felipe,

Can I confirm that DuckDB and Velox use the same encoding for these 
types, and so we aren't going to run into similar issues as [1]?


Kind Regards,

Raphael Taylor-Davies

[1]: https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy

On 29/09/2023 13:09, Felipe Oliveira Carvalho wrote:

Hello,

I'd like to propose adding ListView and LargeListView arrays to the Arrow
format.
Previous discussion in [1][2], columnar format description and flatbuffers
changes in [3].

There are implementations available in both C++ [4] and Go [5]. I'm working
on the integration tests which I will push to one of the PR branches before
they are merged. I've made a graph illustrating how this addition affects,
in a backwards compatible way, the type predicates and inheritance chain on
the C++ implementation. [6]

The vote will be open for at least 72 hours not counting the weekend.

[ ] +1 add the proposed ListView and LargeListView types to the Apache
Arrow format
[ ] -1 do not add the proposed ListView and LargeListView types to the
Apache Arrow format
because...

Sincerely,
Felipe

[1] https://lists.apache.org/thread/r28rw5n39jwtvn08oljl09d4q2c1ysvb
[2] https://lists.apache.org/thread/dcwdzhz15fftoyj6xp89ool9vdk3rh19
[3] https://github.com/apache/arrow/pull/37877
[4] https://github.com/apache/arrow/pull/35345
[5] https://github.com/apache/arrow/pull/37468
[6] https://gist.github.com/felipecrv/3c02f3784221d946dec1b031c6d400db



Re: [DISCUSS][C++] Raw pointer string views

2023-09-28 Thread Raphael Taylor-Davies
FWIW Rust wouldn't have issues using raw pointers, I can't speak for other 
languages though. They would be more expensive to validate, but validation is 
not going to be cheap regardless.

I could definitely see a world where view types use pointers and IPC coerces 
to/from the large non-view types. IPC has to copy the string data regardless 
and re-encoding would avoid encoding masked data.

The notion of supporting both is less of an exciting prospect... I'm also not 
sure if it is too late to make changes at this stage.

On 28 September 2023 15:26:57 BST, Wes McKinney  wrote:
>hi all,
>
>I'm just catching up on this thread after having taken a look at the format
>PRs, the C++ implementation PR, and this e-mail thread. So only my $0.02
>from having spent a great deal less time on this project than others.
>
>The original motivation I had for bringing up the idea of adding the
>StringView concept from DuckDB / Velox / UmbraDB to the Arrow in-memory
>format (though not necessarily the IPC format) was to provide a path for
>zero-copy interoperability in some cases with these systems when dealing
>with strings, and to enhance performance within Arrow-applications (setting
>aside the external interop goal) in scenarios where being able to point to
>external memory spaces could avoid a copy-and-repack step. I think it's
>useful to have an zero-copy IPC-compatible string format (i.e. what was
>proposed and merged into Columnar.rst) for that allows for out-of-order
>construction or arrays, reuse of memory (e.g. consider the case of decoding
>dictionary encoding Parquet data — not having to copy strings many times
>when rehydrating string arrays), and chunked allocation — all good things
>that the existing Arrow VarBinary layout does not provide for.
>
>For the in-memory side of things, I am somewhat more of Antoine's
>perspective that trying to have both in-memory (index+offset and raw
>pointers) creates a kind of uncanny valley situation that may confuse users
>and cause other problems (especially if the raw pointer version is only
>found in the C++ library). The raw pointer version also cannot be
>validated, but I see validation as less of a requirement and more of a
>"nice to have" (I realize others see validation as more of a requirement).
>
>* I see the raw-pointer type has having more net utility (going back to the
>original motivation), but I also see how it is problematic for some non-C++
>implementations.
>* The index-offset version is intrinsic value over the existing "dense"
>varbinary layout (per some of the benefits above) but does not satisfy the
>external interoperability goal with systems that are becoming more popular
>month over month
>* Incoming data from external systems that use the raw pointer model have
>to be serialized (and perhaps repacked) to the index-offset model. This
>isn't ideal — going the other way (from index-offset to raw pointer) is
>just a pointer swizzle, comparatively inexpensive.
>
>So it seems like we have several paths available, none of them wholly
>satisfactory:
>
>1. Essentially what's in the existing PR — the raw pointer variant which is
>"non-standard"
>2. Pick one and only one for in memory — I think the raw pointer version is
>more useful given that swizzling from index-offset is pretty cheap. But the
>raw pointer version can't be validated safely and is problematic for e.g.
>Rust. Picking the index-offset version means that the external ecosystem of
>columnar engines won't be that much closer aligned to Arrow than they are
>now.
>3. Implement the raw pointer variant as an extension type in C++ / C ABI.
>This seems potentially useful but given that it would likely be disfavored
>for data originating from Arrow-land, there would be fewer scenarios where
>zero-copy interop for strings is achieved
>
>This is difficult and I don't know what the best answer is, but personally
>my inclination has been toward choices that are utilitarian and help with
>alignment and cohesion in the open source ecosystem.
>
>- Wes
>
>On Thu, Sep 28, 2023 at 5:20 AM Antoine Pitrou  wrote:
>
>>
>> To make things clear, any of the factory functions listed below create a
>> type that maps exactly onto an Arrow columnar layout:
>> https://arrow.apache.org/docs/dev/cpp/api/datatype.html#factory-functions
>>
>> For example, calling `arrow::dictionary` creates a dictionary type that
>> exactly represents the dictionary layout specified in
>>
>> https://arrow.apache.org/docs/dev/format/Columnar.html#dictionary-encoded-layout
>>
>> Similarly, if you use any of the builders listed below, what you will
>> get at the end is data that complies with the Arrow columnar specification:
>> https://arrow.apache.org/docs/dev/cpp/api/builder.html
>>
>> All the core Arrow C++ APIs create and process data which complies with
>> the Arrow specification, and which is interoperable with other Arrow
>> implementations.
>>
>> Conversely, non-Arrow data such as CSV or Parquet (or Python lists,
>> etc.) goes through 

Re: [DISCUSS][C++] Raw pointer string views

2023-09-27 Thread Raphael Taylor-Davies
Do you have any benchmarks comparing kernels with native pointer array support, 
compared to those that must first convert to the offset representation? I think 
this would help ground this discussion empirically.

On 27 September 2023 22:25:02 BST, Benjamin Kietzman  
wrote:
>Hello all,
>
>@Gang
>> Could you please simply describe the layout of DuckDB and Velox
>
>Arrow represents long (>12 bytes) strings with a view which includes
>a buffer index (used to look up one of the variadic data buffers)
>and an offset (used to find the start of a string's bytes within the
>indicated buffer). DuckDB and Velox by contrast have a raw pointer
>directly to the start of the string's bytes. Since these occupy the
>same 8 bytes of a view, it's possible and fairly efficient to convert
>from one representation to the other by modifying those 8 bytes in place.
>
>@Raphael
>> Is the motivation here to avoid DuckDB and Velox having to duplicate the
>conversion logic from pointer-based to offset-based, or to allow
>arrow-cpp to operate directly on pointer-based arrays?
>
>It's more the latter; arrow C++ is intended to be useful as more than an IPC
>serializer/deserializer, so it is beneficial to be able to import arrays
>and also operate on them with no conversion cost. However it's also worth
>noting that the raw pointer representation is more efficient on access,
>albeit more expensive to validate along with a number of other tradeoffs.
>In order to progress this work, I took this hybrid approach in part to defer
>the question of which representation is preferred in which context. I would
>like to allow the C++ library freedom to extract as much performance from
>this type as possible, internally as well as when communicating with other
>engines.
>
>@Antoine
>> What this PR is creating is an "unofficial" Arrow format, with data
>types exposed in Arrow C++ that are not part of the Arrow standard, but
>are exposed as if they were.
>
>We already do this in every implementation of the arrow format I'm
>aware of: it's more convenient to consider dictionary as a data type
>even though the spec says that it is a field property. I don't think
>it's illegal or unreasonable for an implementation to diverge in their
>internal handling of arrow data (whether to achieve performance,
>consistency, or convenience).
>
>> I'm not sure how DuckDB and Velox data could be exposed, but it could be
>for example an extension type with a fixed_size_binary<16> storage type.
>
>This wouldn't allow for the transmission of the variadic data buffers
>which (even in the presence of raw pointer views) are necessary to
>guarantee the lifetime of string data in the vector. Alternatively we
>could use Utf8View with the high and low bits of the raw pointer
>packed into the index and offset, but I don't think this would be less
>tantamount to an unofficial arrow format.
>
>Sincerely,
>Ben Kietzman
>
>
>On Wed, Sep 27, 2023 at 2:51 AM Antoine Pitrou  wrote:
>
>>
>> Hello,
>>
>> What this PR is creating is an "unofficial" Arrow format, with data
>> types exposed in Arrow C++ that are not part of the Arrow standard, but
>> are exposed as if they were. Most users will probably not read the
>> official format spec, but will simply trust the official Arrow
>> implementations. So the official Arrow implementations have an
>> obligation to faithfully represent the Arrow format and not breed
>> confusion.
>>
>> So I'm -1 on the way the PR presents things currently.
>>
>> I'm not sure how DuckDB and Velox data could be exposed, but it could be
>> for example an extension type with a fixed_size_binary<16> storage type.
>>
>> Regards
>>
>> Antoine.
>>
>>
>>
>> Le 26/09/2023 à 22:34, Benjamin Kietzman a écrit :
>> > Hello all,
>> >
>> > In the PR to add support for Utf8View to the c++ implementation,
>> > I've taken the approach of allowing raw pointer views [1] alongside the
>> > index/offset views described in the spec [2]. This was done to ease
>> > communication with other engines such as DuckDB and Velox whose native
>> > string representation is the raw pointer view. In order to be usable
>> > as a utility for writing IPC files and other operations on arrow
>> > formatted data, it is useful for the library to be able to directly
>> > import raw pointer arrays even when immediately converting these to
>> > the index/offset representation.
>> >
>> > However there has been objection in review [3] since the raw pointer
>> > representation is not part of the official format. Since data visitation
>> > utilities are generic, IMHO this hybrid approach does not add
>> > significantly to the complexity of the C++ library, and I feel the
>> > aforementioned interoperability is a high priority when adding this
>> > feature to the C++ library. It's worth noting that this interoperability
>> > has been a stated goal of the Utf8Type since its original proposal [4]
>> > and throughout the discussion of its adoption [5].
>> >
>> > Sincerely,
>> > Ben Kietzman
>> >
>> > [1]:
>> >

Re: [DISCUSS][C++] Raw pointer string views

2023-09-26 Thread Raphael Taylor-Davies
I'm confused why this would need to copy string data, assuming the pointers are 
into defined memory regions, something necessary for the C data interface's 
ownership semantics regardless, why can't these memory regions just be used as 
buffers as is? This would therefore require just rewriting the views buffer to 
subtract the base pointer of the given buffer, which should be extremely fast?

On 26 September 2023 23:34:54 BST, Matt Topol  wrote:
>I believe the motivation is to avoid the cost of the data copy that would
>have to happen to convert from a pointer based to offset based scenario.
>Allowing the pointer-based implementation will ensure that we can maintain
>zero-copy communication with both DuckDB and Velox in a common workflow
>scenario.
>
>Converting to the offset-based version would have a cost of having to copy
>strings from their locations to contiguous buffers which could end up being
>very significant depending on the shape and size of the data. The pointer
>-based solution wouldn't be allowed in IPC though, only across the C Data
>interface (correct me if I'm wrong).
>
>--Matt
>
>On Tue, Sep 26, 2023, 6:09 PM Raphael Taylor-Davies
> wrote:
>
>> Hi,
>>
>> Is the motivation here to avoid DuckDB and Velox having to duplicate the
>> conversion logic from pointer-based to offset-based, or to allow
>> arrow-cpp to operate directly on pointer-based arrays?
>>
>> If it is the former, I personally wouldn't have thought the conversion
>> logic sufficiently complex to really warrant this?
>>
>> If it is the latter, I wonder if you have some benchmark numbers for
>> converting between and operating on the differing representations? In
>> the absence of a strong performance case, it's hard in my opinion to
>> justify adding what will be an arrow-cpp specific extension that isn't
>> part of the standard, with all the potential for confusion and
>> interoperability challenges that entails.
>>
>> Kind Regards,
>>
>> Raphael
>>
>> On 26/09/2023 21:34, Benjamin Kietzman wrote:
>> > Hello all,
>> >
>> > In the PR to add support for Utf8View to the c++ implementation,
>> > I've taken the approach of allowing raw pointer views [1] alongside the
>> > index/offset views described in the spec [2]. This was done to ease
>> > communication with other engines such as DuckDB and Velox whose native
>> > string representation is the raw pointer view. In order to be usable
>> > as a utility for writing IPC files and other operations on arrow
>> > formatted data, it is useful for the library to be able to directly
>> > import raw pointer arrays even when immediately converting these to
>> > the index/offset representation.
>> >
>> > However there has been objection in review [3] since the raw pointer
>> > representation is not part of the official format. Since data visitation
>> > utilities are generic, IMHO this hybrid approach does not add
>> > significantly to the complexity of the C++ library, and I feel the
>> > aforementioned interoperability is a high priority when adding this
>> > feature to the C++ library. It's worth noting that this interoperability
>> > has been a stated goal of the Utf8Type since its original proposal [4]
>> > and throughout the discussion of its adoption [5].
>> >
>> > Sincerely,
>> > Ben Kietzman
>> >
>> > [1]:
>> >
>> https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752
>> > [2]:
>> >
>> https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379
>> > [3]: https://github.com/apache/arrow/pull/37792#discussion_r1336010665
>> > [4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
>> > [5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4
>> >
>>


Re: [DISCUSS][C++] Raw pointer string views

2023-09-26 Thread Raphael Taylor-Davies

Hi,

Is the motivation here to avoid DuckDB and Velox having to duplicate the 
conversion logic from pointer-based to offset-based, or to allow 
arrow-cpp to operate directly on pointer-based arrays?


If it is the former, I personally wouldn't have thought the conversion 
logic sufficiently complex to really warrant this?


If it is the latter, I wonder if you have some benchmark numbers for 
converting between and operating on the differing representations? In 
the absence of a strong performance case, it's hard in my opinion to 
justify adding what will be an arrow-cpp specific extension that isn't 
part of the standard, with all the potential for confusion and 
interoperability challenges that entails.


Kind Regards,

Raphael

On 26/09/2023 21:34, Benjamin Kietzman wrote:

Hello all,

In the PR to add support for Utf8View to the c++ implementation,
I've taken the approach of allowing raw pointer views [1] alongside the
index/offset views described in the spec [2]. This was done to ease
communication with other engines such as DuckDB and Velox whose native
string representation is the raw pointer view. In order to be usable
as a utility for writing IPC files and other operations on arrow
formatted data, it is useful for the library to be able to directly
import raw pointer arrays even when immediately converting these to
the index/offset representation.

However there has been objection in review [3] since the raw pointer
representation is not part of the official format. Since data visitation
utilities are generic, IMHO this hybrid approach does not add
significantly to the complexity of the C++ library, and I feel the
aforementioned interoperability is a high priority when adding this
feature to the C++ library. It's worth noting that this interoperability
has been a stated goal of the Utf8Type since its original proposal [4]
and throughout the discussion of its adoption [5].

Sincerely,
Ben Kietzman

[1]:
https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752
[2]:
https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379
[3]: https://github.com/apache/arrow/pull/37792#discussion_r1336010665
[4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
[5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4



[VOTE][RUST] Release Apache Arrow Rust Object Store 0.7.1 RC1

2023-09-26 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.7.1.

This release candidate is based on commit: 
4ef7917bd57b701e30def8511b5fd8a7961f2fcf [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store because...

[1]: 
https://github.com/apache/arrow-rs/tree/4ef7917bd57b701e30def8511b5fd8a7961f2fcf
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.7.1-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/4ef7917bd57b701e30def8511b5fd8a7961f2fcf/object_store/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh




[RESULT][VOTE][RUST] Release Apache Arrow Rust 47.0.0 RC1

2023-09-22 Thread Raphael Taylor-Davies

With 4 +1 votes (4 binding) the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-rs-47.0.0


It has also been released to crates.io

Thank you to everyone who helped verify this release

On 19/09/2023 14:20, Raphael Taylor-Davies wrote:

This time with the links..

I would like to propose a release of Apache Arrow Rust Implementation, 
version 47.0.0.


This release candidate is based on commit: 
1d6feeacebb8d0d659d493b783ba381940973745 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/1d6feeacebb8d0d659d493b783ba381940973745
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-47.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/1d6feeacebb8d0d659d493b783ba381940973745/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh


On 19/09/2023 14:18, Raphael Taylor-Davies wrote:

Hi,

I would like to propose a release of Apache Arrow Rust 
Implementation, version 47.0.0.


This release candidate is based on commit: 
1d6feeacebb8d0d659d493b783ba381940973745 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...



Re: [VOTE][RUST] Release Apache Arrow Rust 47.0.0 RC1

2023-09-19 Thread Raphael Taylor-Davies

This time with the links..

I would like to propose a release of Apache Arrow Rust Implementation, 
version 47.0.0.


This release candidate is based on commit: 
1d6feeacebb8d0d659d493b783ba381940973745 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/1d6feeacebb8d0d659d493b783ba381940973745

[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-47.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/1d6feeacebb8d0d659d493b783ba381940973745/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh


On 19/09/2023 14:18, Raphael Taylor-Davies wrote:

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 47.0.0.


This release candidate is based on commit: 
1d6feeacebb8d0d659d493b783ba381940973745 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...



[VOTE][RUST] Release Apache Arrow Rust 47.0.0 RC1

2023-09-19 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 47.0.0.


This release candidate is based on commit: 
1d6feeacebb8d0d659d493b783ba381940973745 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...



Re: [FlightSQL] Supporting binding parameters to prepared statements with a stateless server

2023-09-14 Thread Raphael Taylor-Davies

Hi,

Thank you for starting this discussion. I think the decision to use gRPC 
and by extension HTTP certainly would encourage a design that explicitly 
doesn't rely on server-side state. Not only is it now uncommon for 
backend servers to have a unique globally-routable identity, but as 
these are request-oriented protocols, as opposed to connection-oriented 
protocols, maintaining session state server-side becomes very 
complicated as there is no unambiguous end-of-session signal.


I would very much encourage following an approach similar to web 
cookies, where state is instead managed by the clients and sent with 
each request. This sort of maps to the ticket notion already present in 
many of the APIs, but could perhaps be formalized.


Kind Regards,

Raphael Taylor-Davies

On 14/09/2023 11:52, Andrew Lamb wrote:

Hello,

As FlightSQL gets more widely adopted across the ecosystem, we hit some
issues trying to implement bind parameters in our stateless service. I
filed a ticket [1] that describes the issue as well as a potential
solution.

Please share your thoughts on the ticket

Andrew

[1] https://github.com/apache/arrow/issues/37720



[RESULT][VOTE][RUST] Release Apache Arrow Rust 46.0.0 RC1

2023-08-24 Thread Raphael Taylor-Davies

With 3 +1 votes (3 binding) the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-rs-46.0.0/


It has also been released to crates.io

Thank you to everyone who helped verify this release

On 21/08/2023 16:40, Raphael Taylor-Davies wrote:

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 46.0.0.


This release candidate is based on commit: 
90449ffb2ea6ceef43ce8fc97084b3373975f357 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/90449ffb2ea6ceef43ce8fc97084b3373975f357
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-46.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/90449ffb2ea6ceef43ce8fc97084b3373975f357/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




[VOTE][RUST] Release Apache Arrow Rust 46.0.0 RC1

2023-08-21 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 46.0.0.


This release candidate is based on commit: 
90449ffb2ea6ceef43ce8fc97084b3373975f357 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/90449ffb2ea6ceef43ce8fc97084b3373975f357

[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-46.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/90449ffb2ea6ceef43ce8fc97084b3373975f357/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




[RESULT][VOTE][RUST] Release Apache Arrow Rust Object Store 0.7.0 RC1

2023-08-18 Thread Raphael Taylor-Davies

With 3 +1 votes (3 binding) the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-object-store-rs-0.7.0/ 



It has also been released to crates.io

Thank you to everyone who helped verify this release

On 15/08/2023 10:57, Raphael Taylor-Davies wrote:

Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.7.0.

This release candidate is based on commit: 
77fe72ddd40c1d39068ad580b975504d57032060 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store because...

[1]: 
https://github.com/apache/arrow-rs/tree/77fe72ddd40c1d39068ad580b975504d57032060
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.7.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/77fe72ddd40c1d39068ad580b975504d57032060/object_store/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh




Re: [VOTE][Format] Add Utf8View Arrays to Arrow Format

2023-08-18 Thread Raphael Taylor-Davies

+1 (binding)

Despite my earlier misgivings, I think this will be a valuable addition 
to the specification.


To clarify I've interpreted this as a vote on both Utf8View and 
BinaryView as in the linked PR.


On 28/06/2023 20:34, Benjamin Kietzman wrote:

Hello,

I'd like to propose adding Utf8View arrays to the arrow format.
Previous discussion in [1], columnar format description in [2],
flatbuffers changes in [3].

There are implementations available in both C++[4] and Go[5] which
exercise the new type over IPC. Utf8View format demonstrates[6]
significant performance benefits over Utf8 in common tasks.

The vote will be open for at least 72 hours.

[ ] +1 add the proposed Utf8View type to the Apache Arrow format
[ ] -1 do not add the proposed Utf8View type to the Apache Arrow format
because...

Sincerely,
Ben Kietzman

[1] https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt
[2]
https://github.com/apache/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/docs/source/format/Columnar.rst#variable-size-binary-view-layout
[3]
https://github.com/apache/arrow/pull/35628/files#diff-0623d567d0260222d5501b4e169141b5070eabc2ec09c3482da453a3346c5bf3
[4] https://github.com/apache/arrow/pull/35628
[5] https://github.com/apache/arrow/pull/35769
[6] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617



Re: [Vote][Format] C Data Interface Format string for REE

2023-08-16 Thread Raphael Taylor-Davies

+1 (binding)

On 16/08/2023 16:58, Matt Topol wrote:

It would be nice to get approval from authors of other implementations

such as Rust, C#, Javascript...

I'm hoping that some of them see this and participate in the vote. *crosses
fingers*

On Wed, Aug 16, 2023 at 11:10 AM Antoine Pitrou  wrote:


+1 from me (binding).

It would be nice to get approval from authors of other implementations
such as Rust, C#, Javascript...

Thanks for doing this!


Le 16/08/2023 à 16:16, Matt Topol a écrit :

Hey All,

As proposed by Felipe [1] I'm starting a vote on the proposed update to

the

Format Spec of adding "+r" as the format string for passing Run-End

Encoded

arrays through the Arrow C Data Interface.

A PR containing an update to the C++ Arrow implementation to add support
for this format string along with documentation updates can be found here
[2].

The vote will be open for at least 72 hours.

[ ] +1 - I'm in favor of this new C Data Format string
[ ] +0
[ ] -1 - I'm against adding this new format string because

Thanks everyone!

--Matt

[1]: https://lists.apache.org/thread/smco2mcmw2ob2msoyo84wd4oz8z5f781
[2]: https://github.com/apache/arrow/pull/37174



[VOTE][RUST] Release Apache Arrow Rust Object Store 0.7.0 RC1

2023-08-15 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.7.0.

This release candidate is based on commit: 
77fe72ddd40c1d39068ad580b975504d57032060 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store because...

[1]: 
https://github.com/apache/arrow-rs/tree/77fe72ddd40c1d39068ad580b975504d57032060
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.7.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/77fe72ddd40c1d39068ad580b975504d57032060/object_store/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh




Re: Null buffer spec incompatibility between arrow2 and arrow-C++

2023-08-02 Thread Raphael Taylor-Davies

Hi Anja,

I believe this was clarified in [1] with the addition of the following 
text to the standard


>  The buffer pointers MAY be null only in two situations:
>
>   1. for the null bitmap buffer, if :c:member:`ArrowArray.null_count` 
is 0;
>   2. for any buffer, if the size in bytes of the corresponding buffer 
would be 0.


This is supported by arrow-rs [2], and I believe arrow2 is incorrect in 
not supporting this. I would recommend taking this up with the arrow2 
maintainers [3].


Kind Regards,

Raphael Taylor-Davies

[1]: https://github.com/apache/arrow/pull/14808
[2]: https://github.com/apache/arrow-rs/pull/3276
[3]: https://github.com/jorgecarleitao/arrow2/pull/1476

On 02/08/2023 17:43, Anja wrote:

Hi!

I opened two issues recently on a spec incompatibility between arrow2 and
arrow-C++: https://github.com/apache/arrow/issues/36960 and
.

Namely, there are situations where arrow-C++ will legally create a null
buffer, and arrow2 does not allow null buffers.

I am happy to provide code-effort, but I am looking for consensus from the
community on this issue. I think an interoperability issue like this is
important to resolve, especially with the plans to unify arrow2 and
arrow-rs: https://github.com/apache/arrow-rs/issues/1176

The `kNonNullFiller` (
https://github.com/apache/arrow/blob/a06b2618420ef89431373a9e8f07a5da64d546a5/cpp/src/arrow/util/ubsan.h#L33)
exists for being used in cases where null buffers can cause issues. I am
proposing changes such that it gets pointed to for representing the reading
of an empty table.

Are there any thoughts?  I welcome engagement on either of the issues. This
problem can, in theory, be solved within either project.

~Anja



[RESULT][VOTE][RUST] Release Apache Arrow Rust 45.0.0 RC1

2023-08-02 Thread Raphael Taylor-Davies

With 5 +1 votes (4 binding) the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-rs-45.0.0/


It has also been released to crates.io

Thank you to everyone who helped verify this release

On 01/08/2023 07:44, vin jake wrote:

+1 (binding)

Verified on my M1 Mac.

Thanks Raphael !

On Mon, Jul 31, 2023 at 12:32 AM Raphael Taylor-Davies
 wrote:


Hi,

I would like to propose a release of Apache Arrow Rust Implementation,
version 45.0.0.

This release candidate is based on commit:
16744e5ac08d9ead6c51ff6e08d8b91e87460c52 [1]

The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]:

https://github.com/apache/arrow-rs/tree/16744e5ac08d9ead6c51ff6e08d8b91e87460c52
[2]:
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-45.0.0-rc1
[3]:

https://github.com/apache/arrow-rs/blob/16744e5ac08d9ead6c51ff6e08d8b91e87460c52/CHANGELOG.md
[4]:

https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
-




Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-31 Thread Raphael Taylor-Davies

Hi All,

Having played around with various alternatives, I think we should move 
ahead with this proposal. I think Utf8View and BinaryView would be a 
valuable addition to the arrow specification, and would address many of 
the current pain points encountered when dealing with large string 
payloads or string dictionaries. In particular, they avoid the need to 
copy string data in all bar exceptional circumstances, whilst also 
supporting string interning in a manner that doesn't require recomputing 
dictionaries when combining data streams. Thank you all for the 
insightful discussion.


In that vein I have started implementing them within arrow-rs [1]. My 
only major comment so far is that views for null slots I think should be 
well defined, that is they shouldn't reference buffers or data that 
doesn't exist. This is very important for the Rust implementation to be 
able to provide efficient safe APIs.


Otherwise I don't see any major blockers towards getting these 
standardised, thank you for your continued efforts in this space.


Kind Regards,

Raphael Taylor-Davies

[1]: https://github.com/apache/arrow-rs/pull/4585

On 12/07/2023 17:35, Pedro Eugenio Rocha Pedreira wrote:

Hi all, this is Pedro from the Velox team at Meta. Chiming in here to add a bit 
more context from our side.


I'm not sure the problem here is a lack of understanding or maturity. In
fact, it would be much easier if this was just a problem of education but
it is not.

Adding to what Weston said, when Velox started the intent was to have it built on top of 
vanilla Arrow, if not directly using the Arrow C++ library, at least having an 
implementation conforming to the data layout. The fact we decided to "extended" 
the format into what we internally call Velox Vectors was very much a deliberate 
decision. If Arrow had at the time support for StringView, ListViews (discussed recently 
in a separate thread, related to supporting out-of-order writes), and more encodings 
(specifically RLE and Constant, which can now be done through new REE), Velox would have 
used it from the beginning. The rationale behind these extensions has been discussed 
here, but there is more context in our paper last year [0] (check Section 4.2).


I can't speak for all query engines, but at least in the case of
DataFusion we exclusively use the Arrow format as the interchange format
between operators, including for UDFs. We have found that for most
operators operating directly on the Arrow format is sufficiently
performant to not represent a query bottleneck.

This always comes down to your workloads and operations you are optimizing for. 
We found that these extensions were crucial for our workloads, particularly for 
efficient vectorized expression evaluation. Other modern engines like DuckDB 
and Umbra had comparable findings and deviate from Arrow in similar ways.


Is Arrow meant to only be used in between systems (in this case query
engines) or is it also meant to be used in between components of a query
engine?

Our findings were that if you really care about performance, Arrow today is 
(unfortunately) not sufficient to support the needs of a state-of-the-art 
execution engine. Raphael raised a good point above about Arrow potentially 
favoring interoperability over performance. In that case, I believe Arrow's 
usage would be restricted to communication between systems, which is about what 
we see in practice today.

However, even between system boundaries this is becoming a hurdle. A recent 
example is a new project called Gluten [1], which integrates Velox into Spark 
(akin to Databrick's Photon). The JNI data communication between Velox and Java 
is done using Arrow C ABI. We found that the incompatibilities that result in 
non-zero-copy transfer (back to StringViews) show up high enough as a 
bottleneck that the team is considering using Velox Vectors as the exchange 
format to go around this limitation. We argued with the Gluten team to stick 
with Arrow with the hope we would work with the community to get these 
extensions incorporated into the standard. Another example is a bridge between 
Velox and DuckDB built inside Velox that converts these formats directly from 
one another. We don't use Arrow there for similar reasons. There are other 
similar examples between PyTorch's pythonic ecosystem and Velox.


I therefore also wonder about the possibility of always having a single
backing buffer that stores the character data, including potentially a copy
of the prefix.

There are quite a few use cases for this within Velox and other modern engines, 
but AFAIK they boil down to two reasons:

1. When you're creating string buffers you usually don't know the size of each 
string beforehand. This allows us to allocate the buffer page-by-page as you 
go, then capturing these buffers without having to reallocate and copy them. 
The simplest use case here is a vectorized function that generates a string as 
output.
2. It allows us

[VOTE][RUST] Release Apache Arrow Rust 45.0.0 RC1

2023-07-30 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 45.0.0.


This release candidate is based on commit: 
16744e5ac08d9ead6c51ff6e08d8b91e87460c52 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/16744e5ac08d9ead6c51ff6e08d8b91e87460c52

[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-45.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/16744e5ac08d9ead6c51ff6e08d8b91e87460c52/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh

-



[RESULT][VOTE][RUST] Release Apache Arrow Rust 44.0.0 RC1

2023-07-18 Thread Raphael Taylor-Davies

With 3 +1 votes (3 binding) the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-rs-44.0.0/


It has also been released to crates.io

Thank you to everyone who helped verify this release

On 14/07/2023 15:20, Andrew Lamb wrote:

+1 (binding)

Verified on x86_64 mac

Looks like another very nice release. Thanks for keeping the train moving

Andrew

On Fri, Jul 14, 2023 at 2:05 PM L. C. Hsieh  wrote:


+1 (binding)

Verified on M1 Mac.

Thanks Raphael.

On Fri, Jul 14, 2023 at 10:44 AM Raphael Taylor-Davies
 wrote:

Hi,

I would like to propose a release of Apache Arrow Rust Implementation,
version 44.0.0.

This release candidate is based on commit:
8f44472e5c773f0daec1965253143e94d14c55e5 [1]

The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]:


https://github.com/apache/arrow-rs/tree/8f44472e5c773f0daec1965253143e94d14c55e5

[2]:

https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-44.0.0-rc1

[3]:


https://github.com/apache/arrow-rs/blob/8f44472e5c773f0daec1965253143e94d14c55e5/CHANGELOG.md

[4]:


https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh


[VOTE][RUST] Release Apache Arrow Rust 44.0.0 RC1

2023-07-14 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 44.0.0.


This release candidate is based on commit: 
8f44472e5c773f0daec1965253143e94d14c55e5 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/8f44472e5c773f0daec1965253143e94d14c55e5

[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-44.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/8f44472e5c773f0daec1965253143e94d14c55e5/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




Re: [DISCUSS] Canonical alternative layout proposal

2023-07-13 Thread Raphael Taylor-Davies
use a different tech stack (e.g. rust vs C++ vs
go).


[1]:
https://docs.google.com/presentation/d/1EiBgwtoYW6ADTxFc9iRs8KLPV0st0GZqmGy40Uz8jPk/edit?usp=sharing




# --

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene

https://keybase.io/octalene


--- Original Message ---
On Thursday, July 13th, 2023 at 10:49, Dane Pitkin
  wrote:



I am in favor of this proposal. IMO the Arrow project is the right place

to

standardize both the interoperability and operability of columnar data
layouts. Data engines are a core component of the Arrow ecosystem and the
project should be able to grow with these data engines as they converge

on

new layouts. Since columnar data is ubiquitous in analytical workloads,

we

are seeing a natural progression into optimizing those workloads. This
includes new lossless compression schemes for columnar data that allows
engines to operate directly on the compressed data (e.g. RLE). If we

can't

reliably support the growing needs of the broader data engine ecosystem

in

a timely manner, then I also fear Arrow might lose relevancy over time.

On Thu, Jul 13, 2023 at 11:59 AM Ian cookianmc...@apache.org  wrote:


Thank you Weston for proposing this solution and Neal for describing
its context and implications. I agree with the other replies here—this
seems like an elegant solution to a growing need that could, if left
unaddressed, increase the fragmentation of the ecosystem and reduce
the centrality of the Arrow format.

Greater diversity of layouts is happening. Whether it happens inside
of Arrow or outside of Arrow is up to us. I think we all would like to
see it happen inside of Arrow. This proposal allows for that, while
striking a balance as Raphael describes.

However I think there is still some ambiguity about exactly how an
Arrow implementation that is consuming/producing data would negotiate
with an Arrow implementation or other component that is
producing/consuming data to determine whether an alternative layout is
supported. This was discussed briefly in [5] but I am interested to
see how this negotiation would be implemented in practice in the C
data interface, IPC, Flight, etc.

Ian

[5]https://lists.apache.org/thread/7x2714wookjqgkoykxpq9jtpyrgx2bx2

On Thu, Jul 13, 2023 at 11:00 AM Raphael Taylor-Davies
r.taylordav...@googlemail.com.invalid  wrote:


I like this proposal, I think it strikes a pragmatic balance between
preserving interoperability whilst still allowing new ideas to be
incorporated into the standard. Thank you for writing this up.

On 13/07/2023 10:22, Matt Topol wrote:


I don't have much to add but I do want to second Jacob's comments.

I

agree
that this is a good way to avoid the fragmentation while keeping

Arrow

relevant, and likely something we need to do so that we can ensure
Arrow
remains the way to do this data integration and interoperability.

On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens
ja...@voltrondata.com.invalid  wrote:


Hello Everyone,

Thanks for this comprehensive but concise write up Neal! I think

this

proposal is a good way to avoid both fragmentation of the arrow
ecosystem
as well as its obsolescence. In my opinion of these two problems

the

obsolescence is the bigger issue as (as mentioned in the

proposal)

arrow is
already (close to) being relegated to the sidelines in eco-system
defining
projects.

Jacob

On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson <
neal.p.richard...@gmail.com> wrote:


Hi all,
As was previously raised in 1 and surfaced again in 2, there

is a

proposal for representing alternative layouts. The intent, as I
understand
it, is to be able to support memory layouts that some (but

perhaps

not
all)
applications of Arrow find valuable, so that these nearly Arrow
systems
can
be fully Arrow-native.

I wanted to start a more focused discussion on it because I

think

it's
worth being considered on its own merits, but I also think

this gets

to
the
core of what the Arrow project is and should be, and I don't

want us

to
lose sight of that.

To restate the proposal from 1:

* There are one or more primary layouts
* Existing layouts are automatically considered primary

layouts,

even if they
wouldn't have been primary layouts initially (e.g. large list)
* A new layout, if it is semantically equivalent to another, is
considered an
alternative layout
* An alternative layout still has the same requirements for
adoption
(two implementations
and a vote)
* An implementation should not feel pressured to rush and
implement
the
new
layout. It would be good if they contribute in the discussion

and

consider
the layout and vote if they feel it would be an acceptable

design.

* We can define and vote and approve as many canonical

alternative

layouts as
we want:
* A canonical alternative layout should, at a minimum, have

some

reasonable
justification, such as improved performance for algorithm X
* Arrow implementations MUST support the primary layouts

Re: [DISCUSS] Canonical alternative layout proposal

2023-07-13 Thread Raphael Taylor-Davies
I like this proposal, I think it strikes a pragmatic balance between 
preserving interoperability whilst still allowing new ideas to be 
incorporated into the standard. Thank you for writing this up.


On 13/07/2023 10:22, Matt Topol wrote:

I don't have much to add but I do want to second Jacob's comments. I agree
that this is a good way to avoid the fragmentation while keeping Arrow
relevant, and likely something we need to do so that we can ensure Arrow
remains the way to do this data integration and interoperability.

On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens
 wrote:


Hello Everyone,

Thanks for this comprehensive but concise write up Neal! I think this
proposal is a good way to avoid both fragmentation of the arrow ecosystem
as well as its obsolescence. In my opinion of these two problems the
obsolescence is the bigger issue as (as mentioned in the proposal) arrow is
already (close to) being relegated to the sidelines in eco-system defining
projects.

Jacob

On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson <
neal.p.richard...@gmail.com> wrote:


Hi all,
As was previously raised in [1] and surfaced again in [2], there is a
proposal for representing alternative layouts. The intent, as I

understand

it, is to be able to support memory layouts that some (but perhaps not

all)

applications of Arrow find valuable, so that these nearly Arrow systems

can

be fully Arrow-native.

I wanted to start a more focused discussion on it because I think it's
worth being considered on its own merits, but I also think this gets to

the

core of what the Arrow project is and should be, and I don't want us to
lose sight of that.

To restate the proposal from [1]:

  * There are one or more primary layouts
* Existing layouts are automatically considered primary layouts,
even if they
wouldn't have been primary layouts initially (e.g. large list)
  * A new layout, if it is semantically equivalent to another, is
considered an
alternative layout
  * An alternative layout still has the same requirements for adoption
(two implementations
and a vote)
* An implementation should not feel pressured to rush and implement

the

new
layout. It would be good if they contribute in the discussion and

consider

the layout and vote if they feel it would be an acceptable design.
  * We can define and vote and approve as many canonical alternative
layouts as
we want:
* A canonical alternative layout should, at a minimum, have some
reasonable
justification, such as improved performance for algorithm X
  * Arrow implementations MUST support the primary layouts
  * An Arrow implementation MAY support a canonical alternative, however:
* An Arrow implementation MUST first support the primary layout
* An Arrow implementation MUST support conversion to/from the primary
and
canonical layout
* An Arrow implementation's APIs MUST only provide data in the
alternative layout if it is explicitly asked for (e.g. schema inference
should prefer the primary layout).
  * We can still vote for new primary layouts (e.g. promoting a
canonical alternative)
but, in these votes we don't only consider the value (e.g. performance)

of

the layout but also the interoperability. In other words, a layout can

only

become a primary layout if there is significant evidence that most
implementations
plan to adopt it.


To summarize some of the arguments against the proposal from the previous
threads, there are concerns about increasing the complexity of the Arrow
specification and the cost/burden of updating all of the Arrow
specifications to support them.

Where these discussions, both about several proposed new types and this
layout proposal, get to the core of Arrow is well expressed in the

comments

on the previous thread by Raphael [3] and Pedro [4]. Raphael asks: "what
matters to people more, interoperability or best-in-class performance?"

And

Pedro notes that because of the overhead of converting these

not-yet-Arrow

types to the Arrow C ABI is high enough that they've considered

abandoning

Arrow as their interchange format. So: on the one hand, we're kinda
choosing which quality we're optimizing for, but on the other,
interoperability and performance are dependent on each other.

What I see that we're trying to do here is find a way to expand the Arrow
specification just enough so that Arrow becomes or remains the in-memory
standard everywhere, but not so much that it creates too much complexity

or

burden to implement. Expand too much and you get a fragmented ecosystem
where everyone is writing subsets of the Arrow standard and so nothing is
fully compatible and the whole premise is undermined. But expand too

little

and projects will abandon the standard and we've also failed.

I don't have a tidy answer, but I wanted to acknowledge the bigger

issues,

and see if this helps us reason about the various proposals on the

table. I

wonder if the alternative layout proposal is the happy medium that adds
some complexity to 

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-10 Thread Raphael Taylor-Davies

For example, if someone (datafusion, velox, etc.) were to come up with a
framework for UDFs then would batches be passed in and out of those UDFs in
the Arrow format?

Yes, I think the arrow format is a perfect fit for this

Is Arrow meant to only be used in between systems (in this case query
engines) or is it also meant to be used in between components of a query
engine?


I can't speak for all query engines, but at least in the case of 
DataFusion we exclusively use the Arrow format as the interchange format 
between operators, including for UDFs. We have found that for most 
operators operating directly on the Arrow format is sufficiently 
performant to not represent a query bottleneck. For others, such as 
joins, sorts and aggregates, we do make use of bespoke data structures 
and formats internally, e.g. hash tables, row formats, etc..., but the 
operator's public APIs are still in terms of arrow RecordBatch. We have 
found this approach to perform very well, whilst also providing very 
good modularity and composability.


In fact we are actually currently in the process of migrating the 
aggregation logic away from a bespoke mutable row representation to the 
Arrow model, and are already seeing significant performance 
improvements, not to mention a significant reduction in code complexity 
and improved composability [1].



If every engine has its own bespoke formats internally
then it seems we are placing a limit on how far things can be decomposed.
Agreed, if engines choose to implement operations on bespoke formats, 
these operations will likely not be as interoperable as those 
implemented using Arrow. To what extent an engine favours their own 
format(s) over Arrow will be an engineering trade-off they will have to 
make, but DataFusion has found exclusively using Arrow as the 
interchange format between operators to work well.



There are now multiple implementations of a query
engine and I think we are seeing just the edges of this query engine
decomposition (e.g. using arrow-c++'s datasets to feed DuckDb or consuming
a velox task as a record batch stream into a different system) and these
sorts of challenges are in the forefront.
I agree 100% that this sort of interoperability is what makes Arrow so 
compelling and something we should work very hard to preserve. This is 
the crux of my concern with standardising alternative layouts. I 
definitely hope that with time Arrow will penetrate deeper into these 
engines, perhaps in a similar manner to DataFusion, as opposed to 
primarily existing at the surface-level.


[1]: https://github.com/apache/arrow-datafusion/pull/6800

On 10/07/2023 11:38, Weston Pace wrote:

The point I was trying to make, albeit very badly, was that these
operations are typically implemented using some sort of row format [1]
[2], and therefore their performance is not impacted by the array
representations. I think it is both inevitable, and in fact something to
be encouraged, that query engines will implement their own in-memory
layouts and data structures outside of the arrow specification for
specific operators, workloads, hardware, etc... This allows them to make
trade-offs based on their specific application domain, whilst also
ensuring that new ideas and approaches can continue to be incorporated
and adopted in the broader ecosystem. However, to then seek to
standardise these layouts seems to be both potentially unbounded scope
creep, and also somewhat counter productive if the goal of
standardisation is improved interoperability?

FWIW, I believe this formats are very friendly for row representation as
well, especially when stored as a payload (e.g. in a join).

For your more general point though I will ask the same question I asked on
the ArrayView discussion:

Is Arrow meant to only be used in between systems (in this case query
engines) or is it also meant to be used in between components of a query
engine?

For example, if someone (datafusion, velox, etc.) were to come up with a
framework for UDFs then would batches be passed in and out of those UDFs in
the Arrow format?  If every engine has its own bespoke formats internally
then it seems we are placing a limit on how far things can be decomposed.
 From the C++ perspective, I would personally like to see Arrow be usable
within components.  There are now multiple implementations of a query
engine and I think we are seeing just the edges of this query engine
decomposition (e.g. using arrow-c++'s datasets to feed DuckDb or consuming
a velox task as a record batch stream into a different system) and these
sorts of challenges are in the forefront.

On Fri, Jul 7, 2023 at 7:53 AM Raphael Taylor-Davies
 wrote:


Thus the approach you
describe for validating an entire character buffer as UTF-8 then checking
offsets will be just as valid for Utf8View arrays as for Utf8 arrays.

The difference here is that it is perhaps expected for Utf8View to have
gaps in the underlying data that are not referenced as part of any

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-07 Thread Raphael Taylor-Davies
 column sorting and aggregations would as well.
If there are any other benchmarks which would help to justify Utf8View in
your
mind, I'd be happy to try writing them.


UTF-8 validation for StringArray can be done very efficiently by first

verifying the entire buffer, and then verifying the offsets correspond to
the start of a UTF-8 codepoint

For non-inlined strings, the character buffers do always contain the entire
string's data and not just the last `len - 4` bytes. Thus the approach you
describe for validating an entire character buffer as UTF-8 then checking
offsets will be just as valid for Utf8View arrays as for Utf8 arrays.


it does seem inconsistent to use unsigned types

It is indeed more typical for the arrow format to use signed integers for
offsets and other quantities. In this case there is prior art in other
engines with which we can remain compatible by using unsigned integers
instead. Since this is only a break with convention within the format and
shouldn't be difficult for any implementation to accommodate, I would argue
that it's worthwhile to avoid pushing change onto existing implementers.


I presume that StringView will behave similarly to dictionaries in that

the selection kernels will not recompute the underlying value buffers.

The Utf8View format itself is not prescriptive of selection operations on
the
array; kernels are free to reuse character buffers (which produces an
implicit
selection vector) or to recompute them. Furthermore unlike an explicit
selection vector a kernel may decide to copy and densify dynamically if it
detects that output is getting sparse or fragmented. It's also worth noting
that unlike an explicit selection vector a Utf8View array (however sparse or
fragmented) will still benefit from the prefix comparison fast path.

Sincerely,
Ben Kietzman

On Sun, Jul 2, 2023 at 8:01 AM Raphael Taylor-Davies
  wrote:


I would be interested in hearing some input from the Rust community.

  A couple of thoughts:

The variable number of buffers would definitely pose some challenges for
the Rust implementation, the closest thing we currently have is possibly
UnionArray, but even then the number of buffers is still determined
statically by the DataType. I therefore also wonder about the possibility
of always having a single backing buffer that stores the character data,
including potentially a copy of the prefix. This would also avoid forcing a
branch on access, which I would have expected to hurt performance for some
kernels quite significantly.

Whilst not really a concern for Rust, which supports unsigned types, it
does seem inconsistent to use unsigned types where the rest of the format
encourages the use of signed offsets, etc...

It isn't clearly specified whether a null should have a valid set of
offsets, etc... I think it is an important property of the current array
layouts that, with exception to dictionaries, the data in null slots is
arbitrary, i.e. can take any value, but not undefined. This allows for
separate handling of the null mask and values, which can be important for
some kernels and APIs.

More an observation than an issue, but UTF-8 validation for StringArray
can be done very efficiently by first verifying the entire buffer, and then
verifying the offsets correspond to the start of a UTF-8 codepoint. This
same approach would not be possible for StringView, which would need to
verify individual values and would therefore be significantly more
expensive. As it is UB for a Rust string to contain non-UTF-8 data, this
validation is perhaps more important for Rust than for other languages.

I presume that StringView will behave similarly to dictionaries in that
the selection kernels will not recompute the underlying value buffers. I
think this is fine, but it is perhaps worth noting this has caused
confusion in the past, as people somewhat reasonably expect an array
post-selection to have memory usage reflecting the smaller selection. This
is then especially noticeable if the data is written out to IPC, and still
contains data that was supposedly filtered out. My 2 cents is that explicit
selection vectors are a less surprising way to defer selection than baking
it into the array, but I also don't have any workloads where this is the
major bottleneck so can't speak authoritatively here.

Which leads on to my major concern with this proposal, that it adds
complexity and cognitive load to the specification and implementations,
whilst not meaningfully improving the performance of the operators that I
commonly encounter as performance bottlenecks, which are multi-column sorts
and aggregations, or the expensive string operations such as matching or
parsing. If we didn't already have a string representation I would be more
onboard, but as it stands I'm definitely on the fence, especially given
selection performance can be improved in less intrusive ways using
dictionaries or selection vectors.

Kind Regards,

Raphael Taylor-Davies

On 02/07/2023 11:46, Andrew

[RESULT][VOTE][RUST] Release Apache Arrow Rust 43.0.0 RC1

2023-07-03 Thread Raphael Taylor-Davies

With 5 +1 votes (5 binding) the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-rs-43.0.0


It has also been released to crates.io

Thank you to everyone who helped verify this release

Raphael

On 30/06/2023 16:26, Raphael Taylor-Davies wrote:

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 43.0.0.


This release candidate is based on commit: 
414235e7630d05cccf0b9f5032ebfc0858b8ae5b [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/414235e7630d05cccf0b9f5032ebfc0858b8ae5b
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-43.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/414235e7630d05cccf0b9f5032ebfc0858b8ae5b/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




Re: [DISCUSS] UTF-8 validation

2023-07-02 Thread Raphael Taylor-Davies
For better or for worse the Rust implementation requires the underlying buffer 
is UTF-8 including null slots, as this allows returning the buffer as a native 
string type, which in turn allows kernels to use Rust's native string 
functionality. Whilst I agree the specification is ambiguous on this note, this 
interpretation doesn't appear to have caused issues so far.

On 2 July 2023 13:07:20 BST, Antoine Pitrou  wrote:
>
>
>Le 02/07/2023 à 14:00, Raphael Taylor-Davies a écrit :
>> 
>> More an observation than an issue, but UTF-8 validation for StringArray can 
>> be done very efficiently by first verifying the entire buffer, and then 
>> verifying the offsets correspond to the start of a UTF-8 codepoint.
>
>Caveat: null slots could potentially contain invalid UTF-8 data. Not likely of 
>course, but it should probably not be an error.
>
>That said, yes, it is a smart strategy for the common case!
>
>Regards
>
>Antoine.


Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-02 Thread Raphael Taylor-Davies
> I would be interested in hearing some input from the Rust community.

 A couple of thoughts:

The variable number of buffers would definitely pose some challenges for the 
Rust implementation, the closest thing we currently have is possibly 
UnionArray, but even then the number of buffers is still determined statically 
by the DataType. I therefore also wonder about the possibility of always having 
a single backing buffer that stores the character data, including potentially a 
copy of the prefix. This would also avoid forcing a branch on access, which I 
would have expected to hurt performance for some kernels quite significantly.

Whilst not really a concern for Rust, which supports unsigned types, it does 
seem inconsistent to use unsigned types where the rest of the format encourages 
the use of signed offsets, etc...

It isn't clearly specified whether a null should have a valid set of offsets, 
etc... I think it is an important property of the current array layouts that, 
with exception to dictionaries, the data in null slots is arbitrary, i.e. can 
take any value, but not undefined. This allows for separate handling of the 
null mask and values, which can be important for some kernels and APIs.

More an observation than an issue, but UTF-8 validation for StringArray can be 
done very efficiently by first verifying the entire buffer, and then verifying 
the offsets correspond to the start of a UTF-8 codepoint. This same approach 
would not be possible for StringView, which would need to verify individual 
values and would therefore be significantly more expensive. As it is UB for a 
Rust string to contain non-UTF-8 data, this validation is perhaps more 
important for Rust than for other languages.

I presume that StringView will behave similarly to dictionaries in that the 
selection kernels will not recompute the underlying value buffers. I think this 
is fine, but it is perhaps worth noting this has caused confusion in the past, 
as people somewhat reasonably expect an array post-selection to have memory 
usage reflecting the smaller selection. This is then especially noticeable if 
the data is written out to IPC, and still contains data that was supposedly 
filtered out. My 2 cents is that explicit selection vectors are a less 
surprising way to defer selection than baking it into the array, but I also 
don't have any workloads where this is the major bottleneck so can't speak 
authoritatively here.

Which leads on to my major concern with this proposal, that it adds complexity 
and cognitive load to the specification and implementations, whilst not 
meaningfully improving the performance of the operators that I commonly 
encounter as performance bottlenecks, which are multi-column sorts and 
aggregations, or the expensive string operations such as matching or parsing. 
If we didn't already have a string representation I would be more onboard, but 
as it stands I'm definitely on the fence, especially given selection 
performance can be improved in less intrusive ways using dictionaries or 
selection vectors.

Kind Regards,

Raphael Taylor-Davies

On 02/07/2023 11:46, Andrew Lamb wrote:

 * This is the first layout where the number of buffers depends on the 

data 

and not the schema. I think this is the most architecturally significant fact. 
I 

 I have spent some time reading the initial proposal -- thank you for that. I 
now understand what Weston was saying about the "variable numbers of buffers". 
I wonder if you considered restricting such arrays to a single buffer (so as to 
make them more similar to other arrow array types that have a fixed number of 
buffers)? On Tue, Jun 20, 2023 at 11:33 AM Weston Pace  
<mailto:weston.p...@gmail.com> wrote: 

Before I say anything else I'll say that I am in favor of this new layout. 
There is some existing literature on the idea (e.g. umbra) and your benchmarks 
show some nice improvements. Compared to some of the other layouts we've 
discussed recently (REE, list veiw) I do think this layout is more unique and 
fundamentally different. Perhaps most fundamentally different: * This is the 
first layout where the number of buffers depends on the data and not the 
schema. I think this is the most architecturally significant fact. It does 
require a (backwards compatible) change to the IPC format itself, beyond just 
adding new type codes. It also poses challenges in places where we've assumed 
there will be at most 3 buffers (e.g. in ArraySpan, though, as you have shown, 
we can work around this using a raw pointers representation internally in those 
spots). I think you've done some great work to integrate this well with 
Arrow-C++ and I'm convinced it can work. I would be interested in hearing some 
input from the Rust community. Ben, at one point there was some discussion that 
this might be a c-data only type. However, I believe that was based on the raw 
pointers representation. What you've proposed here, if I understand corr

[VOTE][RUST] Release Apache Arrow Rust 43.0.0 RC1

2023-06-30 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 43.0.0.


This release candidate is based on commit: 
414235e7630d05cccf0b9f5032ebfc0858b8ae5b [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/414235e7630d05cccf0b9f5032ebfc0858b8ae5b

[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-43.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/414235e7630d05cccf0b9f5032ebfc0858b8ae5b/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Raphael Taylor-Davies

Hi All,

I might be missing something, but rather than opening the can of worms 
of alternative layouts, etc... perhaps we could support this use-case as 
a canonical extension type over dictionary encoded, variable-sized 
arrays. I'll try to explain my reasoning below, but the major advantage 
would be that this would not require any modifications to the existing 
specification or codebases, and I think would support all the use-cases 
I've seen articulated.


Taking a step back:

- Variable-sized arrays encode value positions in a list of 
monotonically increasing integer offsets, with each slice of values 
identified by the two consecutive offsets

- Dictionaries encode value positions in a list of integers in any order
- Views encode value positions as a pair of start and end integer offsets

If you squint, dictionary encoding a variable-sized array removes the 
ordering restriction on the value offsets, resulting in a very similar 
construction to the proposed view arrays. The major differences that 
come to mind are:


- Views can contain partially overlapping value ranges, whereas 
dictionaries can only encode disjoint or identical ranges
- Views require one less addition operation to access a value, and may 
therefore have slightly different performance characteristics
- A null in a view takes up two offsets, a null in a dictionary takes 
one key
- Selection/take/filter on a dictionary is the same as for a primitive 
array, and should be ~2x faster than a view
- The dictionary could use a smaller key size than the offset size of 
its child list, potentially saving space compared to a view
- Unreferenced values in a view are free, whereas they take up an offset 
in the child list

- Kernels may be optimised assuming small and/or unique dictionaries
- IPC files currently only support a single dictionary per column, 
something I suspect we may want to revisit regardless


I do think using dictionaries in this way is perhaps a little confusing 
to users, and a canonical extension type will only go so far to address 
this, but I personally think this is a reasonable price to pay for 
avoiding the ecosystem fragmentation that would potentially result from 
introducing alternative layouts for common types such as strings, whilst 
also avoiding the need to reimplement a load of kernels.


What do people think?

Kind Regards,

Raphael Taylor-Davies

On 14/06/2023 06:34, Will Jones wrote:

Hello Arrow devs,

Just a quick note. To answer one of my earlier questions:

1. Is this array type currently only used in Velox? (not DuckDB like some

of the other new types?) What evidence do we have that it will become used
outside of Velox?


This type is also used by DuckDB. Found discussion today in a talk from
Mark Raasveldt [1]. That does improve the case for adding this type in my
eyes.

Best,

Will Jones

[1] https://youtu.be/bZOvAKGkzpQ?t=1570



On Tue, Jun 6, 2023 at 7:40 PM Weston Pace  wrote:


This implies that each canonical alternative layout would codify a
primary layout as its "fallback."

Yes, that was part of my proposal:


  * A new layout, if it is semantically equivalent to another, is

considered an alternative layout

Or, to phrase it another way.  If there is not a "fallback" then it is not
an alternative layout.  It's a brand new primary layout.  I'd expect this
to be quite rare.  I can't really even hypothesize any examples.  I think
the only truly atomic layouts are fixed-width, list, and struct.


This seems reasonable but it opens
up some cans of worms, such as how two components communicating
through an Arrow interface would negotiate which layout is supported

Most APIs that I'm aware of already do this.  For example,
pyarrow.parquet.read_table has a "read_dictionary" property that can be
used to control whether or not a column is returned with the dictionary
encoding.  There is no way (that I'm aware of) to get a column in REE
encoding today without explicitly requesting it.  In fact, this could be as
simple as a boolean "use_advanced_features" flag although I would
discourage something so simplistic.  The point is that arrow-compatible
software should, by default, emit types that are supported by all arrow
implementations.

Of course, there is no way to enforce this, it's just a guideline / strong
recommendation on how software should behave if it wants to state "arrow
compatible" as a feature.

On Tue, Jun 6, 2023 at 3:33 PM Ian Cook  wrote:


Thanks Weston. That all sounds reasonable to me.


  with the caveat that the primary layout must be emitted if the user

does not specifically request the alternative layout

This implies that each canonical alternative layout would codify a
primary layout as its "fallback." This seems reasonable but it opens
up some cans of worms, such as how two components communicating
through an Arrow interface would negotiate which layout is supported.
I suppose such details should be discuss

Scalars in Apache Arrow Rust

2023-06-09 Thread Raphael Taylor-Davies

Hi All,

Currently the Rust implementation of arrow lacks a consistent story for 
supporting scalars. Whilst there are some binary kernels that support 
scalar values [1] [2], the way this is encoded is not consistent [3], 
requires type-dispatch logic in downstreams like DataFusion [4], and has 
no mechanism to preserve type metadata such as timestamp timezones, or 
decimal precision [5].


I would therefore like to draw attention to a proposal [6] to address 
this. As this will necessarily have downstream ramifications, and is 
likely a problem that other implementations of arrow have already 
grappled with, I would very much appreciate any feedback the community 
can give. To avoid bifurcating the discussion, please comment on the 
GitHub PR.


I look forward to hearing your thoughts.

Kind Regards,

Raphael Taylor-Davies

[1]: 
https://docs.rs/arrow-arith/latest/arrow_arith/arithmetic/fn.add_scalar.html
[2]: 
https://docs.rs/arrow-ord/latest/arrow_ord/comparison/fn.eq_dyn_scalar.html

[3]: https://github.com/apache/arrow-rs/issues/2837
[4]: 
https://github.com/apache/arrow-datafusion/blob/d9e91d187c8af7f3f8b1a11d53383826a471/datafusion/physical-expr/src/expressions/binary.rs

[5]: https://github.com/apache/arrow-rs/issues/3999
[6]: https://github.com/apache/arrow-rs/pull/4393



[RESULT][VOTE][RUST] Release Apache Arrow Rust 41.0.0 RC1

2023-06-06 Thread Raphael Taylor-Davies

With 4 +1 votes (4 binding) the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-rs-41.0.0/


It has also been released to crates.io

Thank you to everyone who helped verify this release

Raphael

On 05/06/2023 16:25, Will Jones wrote:

+1 (binding). Verified on Ubuntu 22 x86_64. Thanks, Raphael!

On Fri, Jun 2, 2023 at 12:47 PM Andrew Lamb  wrote:


+1 (binding)
Verified on x86_64 mac

The content of this release looks very good 

Thank you Raphael

Andrew

On Fri, Jun 2, 2023 at 2:59 PM L. C. Hsieh  wrote:


+1 (binding)

Verified on M1 Mac.

Thanks Raphael.

On Fri, Jun 2, 2023 at 11:55 AM Raphael Taylor-Davies
 wrote:

Hi,

I would like to propose a release of Apache Arrow Rust Implementation,
version 41.0.0.

This release candidate is based on commit:
e1badc0542ca82e2304cc3f51a9d25ea2dbb74eb [1]

The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]:


https://github.com/apache/arrow-rs/tree/e1badc0542ca82e2304cc3f51a9d25ea2dbb74eb

[2]:

https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-41.0.0-rc1

[3]:


https://github.com/apache/arrow-rs/blob/e1badc0542ca82e2304cc3f51a9d25ea2dbb74eb/CHANGELOG.md

[4]:


https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh


[RESULT][VOTE][RUST] Release Apache Arrow Rust Object Store 0.6.1 RC1

2023-06-06 Thread Raphael Taylor-Davies

With 4 +1 votes (4 binding) the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-object-store-rs-0.6.1/ 



It has also been released to crates.io

Thank you to everyone who helped verify this release

Raphael

On 05/06/2023 16:15, Will Jones wrote:

+1 (binding), verified on M1 MacOS. Thanks Raphael!

On Fri, Jun 2, 2023 at 11:56 AM L. C. Hsieh  wrote:


+1 (binding)

Verified on M1 Mac.

Thanks Raphael.

On Fri, Jun 2, 2023 at 11:38 AM Andrew Lamb  wrote:

+1 (binding)

I verified the signature and ran the verification script on mac x86_64
Thank you Raphael

On Fri, Jun 2, 2023 at 2:23 PM Raphael Taylor-Davies
  wrote:


Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.6.1.

This release candidate is based on commit:
f323097584eaa8edb1193b4fb67bccadd39594f6 [1]

The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store because...

[1]:



https://github.com/apache/arrow-rs/tree/f323097584eaa8edb1193b4fb67bccadd39594f6

[2]:



https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.6.1-rc1

[3]:



https://github.com/apache/arrow-rs/blob/f323097584eaa8edb1193b4fb67bccadd39594f6/object_store/CHANGELOG.md

[4]:



https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh


[VOTE][RUST] Release Apache Arrow Rust 41.0.0 RC1

2023-06-02 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 41.0.0.


This release candidate is based on commit: 
e1badc0542ca82e2304cc3f51a9d25ea2dbb74eb [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/e1badc0542ca82e2304cc3f51a9d25ea2dbb74eb

[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-41.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/e1badc0542ca82e2304cc3f51a9d25ea2dbb74eb/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




[VOTE][RUST] Release Apache Arrow Rust Object Store 0.6.1 RC1

2023-06-02 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.6.1.

This release candidate is based on commit: 
f323097584eaa8edb1193b4fb67bccadd39594f6 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store because...

[1]: 
https://github.com/apache/arrow-rs/tree/f323097584eaa8edb1193b4fb67bccadd39594f6
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.6.1-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/f323097584eaa8edb1193b4fb67bccadd39594f6/object_store/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh




[RESULT][VOTE][RUST] Release Apache Arrow Rust 40.0.0 RC1

2023-05-22 Thread Raphael Taylor-Davies

With 4 +1 votes (4 binding) the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-rs-40.0.0/


It has also been released to crates.io

Thank you to everyone who helped verify this release

Raphael

On 20/05/2023 20:47, Will Jones wrote:

+1 (binding)

Verified on Ubuntu 22.04. Thanks Raphael!

On Fri, May 19, 2023 at 10:05 AM L. C. Hsieh  wrote:


+1 (binding)

Verified on M1 Mac.

Thanks Raphael

On Fri, May 19, 2023 at 6:37 AM Andrew Lamb  wrote:

+1 (binding)

Verified on mac osx x86_64

Thank you Raphael

On Fri, May 19, 2023 at 8:49 AM Raphael Taylor-Davies
 wrote:


Hi,

I would like to propose a release of Apache Arrow Rust Implementation,
version 40.0.0.

This release candidate is based on commit:
25bfccca58ff219d9f59ba9f4d75550493238a4f [1]

The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]:



https://github.com/apache/arrow-rs/tree/25bfccca58ff219d9f59ba9f4d75550493238a4f

[2]:


https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-40.0.0-rc1

[3]:



https://github.com/apache/arrow-rs/blob/25bfccca58ff219d9f59ba9f4d75550493238a4f/CHANGELOG.md

[4]:



https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




[RESULT][VOTE][RUST] Release Apache Arrow Rust Object Store 0.6.0 RC1

2023-05-22 Thread Raphael Taylor-Davies

With 4 +1 votes (4 binding) the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-object-store-rs-0.6.0/ 



It has also been released to crates.io

Thank you to everyone who helped verify this release

Raphael

On 18/05/2023 10:25, Raphael Taylor-Davies wrote:

Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.6.0.

This release candidate is based on commit: 
ec7706c1f2aeef5a289e46d1df7785e5c93e6bfb [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store because...

[1]: 
https://github.com/apache/arrow-rs/tree/ec7706c1f2aeef5a289e46d1df7785e5c93e6bfb
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.6.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/ec7706c1f2aeef5a289e46d1df7785e5c93e6bfb/object_store/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh




[VOTE][RUST] Release Apache Arrow Rust 40.0.0 RC1

2023-05-19 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 40.0.0.


This release candidate is based on commit: 
25bfccca58ff219d9f59ba9f4d75550493238a4f [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/25bfccca58ff219d9f59ba9f4d75550493238a4f

[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-40.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/25bfccca58ff219d9f59ba9f4d75550493238a4f/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




[VOTE][RUST] Release Apache Arrow Rust Object Store 0.6.0 RC1

2023-05-18 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.6.0.

This release candidate is based on commit: 
ec7706c1f2aeef5a289e46d1df7785e5c93e6bfb [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store because...

[1]: 
https://github.com/apache/arrow-rs/tree/ec7706c1f2aeef5a289e46d1df7785e5c93e6bfb
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.6.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/ec7706c1f2aeef5a289e46d1df7785e5c93e6bfb/object_store/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh




Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-11 Thread Raphael Taylor-Davies

Hi All,


if we added this, do we think many Arrow and query
engine implementations (for example, DataFusion) will be eager to add full
support for the type, including compute kernels? Or are they likely to just
convert this type to ListArray at import boundaries?
I can't speak for query engines in general, but at least for arrow-rs 
and by extension DataFusion, and based on my current understanding of 
the use-cases I would be rather hesitant to add support to the kernels 
for this array type, definitely instead favouring conversion at the 
edges. We already have issues with the amount of code generation 
resulting in binary bloat and long compile times, and I worry this would 
worsen this situation whilst not really providing compelling advantages 
for the vast majority of workloads that don't interact with Velox. 
Whilst I can definitely see that the ListView representation is probably 
a better way to represent variable length lists than what arrow settled 
upon, I'm not yet convinced it is sufficiently better to incentivise 
broad ecosystem adoption.


Kind Regards,

Raphael Taylor-Davies

On 11/05/2023 21:20, Will Jones wrote:

Hi Felipe,

Thanks for the additional details.



Velox kernels benefit from being able to append data to the array from
different threads without care for strict ordering. Only the offsets array
has to be written according to logical order but that is potentially a much
smaller buffer than the values buffer.


It still seems to me like applications are still pretty niche, as I suspect
in most cases the benefits are outweighed by the costs. The benefit here
seems pretty limited: if you are trying to split work between threads,
usually you will have other levels such as array chunks to parallelize. And
if you have an incoming stream of row data, you'll want to append in
predictable order to match the order of the other arrays. Am I missing
something?

And, IIUC, the cost of using ListView with out-of-order values over
ListArray is you lose memory locality; the values of element 2 are no
longer adjacent to the values of element 1. What do you think about that
tradeoff?

I don't mean to be difficult about this. I'm excited for both the REE and
StringView arrays, but this one I'm not so sure about yet. I suppose what I
am trying to ask is, if we added this, do we think many Arrow and query
engine implementations (for example, DataFusion) will be eager to add full
support for the type, including compute kernels? Or are they likely to just
convert this type to ListArray at import boundaries?

Because if it turns out to be the latter, then we might as well ask Velox
to export this type as ListArray and save the rest of the ecosystem some
work.

Best,

Will Jones

On Thu, May 11, 2023 at 12:32 PM Felipe Oliveira Carvalho <
felipe...@gmail.com> wrote:


Initial reason for ListView arrays in Arrow is zero-copy compatibility with
Velox which uses this format.

Velox kernels benefit from being able to append data to the array from
different threads without care for strict ordering. Only the offsets array
has to be written according to logical order but that is potentially a much
smaller buffer than the values buffer.

Acero kernels could take advantage of that in the future.

In implementing ListViewArray/Type I was able to reuse some C++ templates
used for ListArray which can reduce some of the burden on kernel
implementations that aim to work with all the types.

I’m can fix Acero kernels for working with ListView. This is similar to the
work I’ve doing in kernels dealing with run-end encoded arrays.

—
Felipe


On Wed, 26 Apr 2023 at 01:03 Will Jones  wrote:


I suppose one common use case is materializing list columns after some
expanding operation like a join or unnest. That's a case where I could
imagine a lot of repetition of values. Haven't yet thought of common

cases

where there is overlap but not full duplication, but am eager to hear

any.

The dictionary encoding point Raphael makes is interesting, especially
given the existence of LargeList and FixedSizeList. For many operations,

it

might make more sense to just compose those existing types.

IIUC the operations that would be unique to the ArrayView are ones

altering

the shape. One could truncate each array to a certain length cheaply

simply

by replacing the sizes buffer. Or perhaps there are interesting

operations

on tensors that would benefit.

On Tue, Apr 25, 2023 at 7:47 PM Raphael Taylor-Davies
 wrote:


Unless I am missing something, I think the selection use-case could be
equally well served by a dictionary-encoded BinarArray/ListArray, and

would

have the benefit of not requiring any modifications to the existing

format

or kernels.

The major additional flexibility of the proposed encoding would be
permitting disjoint or overlapping ranges, are these common enough in
practice to represent a meaningful bottleneck?


On 26 April 2023 01:40:14 BST, David Li  wrote:

Is there a need for a 64-bit o

[RESULT][VOTE][RUST] Release Apache Arrow Rust 39.0.0 RC1

2023-05-09 Thread Raphael Taylor-Davies

With 4 +1 votes (4 binding) the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-rs-39.0.0


It has also been released to crates.io

Thank you to everyone who helped verify this release

Raphael

On 05/05/2023 15:45, Raphael Taylor-Davies wrote:

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 39.0.0.


This release candidate is based on commit: 
575a199fa669d75833c13a2a69d71255b9a9f2e6 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/575a199fa669d75833c13a2a69d71255b9a9f2e6
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-39.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/575a199fa669d75833c13a2a69d71255b9a9f2e6/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




[VOTE][RUST] Release Apache Arrow Rust 39.0.0 RC1

2023-05-05 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 39.0.0.


This release candidate is based on commit: 
575a199fa669d75833c13a2a69d71255b9a9f2e6 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/575a199fa669d75833c13a2a69d71255b9a9f2e6

[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-39.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/575a199fa669d75833c13a2a69d71255b9a9f2e6/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-25 Thread Raphael Taylor-Davies
Unless I am missing something, I think the selection use-case could be equally 
well served by a dictionary-encoded BinarArray/ListArray, and would have the 
benefit of not requiring any modifications to the existing format or kernels.

The major additional flexibility of the proposed encoding would be permitting 
disjoint or overlapping ranges, are these common enough in practice to 
represent a meaningful bottleneck?


On 26 April 2023 01:40:14 BST, David Li  wrote:
>Is there a need for a 64-bit offsets version the same way we have List and 
>LargeList?
>
>And just to be clear, the difference with List is that the lists don't have to 
>be stored in their logical order (or in other words, offsets do not have to be 
>nondecreasing and so we also need sizes)?
>
>On Wed, Apr 26, 2023, at 09:37, Weston Pace wrote:
>> For context, there was some discussion on this back in [1].  At that time
>> this was called "sequence view" but I do not like that name.  However,
>> array-view array is a little confusing.  Given this is similar to list can
>> we go with list-view array?
>>
>>> Thanks for the introduction. I'd be interested to hear about the
>>> applications Velox has found for these vectors, and in what situations
>> they
>>> are useful. This could be contrasted with the current ListArray
>>> implementations.
>>
>> I believe one significant benefit is that take (and by proxy, filter) and
>> sort are O(# of items) with the proposed format and O(# of bytes) with the
>> current format.  Jorge did some profiling to this effect in [1].
>>
>> [1] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
>>
>> On Tue, Apr 25, 2023 at 3:13 PM Will Jones  wrote:
>>
>>> Hi Felipe,
>>>
>>> Thanks for the introduction. I'd be interested to hear about the
>>> applications Velox has found for these vectors, and in what situations they
>>> are useful. This could be contrasted with the current ListArray
>>> implementations.
>>>
>>> IIUC it would be fairly cheap to transform a ListArray to an ArrayView, but
>>> expensive to go the other way.
>>>
>>> Best,
>>>
>>> Will Jones
>>>
>>> On Tue, Apr 25, 2023 at 3:00 PM Felipe Oliveira Carvalho <
>>> felipe...@gmail.com> wrote:
>>>
>>> > Hi folks,
>>> >
>>> > I would like to start a public discussion on the inclusion of a new array
>>> > format to Arrow — array-view array. The name is also up for debate.
>>> >
>>> > This format is inspired by Velox's ArrayVector format [1]. Logically,
>>> this
>>> > array represents an array of arrays. Each element is an array-view
>>> (offset
>>> > and size pair) that points to a range within a nested "values" array
>>> > (called "elements" in Velox docs). The nested array can be of any type,
>>> > which makes this format very flexible and powerful.
>>> >
>>> > [image: ../_images/array-vector.png]
>>> > 
>>> >
>>> > I'm currently working on a C++ implementation and plan to work on a Go
>>> > implementation to fulfill the two-implementations requirement for format
>>> > changes.
>>> >
>>> > The draft design:
>>> >
>>> > - 3 buffers: [validity_bitmap, int32 offsets buffer, int32 sizes buffer]
>>> > - 1 child array: "values" as an array of the type parameter
>>> >
>>> > validity_bitmap is used to differentiate between empty array views
>>> > (sizes[i] == 0) and NULL array views (validity_bitmap[i] == 0).
>>> >
>>> > When the validity_bitmap[i] is 0, both sizes and offsets are undefined
>>> (as
>>> > usual), and when sizes[i] == 0, offsets[i] is undefined. 0 is recommended
>>> > if setting a value is not an issue to the system producing the arrays.
>>> >
>>> > offsets buffer is not required to be ordered and views don't have to be
>>> > disjoint.
>>> >
>>> > [1]
>>> >
>>> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
>>> >
>>> > Thanks,
>>> > Felipe O. Carvalho
>>> >
>>>


[RESULT][VOTE][RUST] Release Apache Arrow Rust 38.0.0 RC1

2023-04-25 Thread Raphael Taylor-Davies

With 5 +1 votes (4 binding) the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-rs-38.0.0


It has also been released to crates.io

Thank you to everyone who helped verify this release

Raphael

On 21/04/2023 20:10, Andrew Lamb wrote:

+1 (binding)

Verified on x86 mac

Thank you Raphael

On Fri, Apr 21, 2023 at 2:00 PM L. C. Hsieh  wrote:


+1 (binding)

Verified on M1 Mac.

Thanks Raphael.

On Fri, Apr 21, 2023 at 7:47 AM Raphael Taylor-Davies
 wrote:

Hi,

I would like to propose a release of Apache Arrow Rust Implementation,
version 38.0.0.

This release candidate is based on commit:
bbd57c615213bc6e80fb0192674942f688e5f6a8 [1]

The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]:


https://github.com/apache/arrow-rs/tree/bbd57c615213bc6e80fb0192674942f688e5f6a8

[2]:

https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-38.0.0-rc1

[3]:


https://github.com/apache/arrow-rs/blob/bbd57c615213bc6e80fb0192674942f688e5f6a8/CHANGELOG.md

[4]:


https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh


[VOTE][RUST] Release Apache Arrow Rust 38.0.0 RC1

2023-04-21 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 38.0.0.


This release candidate is based on commit: 
bbd57c615213bc6e80fb0192674942f688e5f6a8 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/bbd57c615213bc6e80fb0192674942f688e5f6a8

[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-38.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/bbd57c615213bc6e80fb0192674942f688e5f6a8/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




Re: [VOTE][RUST] Release Apache Arrow Rust 37.0.0 RC2

2023-04-10 Thread Raphael Taylor-Davies

With 4 +1 votes (3 binding) the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-rs-37.0.0


It has also been released to crates.io

Thank you to everyone who helped verify this release

Raphael

On 10/04/2023 07:46, Wayne Xia wrote:

+1
verified on x86 linux. Thanks Raphael

On Sat, Apr 8, 2023 at 8:42 PM Andrew Lamb  wrote:


+1   (binding)
verified on x86 mac

Thank you Raphael

On Fri, Apr 7, 2023 at 1:42 PM L. C. Hsieh  wrote:


+1 (binding)

Verified on M1 Mac.

Thanks Raphael.

On Fri, Apr 7, 2023 at 9:27 AM Raphael Taylor-Davies
 wrote:

Hi,

I would like to propose a release of Apache Arrow Rust Implementation,
version 37.0.0.

There were some complications with the first release candidate, so
please make sure you are verifying RC2 with the latest version of the
verification scripts.

This release candidate is based on commit:
6e9751f6b33e17cb811bd89ed94f29b92707e248 [1]

The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]:


https://github.com/apache/arrow-rs/tree/6e9751f6b33e17cb811bd89ed94f29b92707e248

[2]:

https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-37.0.0-rc2

[3]:


https://github.com/apache/arrow-rs/blob/6e9751f6b33e17cb811bd89ed94f29b92707e248/CHANGELOG.md

[4]:


https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh


[VOTE][RUST] Release Apache Arrow Rust 37.0.0 RC2

2023-04-07 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 37.0.0.


There were some complications with the first release candidate, so 
please make sure you are verifying RC2 with the latest version of the 
verification scripts.


This release candidate is based on commit: 
6e9751f6b33e17cb811bd89ed94f29b92707e248 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/6e9751f6b33e17cb811bd89ed94f29b92707e248

[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-37.0.0-rc2
[3]: 
https://github.com/apache/arrow-rs/blob/6e9751f6b33e17cb811bd89ed94f29b92707e248/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




[RESULT][VOTE][RUST] Release Apache Arrow Rust Object Store 0.5.6 RC2

2023-04-03 Thread Raphael Taylor-Davies

With 6 +1 votes (4 binding) the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-object-store-rs-0.5.6 



It has also been released to crates.io

Thank you to everyone who helped verify this release

Raphael

On 01/04/2023 14:48, Metehan Yıldırım wrote:

+1 (non-binding) verified on mac m1.

On Sat, Apr 1, 2023 at 4:05 AM Jacob Wujciak 
wrote:


+1 (non-binding) verified on manjaro

On Fri, Mar 31, 2023 at 10:00 PM Raphael Taylor-Davies
 wrote:


Thank you for double checking, ApplicationDefaultCredentialsFile is in a
crate private module and isn't re-exported, so we should be good.



On 31 March 2023 20:16:29 BST, Andrew Lamb  wrote:

+1 (verified on mac x86)

I took a look through the PRs -- I think  [1] is actually SemVer  but I
wanted to double check (to see if we needed to make the version 0.6.0

樂)

The ::new() constructor is the same, but

ApplicationDefaultCredentialsFile

went from a struct to enum

[1] https://github.com/apache/arrow-rs/pull/3799/files

On Fri, Mar 31, 2023 at 1:47 PM L. C. Hsieh  wrote:


Oh, I verified RC1 wrongly. Then for RC2,

+1 (binding)

Verified on M1 Mac.

Thanks Raphael.



On Fri, Mar 31, 2023 at 10:34 AM Raphael Taylor-Davies
 wrote:

Are you verifying RC1 or RC2? RC1 had the issue you mention

On 31/03/2023 18:17, L. C. Hsieh wrote:

Hmm, I got an error like following on both M1 and Intel Macs.

```
+ cargo build

error: failed to parse manifest at


`/private/var/folders/zq/2tdnn5955wvdcw7qfy6qhvk0gn/T/arrow-0.5.6.X.VnNkJPdg/apache-arrow-object-store-rs-0.5.6/Cargo.toml`


Caused by:

error inheriting `edition` from workspace root manifest's
`workspace.package.edition`


Caused by:

failed to find a workspace root
```

On Fri, Mar 31, 2023 at 8:06 AM Will Jones <

will.jones...@gmail.com

wrote:

+1
Verified on M1 MacOS.


On Fri, Mar 31, 2023 at 4:29 AM Raphael Taylor-Davies
 wrote:


Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.5.6.

This release candidate is based on commit:
234b7847ecb737e96df3f4623df7b330b34b3d1b [1]

The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit

tests,

and vote on the release. There is a script [4] that automates

some of

the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store

because...

[1]:



https://github.com/apache/arrow-rs/tree/234b7847ecb737e96df3f4623df7b330b34b3d1b

[2]:



https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.5.6-rc2

[3]:



https://github.com/apache/arrow-rs/blob/234b7847ecb737e96df3f4623df7b330b34b3d1b/object_store/CHANGELOG.md

[4]:



https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh




Re: [VOTE][RUST] Release Apache Arrow Rust Object Store 0.5.6 RC2

2023-03-31 Thread Raphael Taylor-Davies
Thank you for double checking, ApplicationDefaultCredentialsFile is in a crate 
private module and isn't re-exported, so we should be good.



On 31 March 2023 20:16:29 BST, Andrew Lamb  wrote:
>+1 (verified on mac x86)
>
>I took a look through the PRs -- I think  [1] is actually SemVer  but I
>wanted to double check (to see if we needed to make the version 0.6.0 樂)
>
>The ::new() constructor is the same, but ApplicationDefaultCredentialsFile
>went from a struct to enum
>
>[1] https://github.com/apache/arrow-rs/pull/3799/files
>
>On Fri, Mar 31, 2023 at 1:47 PM L. C. Hsieh  wrote:
>
>> Oh, I verified RC1 wrongly. Then for RC2,
>>
>> +1 (binding)
>>
>> Verified on M1 Mac.
>>
>> Thanks Raphael.
>>
>>
>>
>> On Fri, Mar 31, 2023 at 10:34 AM Raphael Taylor-Davies
>>  wrote:
>> >
>> > Are you verifying RC1 or RC2? RC1 had the issue you mention
>> >
>> > On 31/03/2023 18:17, L. C. Hsieh wrote:
>> > > Hmm, I got an error like following on both M1 and Intel Macs.
>> > >
>> > > ```
>> > > + cargo build
>> > >
>> > > error: failed to parse manifest at
>> > >
>> `/private/var/folders/zq/2tdnn5955wvdcw7qfy6qhvk0gn/T/arrow-0.5.6.X.VnNkJPdg/apache-arrow-object-store-rs-0.5.6/Cargo.toml`
>> > >
>> > >
>> > > Caused by:
>> > >
>> > >error inheriting `edition` from workspace root manifest's
>> > > `workspace.package.edition`
>> > >
>> > >
>> > > Caused by:
>> > >
>> > >failed to find a workspace root
>> > > ```
>> > >
>> > > On Fri, Mar 31, 2023 at 8:06 AM Will Jones 
>> wrote:
>> > >> +1
>> > >> Verified on M1 MacOS.
>> > >>
>> > >>
>> > >> On Fri, Mar 31, 2023 at 4:29 AM Raphael Taylor-Davies
>> > >>  wrote:
>> > >>
>> > >>> Hi,
>> > >>>
>> > >>> I would like to propose a release of Apache Arrow Rust Object
>> > >>> Store Implementation, version 0.5.6.
>> > >>>
>> > >>> This release candidate is based on commit:
>> > >>> 234b7847ecb737e96df3f4623df7b330b34b3d1b [1]
>> > >>>
>> > >>> The proposed release tarball and signatures are hosted at [2].
>> > >>>
>> > >>> The changelog is located at [3].
>> > >>>
>> > >>> Please download, verify checksums and signatures, run the unit tests,
>> > >>> and vote on the release. There is a script [4] that automates some of
>> > >>> the verification.
>> > >>>
>> > >>> The vote will be open for at least 72 hours.
>> > >>>
>> > >>> [ ] +1 Release this as Apache Arrow Rust Object Store
>> > >>> [ ] +0
>> > >>> [ ] -1 Do not release this as Apache Arrow Rust Object Store
>> because...
>> > >>>
>> > >>> [1]:
>> > >>>
>> > >>>
>> https://github.com/apache/arrow-rs/tree/234b7847ecb737e96df3f4623df7b330b34b3d1b
>> > >>> [2]:
>> > >>>
>> > >>>
>> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.5.6-rc2
>> > >>> [3]:
>> > >>>
>> > >>>
>> https://github.com/apache/arrow-rs/blob/234b7847ecb737e96df3f4623df7b330b34b3d1b/object_store/CHANGELOG.md
>> > >>> [4]:
>> > >>>
>> > >>>
>> https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh
>> > >>>
>> > >>>
>>


Re: [VOTE][RUST] Release Apache Arrow Rust Object Store 0.5.6 RC2

2023-03-31 Thread Raphael Taylor-Davies

Are you verifying RC1 or RC2? RC1 had the issue you mention

On 31/03/2023 18:17, L. C. Hsieh wrote:

Hmm, I got an error like following on both M1 and Intel Macs.

```
+ cargo build

error: failed to parse manifest at
`/private/var/folders/zq/2tdnn5955wvdcw7qfy6qhvk0gn/T/arrow-0.5.6.X.VnNkJPdg/apache-arrow-object-store-rs-0.5.6/Cargo.toml`


Caused by:

   error inheriting `edition` from workspace root manifest's
`workspace.package.edition`


Caused by:

   failed to find a workspace root
```

On Fri, Mar 31, 2023 at 8:06 AM Will Jones  wrote:

+1
Verified on M1 MacOS.


On Fri, Mar 31, 2023 at 4:29 AM Raphael Taylor-Davies
 wrote:


Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.5.6.

This release candidate is based on commit:
234b7847ecb737e96df3f4623df7b330b34b3d1b [1]

The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store because...

[1]:

https://github.com/apache/arrow-rs/tree/234b7847ecb737e96df3f4623df7b330b34b3d1b
[2]:

https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.5.6-rc2
[3]:

https://github.com/apache/arrow-rs/blob/234b7847ecb737e96df3f4623df7b330b34b3d1b/object_store/CHANGELOG.md
[4]:

https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh




[VOTE][RUST] Release Apache Arrow Rust Object Store 0.5.6 RC2

2023-03-31 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Object
Store Implementation, version 0.5.6.

This release candidate is based on commit: 
234b7847ecb737e96df3f4623df7b330b34b3d1b [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust Object Store
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust Object Store because...

[1]: 
https://github.com/apache/arrow-rs/tree/234b7847ecb737e96df3f4623df7b330b34b3d1b
[2]: 
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-object-store-rs-0.5.6-rc2
[3]: 
https://github.com/apache/arrow-rs/blob/234b7847ecb737e96df3f4623df7b330b34b3d1b/object_store/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/object_store/dev/release/verify-release-candidate.sh




Re: OpenTelemetry + Arrow

2023-03-30 Thread Raphael Taylor-Davies

Hi Laurent,

I gave the first blog post a read and I also really like it and would be 
+1 on publishing it, nice work.


I would also like to echo Will's sentiment that getting real-world case 
studies for the more complex Arrow schemas is invaluable and will help 
drive improvements in this space, so thank you for driving this forward.


Kind Regards,

Raphael

On 30/03/2023 19:52, Will Jones wrote:

Hi Laurent,

I have read the first post and I really like it. I'd be +1 on publishing
these to the blog. I'm interested to read the second one when it's finished.

IMO the blog could use more examples of using Arrow that's not building a
data frame library / query engine, and I appreciate that this blog provides
advice for some of the trickier parts of working with complex Arrow
schemas. I think this will also provide a good concrete use case for us to
think about improving the ecosystem's support for nested data.

Best,

Will Jones

On Thu, Mar 30, 2023 at 10:56 AM Laurent Quérel 
wrote:


Hello everyone,

I was wondering if the Apache Arrow community would be interested in
featuring a two-part article series on their blog, discussing the
experiences and insights gained from an experimental version of the
OpenTelemetry protocol (OTLP) utilizing Apache Arrow. As the main author of
the OTLP Arrow specification
<https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md

,

the reference implementation otlp-arrow-adapter
<https://github.com/f5/otel-arrow-adapter>, and the two articles (see
links
below), I believe that fostering collaboration between open-source projects
like these is essential and mutually beneficial.

These articles would serve as a fitting complement to the three
introductory articles that Andrew Lamb and Raphael Taylor-Davies
co-authored. They delve into the practical aspects of integrating Apache
Arrow into an existing project, as well as the process of converting a
hierarchical data model into its Arrow representation. The first article
examines various mapping techniques for aligning an existing data model
with the corresponding Arrow representation, while the second article
explores an adaptive schema technique that I implemented in the library's
final version in greater depth. Although the second article is still under
development, the core framework description is already in place.

What are your thoughts on this proposal?

Article 1:

https://docs.google.com/document/d/11lG7Go2IgKOyW-RReBRW6r7HIdV1X7lu5WrDGlW5LbQ/edit?usp=sharing

Article 2 (WIP):

https://docs.google.com/document/d/1K2CqAtF4pZjpiVts8BOcq34sOcNgozvZ9ZZw-_zTv6I/edit?usp=sharing


Best regards,

Laurent Quérel

--
Laurent Quérel



[DISCUSS][RUST]: Breaking Change to Schema Representation

2023-03-29 Thread Raphael Taylor-Davies

Hi All,

The Rust Arrow implementation stores metadata for child arrays in a 
struct called Field [1]. This encodes the name, nullability, datatype, 
and other metadata about that array. Currently the various schema 
representations, such as DataType [2] and Schema [3], store uniquely 
owned Field, as either Vec or Box. This poses a couple of 
challenges:


1. Nested schema will encode the same Field, including separate 
allocations for the name and any metadata, in multiple redundant 
allocations at every level in the hierarchy
2. The above nested schema will then be duplicated for every instance of 
a nested array
3. Projecting or cloning schema results in large amounts of cloning of 
Field names and metadata
4. Looking up a Field by name requires a linear O(n^2) search through a 
Vec

5. No cheap way to compare schema for pointer equality

Together these result in inefficient CPU and memory utilisation [4] [5], 
especially for tables with wide or nested schemas.


The proposed fix [6] for this is:

- Replace Box with Arc within the various schema 
representations
- Replace Vec with an opaque Fields type that approximates 
Arc<[Arc]>, see [7] for rationale and implementation


As this is necessarily a breaking change with downstream implications, I 
wanted to solicit opinions on this approach, and would welcome any feedback


Kind Regards,

Raphael

[1]: https://docs.rs/arrow-schema/latest/arrow_schema/struct.Field.html
[2]: https://docs.rs/arrow-schema/latest/arrow_schema/enum.DataType.html
[3]: https://docs.rs/arrow-schema/latest/arrow_schema/struct.Schema.html
[4]: https://github.com/apache/arrow-datafusion/issues/5157
[5]: https://github.com/influxdata/influxdb_iox/issues/5202
[6]: https://github.com/apache/arrow-rs/issues/3955
[7]: https://github.com/apache/arrow-rs/pull/3965




[RESULT][VOTE][RUST] Release Apache Arrow Rust 36.0.0 RC1

2023-03-28 Thread Raphael Taylor-Davies

With 6 +1 votes (3 binding) the release is approved

The release is available here: 
https://dist.apache.org/repos/dist/release/arrow/arrow-rs-36.0.0


It has also been released to crates.io

Thank you to everyone who helped verify this release

Raphael

On 27/03/2023 12:05, Patrick Horan wrote:

+1 (non-binding)

Verified on Mac M1

On Mon, Mar 27, 2023, at 2:13 AM, Ian Joiner wrote:

+1 (Non-binding)

Ian

Verified on my HP / Ubuntu 22.04 / AMD64


On Fri, Mar 24, 2023 at 3:24 PM Raphael Taylor-Davies
 wrote:


Hi,

I would like to propose a release of Apache Arrow Rust Implementation,
version 36.0.0.

This release candidate is based on commit:
71ecc39f36c8f38a5fc93bc3878a607c831b2f12 [1]

The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]:

https://github.com/apache/arrow-rs/tree/71ecc39f36c8f38a5fc93bc3878a607c831b2f12
[2]:
https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-36.0.0-rc1
[3]:

https://github.com/apache/arrow-rs/blob/71ecc39f36c8f38a5fc93bc3878a607c831b2f12/CHANGELOG.md
[4]:

https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




[VOTE][RUST] Release Apache Arrow Rust 36.0.0 RC1

2023-03-24 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 36.0.0.


This release candidate is based on commit: 
71ecc39f36c8f38a5fc93bc3878a607c831b2f12 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/71ecc39f36c8f38a5fc93bc3878a607c831b2f12

[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-36.0.0-rc1
[3]: 
https://github.com/apache/arrow-rs/blob/71ecc39f36c8f38a5fc93bc3878a607c831b2f12/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh




  1   2   >