Re: [DISCUSS][Format] C data interface for Utf8View

2023-11-15 Thread Andrew Lamb
Given the constraints of not changing the existing struct definitions, adding a new buffer seems like the only way forward from what I understand. It is unfortunate that each array now needs need a new allocation (the buffer lengths) when passing via FFI, but I don't have any other suggestions

Re: [DISCUSS][Format] C data interface for Utf8View

2023-11-07 Thread Weston Pace
+1 for the original proposal as well. --- The (minor) problem I see with flags is that there isn't much point to this feature if you are gating on a flag. I'm assuming the goal is what Dewey originally mentioned which is making buffer calculations easier. However, if you're gating the feature

Re: [DISCUSS][Format] C data interface for Utf8View

2023-11-07 Thread Will Jones
I agree with the approach originally proposed by Ben. It seems like the most straightforward way to implement within the current protocol. On Sun, Oct 29, 2023 at 4:59 PM Dewey Dunnington wrote: > In the absence of a general solution to the C data interface omitting > buffer sizes, I think the

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-29 Thread Dewey Dunnington
In the absence of a general solution to the C data interface omitting buffer sizes, I think the original proposal is the best way forward...this is the first type to be added whose buffer sizes cannot be calculated without looping over every element of the array; the buffer sizes are needed to

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-27 Thread Benjamin Kietzman
> This begs the question of what happens if a consumer receives an unknown > flag value. It seems to me that ignoring unknown flags is the primary case to consider at this point, since consumers may ignore unknown flags. Since that is the case, it seems adding any flag which would break such a

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Dewey Dunnington
> This begs the question of what happens if a consumer receives an unknown flag > value That's a great point...I might be the only person who has implemented a deep copy of an ArrowSchema in C, but it does blindly pass along a schema's flag value (which in the scenario I proposed could lead to a

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Dewey Dunnington
I'm afraid I've derailed the discussion into solving a bigger problem than strictly necessary. I don't think this is the time to solve the general problem of the C data interface having no way to communicate buffer sizes, particularly since there's no immediate agreement on its utility or

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou
Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit : Is this buffer lengths buffer only present if the array type is Utf8View? IIUC, the proposal would add the buffer lengths buffer for all types if the schema's flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to avoid the

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Benjamin Kietzman
> Is this buffer lengths buffer only present if the array type is Utf8View? IIUC, the proposal would add the buffer lengths buffer for all types if the schema's flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to avoid the special case and that `n_buffers` would continue to be

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Weston Pace
Is this buffer lengths buffer only present if the array type is Utf8View? Or are you suggesting that other types might want to adopt this as well? On Thu, Oct 26, 2023 at 10:00 AM Dewey Dunnington wrote: > > I expect C code to not be much longer then this :-) > > nanoarrow's

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou
Le 26/10/2023 à 18:59, Dewey Dunnington a écrit : That sounds a bit hackish to me. Including only *some* buffer sizes in array->buffers[array->n_buffers] special-cased for only two types (or altering the number of buffers required by the IPC format vs. the number of buffers required by the

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Dewey Dunnington
> I expect C code to not be much longer then this :-) nanoarrow's buffer-length-calculation and validation concepts are (perhaps inadvisably) intertwined...even with both it is not that much code (perhaps I was remembering how much time it took me to figure out which 35 lines to write :-)) >

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou
Le 26/10/2023 à 17:45, Dewey Dunnington a écrit : The lack of buffer sizes is something that has come up for me a few times working with nanoarrow (which dedicates a significant amount of code to calculating buffer sizes, which it uses to do validation and more efficient copying). By the

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou
Le 26/10/2023 à 17:45, Dewey Dunnington a écrit : > A potential alternative might be to allow any ArrowArray to declare > its buffer sizes in array->buffers[array->n_buffers], perhaps with a > new flag in schema->flags to advertise that capability. That sounds a bit hackish to me. I'd rather

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Dewey Dunnington
Ben kindly explained to me offline that the need for the buffer sizes is because when Arrow C++ imports an Array it creates Buffer class wrappers around the imported pointers. Arrow C++ does not have a notion of a buffer of unknown size to my knowledge, which leaves two undesirable alternatives:

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-25 Thread Antoine Pitrou
Hello, We might want to keep the variadic buffers at the end and instead export the buffer sizes as buffer #2? Though that's mostly stylistic... Regards Antoine. Le 25/10/2023 à 18:36, Benjamin Kietzman a écrit : Hello all, The C ABI does not store buffer lengths explicitly, which

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-25 Thread Benjamin Kietzman
Worth noting: the c data interface explicitly forbids adding new members [1] to its structs, so simply adding ArrowArray::buffer_sizes is not viable. [1] https://github.com/bkietz/arrow/blob/0afb739a16672483b69894c6fe3f5ece5cfc79d8/docs/source/format/CDataInterface.rst?plain=1#L984-L986 On Wed,

[DISCUSS][Format] C data interface for Utf8View

2023-10-25 Thread Benjamin Kietzman
Hello all, The C ABI does not store buffer lengths explicitly, which presents a problem for Utf8View since buffer lengths are not trivially extractable from other data in the array. A potential solution is to store the lengths in an extra buffer after the variadic data buffers. I've adopted this