Re: [DISCUSS][Format] C data interface for Utf8View

Weston Pace Tue, 07 Nov 2023 14:46:15 -0800

+1 for the original proposal as well.

---


The (minor) problem I see with flags is that there isn't much point to this
feature if you are gating on a flag.  I'm assuming the goal is what Dewey
originally mentioned which is making buffer calculations easier.  However,
if you're gating the feature with a flag then you are either:

 * Rejecting input from producers that don't support this feature
(undesirable, better to align on one use model if we can)
 * Doing all the work anyways to handle producers that don't support the
feature

Maybe it makes sense for a long term migration (e.g. we all agree this is
something we want to move towards but we need to handle old producers in
the meantime) but we can always discuss that separately and I don't think
the benefit here is worth the confusion.

On Tue, Nov 7, 2023 at 7:46 AM Will Jones <[email protected]> wrote:

> I agree with the approach originally proposed by Ben. It seems like the
> most straightforward way to implement within the current protocol.
>
> On Sun, Oct 29, 2023 at 4:59 PM Dewey Dunnington
> <[email protected]> wrote:
>
> > In the absence of a general solution to the C data interface omitting
> > buffer sizes, I think the original proposal is the best way
> > forward...this is the first type to be added whose buffer sizes cannot
> > be calculated without looping over every element of the array; the
> > buffer sizes are needed to efficiently serialize the imported array to
> > IPC if imported by a consumer that cares about buffer sizes.
> >
> > Using a schema's flags to indicate something about a specific paired
> > array (particularly one that, if misinterpreted, would lead to a
> > crash) is a precedent that is probably not worth introducing for just
> > one type. Currently a schema is completely independent of any
> > particular ArrowArray, and I think that is a feature that is worth
> > preserving. My gripes about not having buffer sizes on the CPU to more
> > efficiently copy between devices is a concept almost certainly better
> > suited to the ArrowDeviceArray struct.
> >
> > On Fri, Oct 27, 2023 at 12:45 PM Benjamin Kietzman <[email protected]>
> > wrote:
> > >
> > > > This begs the question of what happens if a consumer receives an
> > unknown
> > > > flag value.
> > >
> > > It seems to me that ignoring unknown flags is the primary case to
> > consider
> > > at
> > > this point, since consumers may ignore unknown flags. Since that is the
> > > case,
> > > it seems adding any flag which would break such a consumer would be
> > > tantamount to an ABI breakage. I don't think this can be averted unless
> > all
> > > consumers are required to error out on unknown flag values.
> > >
> > > In the specific case of Utf8View it seems certain that consumers would
> > add
> > > support for the buffer sizes flag simultaneously with adding support
> for
> > the
> > > new type (since Utf8View is difficult to import otherwise), so any
> > consumer
> > > which would error out on the new flag would already be erroring out on
> an
> > > unsupported data type.
> > >
> > > > I might be the only person who has implemented
> > > > a deep copy of an ArrowSchema in C, but it does blindly pass along a
> > > > schema's flag value
> > >
> > > I think passing a schema's flag value including unknown flags is an
> > error.
> > > The ABI defines moving structures but does not define deep copying. I
> > think
> > > in order to copy deeply in terms of operations which *are* specified:
> we
> > > import then export the schema. Since this includes an export step, it
> > > should not
> > > include flags which are not supported by the exporter.
> > >
> > > On Thu, Oct 26, 2023 at 6:40 PM Antoine Pitrou <[email protected]>
> > wrote:
> > >
> > > >
> > > > Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit :
> > > > >> Is this buffer lengths buffer only present if the array type is
> > > > Utf8View?
> > > > >
> > > > > IIUC, the proposal would add the buffer lengths buffer for all
> types
> > if
> > > > the
> > > > > schema's
> > > > > flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to
> > avoid
> > > > > the special case and that `n_buffers` would continue to be
> consistent
> > > > with
> > > > > IPC.
> > > >
> > > > This begs the question of what happens if a consumer receives an
> > unknown
> > > > flag value. We haven't specified that unknown flag values should be
> > > > ignored, so a consumer could judiciously choose to error out instead
> of
> > > > potentially misinterpreting the data.
> > > >
> > > > All in all, personally I'd rather we make a special case for Utf8View
> > > > instead of adding a flag that can lead to worse interoperability.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> >
>

Re: [DISCUSS][Format] C data interface for Utf8View

Reply via email to